# RAG Methods - Parsing

---

Embedding models are so robust and powerful, that with little effort they seem to perform well. As it is hard to quantify the quality of the results in an intuitive sense. It is easy to assume that the results are optimal.

In reality there are optimizations to make all throughout the workflow. This starts with how well we manage to parse the data source, then passes onto how we cut up and load the data for the embeddings. Clearly if we embed 100 tokens at a time, the results will vary from if we cut up 500 tokens at a time.

One idea is that a smaller sentence encodes a more accurate semantic meaning. This could help us identify the most pertinant text, but afterwards we want to include all the surrounding text to the LLM so there is more context from which to generate a response.

Here we examine how we can accomplish this with Llama Index.


## $\color{blue}{Sections:}$
* Admin
* Setup
* Data
* Small to Big
* Reranking Algorithms
* Metadata filter
* Evaluation

---
## $\color{blue}{Admin}$
---

In [None]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
%%capture
!pip install llama_index pypdf -q -U

In [None]:
import os
import getpass

In [None]:
from google.colab import drive

In [None]:
drive.mount("/content/drive")
%cd '/content/drive/MyDrive/'

---
## $\color{blue}{Setup}$
---
Initialising the LLM and the embedding models, here we are going to connect up with Hugging face API and get Zephyr, so we can use the LLM in the cloud. The embedding model is small and we can download it locally.

In [None]:
%%capture
!pip install llama-index-llms-huggingface

In [None]:
from llama_index.llms.huggingface import HuggingFaceInferenceAPI

In [None]:
from huggingface_hub import login
import os

In [None]:

HF_TOKEN = getpass.getpass('Hugging Face token please: ')

In [None]:
login(token=HF_TOKEN)
os.environ['HUGGINGFACEHUB_API_TOKEN'] = HF_TOKEN

In [None]:
#can put APIkey in the method
llm = HuggingFaceInferenceAPI(
    model_name = 'HuggingFaceH4/zephyr-7b-alpha',
    api_key = HF_TOKEN
)

In [None]:
%%capture
%pip install llama-index-embeddings-huggingface

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [None]:
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

**We can attach the llm to the llama_index Settings object**

In [None]:
from llama_index.core import Settings

In [None]:
Settings.llm = llm

In [None]:
Settings.embed_model = embed_model


---
## $\color{blue}{Data}$
---

The train data is a MS excel pdf guide to new version of Excel 2010, approx 80 pages.

The valid data is a University issue how to guide for MS Excel.

Get train and validation nodes.

In [None]:
from llama_index.core import SimpleDirectoryReader

In [None]:
# Substitute pdf document


# train_reader = SimpleDirectoryReader(
#     input_files =["RAG_tutorial/Data/excel_train.pdf"]
# )

In [None]:
train_data = train_reader.load_data()

In [None]:
len(train_data)

76

**At present the data has been cut up into numerous pieces by default, but we want to parse these fragments to get the correct size**

In [None]:
from llama_index.core.node_parser import SimpleNodeParser, SentenceWindowNodeParser, SentenceSplitter

The main features of the simple node parset are the ability to set the chunk size and chunk overlap.

After forming the nodes text cleaning can be implemented in a loop by modifing the .text field. This is also the occassion to get rid of any irrelevant text, the contents page for instance. The final data is in the form of a list, so elements can be removed at will.

The previous document reader also has behavior to consider. For instance the train_data is split into 80 document objects which correspond to the pages of our pdf. This seems to impact the node creation, no node will overlap seperate documents.

In [None]:
parser = SimpleNodeParser(chunk_size=500, chunk_overlap=100)
train_nodes = parser.get_nodes_from_documents(train_data, show_progress=True)

In [None]:
print('Train: ', len(train_nodes))

Train:  83


In [None]:
train_nodes[50]

TextNode(id_='d116fadb-276c-450d-83c0-7c4f8e76c6e2', embedding=None, metadata={'page_label': '48', 'file_name': 'excel_train.pdf', 'file_path': 'RAG_tutorial/Data/excel_train.pdf', 'file_type': 'application/pdf', 'file_size': 3074978, 'creation_date': '2024-05-07', 'last_modified_date': '2024-05-07'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='6a11e810-b3f5-4bdb-b865-83a58e196875', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '48', 'file_name': 'excel_train.pdf', 'file_path': 'RAG_tutorial/Data/excel_train.pdf', 'file_type': 'application/pdf', 'file_size': 3074978, 'creation_date': '2024-05-07', 'last_modified_date': '2024-05-07'}, hash='bb0ac41b970503633514e079acd627166c83edbc0b963653

In [None]:
train_nodes[50].text

"These are just a few of many tasks that advanced Office 2010 users can easily accomplish using \nbasic Office Open XML. Explore the resources that follow to help you get started with Office Open \nXML and for the steps you need to accompl ish these tasks and more:  \nNote : The following resources were written for Office 2007 but are equally applicable to the tasks \ndiscussed here for Office 2010.  \n\uf0b7 Open XML I: Exploring the Open XML File Formats  \n(http://office.microsoft.com/training/training.aspx?AssetID=RC102435331033 ) \n\uf0b7 Open XML II: Editing documents in the XML  \n(http://office.microsoft.com/training/training.aspx?AssetID=RC103570001033 ) \n\uf0b7 A Guide to Customizing the Office 2007 Ribbon  \n(http://technet.microsoft.com/en -us/magazine/2009.05.ri bbon.aspx )  \n\uf0b7 Using Office Open XML to Customize Document Formatting in the 2007 Office System  \n(http://msdn.microsoft.com/en -us/library/dd560821.aspx )  \n\uf0b7 Getting More from Document Themes in th

In [None]:
train_nodes[50].get_metadata_str()

'page_label: 48\nfile_name: excel_train.pdf\nfile_path: RAG_tutorial/Data/excel_train.pdf\nfile_type: application/pdf\nfile_size: 3074978\ncreation_date: 2024-05-07\nlast_modified_date: 2024-05-07'

---
## $\color{blue}{Small-to-Big}$
---

Now we look at how to have a small window for the embedding, but at the same time retrieve a larger portion of the text from the original text used for the embedding.

Can customize the sentence splitter with regex for example, depending on the document

In [None]:
# bullet_splitter = SentenceSplitter(paragraph_separator=r"\n●|\n-|\n", chunk_size=250)

One concept that makes intuitive sense, is that there is more information loss with larger passages of text. This implies that it is a good idea to embed small passages of text. The disadvantage of this is that the LLM may not have enough context from a small sentence.

The solution (sometimes called small to big) is the idea of encoding small passages of text, and then at inference time we return not only the text from which the embedding was made but also the surround text, specified by defining a window. Eg. A window of 3 would give us access to the pre and proceeding sentences, and a window size of 5 would give us access to the pre and proceeding 2 sentences.



The variation of chunck length must take place in the lower level object, the sentnece splitter (that controls all the rules for breaking up the document), this can then be passed to the parser.

In [None]:
parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text"
)

new_nodes = parser.build_window_nodes_from_documents(train_data)

In [None]:
len(new_nodes)

953

In [None]:
new_nodes[50]

TextNode(id_='5dea6275-f492-4a33-8540-999a14e08d1f', embedding=None, metadata={'window': '  \n \n 2 \n \nExcel 2010: At -a-Glance  \nTake a glance at how Excel  2010 is designed  to give you the best productivity experience across \nPC, phone , and browser.  Get a closer look at the new and improved features in the sections t hat \nfollow.  \n Today, a spreadsheet application is used for a variety of tasks, such as statistical analysis, \nforecasting revenue, managing business and personal finances, and maintaining address lists or \nstudent records.  Your needs m ay be increasing but there’s no need to outsource and hire a \nconsultant to meet them.  With Excel 2010, you can quickly create polished and professional \nwork. ', 'original_text': 'Get a closer look at the new and improved features in the sections t hat \nfollow.  \n'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date', 'window', 'original_text'

In [None]:
new_nodes[50].text

'Get a closer look at the new and improved features in the sections t hat \nfollow.  \n'

In [None]:
new_nodes[50].get_metadata_str()

'window:   \n \n 2 \n \nExcel 2010: At -a-Glance  \nTake a glance at how Excel  2010 is designed  to give you the best productivity experience across \nPC, phone , and browser.  Get a closer look at the new and improved features in the sections t hat \nfollow.  \n Today, a spreadsheet application is used for a variety of tasks, such as statistical analysis, \nforecasting revenue, managing business and personal finances, and maintaining address lists or \nstudent records.  Your needs m ay be increasing but there’s no need to outsource and hire a \nconsultant to meet them.  With Excel 2010, you can quickly create polished and professional \nwork. \noriginal_text: Get a closer look at the new and improved features in the sections t hat \nfollow.  \n'

We now have a very short text for the embedding, held in the original text, but the window contains much wider context of sentences before and after.

**With our nodes we can now go ahead and make the embeddings**

In [None]:
from llama_index.core import VectorStoreIndex

In [None]:
index = VectorStoreIndex(new_nodes)

Llama index does the heavy lifting

In [None]:
from llama_index.core.indices.postprocessor import MetadataReplacementPostProcessor

When it comes to inference time, we supply a postprocessor object to the query engine, and the surrounding window is automatically generated for the LLM.

In [None]:
postproc = MetadataReplacementPostProcessor(
    target_metadata_key="window"
)

query_engine = index.as_query_engine(
    node_postprocessors = [postproc],
)

In [None]:
response = query_engine.query('Where can I share my workbooks?')

In [None]:
print(response.response)



The new Microsoft Office Backstage™ view in Excel 2010 can help you share your workbooks easily. You can now more easily print, share and manage your workbooks, and customize your Excel 2010 experience, all from one convenient location.


In [None]:
window = response.source_nodes[0].node.metadata["window"]
sentence = response.source_nodes[0].node.metadata["original_text"]

In [None]:
response

In [None]:
print('Original Text: ', sentence,'\n')
print('Expanded Text: ', window)

Original Text:  Sometimes you want to share your workbooks with friends or co -workers.  

Expanded Text:  Effortlessly reuse content by previewing how information will look 
before actually pasting using Paste with Live Preview .  
  Add polished and professional images to your workbooks.   With new and imp roved picture 
editing tools  you don't have to be a graphic designer or use additional photo -editing 
programs.  
 Sometimes you want to share your workbooks with friends or co -workers.  At other times, you 
need to work together wi th a team on school or work projects.  In either instance, you want to 
focus on what needs to be done  as opposed to the processes that make sharing and 
communicating easy and convenient.  Excel 2010 provides new and enhanced features to help 
you work on team  projects or show your work to other people.  

