# RAG Workshop

**Skills: HuggingFace, LangChain, Pinecone**

**Other Resources:**
- [Get your Groq API Key](https://console.groq.com/keys)
- [Get your Pinecone API Key](https://www.pinecone.io/)


### What is RAG anyway?


![withoutRAG](https://github.com/user-attachments/assets/649d6101-b63a-4750-997a-b6abc25e5609)

![withRAG](https://github.com/user-attachments/assets/e6dd9c46-0bf9-4c31-bd72-a27939ef82b8)

Retrieval-Augmented Generation (RAG) is a technique primarily used in GenAI applications to improve the quality and accuracy of generated text by LLMs by combining two key processes: retrieval and generation.

### Breaking It Down:
#### Retrieval:

- Before generating a response, the system first looks up relevant information from a large database or knowledge base. This is like searching through a library or the internet to find the most useful facts, articles, or data related to the question or topic.

#### Generation:

- Once the relevant information is retrieved, the system then uses it to help generate a response. This is where the model, like GPT, creates new text (answers, explanations, etc.) based on the retrieved information.

# Install libraries

In [1]:
! pip install langchain langchain-community openai groq tiktoken pinecone-client langchain_pinecone unstructured pdfminer==20191125 pdfminer.six==20221105 pillow_heif unstructured_inference sentence-transformers

Collecting langchain
  Downloading langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.0-py3-none-any.whl.metadata (2.8 kB)
Collecting openai
  Downloading openai-1.46.1-py3-none-any.whl.metadata (24 kB)
Collecting groq
  Downloading groq-0.11.0-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting langchain_pinecone
  Downloading langchain_pinecone-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Collecting unstructured
  Downloading unstructured-0.15.12-py3-none-any.whl.metadata (29 kB)
Collecting pdfminer==20191125
  Downloading pdfminer-20191125.tar.gz (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.p

In [5]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, WebBaseLoader, YoutubeLoader, DirectoryLoader, TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
from langchain.schema import Document
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone
from openai import OpenAI
import numpy as np
import tiktoken
import os
from groq import Groq

pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key

# openai_api_key = userdata.get("OPENAI_API_KEY")
# os.environ['OPENAI_API_KEY'] = openai_api_key

groq_api_key = userdata.get("GROQ_API_KEY")
os.environ['GROQ_API_KEY'] = groq_api_key

# Initialize the HuggingFace Embeddings client

In [4]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
text = "Hello my name is Faizan"

query_result = embeddings.embed_query(text)

In [7]:
len(query_result)

384

# Initialize the Groq client

In [8]:
# Free Llama 3.1 API via Groq

groq_client = Groq(api_key=os.getenv('GROQ_API_KEY'))

# Calculating sentence similarity with embeddings

In [9]:
def get_huggingface_embeddings(text, model_name="sentence-transformers/all-MiniLM-L6-v2"):
    model = SentenceTransformer(model_name)
    return model.encode(text)


def cosine_similarity_between_sentences(sentence1, sentence2):
    # Get embeddings for both sentences
    embedding1 = np.array(get_huggingface_embeddings(sentence1))
    embedding2 = np.array(get_huggingface_embeddings(sentence2))

    # Reshape embeddings for cosine_similarity function
    embedding1 = embedding1.reshape(1, -1)
    embedding2 = embedding2.reshape(1, -1)

    print("Embedding for Sentence 1:", embedding1)
    print("\nEmbedding for Sentence 2:", embedding2)

    # Calculate cosine similarity
    similarity = cosine_similarity(embedding1, embedding2)
    return similarity[0][0]


# Example usage
sentence1 = "I like walking to the park"
sentence2 = "I like running to the office"


similarity = cosine_similarity_between_sentences(sentence1, sentence2)
print(f"\n\nCosine similarity between '{sentence1}' and '{sentence2}': {similarity:.4f}")

Embedding for Sentence 1: [[-7.94647262e-04 -4.52190749e-02  5.60034551e-02  4.00062464e-02
   7.82356039e-02 -3.10016028e-03  1.56902865e-01 -1.61643641e-03
   8.40177536e-02  7.29586333e-02 -2.27428153e-02 -1.00336559e-02
  -4.77766357e-02  5.78007065e-02  6.89263120e-02  2.29866221e-03
   3.41052189e-02  8.23902860e-02 -4.47453046e-03  1.18202856e-02
  -7.44135678e-02  2.10828464e-02  1.92200206e-02  5.48400655e-02
  -1.07110761e-01  8.79157037e-02 -1.64800771e-02  6.51672296e-03
  -6.67020795e-05 -4.27562976e-03 -8.20703059e-02  7.05852956e-02
  -1.80556532e-02  3.27348486e-02 -4.36549522e-02  9.93786659e-03
   5.78057803e-02 -6.92316219e-02  4.53142636e-02  4.96660285e-02
  -1.49475699e-02  5.79100735e-02  8.14058036e-02  2.62879906e-03
  -1.49136577e-02 -4.37886156e-02  2.26743110e-02 -3.19027528e-02
   1.00592583e-01  3.10835298e-02  1.30596399e-01  7.27660581e-03
   8.58721696e-03  7.95205031e-03 -7.91899022e-03  4.98277741e-03
  -8.22421089e-02  2.46388651e-02  5.11084683e-02 

# Load in the Data

Learn more about the dataset [here](https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset)

In [10]:
! kaggle datasets download -d ayoubcherguelaine/company-documents-dataset
! unzip company-documents-dataset.zip

Dataset URL: https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
License(s): apache-2.0
Downloading company-documents-dataset.zip to /content
  0% 0.00/9.34M [00:00<?, ?B/s]
100% 9.34M/9.34M [00:00<00:00, 137MB/s]
Archive:  company-documents-dataset.zip
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_1.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_2.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_3.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_4.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_5.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_6.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/mont

In [11]:
def process_directory(directory_path):
    data = []
    for root, _, files in os.walk(directory_path):
        for file in files:

            file_path = os.path.join(root, file)
            print(f"Processing file: {file_path}")
            loader = PyPDFLoader(file_path)
            data.append({"File": file_path, "Data": loader.load()})

    return data

directory_path = "/content/CompanyDocuments"
documents = process_directory(directory_path)


Processing file: /content/CompanyDocuments/Shipping orders/order_10491.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10921.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10504.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10332.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10673.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10783.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10928.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10804.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10644.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10636.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10999.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10476.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10991.pdf
Processing file: /content

In [12]:
documents

[{'File': '/content/CompanyDocuments/Shipping orders/order_10491.pdf',
  'Data': [Document(metadata={'source': '/content/CompanyDocuments/Shipping orders/order_10491.pdf', 'page': 0}, page_content='Order ID: 10491\nShipping Details:\nShip Name: Furia Bacalhau e Frutos do Mar\nShip Address: Jardim das rosas n. 32\nShip City: Lisboa\nShip Region: Southern Europe\nShip Postal Code: 1675\nShip Country: Portugal\nCustomer Details:\nCustomer ID: FURIB\nCustomer Name: Furia Bacalhau e Frutos do Mar\nEmployee Details:\nEmployee Name: Laura Callahan\nShipper Details:\nShipper ID: 3\nShipper Name: Federal Shipping\nOrder Details:\nOrder Date: 2017-03-31\nShipped Date: 2017-04-08\nProducts:\n--------------------------------------------------------------------------------------------------\nProduct: Gula Malacca\nQuantity: 15\nUnit Price: 15.5\nTotal: 232.5\n--------------------------------------------------------------------------------------------------\nProduct: Original Frankfurter grüne Soße\

# Initialize Pinecone

In [13]:
# Make sure to create a Pinecone index with 384 dimensions

index_name = "rag-workshop"

namespace = "company-documents"

vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)

# Insert data into Pinecone

In [14]:
for document in documents:
  print(document['File'], document['Data'])

/content/CompanyDocuments/Shipping orders/order_10491.pdf [Document(metadata={'source': '/content/CompanyDocuments/Shipping orders/order_10491.pdf', 'page': 0}, page_content='Order ID: 10491\nShipping Details:\nShip Name: Furia Bacalhau e Frutos do Mar\nShip Address: Jardim das rosas n. 32\nShip City: Lisboa\nShip Region: Southern Europe\nShip Postal Code: 1675\nShip Country: Portugal\nCustomer Details:\nCustomer ID: FURIB\nCustomer Name: Furia Bacalhau e Frutos do Mar\nEmployee Details:\nEmployee Name: Laura Callahan\nShipper Details:\nShipper ID: 3\nShipper Name: Federal Shipping\nOrder Details:\nOrder Date: 2017-03-31\nShipped Date: 2017-04-08\nProducts:\n--------------------------------------------------------------------------------------------------\nProduct: Gula Malacca\nQuantity: 15\nUnit Price: 15.5\nTotal: 232.5\n--------------------------------------------------------------------------------------------------\nProduct: Original Frankfurter grüne Soße\nQuantity: 7\nUnit Pric

In [16]:
document_data = []
for document in documents:
  document_source = document['Data'][0].metadata['source']
  document_content = document['Data'][0].page_content

  file_name = document_source.split("/")[-1]
  folder_names = document_source.split("/")[2:-1]

  # print('DOCUMENT SOURCE', document_source)
  # print('DOCUMET CONTENT', document_content)

  doc = Document(
      page_content= f"<Source>\n{document_source}\n</Source>\n\n<Content>\n{document_content}\n</Content>",
      metaddata = {
          "filename": file_name,
          "parent_folder": folder_names[-1]
      }
  )
  document_data.append(doc)

In [17]:
document_data

[Document(metadata={}, page_content='<Source>\n/content/CompanyDocuments/Shipping orders/order_10491.pdf\n</Source>\n\n<Content>\nOrder ID: 10491\nShipping Details:\nShip Name: Furia Bacalhau e Frutos do Mar\nShip Address: Jardim das rosas n. 32\nShip City: Lisboa\nShip Region: Southern Europe\nShip Postal Code: 1675\nShip Country: Portugal\nCustomer Details:\nCustomer ID: FURIB\nCustomer Name: Furia Bacalhau e Frutos do Mar\nEmployee Details:\nEmployee Name: Laura Callahan\nShipper Details:\nShipper ID: 3\nShipper Name: Federal Shipping\nOrder Details:\nOrder Date: 2017-03-31\nShipped Date: 2017-04-08\nProducts:\n--------------------------------------------------------------------------------------------------\nProduct: Gula Malacca\nQuantity: 15\nUnit Price: 15.5\nTotal: 232.5\n--------------------------------------------------------------------------------------------------\nProduct: Original Frankfurter grüne Soße\nQuantity: 7\nUnit Price: 10.4\nTotal: 72.8\nTotal Price:\n\n</Conte

In [18]:
vectorstore_from_documents = PineconeVectorStore.from_documents(
    document_data,
    embeddings,
    index_name=index_name,
    namespace=namespace
)

# Perform RAG

In [20]:
# Initialize pinecone
pc = Pinecone(api_key=userdata.get("PINECONE_API_KEY"), )

# Connect to Pinecone index
pinecone_index = pc.Index(index_name)

In [21]:
query = "What are some common products bought by Mary Saveley"

In [23]:
raw_query_embedding = get_huggingface_embeddings(query)
raw_query_embedding

array([-4.26725596e-02, -6.77617192e-02,  1.35143986e-02, -9.34239384e-03,
       -3.53645310e-02,  1.50140822e-01,  2.53909770e-02, -9.55799315e-03,
       -3.04681305e-02, -8.55066180e-02,  7.04338029e-02, -3.25601851e-03,
        1.76474210e-02, -6.33681267e-02, -1.97862964e-02,  5.99328689e-02,
       -3.36771784e-03,  1.15839437e-01, -2.16768794e-02, -2.27578115e-02,
       -4.50396240e-02,  2.58584227e-02,  4.44791801e-02,  8.40743557e-02,
        2.40690890e-03, -1.55011723e-02,  6.56717038e-03,  2.42126826e-02,
       -4.94253030e-03, -8.99974853e-02, -2.78666709e-02,  1.11753186e-02,
       -3.28647792e-02,  2.51273136e-03,  1.29573699e-02,  3.35337520e-02,
        4.09512520e-02,  3.65742072e-02,  4.78686020e-02, -4.06755414e-03,
       -3.53386849e-02, -1.24937393e-01, -4.32422245e-03, -5.11763711e-03,
       -7.36645460e-02,  2.81967465e-02,  6.48829807e-03,  9.16031599e-02,
        2.23698728e-02, -4.23583426e-02, -6.91724867e-02,  7.57981511e-03,
       -3.12628709e-02, -

In [26]:
top_matches = pinecone_index.query(vector=raw_query_embedding.tolist(), top_k=10, include_metadata=True, namespace=namespace)
top_matches

{'matches': [{'id': 'ec1aed05-dbdf-47bc-9576-683e529cabc6',
              'metadata': {'text': '<Source>\n'
                                   '/content/CompanyDocuments/PurchaseOrders/purchase_orders_10478.pdf\n'
                                   '</Source>\n'
                                   '\n'
                                   '<Content>\n'
                                   'Purchase Orders\n'
                                   'Order ID Order Date Customer Name\n'
                                   '10478 2017-03-18 Mary Saveley\n'
                                   'Products\n'
                                   'Product ID: Product: Quantity: Unit '
                                   'Price:\n'
                                   '10 Ikura 20 24.8\n'
                                   'Page 1\n'
                                   '</Content>'},
              'score': 0.564242482,
              'values': []},
             {'id': '32ec967e-4e9e-4204-b33f-ea995f5c3f81',
      

In [27]:
contexts = [item['metadata']['text'] for item in top_matches['matches']]
contexts

['<Source>\n/content/CompanyDocuments/PurchaseOrders/purchase_orders_10478.pdf\n</Source>\n\n<Content>\nPurchase Orders\nOrder ID Order Date Customer Name\n10478 2017-03-18 Mary Saveley\nProducts\nProduct ID: Product: Quantity: Unit Price:\n10 Ikura 20 24.8\nPage 1\n</Content>',
 '<Source>\n/content/CompanyDocuments/PurchaseOrders/purchase_orders_10806.pdf\n</Source>\n\n<Content>\nPurchase Orders\nOrder ID Order Date Customer Name\n10806 2017-12-31 Mary Saveley\nProducts\nProduct ID: Product: Quantity: Unit Price:\n2 Chang 20 19\n65 Louisiana Fiery Hot Pepper Sauce 2 21.05\n74 Longlife Tofu 15 10\nPage 1\n</Content>',
 '<Source>\n/content/CompanyDocuments/PurchaseOrders/purchase_orders_10334.pdf\n</Source>\n\n<Content>\nPurchase Orders\nOrder ID Order Date Customer Name\n10334 2016-10-21 Mary Saveley\nProducts\nProduct ID: Product: Quantity: Unit Price:\n52 Filo Mix 8 5.6\n68 Scottish Longbreads 10 10\nPage 1\n</Content>',
 '<Source>\n/content/CompanyDocuments/PurchaseOrders/purchase_o

In [30]:
augmented_query = "<CONTEXT>\n" + "\n\n---------------\n\n".join(contexts[:10]) + "\n-------------------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query
print(augmented_query)

<CONTEXT>
<Source>
/content/CompanyDocuments/PurchaseOrders/purchase_orders_10478.pdf
</Source>

<Content>
Purchase Orders
Order ID Order Date Customer Name
10478 2017-03-18 Mary Saveley
Products
Product ID: Product: Quantity: Unit Price:
10 Ikura 20 24.8
Page 1
</Content>

---------------

<Source>
/content/CompanyDocuments/PurchaseOrders/purchase_orders_10806.pdf
</Source>

<Content>
Purchase Orders
Order ID Order Date Customer Name
10806 2017-12-31 Mary Saveley
Products
Product ID: Product: Quantity: Unit Price:
2 Chang 20 19
65 Louisiana Fiery Hot Pepper Sauce 2 21.05
74 Longlife Tofu 15 10
Page 1
</Content>

---------------

<Source>
/content/CompanyDocuments/PurchaseOrders/purchase_orders_10334.pdf
</Source>

<Content>
Purchase Orders
Order ID Order Date Customer Name
10334 2016-10-21 Mary Saveley
Products
Product ID: Product: Quantity: Unit Price:
52 Filo Mix 8 5.6
68 Scottish Longbreads 10 10
Page 1
</Content>

---------------

<Source>
/content/CompanyDocuments/PurchaseOrders/

In [31]:
system_prompt = f'''You are an expert at understanding and analyzing company data - particularly shipping orders, purchase orders, invoices, and inventory reports.
Answer any questions I have, based on the data provided. Always consider all of the context provided when forming a response.
'''

llm_response = groq_client.chat.completions.create(
    model="llama-3.1-70b-versatile",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": augmented_query}
    ]
)

response = llm_response.choices[0].message.content

In [32]:
print(response)

Based on the provided purchase orders and invoices, Mary Saveley has purchased the following products from different orders:

- Ikura (Product ID: 10) on two separate occasions: 
  1. In Order ID 10450 on 2017-02-19, 20 units at a price of 24.8 per unit.
  2. In Order ID 10478 on 2017-03-18, 20 units at a price of 24.8 per unit.

- Uncle Bob's Organic Dried Pears (Product ID: 7) on two separate occasions:
  1. In Order ID 10459 on 2017-02-27 (also present as an invoice), 16 units at a price of 24 per unit.
  2. In Order ID 10546 on 2017-05-23, 10 units at a price of 30 per unit.

- Louisiana Fiery Hot Pepper Sauce (Product ID: 65) on two separate occasions:
  1. In Order ID 10251 on 2016-07-08, 20 units at a price of 16.8 per unit.
  2. In Order ID 10806 on 2017-12-31, 2 units at a price of 21.05 per unit.


These products are common purchases made by Mary Saveley.


# Putting it all together

In [35]:
def perform_rag(query):
  raw_query_embedding = get_huggingface_embeddings(query)
  top_matches = pinecone_index.query(vector=raw_query_embedding.tolist(), top_k=10, include_metadata=True, namespace=namespace)
  contexts = [item['metadata']['text'] for item in top_matches['matches']]
  augmented_query = "<CONTEXT>\n" + "\n\n---------------\n\n".join(contexts[:10]) + "\n-------------------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

  system_prompt = f'''You are an expert at understanding and analyzing company data - particularly shipping orders, purchase orders, invoices, and inventory reports.
  Answer any questions I have, based on the data provided. Always consider all of the context provided when forming a response.
  '''

  llm_response = groq_client.chat.completions.create(
      model="llama-3.1-70b-versatile",
      messages=[
          {"role": "system", "content": system_prompt},
          {"role": "user", "content": augmented_query}
      ]
  )

  return llm_response.choices[0].message.content

In [36]:
response = perform_rag("What are some trends with Ricardo Adocicados purchase orders?")
print(response)



Based on the provided shipping orders data for Ricardo Adocicados, here are some trends that can be observed:

1. **Shipper Preference**: Ricardo Adocicados primarily uses two shipping companies: United Package (Shipper ID: 2) and Speedy Express (Shipper ID: 1). United Package is used for about 70% of the shipping orders, suggesting a strong partnership or preferred shipping arrangement.
2. **Product Variety**: Ricardo Adocicados purchases a diverse range of products, including food and beverages (e.g., Chang, Spegesild, Inlagd Sill, Gustaf's Knäckebröd, Teatime Chocolate Biscuits, and Outback Lager). This indicates a diverse product portfolio, possibly catering to different customer segments or market demands.
3. **Quantity Fluctuations**: Order quantities vary significantly, ranging from small orders (e.g., 2 units of Flotemysost in order 10447) to larger orders (e.g., 70 units of Filo Mix in order 10563). This might indicate varying demand patterns or restocking schedules.
4. **Freq