![Img](https://app.theheadstarter.com/static/hs-logo-opengraph.png)

# Headstarter RAG Workshop

**Skills: HuggingFace, LangChain, Pinecone**

**Other Resources:**
- [Get your Groq API Key](https://console.groq.com/keys)
- [Get your Pinecone API Key](https://www.pinecone.io/)


### What is RAG anyway?


![withoutRAG](https://github.com/user-attachments/assets/649d6101-b63a-4750-997a-b6abc25e5609)

![withRAG](https://github.com/user-attachments/assets/e6dd9c46-0bf9-4c31-bd72-a27939ef82b8)

Retrieval-Augmented Generation (RAG) is a technique primarily used in GenAI applications to improve the quality and accuracy of generated text by LLMs by combining two key processes: retrieval and generation.

### Breaking It Down:
#### Retrieval:

- Before generating a response, the system first looks up relevant information from a large database or knowledge base. This is like searching through a library or the internet to find the most useful facts, articles, or data related to the question or topic.

#### Generation:

- Once the relevant information is retrieved, the system then uses it to help generate a response. This is where the model, like GPT, creates new text (answers, explanations, etc.) based on the retrieved information.

# Install libraries

In [1]:
! pip install langchain langchain-community openai groq tiktoken pinecone-client langchain_pinecone unstructured pdfminer==20191125 pdfminer.six==20221105 pillow_heif unstructured_inference sentence-transformers

Collecting langchain
  Downloading langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.0-py3-none-any.whl.metadata (2.8 kB)
Collecting openai
  Downloading openai-1.46.1-py3-none-any.whl.metadata (24 kB)
Collecting groq
  Downloading groq-0.11.0-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting langchain_pinecone
  Downloading langchain_pinecone-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Collecting unstructured
  Downloading unstructured-0.15.12-py3-none-any.whl.metadata (29 kB)
Collecting pdfminer==20191125
  Downloading pdfminer-20191125.tar.gz (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.p

In [74]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, WebBaseLoader, YoutubeLoader, DirectoryLoader, TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
from langchain.schema import Document
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone
from openai import OpenAI
import numpy as np
import tiktoken
import os
from groq import Groq

pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key

# openai_api_key = userdata.get("OPENAI_API_KEY")
# os.environ['OPENAI_API_KEY'] = openai_api_key
# openai_client = OpenAI()

groq_api_key = userdata.get("GROQ_API_KEY")
os.environ['GROQ_API_KEY'] = groq_api_key

# Initialize the HuggingFace Embeddings client

In [4]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [5]:
text = "Hello my name is Faizan"

query_result = embeddings.embed_query(text)

In [7]:
query_result

[-0.0720648542046547,
 -0.016787143424153328,
 -0.005367891397327185,
 0.03257559984922409,
 -0.05403665080666542,
 -0.10101474076509476,
 0.15696749091148376,
 -0.01565767079591751,
 0.057388581335544586,
 -0.006620048079639673,
 0.00459709856659174,
 -0.05613341182470322,
 -0.0031999684870243073,
 -0.05476828292012215,
 0.05880908668041229,
 -0.04973414167761803,
 0.04378809779882431,
 0.04029259830713272,
 -0.06241477280855179,
 -0.057850539684295654,
 -0.023712923750281334,
 -0.019358182325959206,
 0.02101016417145729,
 0.0038826086092740297,
 -0.12229358404874802,
 -0.07189083844423294,
 0.00021962873870506883,
 0.11653251945972443,
 -0.00862892996519804,
 -0.06328410655260086,
 0.043614570051431656,
 0.05763636529445648,
 0.05562926083803177,
 0.03188147768378258,
 -0.05472175404429436,
 0.06748832762241364,
 -0.10776611417531967,
 -0.06072349473834038,
 -0.028926299884915352,
 -0.026966897770762444,
 0.029739413410425186,
 -0.07454313337802887,
 -0.025948744267225266,
 -0.058356

In [6]:
len(query_result)

384

# Initialize the Groq client

In [10]:
# Free Llama 3.1 API via Groq

groq_client = Groq(api_key=os.getenv('GROQ_API_KEY'))

# Calculating sentence similarity with embeddings

In [12]:
def get_huggingface_embeddings(text, model_name="sentence-transformers/all-MiniLM-L6-v2"):
    model = SentenceTransformer(model_name)
    return model.encode(text)


def cosine_similarity_between_sentences(sentence1, sentence2):
    # Get embeddings for both sentences
    embedding1 = np.array(get_huggingface_embeddings(sentence1))
    embedding2 = np.array(get_huggingface_embeddings(sentence2))

    # Reshape embeddings for cosine_similarity function
    embedding1 = embedding1.reshape(1, -1)
    embedding2 = embedding2.reshape(1, -1)

    print("Embedding for Sentence 1:", embedding1)
    print("\nEmbedding for Sentence 2:", embedding2)

    # Calculate cosine similarity
    similarity = cosine_similarity(embedding1, embedding2)
    return similarity[0][0]


# Example usage
sentence1 = "I like running to the park"
sentence2 = "I like running to the office"


similarity = cosine_similarity_between_sentences(sentence1, sentence2)
print(f"\n\nCosine similarity between '{sentence1}' and '{sentence2}': {similarity:.4f}")



Embedding for Sentence 1: [[ 1.23337246e-02 -5.13279764e-03  3.57208848e-02  3.30768302e-02
   7.82396719e-02 -4.72629350e-03  6.77640066e-02 -1.38818845e-02
   5.46722263e-02  3.72701623e-02 -6.69813827e-02 -1.25908284e-02
  -3.56482230e-02  1.04401549e-02  7.53818303e-02  2.34690451e-04
   5.66774085e-02  5.25491722e-02 -3.24604735e-02  5.10555133e-02
  -7.36738369e-02  2.87814029e-02  4.96469587e-02  1.52121764e-02
  -1.48359686e-01  9.21717212e-02 -3.28955650e-02 -1.19502367e-02
  -1.94677711e-02 -1.07078562e-02 -7.19258264e-02 -2.60948092e-02
   3.90276383e-03  2.24410407e-02 -2.83260699e-02  3.94298211e-02
   3.03058382e-02 -6.47537112e-02 -2.83420784e-03  6.29675165e-02
   5.27709723e-03 -2.52472498e-02  6.72674701e-02 -6.20138133e-04
  -2.60539651e-02 -1.77838989e-02 -1.57480724e-02 -1.01696076e-02
   1.10002562e-01 -2.35369720e-04  1.14311509e-01 -1.48504437e-03
   1.67892501e-02 -2.43191961e-02  3.09247011e-03  3.71284708e-02
  -8.57886225e-02  4.56232540e-02  3.15336883e-02 

# Load in the Data

Learn more about the dataset [here](https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset)

In [None]:
! kaggle datasets download -d ayoubcherguelaine/company-documents-dataset
! unzip company-documents-dataset.zip

In [None]:
def process_directory(directory_path):
    data = []
    for root, _, files in os.walk(directory_path):
        for file in files:

            file_path = os.path.join(root, file)
            print(f"Processing file: {file_path}")
            loader = PyPDFLoader(file_path)
            data.append({"File": file_path, "Data": loader.load()})

    return data

directory_path = "/content/CompanyDocuments"
documents = process_directory(directory_path)


In [75]:
documents

[{'File': '/content/CompanyDocuments/Shipping orders/order_10491.pdf',
  'Data': [Document(metadata={'source': '/content/CompanyDocuments/Shipping orders/order_10491.pdf', 'page': 0}, page_content='Order ID: 10491\nShipping Details:\nShip Name: Furia Bacalhau e Frutos do Mar\nShip Address: Jardim das rosas n. 32\nShip City: Lisboa\nShip Region: Southern Europe\nShip Postal Code: 1675\nShip Country: Portugal\nCustomer Details:\nCustomer ID: FURIB\nCustomer Name: Furia Bacalhau e Frutos do Mar\nEmployee Details:\nEmployee Name: Laura Callahan\nShipper Details:\nShipper ID: 3\nShipper Name: Federal Shipping\nOrder Details:\nOrder Date: 2017-03-31\nShipped Date: 2017-04-08\nProducts:\n--------------------------------------------------------------------------------------------------\nProduct: Gula Malacca\nQuantity: 15\nUnit Price: 15.5\nTotal: 232.5\n--------------------------------------------------------------------------------------------------\nProduct: Original Frankfurter grüne Soße\

# Initialize Pinecone

In [21]:
# Make sure to create a Pinecone index with 384 dimensions
# You don't need to create a namespace through Pinecone, we will just define the name of the namespace here and use it later

index_name = "rag-workshop"

namespace = "company-documents"

vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)

# Insert data into Pinecone

In [None]:
for document in documents:
    print(document['File'], document['Data'])

In [22]:
# Prepare the text for embedding
document_data = []
for document in documents:

    document_source = document['Data'][0].metadata['source']
    document_content = document['Data'][0].page_content

    file_name = document_source.split("/")[-1]
    folder_names = document_source.split("/")[2:-1]

    doc = Document(
        page_content = f"<Source>\n{document_source}\n</Source>\n\n<Content>\n{document_content}\n</Content>",
        metadata = {
            "file_name": file_name,
            "parent_folder": folder_names[-1],
            "folder_names": folder_names
        }
    )
    document_data.append(doc)

In [None]:
document_data

In [None]:
# Insert documents into Pinecone
vectorstore_from_documents = PineconeVectorStore.from_documents(
    document_data,
    embeddings,
    index_name=index_name,
    namespace=namespace
)

# Perform RAG

In [24]:
# Initialize Pinecone
pc = Pinecone(api_key=userdata.get("PINECONE_API_KEY"),)

# Connect to your Pinecone index
pinecone_index = pc.Index(index_name)

In [50]:
query = "What are some items that Pirkko Koskitalo is likely to buy next? What incentives can I put in place to ensure he orders more?"

In [51]:
raw_query_embedding = get_huggingface_embeddings(query)

In [76]:
raw_query_embedding

array([-2.37192772e-02,  5.11897393e-02, -3.76846157e-02, -3.59025188e-02,
       -1.70774478e-02,  4.99673262e-02,  5.41750118e-02,  1.63438041e-02,
       -7.25995526e-02,  1.91715304e-02,  4.41450216e-02, -1.21413413e-02,
       -5.31499051e-02,  1.80768650e-02,  4.38538678e-02, -2.18482707e-02,
        8.62548500e-02,  2.99322605e-02, -5.04300892e-02, -3.99649851e-02,
        9.35384538e-03, -9.97161716e-02,  8.71505290e-02,  1.87936903e-03,
       -2.41057333e-02, -1.08202472e-02, -1.57260206e-02, -2.97939051e-02,
        1.20479465e-02, -7.24623278e-02,  3.19638886e-02,  6.17180690e-02,
        7.11756572e-03, -5.66110983e-02,  5.71597740e-03,  8.44609085e-03,
        2.97988509e-03, -5.20805381e-02,  2.58033983e-02,  2.08102539e-02,
        3.32184136e-02, -5.05116731e-02, -5.20432591e-02,  2.99860332e-02,
        2.27520196e-03,  2.01680623e-02,  5.08307479e-03,  3.94049240e-03,
        4.68822531e-02,  1.60782076e-02, -1.24224432e-01, -3.86481136e-02,
        3.41078863e-02, -

In [53]:
top_matches = pinecone_index.query(vector=raw_query_embedding.tolist(), top_k=10, include_metadata=True, namespace=namespace)

In [77]:
top_matches

{'matches': [{'id': '465b4d36-921d-4b22-8050-200ae41a3630',
              'metadata': {'file_name': 'purchase_orders_10333.pdf',
                           'folder_names': ['CompanyDocuments',
                                            'PurchaseOrders'],
                           'parent_folder': 'PurchaseOrders',
                           'text': '<Source>\n'
                                   '/content/CompanyDocuments/PurchaseOrders/purchase_orders_10333.pdf\n'
                                   '</Source>\n'
                                   '\n'
                                   '<Content>\n'
                                   'Purchase Orders\n'
                                   'Order ID Order Date Customer Name\n'
                                   '10333 2016-10-18 Pirkko Koskitalo\n'
                                   'Products\n'
                                   'Product ID: Product: Quantity: Unit '
                                   'Price:\n'
                     

In [55]:
contexts = [item['metadata']['text'] for item in top_matches['matches']]

In [78]:
contexts

["<Source>\n/content/CompanyDocuments/PurchaseOrders/purchase_orders_10333.pdf\n</Source>\n\n<Content>\nPurchase Orders\nOrder ID Order Date Customer Name\n10333 2016-10-18 Pirkko Koskitalo\nProducts\nProduct ID: Product: Quantity: Unit Price:\n14 Tofu 10 18.6\n21 Sir Rodney's Scones 10 8\n71 Flotemysost 40 17.2\nPage 1\n</Content>",
 '<Source>\n/content/CompanyDocuments/PurchaseOrders/purchase_orders_10583.pdf\n</Source>\n\n<Content>\nPurchase Orders\nOrder ID Order Date Customer Name\n10583 2017-06-30 Pirkko Koskitalo\nProducts\nProduct ID: Product: Quantity: Unit Price:\n29 Thüringer Rostbratwurst 10 123.79\n60 Camembert Pierrot 24 34\n69 Gudbrandsdalsost 10 36\nPage 1\n</Content>',
 '<Source>\n/content/CompanyDocuments/PurchaseOrders/purchase_orders_10412.pdf\n</Source>\n\n<Content>\nPurchase Orders\nOrder ID Order Date Customer Name\n10412 2017-01-13 Pirkko Koskitalo\nProducts\nProduct ID: Product: Quantity: Unit Price:\n14 Tofu 20 18.6\nPage 1\n</Content>',
 "<Source>\n/content/C

In [57]:
augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

In [79]:
print(augmented_query)

<CONTEXT>
<Source>
/content/CompanyDocuments/PurchaseOrders/purchase_orders_10333.pdf
</Source>

<Content>
Purchase Orders
Order ID Order Date Customer Name
10333 2016-10-18 Pirkko Koskitalo
Products
Product ID: Product: Quantity: Unit Price:
14 Tofu 10 18.6
21 Sir Rodney's Scones 10 8
71 Flotemysost 40 17.2
Page 1
</Content>

-------

<Source>
/content/CompanyDocuments/PurchaseOrders/purchase_orders_10583.pdf
</Source>

<Content>
Purchase Orders
Order ID Order Date Customer Name
10583 2017-06-30 Pirkko Koskitalo
Products
Product ID: Product: Quantity: Unit Price:
29 Thüringer Rostbratwurst 10 123.79
60 Camembert Pierrot 24 34
69 Gudbrandsdalsost 10 36
Page 1
</Content>

-------

<Source>
/content/CompanyDocuments/PurchaseOrders/purchase_orders_10412.pdf
</Source>

<Content>
Purchase Orders
Order ID Order Date Customer Name
10412 2017-01-13 Pirkko Koskitalo
Products
Product ID: Product: Quantity: Unit Price:
14 Tofu 20 18.6
Page 1
</Content>

-------

<Source>
/content/CompanyDocuments

In [59]:
system_prompt = f"""You are an expert at understanding and analyzing company data - particularly shipping orders, purchase orders, invoices, and inventory reports.

Answer any questions I have, based on the data provided. Always consider all of the context provided when forming a response.
"""

llm_response = groq_client.chat.completions.create(
    model="llama-3.1-70b-versatile",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": augmented_query}
    ]
)

response = llm_response.choices[0].message.content

In [60]:
print(response)

Based on the purchase order data provided, we can see that Pirkko Koskitalo is a frequent customer who has been placing orders for various products over a period of time. To identify potential items that Pirkko Koskitalo may buy next, we can look at the following:

1. **Recurring products:** We can see that Pirkko Koskitalo has been ordering Flotemysost (Product ID: 71) multiple times (Order IDs: 10320, 10333, 10455). This suggests a consistent demand for this product, and it's likely they may order it again.
2. **Similar product categories:** Perth Pasties (Product ID: 53) and Gnocchi di nonna Alice (Product ID: 56) have also been ordered multiple times (Order IDs: 10416, 10437, 10526, 10781). This indicates a preference for certain types of products, and Pirkko Koskitalo may be interested in other similar products.
3. **Recent purchase history:** Looking at the most recent orders, we can see that Pirkko Koskitalo has been ordering a mix of products, including Chai, Konbu, Tourtière, 

# Putting it all together

In [64]:
def perform_rag(query):
    raw_query_embedding = get_huggingface_embeddings(query)

    query_embedding = np.array(raw_query_embedding)

    top_matches = pinecone_index.query(vector=query_embedding.tolist(), top_k=10, include_metadata=True, namespace=namespace)

    # Get the list of retrieved texts
    contexts = [item['metadata']['text'] for item in top_matches['matches']]

    augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

    # Modify the prompt below as need to improve the response quality
    system_prompt = f"""You are an expert at understanding and analyzing company data - particularly shipping orders, purchase orders, invoices, and inventory reports.

    Answer any questions I have, based on the data provided. Always consider all parts of the context provided when forming a response.
    """

    res = groq_client.chat.completions.create(
        model="llama-3.1-70b-versatile", # llama-3.1-70b-versatile
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": augmented_query}
        ]
    )

    return res.choices[0].message.content

In [70]:
# If you have access to the o1 model through the OpenAI API, you can use this function to compare the quality of responses
def perform_rag_openai(query):
    raw_query_embedding = get_huggingface_embeddings(query)

    query_embedding = np.array(raw_query_embedding)

    top_matches = pinecone_index.query(vector=query_embedding.tolist(), top_k=10, include_metadata=True, namespace=namespace)

    # Get the list of retrieved texts
    contexts = [item['metadata']['text'] for item in top_matches['matches']]

    augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

    # Modify the prompt below as need to improve the response quality
    system_prompt = f"""You are an expert at understanding and analyzing company data - particularly shipping orders, purchase orders, invoices, and inventory reports.

    Answer any questions I have, based on the data provided. Always consider all parts of the context provided when forming a response.
    """

    res = openai_client.chat.completions.create(
        model="o1-preview",
        messages=[
            {"role": "user", "content": f"{system_prompt} {augmented_query}"}
        ]
    )

    return res.choices[0].message.content

In [65]:
response = perform_rag("What are some trends with Ricardo Adocicados purchase orders?")

print(response)

Based on the provided data, I can identify some trends in Ricardo Adocicados' purchase orders. Here are some observations:

1. **Shipper preference**: Ricardo Adocicados seems to have a preference for using Speedy Express (Shipper ID: 1) as their primary shipper. Out of the 11 order documents provided, 3 of them (orders 10813, 10851, and 10877) were shipped via Speedy Express.

However, it's also evident that United Package (Shipper ID: 2) was a close second, with a total of 5 orders (orders 10563, 10648, 10299, 10447, and 10481).

Federal Shipping (Shipper ID: 3) only appeared in 2 orders (orders 10622 and 10287). This might suggest that Ricardo Adocicados primarily works with Speedy Express and United Package for their shipping needs.

2. **Product consistency**: Ricardo Adocicados frequently ordered certain products from different suppliers. Some of these products include:
   - Chang: This product was ordered in orders 10813, 10622, and 10851.
   - Spegesild: This product was ordere

In [71]:
response = perform_rag_openai("What are some trends with Ricardo Adocicados purchase orders?")

print(response)



Based on the provided shipping order data for Ricardo Adocicados, several noteworthy trends emerge regarding their purchasing behavior between 2016 and early 2018:

1. **Shift in Shipping Preferences**:
   - **2016 to 2017**: The company predominantly used **United Package** (Shipper ID 2) and **Federal Shipping** (Shipper ID 3).
     - *United Package* was used for shipments on:
       - September 6, 2016
       - February 14, 2017
       - March 20, 2017
       - June 10, 2017
       - August 28, 2017
     - *Federal Shipping* was chosen on:
       - August 22, 2016
       - August 6, 2017
   - **2018 Onwards**: A noticeable shift to **Speedy Express** (Shipper ID 1) is evident.
     - *Speedy Express* was used for all recorded shipments in 2018:
       - January 5, 2018
       - January 26, 2018
       - February 9, 2018
   - **Implication**: This change may indicate a new partnership, better rates, improved service quality, or a strategic decision to consolidate shipping providers.