![Img](https://app.theheadstarter.com/static/hs-logo-opengraph.png)

# Headstarter RAG Workshop

- Follow along with the [Google Doc here](https://docs.google.com/document/d/1RF-_JdPRMKL7JQgKa5R54L9LtNEofuPZJ1SX31d2Xik/edit?usp=sharing)

- **Skills: HuggingFace, LangChain, Pinecone**




### What is RAG anyway?


![withoutRAG](https://github.com/user-attachments/assets/649d6101-b63a-4750-997a-b6abc25e5609)

![withRAG](https://github.com/user-attachments/assets/e6dd9c46-0bf9-4c31-bd72-a27939ef82b8)

Retrieval-Augmented Generation (RAG) is a technique primarily used in GenAI applications to improve the quality and accuracy of generated text by LLMs by combining two key processes: retrieval and generation.

### Breaking It Down:
#### Retrieval:

- Before generating a response, the system first looks up relevant information from a large database or knowledge base. This is like searching through a library or the internet to find the most useful facts, articles, or data related to the question or topic.

#### Generation:

- Once the relevant information is retrieved, the system then uses it to help generate a response. This is where the model, like GPT, creates new text (answers, explanations, etc.) based on the retrieved information.

# Install libraries

In [1]:
! pip install langchain langchain-community openai groq tiktoken pinecone-client langchain_pinecone unstructured pdfminer==20191125 pdfminer.six==20221105 pillow_heif unstructured_inference sentence-transformers

Collecting langchain-community
  Downloading langchain_community-0.3.8-py3-none-any.whl.metadata (2.9 kB)
Collecting groq
  Downloading groq-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting langchain_pinecone
  Downloading langchain_pinecone-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Collecting unstructured
  Downloading unstructured-0.16.8-py3-none-any.whl.metadata (24 kB)
Collecting pdfminer==20191125
  Downloading pdfminer-20191125.tar.gz (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pdfminer.six==20221105
  Downloading pdfminer.six-20221105-py3-none-any.whl.metadata (4.0 kB)
Collecting pillow_heif
  Downloading p

In [2]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, WebBaseLoader, YoutubeLoader, DirectoryLoader, TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
from langchain.schema import Document
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone
from openai import OpenAI
import numpy as np
import tiktoken
import os
from groq import Groq



# Initialize the HuggingFace Embeddings client

In [3]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [4]:
text = "Hello my name is Faizan"

query_result = embeddings.embed_query(text)

In [5]:
len(query_result)

768

# Calculating sentence similarity with embeddings

In [6]:
def get_huggingface_embeddings(text, model_name="sentence-transformers/all-mpnet-base-v2"):
    model = SentenceTransformer(model_name)
    return model.encode(text)


def cosine_similarity_between_sentences(sentence1, sentence2):
    # Get embeddings for both sentences
    embedding1 = np.array(get_huggingface_embeddings(sentence1))
    embedding2 = np.array(get_huggingface_embeddings(sentence2))

    # Reshape embeddings for cosine_similarity function
    embedding1 = embedding1.reshape(1, -1)
    embedding2 = embedding2.reshape(1, -1)

    print("Embedding for Sentence 1:", embedding1)
    print("\nEmbedding for Sentence 2:", embedding2)

    # Calculate cosine similarity
    similarity = cosine_similarity(embedding1, embedding2)
    return similarity[0][0]


# Example usage
sentence1 = "I like walking to the park"
sentence2 = "I like walking to the office"


similarity = cosine_similarity_between_sentences(sentence1, sentence2)
print(f"\n\nCosine similarity between '{sentence1}' and '{sentence2}': {similarity:.4f}")

Embedding for Sentence 1: [[-5.18317223e-02  5.11822924e-02  1.72791979e-03 -1.36199202e-02
  -1.06869487e-03  2.96393428e-02 -4.72495109e-02 -2.11421009e-02
   5.48422784e-02  2.37766728e-02 -8.88856407e-03  1.03983462e-01
   1.87567491e-02 -6.70846319e-03 -3.84319052e-02 -7.80754834e-02
  -5.44624683e-03  6.69373479e-03 -1.80737115e-02  3.50141115e-02
  -3.07590049e-02  3.44667286e-02 -5.48802782e-03 -2.29204204e-02
   9.91370343e-03 -1.50746480e-02  1.37100741e-02 -3.11791096e-02
   7.79691711e-02  3.52224708e-02 -1.94613449e-02 -1.78903583e-02
   2.13377569e-02 -1.85624994e-02  1.29274099e-06  7.14496849e-03
  -7.68434315e-04  1.04230363e-02  3.67814861e-02 -3.46986540e-02
   3.50453630e-02  1.30667230e-02  1.00722872e-02 -4.18642862e-03
   2.04598345e-02 -2.74207480e-02  3.01958937e-02  2.14188918e-02
  -6.43193796e-02  1.04757305e-02 -4.66440478e-03 -4.05048616e-02
  -5.80140166e-02  1.99005734e-02 -2.49033840e-03  8.85135308e-02
   6.04227521e-02  1.96583439e-02  5.06717786e-02 

# Load in the Data

Learn more about the dataset [here](https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset)

In [7]:
! kaggle datasets download -d ayoubcherguelaine/company-documents-dataset
! unzip company-documents-dataset.zip

Dataset URL: https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
License(s): apache-2.0
Downloading company-documents-dataset.zip to /content
 96% 9.00M/9.34M [00:00<00:00, 88.4MB/s]
100% 9.34M/9.34M [00:00<00:00, 90.0MB/s]
Archive:  company-documents-dataset.zip
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_1.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_2.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_3.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_4.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_5.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_6.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Cat

In [8]:
def process_directory(directory_path):
    data = []
    for root, _, files in os.walk(directory_path):
        for file in files:

            file_path = os.path.join(root, file)
            print(f"Processing file: {file_path}")
            loader = PyPDFLoader(file_path)
            data.append({"File": file_path, "Data": loader.load()})

    return data

directory_path = "/content/CompanyDocuments"
documents = process_directory(directory_path)


Processing file: /content/CompanyDocuments/Shipping orders/order_10640.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10466.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10712.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10510.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10459.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10716.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_11066.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_11013.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10674.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10746.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10355.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10552.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10727.pdf
Processing file: /content

In [9]:
documents

[{'File': '/content/CompanyDocuments/Shipping orders/order_10640.pdf',
  'Data': [Document(metadata={'source': '/content/CompanyDocuments/Shipping orders/order_10640.pdf', 'page': 0}, page_content='Order ID: 10640\nShipping Details:\nShip Name: Die Wandernde Kuh\nShip Address: Adenauerallee 900\nShip City: Stuttgart\nShip Region: Western Europe\nShip Postal Code: 70563\nShip Country: Germany\nCustomer Details:\nCustomer ID: WANDK\nCustomer Name: Die Wandernde Kuh\nEmployee Details:\nEmployee Name: Margaret Peacock\nShipper Details:\nShipper ID: 1\nShipper Name: Speedy Express\nOrder Details:\nOrder Date: 2017-08-21\nShipped Date: 2017-08-28\nProducts:\n--------------------------------------------------------------------------------------------------\nProduct: Gudbrandsdalsost\nQuantity: 20\nUnit Price: 36.0\nTotal: 720.0\n--------------------------------------------------------------------------------------------------\nProduct: Outback Lager\nQuantity: 15\nUnit Price: 15.0\nTotal: 225

# Setting up Pinecone
**1. Create an account on [Pinecone.io](https://app.pinecone.io/)**

**2. Create a new index called "rag-workshop" and set the dimensions to 768. Leave the rest of the settings as they are.**

![Screenshot 2024-11-28 at 12 01 30 AM](https://github.com/user-attachments/assets/548657af-ad75-4767-9bcf-41998e01a33e)


**3. Create an API Key for Pinecone**

![Screenshot 2024-11-24 at 10 44 37 PM](https://github.com/user-attachments/assets/e7feacc6-2bd1-472a-82e5-659f65624a88)


**4. Store your Pinecone API Key within Google Colab's secrets section, and then enable access to it (see the blue checkmark)**


![Screenshot 2024-11-24 at 10 45 25 PM](https://github.com/user-attachments/assets/eaf73083-0b5f-4d17-9e0c-eab84f91b0bc)




In [12]:
pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key

index_name = "rag-workshop"

namespace = "company-documents"

vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)

# Insert Data into Pinecone

In [13]:
for document in documents:
    print(document['File'])
    print(document['Data'])
    print("\n")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
/content/CompanyDocuments/PurchaseOrders/purchase_orders_10405.pdf
[Document(metadata={'source': '/content/CompanyDocuments/PurchaseOrders/purchase_orders_10405.pdf', 'page': 0}, page_content='Purchase Orders\nOrder ID Order Date Customer Name\n10405 2017-01-06 Felipe Izquierdo\nProducts\nProduct ID: Product: Quantity: Unit Price:\n3 Aniseed Syrup 50 8\nPage 1')]


/content/CompanyDocuments/PurchaseOrders/purchase_orders_10443.pdf
[Document(metadata={'source': '/content/CompanyDocuments/PurchaseOrders/purchase_orders_10443.pdf', 'page': 0}, page_content='Purchase Orders\nOrder ID Order Date Customer Name\n10443 2017-02-12 Maurizio Moroni\nProducts\nProduct ID: Product: Quantity: Unit Price:\n11 Queso Cabrales 6 16.8\n28 Rössle Sauerkraut 12 36.4\nPage 1')]


/content/CompanyDocuments/PurchaseOrders/purchase_orders_11008.pdf
[Document(metadata={'source': '/content/CompanyDocuments/PurchaseOrders/purchase_orders_11008.pdf',

In [14]:
document_data = []

for document in documents:

    document_source = document['File']
    document_content = document['Data'][0].page_content

    doc = Document(
        metadata= {
            "source": document_source
        },
        page_content=f"Source: {document_source}\n{document_content}"
    )

    document_data.append(doc)

    print(doc)
    print("\n")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
City: Lisboa
Postal Code: 1675
Country: Portugal
Phone: (1) 354-2534
Fax: (1) 354-2535
Product Details:
Product ID Product Name Quantity Unit Price
59 Raclette Courdavault 9 44.0
65 Louisiana Fiery Hot Pepper Sauce 40 16.8
68 Scottish Longbreads 10 10.0
  TotalPrice 1168.0
Page 1' metadata={'source': '/content/CompanyDocuments/invoices/invoice_10328.pdf'}


page_content='Source: /content/CompanyDocuments/invoices/invoice_10479.pdf
Invoice
Order ID: 10479
Customer ID: RATTC
Order Date: 2017-03-19
Customer Details:
Contact Name: Paula Wilson
Address: 2817 Milton Dr.
City: Albuquerque
Postal Code: 87110
Country: USA
Phone: (505) 555-5939
Fax: (505) 555-3620
Product Details:
Product ID Product Name Quantity Unit Price
38 Côte de Blaye 30 210.8
53 Perth Pasties 28 26.2
59 Raclette Courdavault 60 44.0
64 Wimmers gute Semmelknödel 30 26.6
  TotalPrice 10495.6
Page 1' metadata={'source': '/content/CompanyDocuments/invoices/invoic

In [15]:
document_data

[Document(metadata={'source': '/content/CompanyDocuments/Shipping orders/order_10640.pdf'}, page_content='Source: /content/CompanyDocuments/Shipping orders/order_10640.pdf\nOrder ID: 10640\nShipping Details:\nShip Name: Die Wandernde Kuh\nShip Address: Adenauerallee 900\nShip City: Stuttgart\nShip Region: Western Europe\nShip Postal Code: 70563\nShip Country: Germany\nCustomer Details:\nCustomer ID: WANDK\nCustomer Name: Die Wandernde Kuh\nEmployee Details:\nEmployee Name: Margaret Peacock\nShipper Details:\nShipper ID: 1\nShipper Name: Speedy Express\nOrder Details:\nOrder Date: 2017-08-21\nShipped Date: 2017-08-28\nProducts:\n--------------------------------------------------------------------------------------------------\nProduct: Gudbrandsdalsost\nQuantity: 20\nUnit Price: 36.0\nTotal: 720.0\n--------------------------------------------------------------------------------------------------\nProduct: Outback Lager\nQuantity: 15\nUnit Price: 15.0\nTotal: 225.0\nTotal Price:\n'),
 Do

In [None]:
for idx, document in enumerate(document_data):
    print("Processing document:", idx)
    vectorstore_from_documents = PineconeVectorStore.from_documents(
        [document],
        embeddings,
        index_name=index_name,
        namespace=namespace
    )

# Initialize the Groq client

1. Get your Groq API Key [here](https://console.groq.com/keys)

2. Paste your Groq API Key into your Google Colab secrets, and make sure to enable permissions for it

![Screenshot 2024-11-25 at 12 00 16 AM](https://github.com/user-attachments/assets/e5525d29-bca6-4dbd-892b-cc770a6b281d)

In [27]:
groq_api_key = userdata.get("GROQ_API_KEY")
os.environ['GROQ_API_KEY'] = groq_api_key

groq_client = Groq(api_key=os.getenv('GROQ_API_KEY'))

# Perform RAG

In [20]:
# Initialize Pinecone
pc = Pinecone(api_key=userdata.get("PINECONE_API_KEY"),)

# Connect to your Pinecone index
pinecone_index = pc.Index(index_name)

In [21]:
query = "What are some items that Pirkko Koskitalo is likely to buy next? What incentives can I put in place to ensure he orders more?"

In [22]:
raw_query_embedding = get_huggingface_embeddings(query)

In [23]:
top_matches = pinecone_index.query(vector=raw_query_embedding.tolist(), top_k=10, include_metadata=True, namespace=namespace)

In [24]:
contexts = [item['metadata']['text'] for item in top_matches['matches']]

In [25]:
augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

In [31]:
print(augmented_query)

<CONTEXT>
Source: /content/CompanyDocuments/PurchaseOrders/purchase_orders_10553.pdf
Purchase Orders
Order ID Order Date Customer Name
10553 2017-05-30 Pirkko Koskitalo
Products
Product ID: Product: Quantity: Unit Price:
11 Queso Cabrales 15 21
16 Pavlova 14 17.45
22 Gustaf's Knäckebröd 24 21
31 Gorgonzola Telino 30 12.5
35 Steeleye Stout 6 18
Page 1

-------

Source: /content/CompanyDocuments/PurchaseOrders/purchase_orders_11025.pdf
Purchase Orders
Order ID Order Date Customer Name
11025 2018-04-15 Pirkko Koskitalo
Products
Product ID: Product: Quantity: Unit Price:
1 Chai 10 18
13 Konbu 20 6
Page 1

-------

Source: /content/CompanyDocuments/PurchaseOrders/purchase_orders_10526.pdf
Purchase Orders
Order ID Order Date Customer Name
10526 2017-05-05 Pirkko Koskitalo
Products
Product ID: Product: Quantity: Unit Price:
1 Chai 8 18
13 Konbu 10 6
56 Gnocchi di nonna Alice 30 38
Page 1

-------

Source: /content/CompanyDocuments/PurchaseOrders/purchase_orders_10455.pdf
Purchase Orders
Order

In [28]:
system_prompt = f"""You are an expert at understanding and analyzing company data - particularly shipping orders, purchase orders, invoices, and inventory reports.

Answer any questions I have, based on the data provided. Always consider all of the context provided when forming a response.
"""

llm_response = groq_client.chat.completions.create(
    model="llama-3.1-70b-versatile",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": augmented_query}
    ]
)

response = llm_response.choices[0].message.content

In [29]:
print(response)

Based on the provided purchase order data, it appears that Pirkko Koskitalo is a frequent customer who places orders with a regular pattern throughout the year. By analyzing the purchase history, we can identify some item categories and specific products that Pirkko is likely to buy in the future.

**Items likely to be purchased next:**

1. **Perth Pasties (Product ID: 53)**: Pirkko has purchased this product multiple times in different orders (10416, 10437, and 10455), indicating a regular demand for this item.
2. **Chai (Product ID: 1)**: This product has been ordered in two separate instances (10526 and 11025), suggesting a possible re-order pattern.
3. **Flotemysost (Product ID: 71)**: Pirkko has purchased this item twice (10333 and 10455), which may indicate a steady demand for this product.
4. **Gnocchi di nonna Alice (Product ID: 56)**: Although this product has been ordered only twice (10526 and 10781), its relatively high unit price and quantity suggest it could be a regular p

# Putting it all together

In [35]:
def perform_rag(query):
    raw_query_embedding = get_huggingface_embeddings(query)

    query_embedding = np.array(raw_query_embedding)

    top_matches = pinecone_index.query(vector=query_embedding.tolist(), top_k=10, include_metadata=True, namespace=namespace)

    # Get the list of retrieved texts
    contexts = [item['metadata']['text'] for item in top_matches['matches']]

    augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

    # Modify the prompt below as need to improve the response quality
    system_prompt = f"""You are an expert at understanding and analyzing company data - particularly shipping orders, purchase orders, invoices, and inventory reports.

    Answer any questions I have, based on the data provided. Always consider all parts of the context provided when forming a response.
    """

    res = groq_client.chat.completions.create(
        model="llama-3.1-70b-versatile", # llama-3.1-70b-versatile
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": augmented_query}
        ]
    )

    return res.choices[0].message.content

In [36]:
response = perform_rag("What are some trends with Ricardo Adocicados purchase orders?")

print(response)

There is no purchase order data available for Ricardo Adocicados in the provided context. The context only includes purchase order data for customers such as Mario Pontes, Carlos González, André Fonseca, Guillermo Fernández, Antonio Moreno, and Roland Mendel.
