<a href="https://colab.research.google.com/github/tj-guruvelli/HeadstarterRAGLab/blob/main/Headstarter_RAG_Workshop_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Img](https://app.theheadstarter.com/static/hs-logo-opengraph.png)

# Headstarter RAG Workshop

**Skills: HuggingFace, LangChain, Pinecone**

**Other Resources:**
- [Get your Groq API Key](https://console.groq.com/keys)
- [Get your Pinecone API Key](https://www.pinecone.io/)


### What is RAG anyway?


![withoutRAG](https://github.com/user-attachments/assets/649d6101-b63a-4750-997a-b6abc25e5609)

![withRAG](https://github.com/user-attachments/assets/e6dd9c46-0bf9-4c31-bd72-a27939ef82b8)

Retrieval-Augmented Generation (RAG) is a technique primarily used in GenAI applications to improve the quality and accuracy of generated text by LLMs by combining two key processes: retrieval and generation.

### Breaking It Down:
#### Retrieval:

- Before generating a response, the system first looks up relevant information from a large database or knowledge base. This is like searching through a library or the internet to find the most useful facts, articles, or data related to the question or topic.

#### Generation:

- Once the relevant information is retrieved, the system then uses it to help generate a response. This is where the model, like GPT, creates new text (answers, explanations, etc.) based on the retrieved information.

#### Using PINECONE and GROQ as open-source tools for demonstration and accessibility

# Install libraries

In [1]:
! pip install langchain langchain-community openai groq tiktoken pinecone-client langchain_pinecone unstructured pdfminer==20191125 pdfminer.six==20221105 pillow_heif unstructured_inference sentence-transformers



In [2]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, WebBaseLoader, YoutubeLoader, DirectoryLoader, TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
from langchain.schema import Document
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone
from openai import OpenAI
import numpy as np
import tiktoken
import os
from groq import Groq

pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key

# openai_api_key = userdata.get("OPENAI_API_KEY")
# os.environ['OPENAI_API_KEY'] = openai_api_key

groq_api_key = userdata.get("GROQ_API_KEY")
os.environ['GROQ_API_KEY'] = groq_api_key



# Initialize the HuggingFace Embeddings client

- Can use OpenAI Embedding model, or others like NVIDIA if you want to experiment other than HuggingFace.
- Converts text into numbers Retains its semantic meaning

In [12]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")



In [11]:

text = "Hello my name is Teja"

query_result = embeddings.embed_query(text)

Here is how the query looks like after getting converted

In [13]:
query_result

[-0.07406413555145264,
 0.012264702469110489,
 0.025599660351872444,
 0.04017011076211929,
 -0.09220544248819351,
 -0.08846322447061539,
 0.15335537493228912,
 0.002915057586506009,
 0.025351107120513916,
 0.024370102211833,
 0.010215126909315586,
 -0.09724365919828415,
 0.026195887476205826,
 -0.02894514426589012,
 0.037661343812942505,
 -0.050083551555871964,
 0.07666006684303284,
 -0.010352257639169693,
 0.012071364559233189,
 -0.04377083480358124,
 -0.05398358404636383,
 0.023461850360035896,
 -0.04079563543200493,
 -0.029599731788039207,
 -0.11574237048625946,
 -0.016673732548952103,
 0.022969799116253853,
 0.08957386761903763,
 -0.006595910061150789,
 -0.032935887575149536,
 -0.022078178822994232,
 0.024496393278241158,
 0.05894646421074867,
 0.020998159423470497,
 -0.04572267457842827,
 0.06009586900472641,
 -0.13809846341609955,
 -0.037111781537532806,
 -0.056543927639722824,
 0.012932652607560158,
 -0.03299641236662865,
 -0.06587936729192734,
 -0.0696813240647316,
 -0.02247665

In [5]:
len(query_result)

384

# Initialize the Groq client

In [6]:
# Free Llama 3.1 API via Groq

groq_client = Groq(api_key=os.getenv('GROQ_API_KEY'))

# Calculating sentence similarity with embeddings

- Only need to be pulling from relevant documents and not all the documents or info that an company might have on internal data, SO we need a way to define WHICH documents are relevant
- The sentences produced here are basically the sentences in matrix form like in linear algebra

In [7]:
def get_huggingface_embeddings(text, model_name="sentence-transformers/all-MiniLM-L6-v2"):
    model = SentenceTransformer(model_name)
    return model.encode(text)


def cosine_similarity_between_sentences(sentence1, sentence2):
    # Get embeddings for both sentences
    embedding1 = np.array(get_huggingface_embeddings(sentence1))
    embedding2 = np.array(get_huggingface_embeddings(sentence2))

    # Reshape embeddings for cosine_similarity function
    embedding1 = embedding1.reshape(1, -1)
    embedding2 = embedding2.reshape(1, -1)

    print("Embedding for Sentence 1:", embedding1)
    print("\nEmbedding for Sentence 2:", embedding2)

    # Calculate cosine similarity
    similarity = cosine_similarity(embedding1, embedding2)
    return similarity[0][0]


# Example usage
sentence1 = "I like walking to the park"
sentence2 = "I like running to the office"


similarity = cosine_similarity_between_sentences(sentence1, sentence2)
print(f"\n\nCosine similarity between '{sentence1}' and '{sentence2}': {similarity:.4f}")

Embedding for Sentence 1: [[-7.94643245e-04 -4.52190675e-02  5.60034849e-02  4.00062427e-02
   7.82356560e-02 -3.10017494e-03  1.56902850e-01 -1.61643466e-03
   8.40177163e-02  7.29586780e-02 -2.27428079e-02 -1.00336457e-02
  -4.77766395e-02  5.78006804e-02  6.89262897e-02  2.29864451e-03
   3.41052301e-02  8.23903084e-02 -4.47455328e-03  1.18202902e-02
  -7.44136199e-02  2.10828688e-02  1.92200597e-02  5.48400506e-02
  -1.07110791e-01  8.79156739e-02 -1.64800640e-02  6.51677838e-03
  -6.67075947e-05 -4.27565910e-03 -8.20703283e-02  7.05852807e-02
  -1.80556215e-02  3.27348337e-02 -4.36549708e-02  9.93788615e-03
   5.78057729e-02 -6.92315474e-02  4.53141518e-02  4.96660061e-02
  -1.49475960e-02  5.79100400e-02  8.14057887e-02  2.62878463e-03
  -1.49136772e-02 -4.37886156e-02  2.26742886e-02 -3.19028087e-02
   1.00592643e-01  3.10834609e-02  1.30596489e-01  7.27661699e-03
   8.58718809e-03  7.95199908e-03 -7.91897438e-03  4.98278998e-03
  -8.22420716e-02  2.46388949e-02  5.11084795e-02 

# Load in the Data

Learn more about the dataset [here](https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset)

In [8]:
! kaggle datasets download -d ayoubcherguelaine/company-documents-dataset
! unzip company-documents-dataset.zip

Dataset URL: https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
License(s): apache-2.0
company-documents-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  company-documents-dataset.zip
replace CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_1.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

#### Processing each file and EXTRACTING THE TEXT that was processed

In [9]:
def process_directory(directory_path):
    data = []
    for root, _, files in os.walk(directory_path):
        for file in files:

            file_path = os.path.join(root, file)
            print(f"Processing file: {file_path}")
            loader = PyPDFLoader(file_path)
            data.append({"File": file_path, "Data": loader.load()})

    return data

directory_path = "/content/CompanyDocuments"
documents = process_directory(directory_path)


Processing file: /content/CompanyDocuments/PurchaseOrders/purchase_orders_10284.pdf
Processing file: /content/CompanyDocuments/PurchaseOrders/purchase_orders_10889.pdf
Processing file: /content/CompanyDocuments/PurchaseOrders/purchase_orders_10402.pdf
Processing file: /content/CompanyDocuments/PurchaseOrders/purchase_orders_10468.pdf
Processing file: /content/CompanyDocuments/PurchaseOrders/purchase_orders_10690.pdf
Processing file: /content/CompanyDocuments/PurchaseOrders/purchase_orders_10727.pdf
Processing file: /content/CompanyDocuments/PurchaseOrders/purchase_orders_10396.pdf
Processing file: /content/CompanyDocuments/PurchaseOrders/purchase_orders_10835.pdf
Processing file: /content/CompanyDocuments/PurchaseOrders/purchase_orders_10596.pdf
Processing file: /content/CompanyDocuments/PurchaseOrders/purchase_orders_11037.pdf
Processing file: /content/CompanyDocuments/PurchaseOrders/purchase_orders_11041.pdf
Processing file: /content/CompanyDocuments/PurchaseOrders/purchase_orders_10

In [14]:
documents

[{'File': '/content/CompanyDocuments/PurchaseOrders/purchase_orders_10284.pdf',
  'Data': [Document(metadata={'source': '/content/CompanyDocuments/PurchaseOrders/purchase_orders_10284.pdf', 'page': 0}, page_content='Purchase Orders\nOrder ID Order Date Customer Name\n10284 2016-08-19 Renate Messner\nProducts\nProduct ID: Product: Quantity: Unit Price:\n27 Schoggi Schokolade 15 35.1\n44 Gula Malacca 21 15.5\n60 Camembert Pierrot 20 27.2\n67 Laughing Lumberjack Lager 5 11.2\nPage 1')]},
 {'File': '/content/CompanyDocuments/PurchaseOrders/purchase_orders_10889.pdf',
  'Data': [Document(metadata={'source': '/content/CompanyDocuments/PurchaseOrders/purchase_orders_10889.pdf', 'page': 0}, page_content='Purchase Orders\nOrder ID Order Date Customer Name\n10889 2018-02-16 Paula Wilson\nProducts\nProduct ID: Product: Quantity: Unit Price:\n11 Queso Cabrales 40 21\n38 Côte de Blaye 40 263.5\nPage 1')]},
 {'File': '/content/CompanyDocuments/PurchaseOrders/purchase_orders_10402.pdf',
  'Data': [Do

# Initialize Pinecone

- It takes a few minutes to intialize
- Way to parition data together
- Split different comapnies into different NAMESPACES

In [15]:
# Make sure to create a Pinecone index with 384 dimensions
# USE 384 for the length of the EMBEDDING model when defining a index in a vector database

index_name = "rag-workshop"

namespace = "company-documents"

vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)

# Insert data into Pinecone

- This prints out the documents strutcures
- 'File' is the path to data files
- 'Data' is the actual content in the file
- Way to represent pdfs and extracted text

In [16]:
for document in documents:
    print(document['File'], document['Data'])

/content/CompanyDocuments/PurchaseOrders/purchase_orders_10284.pdf [Document(metadata={'source': '/content/CompanyDocuments/PurchaseOrders/purchase_orders_10284.pdf', 'page': 0}, page_content='Purchase Orders\nOrder ID Order Date Customer Name\n10284 2016-08-19 Renate Messner\nProducts\nProduct ID: Product: Quantity: Unit Price:\n27 Schoggi Schokolade 15 35.1\n44 Gula Malacca 21 15.5\n60 Camembert Pierrot 20 27.2\n67 Laughing Lumberjack Lager 5 11.2\nPage 1')]
/content/CompanyDocuments/PurchaseOrders/purchase_orders_10889.pdf [Document(metadata={'source': '/content/CompanyDocuments/PurchaseOrders/purchase_orders_10889.pdf', 'page': 0}, page_content='Purchase Orders\nOrder ID Order Date Customer Name\n10889 2018-02-16 Paula Wilson\nProducts\nProduct ID: Product: Quantity: Unit Price:\n11 Queso Cabrales 40 21\n38 Côte de Blaye 40 263.5\nPage 1')]
/content/CompanyDocuments/PurchaseOrders/purchase_orders_10402.pdf [Document(metadata={'source': '/content/CompanyDocuments/PurchaseOrders/purc

In [10]:
### UNCOMMENT TO VIEW THE DATA BEING STRUCTURED into PINECONE

# document_data = []
# for document in documents:
#     # need to get the file path and access metadata and then the path
#     document_source = document['Data'][0].metadata['source']
#     document_content = document['Data'][0].page_content

#     # Printing out the contents to visual how the data is getting stored in PINECONE
#     print("DOCUMENT SOURCE:", document_source)
#     print("DOCUMENT CONTENT:", document_content)

#     doc = Document(
#         page_content= f"",
#     )

In [17]:
 # Prepare the text for EMBEDDING
document_data = []
for document in documents:

    # need to get the file path and access metadata and then the path
    document_source = document['Data'][0].metadata['source']
    document_content = document['Data'][0].page_content
    # IN ORDER To insert into PINECONE need to structure the METADATA more refined
    file_name = document_source.split("/")[-1]
    # Splitting the file path name to get the last bit of the string
    folder_names = document_source.split("/")[2:-1]

   # Document class is from Langchain
   # when you insert data into pinecone there make sure its optimized for the LLM
   # Using XML tags helps LLMS track properties into a prompt; know when the content starts and ends to strutcure your code better
    doc = Document(
       page_content=  f"\n{document_source}\n\n\n\n{document_content}\n",
       metadata={
           "file_name": file_name,
           "parent_folder": folder_names[-1],
           "folder_names": folder_names
       }

   )

document_data.append(doc)

In [18]:
document_data


[Document(metadata={'file_name': 'StockReport_2018-02.pdf', 'parent_folder': 'monthly', 'folder_names': ['CompanyDocuments', 'Inventory Report', 'monthly', 'monthly']}, page_content="\n/content/CompanyDocuments/Inventory Report/monthly/monthly/StockReport_2018-02.pdf\n\n\n\nStock Report for 2018-02\nCategory\nProduct\nUnits Sold\nUnits in Stock\nUnit Price\nBeverages\nChai\n90\n39\n18\nBeverages\nChang\n61\n17\n19\nBeverages\nChartreuse verte\n133\n69\n18\nBeverages\nCôte de Blaye\n100\n17\n263.5\nBeverages\nGuaraná Fantástica\n146\n20\n4.5\nBeverages\nLakkalikööri\n10\n57\n18\nBeverages\nLaughing Lumberjack..\n30\n52\n14\nBeverages\nOutback Lager\n83\n15\n15\nBeverages\nRhönbräu Klosterbier\n134\n125\n7.75\nBeverages\nSasquatch Ale\n10\n111\n14\nBeverages\nSteeleye Stout\n37\n20\n18\nCondiments\nChef Anton's Cajun..\n30\n53\n22\nCondiments\nGrandma's Boysenberry..\n50\n120\n25\nCondiments\nLouisiana Fiery Hot..\n64\n76\n21.05\nCondiments\nNorthwoods Cranberry..\n30\n6\n40\nCondiments\

In [19]:
# Insert documents into Pinecone
# Usually takes 4 min running first time
# All the files 2000+ are being processed and being converted to an embedding
# and each file takes about 1 to 3 secd
vectorstore_from_documents = PineconeVectorStore.from_documents(
    document_data,
    embeddings,
    index_name=index_name,
    namespace=namespace
)

# Perform RAG

# Putting it all together