<a href="https://colab.research.google.com/github/tj-guruvelli/HeadstarterRAGLab/blob/main/Headstarter_RAG_Workshop_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Img](https://app.theheadstarter.com/static/hs-logo-opengraph.png)

# Headstarter RAG Workshop

**Skills: HuggingFace, LangChain, Pinecone**

**Other Resources:**
- [Get your Groq API Key](https://console.groq.com/keys)
- [Get your Pinecone API Key](https://www.pinecone.io/)


### What is RAG anyway?


![withoutRAG](https://github.com/user-attachments/assets/649d6101-b63a-4750-997a-b6abc25e5609)

![withRAG](https://github.com/user-attachments/assets/e6dd9c46-0bf9-4c31-bd72-a27939ef82b8)

Retrieval-Augmented Generation (RAG) is a technique primarily used in GenAI applications to improve the quality and accuracy of generated text by LLMs by combining two key processes: retrieval and generation.

### Breaking It Down:
#### Retrieval:

- Before generating a response, the system first looks up relevant information from a large database or knowledge base. This is like searching through a library or the internet to find the most useful facts, articles, or data related to the question or topic.

#### Generation:

- Once the relevant information is retrieved, the system then uses it to help generate a response. This is where the model, like GPT, creates new text (answers, explanations, etc.) based on the retrieved information.


### Using PINECONE and GROQ as open-source tools for demonstration and accessibility

# Install libraries

In [46]:
! pip install langchain langchain-community openai groq tiktoken pinecone-client langchain_pinecone unstructured pdfminer==20191125 pdfminer.six==20221105 pillow_heif unstructured_inference sentence-transformers



In [47]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, WebBaseLoader, YoutubeLoader, DirectoryLoader, TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
from langchain.schema import Document
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone
from openai import OpenAI
import numpy as np
import tiktoken
import os
from groq import Groq

pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key

# openai_api_key = userdata.get("OPENAI_API_KEY")
# os.environ['OPENAI_API_KEY'] = openai_api_key

groq_api_key = userdata.get("GROQ_API_KEY")
os.environ['GROQ_API_KEY'] = groq_api_key

Can use OpenAI Embedding model, or others like NVIDIA if you want to experiment other than HuggingFace.


Converts text into numbers
Retains its semantic meaning



# Initialize the HuggingFace Embeddings client

In [48]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")



In [49]:
text = "Hello my name is Teja"

query_result = embeddings.embed_query(text)

Here is how the query looks like after getting converted

In [50]:
query_result

[-0.07406411319971085,
 0.012264770455658436,
 0.025599660351872444,
 0.04017006233334541,
 -0.09220541268587112,
 -0.08846328407526016,
 0.15335537493228912,
 0.002915000542998314,
 0.025351058691740036,
 0.024370085448026657,
 0.01021509524434805,
 -0.09724367409944534,
 0.026195870712399483,
 -0.02894514799118042,
 0.03766138479113579,
 -0.050083499401807785,
 0.07666008919477463,
 -0.010352355428040028,
 0.012071335688233376,
 -0.04377082735300064,
 -0.053983528167009354,
 0.023461839184165,
 -0.040795665234327316,
 -0.029599737375974655,
 -0.11574239283800125,
 -0.016673805192112923,
 0.022969834506511688,
 0.08957386016845703,
 -0.006595887243747711,
 -0.03293594345450401,
 -0.022078193724155426,
 0.024496443569660187,
 0.0589464008808136,
 0.02099817618727684,
 -0.045722734183073044,
 0.0600958950817585,
 -0.13809847831726074,
 -0.03711182624101639,
 -0.05654394254088402,
 0.012932590208947659,
 -0.03299634903669357,
 -0.06587941199541092,
 -0.06968126446008682,
 -0.022476643323

In [51]:
len(query_result)

384

# Initialize the Groq client

In [52]:
# Free Llama 3.1 API via Groq

groq_client = Groq(api_key=os.getenv('GROQ_API_KEY'))

# Calculating sentence similarity with embeddings


only need to be pulling from relevant documents and not all the documents or info that an company might have on internal data, SO we need a way to define WHICH documents are relevant

The sentences produced here are basically the sentences in matrix form like in linear algebra

In [53]:
def get_huggingface_embeddings(text, model_name="sentence-transformers/all-MiniLM-L6-v2"):
    model = SentenceTransformer(model_name)
    return model.encode(text)


def cosine_similarity_between_sentences(sentence1, sentence2):
    # Get embeddings for both sentences
    embedding1 = np.array(get_huggingface_embeddings(sentence1))
    embedding2 = np.array(get_huggingface_embeddings(sentence2))

    # Reshape embeddings for cosine_similarity function
    embedding1 = embedding1.reshape(1, -1)
    embedding2 = embedding2.reshape(1, -1)

    print("Embedding for Sentence 1:", embedding1)
    print("\nEmbedding for Sentence 2:", embedding2)

    # Calculate cosine similarity
    similarity = cosine_similarity(embedding1, embedding2)
    return similarity[0][0]


# Example usage
sentence1 = "I like walking to the park"
sentence2 = "I like running to the office"


similarity = cosine_similarity_between_sentences(sentence1, sentence2)
print(f"\n\nCosine similarity between '{sentence1}' and '{sentence2}': {similarity:.4f}")

Embedding for Sentence 1: [[-7.94647262e-04 -4.52190749e-02  5.60034551e-02  4.00062464e-02
   7.82356039e-02 -3.10016028e-03  1.56902865e-01 -1.61643641e-03
   8.40177536e-02  7.29586333e-02 -2.27428153e-02 -1.00336559e-02
  -4.77766357e-02  5.78007065e-02  6.89263120e-02  2.29866221e-03
   3.41052189e-02  8.23902860e-02 -4.47453046e-03  1.18202856e-02
  -7.44135678e-02  2.10828464e-02  1.92200206e-02  5.48400655e-02
  -1.07110761e-01  8.79157037e-02 -1.64800771e-02  6.51672296e-03
  -6.67020795e-05 -4.27562976e-03 -8.20703059e-02  7.05852956e-02
  -1.80556532e-02  3.27348486e-02 -4.36549522e-02  9.93786659e-03
   5.78057803e-02 -6.92316219e-02  4.53142636e-02  4.96660285e-02
  -1.49475699e-02  5.79100735e-02  8.14058036e-02  2.62879906e-03
  -1.49136577e-02 -4.37886156e-02  2.26743110e-02 -3.19027528e-02
   1.00592583e-01  3.10835298e-02  1.30596399e-01  7.27660581e-03
   8.58721696e-03  7.95205031e-03 -7.91899022e-03  4.98277741e-03
  -8.22421089e-02  2.46388651e-02  5.11084683e-02 

# Load in the Data

Learn more about the dataset [here](https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset)

In [None]:
! kaggle datasets download -d ayoubcherguelaine/company-documents-dataset
! unzip company-documents-dataset.zip

Dataset URL: https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
License(s): apache-2.0
company-documents-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  company-documents-dataset.zip
replace CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_1.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

Processing each file and EXTRACTING THE TEXT that was processed

In [55]:
def process_directory(directory_path):
    data = []
    for root, _, files in os.walk(directory_path):
        for file in files:

            file_path = os.path.join(root, file)
            print(f"Processing file: {file_path}")
            loader = PyPDFLoader(file_path)
            data.append({"File": file_path, "Data": loader.load()})

    return data

directory_path = "/content/CompanyDocuments"
documents = process_directory(directory_path)


Processing file: /content/CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2017-07_8.pdf
Processing file: /content/CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2017-03_2.pdf
Processing file: /content/CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2017-03_1.pdf
Processing file: /content/CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2017-04_5.pdf
Processing file: /content/CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2017-05_7.pdf
Processing file: /content/CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_6.pdf
Processing file: /content/CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2017-09_2.pdf
Processing file: /content/CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2017-08_3.pdf
Processing file: /content/CompanyDocumen

KeyboardInterrupt: 

In [None]:
documents

# Initialize Pinecone


- It takes a few minutes to intialize
- Way to parition data together
- Split different comapnies into different NAMESPACES

In [None]:
# Make sure to create a Pinecone index with 384 dimensions
# USE 384 for the length of the EMBEDDING model when defining a index in a vector database

index_name = "rag-workshop"

namespace = "company-documents"

vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)

# Insert data into Pinecone


- This prints out the documents strutcures
- 'File' is the path to data files
- 'Data' is the actual content in the file
- way to represent pdfs and extracted text

In [None]:
for document in documents:
    print(document['File'], document['Data'])

In [None]:
### UNCOMMENT TO VIEW THE DATA BEING STRUCTURED into PINECONE

# document_data = []
# for document in documents:
#     # need to get the file path and access metadata and then the path
#     document_source = document['Data'][0].metadata['source']
#     document_content = document['Data'][0].page_content

#     # Printing out the contents to visual how the data is getting stored in PINECONE
#     print("DOCUMENT SOURCE:", document_source)
#     print("DOCUMENT CONTENT:", document_content)

#     doc = Document(
#         page_content= f"<Source>",
#     )





In [None]:
 # Prepare the text for EMBEDDING
document_data = []
for document in documents:

    # need to get the file path and access metadata and then the path
    document_source = document['Data'][0].metadata['source']
    document_content = document['Data'][0].page_content
    # IN ORDER To insert into PINECONE need to structure the METADATA more refined
    file_name = document_source.split("/")[-1]
    # Splitting the file path name to get the last bit of the string
    folder_names = document_source.split("/")[2:-1]

   # Document class is from Langchain
   # when you insert data into pinecone there make sure its optimized for the LLM
   # Using XML tags helps LLMS track properties into a prompt; know when the content starts and ends to strutcure your code better
    doc = Document(
       page_content=  f"<Source>\n{document_source}\n</Source>\n\n<Content>\n{document_content}\n</Content>",
       metadata={
           "file_name": file_name,
           "parent_folder": folder_names[-1],
           "folder_names": folder_names
       }

   )

document_data.append(doc)

In [None]:
document_data

In [None]:
# Insert documents into Pinecone
# Usually takes 4 min running first time
# All the files 2000+ are being processed and being converted to an embedding
# and each file takes about 1 to 3 secd
vectorstore_from_documents = PineconeVectorStore.from_documents(
    document_data,
    embeddings,
    index_name=index_name,
    namespace=namespace
)

# Perform RAG

# Putting it all together