<a href="https://colab.research.google.com/github/smruthi-sreenivas/RAG-Personal-Resource-Assistant/blob/main/RAG_Personal_Resource_Assistant_Langchain%2C_Cohere_and_ChromaDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installing necessary libraries

Cohere provides free trial keys to use their LLMs. So generate one trial key from dashboard.cohere.com

In [2]:
!pip install langchain-cohere langchain pdfminer.six chromadb

Collecting langchain-cohere
  Downloading langchain_cohere-0.4.4-py3-none-any.whl.metadata (6.6 kB)
Collecting pdfminer.six
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting chromadb
  Downloading chromadb-1.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting cohere<6.0,>=5.12.0 (from langchain-cohere)
  Downloading cohere-5.15.0-py3-none-any.whl.metadata (3.4 kB)
Collecting langchain-community<0.4.0,>=0.3.0 (from langchain-cohere)
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting types-pyyaml<7.0.0.0,>=6.0.12.20240917 (from langchain-cohere)
  Downloading types_pyyaml-6.0.12.20250516-py3-none-any.whl.metadata (1.8 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-

langchain-cohere: Enables integration of Cohere's language models with LangChain for advanced text generation and processing workflows.

langchain: Provides a modular framework for building language model-powered applications, such as chatbots, question-answering systems, and conversational agents.

pdfminer.six: Facilitates text extraction from PDF files, making it useful for document analysis and preprocessing tasks.

chromadb: A vector database library designed for efficient storage and retrieval of embeddings, ideal for tasks like semantic search and recommendation systems.

## Importing libraries

In [3]:
import os
from google.colab import userdata
os.environ["COHERE_API_KEY"] = userdata.get('COHERE_KEY')
from langchain_core.prompts import PromptTemplate
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_cohere import ChatCohere
from langchain.schema.output_parser import StrOutputParser
from pdfminer.high_level import extract_text as extract_text_pdf_miner
from langchain.vectorstores import Chroma
from langchain.embeddings import CohereEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_core.runnables import RunnableParallel,RunnablePassthrough

## VectorDB setup

In [4]:
# Define the directory where the Chroma database will persist data
persist_directory = "/content/chroma_db"

# Initialize Cohere embeddings with the specified model
# "embed-english-v3.0" is a pre-trained English language embedding model by Cohere
# The user_agent parameter specifies the tool or library using the Cohere API, in this case, LangChain
embedding = CohereEmbeddings(
    model="embed-english-v3.0",
    user_agent="langchain"
)


  embedding = CohereEmbeddings(


We are processing 2 research papers on transformers and yolo. You can use your own PDFs.

In [10]:
# Loop through a list of PDF files to process
for pdf_name in ["/content/1706.03762v7.pdf", "/content/1506.02640v5.pdf"]:
    # Open each PDF file in binary mode
    with open(pdf_name, 'rb') as f:
        # Extract text from the PDF using the extract_text_pdf_miner function
        text = extract_text_pdf_miner(f)

        # Clean the extracted text by removing newline characters and joining into a single string
        cleaned_text = " ".join(text.split("\n"))

        # Initialize a list to store document chunks
        docs = []

        # Create a text splitter to divide the text into manageable chunks
        # Each chunk has a maximum size of 2048 characters with a 512-character overlap
        splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=512)


        # Split the cleaned text into chunks and wrap each chunk in a Document object
        for chunk in splitter.split_text(cleaned_text):
            docs.append(Document(page_content=chunk, metadata={"source": pdf_name}))


    # Create a Chroma collection from the processed documents
    # Use the specified persist directory and embedding model for storage and retrieval
    vector_collection_fixed_size = Chroma.from_documents(
        documents=docs,
        persist_directory=persist_directory,
        embedding=embedding
    )

In [11]:
# Initialize a Chroma vector database
# The persist_directory specifies the location where the database is stored
# The embedding_function parameter provides the embedding model used for vector representation
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

  vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)


In [12]:
# Perform a similarity search on the vector database
# The query "What is YOLO?" is used to find the most relevant documents
# k=1 specifies that the top 1 most similar document should be retrieved
# The method also returns relevance scores indicating how closely each document matches the query
vectordb.similarity_search_with_relevance_scores("What is YOLO?", k=1)

[(Document(metadata={'source': '/content/1506.02640v5.pdf'}, page_content='Detection In The Wild  Academic datasets for object detection draw the training and testing data from the same distribution. In real-world applications it is hard to predict all possible use cases and  YOLO is a fast, accurate object detector, making it ideal for computer vision applications. We connect YOLO to a webcam and verify that it maintains real-time performance,  \x0cVOC 2007 AP 59.2 54.2 43.2 36.5 -  Picasso AP Best F1 0.590 53.3 0.226 10.4 0.458 37.8 0.271 17.8 0.051 1.9  People-Art AP 45 26 32  YOLO R-CNN DPM Poselets [2] D&T [4]  (a) Picasso Dataset precision-recall curves.  (b) Quantitative results on the VOC 2007, Picasso, and People-Art Datasets. The Picasso Dataset evaluates on both AP and best F1 score.  Figure 5: Generalization results on Picasso and People-Art datasets.  Figure 6: Qualitative Results. YOLO running on sample artwork and natural images from the internet. It is mostly accurate a

## RAG pipeline

In [13]:
# Initialize an LLM instance using Cohere's "command-r" model. Cohere's "command-r" model is a retrieval-augmented generation (RAG)-optimized large language model developed by Cohere, designed specifically for enterprise-grade RAG tasks, including:
#Answering questions using external documents
#Summarizing retrieved content
#Multi-hop reasoning across sources

# The temperature parameter controls randomness in the generated responses; 0 ensures deterministic outputs
llm = ChatCohere(model="command-r", temperature=0)

# Define a prompt template for generating answers based on a given context and question
prompt_str = """Answer the question below using the context:

Context: {context}

Question: {question}

Answer: """

# Create a ChatPromptTemplate from the string template, enabling dynamic input for context and question
prompt = ChatPromptTemplate.from_template(prompt_str)

# Create a retrieval pipeline to fetch relevant context and pass through the user's question
retrieval = RunnableParallel(
    {
        # Use the vector database as a retriever to fetch relevant context for the question
        "context": vectordb.as_retriever(),

        # Pass through the user's input question without modification
        "question": RunnablePassthrough()
    }
)

# Define an output parser to format the generated response into a string
output_parser = StrOutputParser()

# Create a processing chain that retrieves context, formats the prompt, generates an LLM response, and parses the output
chain = retrieval | prompt | llm | output_parser

In [15]:
# Invoke the chain of components (retrieval, prompt generation, LLM processing, and output parsing)
# The question "What is YOLO?" is passed through the chain to generate the response
response = chain.invoke("What is attention mechanism?")

# Print the response generated by the chain
print(response)

The attention mechanism is a process relating to different positions of a single sequence that is used to compute a representation of the sequence. Self-attention, or intra-attention, is used in tasks like reading comprehension and abstractive summarization. The Transformer is an example of a model that uses self-attention to compute representations of its input and output.


## Other chain invoking methods!

.invoke(): The goal is to pass in an input and receive the output—neither more nor less.

.batch(): This is faster than using invoke three times when you wish to supply several inputs to get multiple outputs because it handles the parallelization for you.

.stream():  We may begin printing the response before the entire response is complete.

In [16]:
response_with_batch = chain.batch(["What is Transformers", "How is Transformer different than YOLO?"])

for response in response_with_batch:
  print(response)
  print("\n")

Transformers are a type of model architecture that do not use recurrence but instead rely entirely on an attention mechanism to establish global dependencies between the input and output. They are particularly useful for sequence modeling and transduction problems such as language modeling and machine translation. Transformers allow for much more parallelization compared to recurrent models, which in turn increases computational efficiency. So much so, that it reaches a new high standard in translation quality with just twelve hours of training on eight P100 GPUs.


YOLO (You Only Look Once) is a simple object detection system that simultaneously predicts multiple bounding boxes and class probabilities for an image. It trains on full images and directly optimizes detection performance. YOLO is extremely fast, achieving more than twice the mean average precision of other real-time systems. It sees the entire image, encoding contextual information about classes and their appearance. It l

In [17]:
for chunk in chain.stream("What are the 3 vectors in Transformers architecture?"):
  print(chunk, flush=True, end="")

In the context of the provided text, the term "vectors" is not explicitly mentioned in relation to Transformers architecture. However, the text does refer to various vector sizes and dimensions. 

The three key vectors in the Transformer architecture, which the text alludes to, are likely: 

1. **Query Vector****:** This could be considered the primary vector in the Transformer's attention mechanism. It is often denoted as a Q (query) and is responsible for querying relevant context from the input sequence. 

2. **Key Vector****:** The secondary vector, often denoted as K (key), holds information about the input sequence and helps align the query with relevant segments of the input. 

3. **Value Vector****:** The value vector, V, contains the actual data or information associated with the key vectors. 

These three vectors are fundamental to the attention mechanism in Transformers, enabling the model to weigh the importance of different input elements when making predictions.