Data from here https://www.kaggle.com/datasets/danielihenacho/amazon-reviews-dataset?resource=download

# Start with RAG

## Loading the PDF (With PyPDFLoader we already have them as documents)

In [63]:
from langchain_community.document_loaders.pdf import PyPDFLoader

loader = PyPDFLoader("data/football_tutorial.pdf")

docs = loader.load()

docs

[Document(metadata={'producer': 'Microsoft® Word 2016', 'creator': 'Microsoft® Word 2016', 'creationdate': '2022-06-27T13:05:06+05:30', 'author': 'jami', 'moddate': '2022-06-27T13:05:06+05:30', 'source': 'data/football_tutorial.pdf', 'total_pages': 19, 'page': 0, 'page_label': '1'}, page_content=''),
 Document(metadata={'producer': 'Microsoft® Word 2016', 'creator': 'Microsoft® Word 2016', 'creationdate': '2022-06-27T13:05:06+05:30', 'author': 'jami', 'moddate': '2022-06-27T13:05:06+05:30', 'source': 'data/football_tutorial.pdf', 'total_pages': 19, 'page': 1, 'page_label': '2'}, page_content='Football \ni \n \nAbout the T utorial \nFootball or soccer is the most popular ball game around the world. Football requires a lot of \nstamina and staying power on the ground as it is all about foot speed, and the confidence to \nskillfully maneuver the ball to score a goal. This tutorial explains the simple yet fundamental \nrules of the game and various terminolog ies involved. It also provides

## STEP 2: Splitting the Document into CHUNKS

In [75]:


from langchain_text_splitters import RecursiveCharacterTextSplitter

# Split documents ito chunks (not really necessary)
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

chunks = splitter.split_documents(docs) 
chunks[:5]

[Document(metadata={'producer': 'Microsoft® Word 2016', 'creator': 'Microsoft® Word 2016', 'creationdate': '2022-06-27T13:05:06+05:30', 'author': 'jami', 'moddate': '2022-06-27T13:05:06+05:30', 'source': 'data/football_tutorial.pdf', 'total_pages': 19, 'page': 1, 'page_label': '2'}, page_content='Football \ni \n \nAbout the T utorial \nFootball or soccer is the most popular ball game around the world. Football requires a lot of \nstamina and staying power on the ground as it is all about foot speed, and the confidence to \nskillfully maneuver the ball to score a goal. This tutorial explains the simple yet fundamental \nrules of the game and various terminolog ies involved. It also provides information on the'),
 Document(metadata={'producer': 'Microsoft® Word 2016', 'creator': 'Microsoft® Word 2016', 'creationdate': '2022-06-27T13:05:06+05:30', 'author': 'jami', 'moddate': '2022-06-27T13:05:06+05:30', 'source': 'data/football_tutorial.pdf', 'total_pages': 19, 'page': 1, 'page_label': '

## STEP 3: Creating Embeddings for the Chunks

For this project, the best choice was mxbai-embed-large due to its high accuracy for sentiment analysis.

| Model | Size | Strength | Best Use Case |
| :--- | :--- | :--- | :--- |
| **mxbai-embed-large** | ~670MB | Accuracy | High-quality search/RAG |
| **qwen3-embedding** | ~1.1GB | Intelligence | Messy data, slang, multi-language |
| **nomic-embed-text** | ~270MB | Context | Long documents or large batches |
| **all-minilm** | ~45MB | Speed | Ultra-fast real-time processing |


In [76]:
from langchain_ollama import OllamaEmbeddings

embedding_model = OllamaEmbeddings(model='mxbai-embed-large')

## STEP 4: Store embeddings in the Vector store

In [77]:
from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=chunks, 
                                    embedding=embedding_model)

## STEP 5. Semantic Search

In [None]:
# A user asks a question

from pydantic import BaseModel

class UserInput(BaseModel):
    question: str
    
user_input = input("Enter your thoughts: ")

# Retrieve the most relevant documents
context = vectorstore.similarity_search(user_input, k=3)
context


[Document(metadata={'total_pages': 19, 'producer': 'Microsoft® Word 2016', 'page': 5, 'page_label': '6', 'source': 'data/football_tutorial.pdf', 'author': 'jami', 'creationdate': '2022-06-27T13:05:06+05:30', 'creator': 'Microsoft® Word 2016', 'moddate': '2022-06-27T13:05:06+05:30'}, page_content='associations in whole of Europe. It is a commemorate  of 54 nations and some of those are \nCroatia, England, France, Germany, Greece, Italy, Netherlands, Portugal, Russia, Spain, \nSwitzerland, Belgium, Bosnia, and Herzegovina. \n \n2. Football – Participating Countries'),
 Document(metadata={'author': 'jami', 'moddate': '2022-06-27T13:05:06+05:30', 'total_pages': 19, 'source': 'data/football_tutorial.pdf', 'page': 5, 'page_label': '6', 'creationdate': '2022-06-27T13:05:06+05:30', 'producer': 'Microsoft® Word 2016', 'creator': 'Microsoft® Word 2016'}, page_content='represented in international events. Indonesia is first Asian country to have qualified for World \nCup. However, India has made 

## Context for the LLM

In [None]:
prompt_template = f"""You are a helpful assistant who answers the user's questions by reasoning with the context that is given to you: {user_input}"""

In [None]:
from langchain_ollama import ChatOllama

llm = ChatOllama(model='mistral-nemo')
llm.invoke()

In [None]:
from langchain_ollama import ChatOllama

# Good reviews
llm = ChatOllama(model='mistral-nemo')

response = llm.invoke("""Give me a reply for each of these best reviews: {question} 
                      and then do the same for these worst reviews: {negative_context}""")
print(response.content)

**Best Reviews:**

1. **ID: 36 (Rating: 5.0)**
   - Reply: "Thank you for your kind words! We're glad to hear that you found our product easy to use. Your satisfaction is our top priority, and we appreciate your feedback!"

2. **ID: 44 (Rating: 2.0)**
   - Reply: "Hello! We're really happy to know that you find our quiet clicks and LED lights useful. We strive to make products that meet your needs. Thank you for sharing your favorite features with us."

3. **ID: 50 (Rating: 1.0)**
   - Reply: "Thank you so much for your positive feedback! We're glad to hear that you've been enjoying the battery life and color-changing feature of our product. Your happiness means a lot to us!"

**Worst Reviews:**

1. **ID: 47 (Rating: 3.0)**
   - Reply: "We're sorry to hear that you're having trouble with your device. Can you please provide more details about the issue? We'd like to help resolve this problem as soon as possible."

2. **ID: 3 (Rating: 1.0)**
   - Reply: "We apologize for the frustration 

In [None]:
# Bad reviews