In [2]:
!pip install transformers --quiet
!pip install langchain --quiet
!pip install docarray --quiet
!pip install pypdf --quiet
!pip install langchain_huggingface --quiet
!pip install bitsandbytes --quiet
!pip install langchain-community --quiet

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from transformers import BitsAndBytesConfig
import torch

In [4]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4"
)

**Note:** the model is loaded from *unsloth* but not *google*.

**Unsloth** is an open-source project that provides **extremely memory-efficient** and **fast loading & fine-tuning** of large language models, especially for LoRA and QLoRA training.
- It is not from Google — it is developed independently. **Unsloth** is often chosen because its models load much lighter and train much faster.

In [5]:
model_name = "unsloth/gemma-2-9b-it"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config,
    dtype=torch.float16
)

In [None]:
text_gen = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=64
)

In [8]:
from langchain_huggingface import HuggingFaceEmbeddings

In [9]:
embedding_name ='sentence-transformers/all-mpnet-base-v2'

`embeddings` will be used by the **retriever**.

In [10]:
embeddings = HuggingFaceEmbeddings(
    model_name=embedding_name,
    # model_kwargs={"trust_remote_code": True},
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Covert **transformers'** `pipeline` to `HuggingFacePipeline`.

In [13]:
from langchain_huggingface.llms import HuggingFacePipeline

In [14]:
llm = HuggingFacePipeline(pipeline=text_gen)

In [15]:
def apply_chat_template_and_response(prompt):
    messages = [
    {'role': 'user', 'content': prompt}
    ]

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False  # as reasoning is not needed
    )

    # invoke(): is needed to get the output form HuggingFacePipeline
    # replace(): causal models genetaye the question along with answer => we remove the question from the generated text
    return llm.invoke(text).replace(text, '')

## Test: Get the Output

### Formt 1
To get the output in `HuggingFacePipeline` format:

In [16]:
response = apply_chat_template_and_response("Who are you?")
print(response)

I am Gemma, an open-weights AI assistant. I am a large language model trained by Google DeepMind.

Here are some key things to know about me:

* **Open-weights:** My weights are publicly accessible, meaning anyone can see and use the underlying code that makes me work. This promotes transparency


## Format 2
To get the output in the **LangChain** standard format:

In [17]:
from langchain_core.output_parsers import StrOutputParser

In [18]:
parser = StrOutputParser()
response_from_model = apply_chat_template_and_response("Who are you?")
parsed_response = parser.parse(response_from_model)
print(parsed_response)

I am Gemma, an open-weights AI assistant. I am a large language model trained by Google DeepMind.

Here are some key things to know about me:

* **Open-Weights:** My weights are publicly accessible. This means anyone can see and use the underlying code that makes me work.
*


In [19]:
from langchain_core.prompts import PromptTemplate

In [20]:
template = """
You are a helpful and knowledgeable AI assistant. Use only the information retrieved from the documents to answer the user's question in English.
If the answer is not found in the retrieved context, respond with: "Sorry, I don't have that information!" Do not use your own knowledge beyond the provided context.
Be accurate, clear, and polite. Never mention the documents or the retrieval process in your response.
Context: {context}

Question: {question}

Answer:

"""

prompt = PromptTemplate.from_template(template)
prompt.format(context="Here is some context", question="Here is a question")

'\nYou are a helpful and knowledgeable AI assistant. Use only the information retrieved from the documents to answer the user\'s question in English.\nIf the answer is not found in the retrieved context, respond with: "Sorry, I don\'t have that information!" Do not use your own knowledge beyond the provided context.\nBe accurate, clear, and polite. Never mention the documents or the retrieval process in your response. \nContext: Here is some context\n\nQuestion: Here is a question\n\nAnswer:\n\n'

### Test the Response (After Giving the Context to LLM)

In [21]:
context = "I am Sara, and I work as an ML engineer."

In [22]:
formatted_prompt = prompt.format(context = context, question="Who are you?")
response_from_model = apply_chat_template_and_response(formatted_prompt)
parsed_response = parser.parse(response_from_model)
print(parsed_response.replace(formatted_prompt, ""))

I am a helpful and knowledgeable AI assistant.  




In [23]:
formatted_prompt = prompt.format(context = context, question="How old are you?")
response_from_model = apply_chat_template_and_response(formatted_prompt)
parsed_response = parser.parse(response_from_model)
print(parsed_response.replace(formatted_prompt, ""))

Sorry, I don't have that information! 



In [24]:
formatted_prompt = prompt.format(context = context, question="Who am I?")
response_from_model = apply_chat_template_and_response(formatted_prompt)
parsed_response = parser.parse(response_from_model)
print(parsed_response.replace(formatted_prompt, ""))

You are Sara, and you work as an ML engineer. 



In [25]:
formatted_prompt = prompt.format(context = context, question="How old am I?")
response_from_model = apply_chat_template_and_response(formatted_prompt)
parsed_response = parser.parse(response_from_model)
print(parsed_response.replace(formatted_prompt, ""))

Sorry, I don't have that information! 



## Context as PDF

In [26]:
!wget -O sample_doc.pdf "https://github.com/saraLatifi/LLMs/raw/main/rag/Agentic_AI.pdf"


--2025-12-12 15:07:24--  https://github.com/saraLatifi/LLMs/raw/main/rag/Agentic_AI.pdf
Resolving github.com (github.com)... 140.82.116.4
Connecting to github.com (github.com)|140.82.116.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/saraLatifi/LLMs/main/rag/Agentic_AI.pdf [following]
--2025-12-12 15:07:24--  https://raw.githubusercontent.com/saraLatifi/LLMs/main/rag/Agentic_AI.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3921 (3.8K) [application/octet-stream]
Saving to: ‘sample_doc.pdf’


2025-12-12 15:07:24 (38.0 MB/s) - ‘sample_doc.pdf’ saved [3921/3921]



## Custom Loader for LangChain

In [27]:
from langchain_community.document_loaders import PyPDFLoader

In [28]:
loader = PyPDFLoader("sample_doc.pdf")
pages = loader.load_and_split()
#pages = loader.load()
pages

[Document(metadata={'producer': 'ReportLab PDF Library - (opensource)', 'creator': 'anonymous', 'creationdate': '2025-12-12T14:31:08+01:00', 'author': 'anonymous', 'keywords': '', 'moddate': '2025-12-12T14:31:08+01:00', 'subject': 'unspecified', 'title': 'untitled', 'trapped': '/False', 'source': 'sample_doc.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='Agentic AI\nIn the context of generative artificial intelligence, AI agents (also referred to as compound AI systems or agentic AI) are a class of intelligent agents distinguished by their ability to operate autonomously in complex environments. Agentic AI tools prioritize decision-making over content creation and do not require human prompts or continuous oversight.\nOverview\nAI agents possess several key attributes, including complex goal structures, natural language interfaces, the capacity to act independently of user supervision, and the integration of software tools or planning systems. Their control flow i

In [None]:
!pip install --upgrade langchain langchain-community langchain-core

In [38]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

Set `chunk_size` based on the **context window** of the LLM model.

In [39]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)

In [40]:
text_documents = text_splitter.split_documents(pages)[:5]

# this part of the document is used to retrieve relevant information

In [41]:
text_documents

[Document(metadata={'producer': 'ReportLab PDF Library - (opensource)', 'creator': 'anonymous', 'creationdate': '2025-12-12T14:31:08+01:00', 'author': 'anonymous', 'keywords': '', 'moddate': '2025-12-12T14:31:08+01:00', 'subject': 'unspecified', 'title': 'untitled', 'trapped': '/False', 'source': 'sample_doc.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='Agentic AI\nIn the context of generative artificial intelligence, AI agents (also referred to as compound AI systems or agentic AI) are a class of intelligent agents distinguished by their ability to operate autonomously in complex environments. Agentic AI tools prioritize decision-making over content creation and do not require human prompts or continuous oversight.\nOverview\nAI agents possess several key attributes, including complex goal structures, natural language interfaces, the capacity to act independently of user supervision, and the integration of software tools or planning systems. Their control flow i

The external data source is a small PDF file, so a **vector store** is used—specifically, *DocArrayInMemorySearch*, since the document can be stored in RAM.

For large external datasets, a **vector database** like *Faiss* can be used.

In [42]:
from langchain_community.vectorstores import DocArrayInMemorySearch

# it stores the embeddings of the text_documents
vectorstore = DocArrayInMemorySearch.from_documents(text_documents, embedding=embeddings)

In [50]:
query = "Who is Andrew Ng?"
# vectorstore will act as a retriever
retriever = vectorstore.as_retriever()
# retrieves the related chunk
retrieved_context = retriever.invoke(query)


In [51]:
retrieved_context

[Document(metadata={'producer': 'ReportLab PDF Library - (opensource)', 'creator': 'anonymous', 'creationdate': '2025-12-12T14:31:08+01:00', 'author': 'anonymous', 'keywords': '', 'moddate': '2025-12-12T14:31:08+01:00', 'subject': 'unspecified', 'title': 'untitled', 'trapped': '/False', 'source': 'sample_doc.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='Memory systems for agents include Mem0, MemGPT, and MemOS.\nHistory\nAI agents have been traced back to research from the 1990s, with Harvard professor Milind Tambe noting that the definition of an AI agent was not clear at the time either. Researcher Andrew Ng has been credited with spreading the term "agentic" to a wider audience in 2024.\nTraining and testing\nResearchers have attempted to build world models and reinforcement learning environments to train or evaluate AI agents. For example, video games such as Minecraft and No Man\'s Sky as well as replicas of company websites, have also been used for training

## RAG Pipeline

In [55]:
questions = [
    "Who is Andrew Ng?",
    "Explain AI agents."
]

for question in questions:
    ### RAG pipeline
    # retrive the context from external data source
    retrieved_context = retriever.invoke(question)
    # add context to the prompt
    formatted_prompt = prompt.format(context=retrieved_context, question=question)
    # query the prompt & get the response
    response_from_model = apply_chat_template_and_response(formatted_prompt)
    # parse the response
    parsed_response = parser.parse(response_from_model)

    print(f"Question: {question}")
    print(f"Answer: {parsed_response}")
    print()

Question: Who is Andrew Ng?
Answer: Researcher Andrew Ng has been credited with spreading the term "agentic" to a wider audience in 2024. 




Question: Explain AI agents.
Answer: AI agents, also known as compound AI systems or agentic AI, are intelligent agents capable of operating autonomously in complex environments.  

They prioritize decision-making over content creation and don't require continuous human supervision.  

Key attributes of AI agents include:

* Complex goal structures
* Natural language interfaces



## Simple Chatbot with RAG

Sample question: *Who used the term “agentic” for the first time?*

In [56]:
while True:
    print("Say 'exit' or 'quit' to exit the loop")
    question = input('User question: ')
    print(f"Question: {question}")
    if question.lower() in ["exit", "quit"]:
        print("Exiting the conversation. Goodbye!")
        break
    formatted_prompt = prompt.format(context=retrieved_context, question=question)
    response_from_model = apply_chat_template_and_response(formatted_prompt)
    parsed_response = parser.parse(response_from_model)
    print(f"Answer: {parsed_response}")
    print()

Say 'exit' or 'quit' to exit the loop
User question: Hi
Question: Hi
Answer: Hello!  How can I help you? 




Say 'exit' or 'quit' to exit the loop
User question: Who used the term “agentic” for the first time?
Question: Who used the term “agentic” for the first time?
Answer: Andrew Ng.  


Say 'exit' or 'quit' to exit the loop
User question: exit
Question: exit
Exiting the conversation. Goodbye!
