# The code represents standard Retrieval-Augmented Generation (RAG) pipeline from data ingestion to llm answer generation

**1) Install Dependencies**

In [None]:
!pip install langchain langchain-community langchain-huggingface langchain_groq chromadb pdfminer.six sentence-transformers transformers groq langchain-core

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
Collecting langchain_groq
  Downloading langchain_groq-0.3.6-py3-none-any.whl.metadata (2.6 kB)
Collecting chromadb
  Downloading chromadb-1.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting pdfminer.six
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting groq
  Downloading groq-0.30.0-py3-none-any.whl.metadata (16 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.w

**2) Import necessary packages**

In [None]:
import os
from langchain_community.document_loaders import PDFMinerLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_groq import ChatGroq
from langchain.schema import HumanMessage
from langchain.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

**3) Data Ingestion**

In [None]:
Loader = PDFMinerLoader('/content/Statistics for Machine Learning.pdf')
doc = Loader.load()

**4) Chunking**

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100)
chunks = text_splitter.split_documents(doc)

**5) View sample chunks**

In [None]:
print(chunks[0])
print(chunks[1])
print(chunks[2])

page_content='Statistics: Definition, Importance, Limitation 

Statistics is a form of mathematical analysis that uses quantified models, representations and synopses for a given set 
of  experimental  data  or  real-life  studies.  Statistics  studies  methodologies  to  gather,  review,  analyze  and  draw 
conclusions from data. Some statistical measures include mean, regression analysis, skewness, kurtosis, variance and 
analysis of variance. 

Statistics is a term used to summarize a process that an analyst uses to characterize a data set. If the data set depends 
on a sample of a larger population, then the analyst can develop interpretations about the population primarily based 
on the statistical outcomes from the sample. Statistical analysis involves the process of gathering and evaluating data 
and then summarizing the data into a mathematical form.' metadata={'producer': 'PDFMiner', 'creator': 'PDFMiner', 'creationdate': '', 'total_pages': 47, 'source': '/content/Statistics 

**6) View no of chunks**

In [None]:
print(len(chunks))

264


**7) Create embeddings**

In [None]:
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**8) Storing embeddings in ChromaDB**

In [None]:
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings)

**9) Collection count in ChromaDB**

In [None]:
print(vectorstore._collection.count())

264


**10) Inspect vector collection**

In [None]:
print(vectorstore._collection.get())

{'ids': ['a514701d-c01f-4b5a-ba91-6941e2e9795d', '16242ac7-525f-40c8-bc8d-cc99d27f9847', 'f8fba297-4513-4627-a38d-1d77c588547c', '8b0b10f5-b1cb-4e1c-bef8-78c0387eedb8', 'f8c735c8-2305-4a0f-8c1a-25a844535f72', '5ec32a30-c3a4-4529-b0f7-169f8460eed2', '8588194a-2fda-4482-80d4-eabcfe7a8895', 'f37b3c90-89fb-4ba9-9b9c-676f97da94c0', 'd8a4349c-2c04-4abd-81f0-4acc56f6caba', '4a0236e2-7474-4347-b452-5737d3743ed4', 'f303105d-35ff-43ef-a107-250954b7ddd9', 'c73a7307-139d-4101-898b-48966e2193e5', 'feb52a0a-f607-4005-ad8c-d02b157f0590', '81747832-72db-4ac4-8dde-32bffa9d14e3', 'eeb5a38c-f89e-4fd1-b699-78dcf54cc04b', '0b93c177-0057-414c-a1dc-930d605ae795', 'ee633597-ed28-43de-bb60-d61f90075c1b', '735c6703-a34d-48f7-83ed-875d5f8a91d4', 'b835ba0e-76f3-4eca-a31a-c46b1f4f5d54', 'a2723bf7-45d8-4ea9-aabe-ca1514dfca98', '36dee6d0-8a44-4151-9c69-e2d11613f3af', '007da46c-a127-4d3b-87d7-e7f0f1485253', '748b0a6b-b200-4694-9618-d63bc77b35af', '2e86707f-4757-4425-a98b-4cab3c967d37', '8340d75b-fca4-4f3f-a992-ab0325

**11) Set Retriever for semantic retrieval of document chunks**

In [None]:
retriever = vectorstore.as_retriever(search_kwargs={"k":4})

**12) Initialise Groq API to access Llama model for text generation**

In [None]:
from getpass import getpass
import os

api_key = getpass("Enter your Groq API key: ")
os.environ["GROQ_API_KEY"] = api_key

Enter your Groq API key: ··········


In [None]:
chat_model = ChatGroq(
    model_name="llama-3.3-70b-versatile",
    temperature=0.1,
)

**13) Generate custom template**

In [None]:
template = """
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question asked only.
Explain the answer in detail. DO NOT provide additional commentary or questions.
Context:
{context}

Question: {question}

Answer:"""
prompt = PromptTemplate(template=template, input_variables=["context", "question"])

**14) Format document chunks**

In [None]:
def format_docs(docs):
  return "\n".join(doc.page_content for doc in docs)

**15) Create RAG Chain**

In [None]:
rag_chain = ({"context" : retriever | format_docs, "question" : RunnablePassthrough()}
             | prompt
             | chat_model
             | StrOutputParser())

**16) Display retrieved chunks for the given query**

In [None]:
retrieved_docs = retriever.get_relevant_documents("What is the importance of statistics?")
print("Retrieved Context:\n")
for i, doc in enumerate(retrieved_docs):
    print(f"Doc chunk {i+1}:\n{doc.page_content}\n{'-'*40}")

Retrieved Context:

Doc chunk 1:
(ii) Statistics in Mathematics 

Statistics is intimately related to and essentially dependent upon mathematics. The modern theory of Statistics has its 
foundations in the theory of probability which in turn is a particular branch of more advanced mathematical theory of 
Measures and Integration. Ever increasing role of mathematics in statistics has led to the development of a new branch 
of statistics called Mathematical Statistics. 

Thus, Statistics may be considered to be an important member of the mathematics family. In the words of Connor, 
"Statistics is a branch of applied mathematics which specializes in data." 

(iii) Statistics in Economics 

Statistics  and  Economics  are  so  intermixed  with  each  other  that  it  looks  foolish  to  separate  them.  DeveThe 
development modern statistical methods has led to an extensive use of statistics in Economics.
----------------------------------------
Doc chunk 2:
Statistics: Definition, Importa

  retrieved_docs = retriever.get_relevant_documents("What is the importance of statistics?")


**17) Generate RAG output for the given question**

In [None]:
answer = rag_chain.invoke("What is the importance of statistics?")
print(answer)

The importance of statistics lies in its ability to summarize and analyze data, providing valuable insights and conclusions that can inform decision-making in various fields, including planning, economics, and mathematics. Statistics is indispensable in planning, whether at the business, economic, or government level, as it provides crucial data on production, consumption, birth, death, investment, and income, which are essential for efficient working and formulating policy decisions.

In economics, statistics plays a vital role in comparison, presentation, interpretation, and analysis of economic data, enabling the understanding of complex economic concepts such as consumption, production, exchange, distribution, and public finance. Statistical analysis helps in addressing problems like spending of income, production of national wealth, adjustment of demand and supply, and the impact of economic policies on the economy.

Furthermore, statistics is essential in public finance, as it en

**18) Additional examples**

In [None]:
retrieved_docs = retriever.get_relevant_documents("What are the properties of correlation?")
print("Retrieved Context:\n")
for i, doc in enumerate(retrieved_docs):
    print(f"Doc chunk {i+1}:\n{doc.page_content}\n{'-'*40}")

Retrieved Context:

Doc chunk 1:
Correlation analysis can be used in a variety of fields, such as psychology, economics, and biology, to investigate the 
relationship between variables and to help make predictions or inform decision-making. However, correlation does 
not necessarily imply causation, and other factors may be responsible for the observed correlation. Therefore, it is 
important to use caution when interpreting the results of correlation analysis. 

The properties of correlation are as follows: 

Correlation coefficient ranges from -1 to +1: The correlation coefficient is a standardized measure that ranges from -
1 to +1. A correlation of -1 indicates a perfect negative correlation, while a correlation of +1 indicates a perfect 
positive correlation. A correlation coefficient of 0 indicates no correlation between the two variables.
----------------------------------------
Doc chunk 2:
Correlation is affected by scale: Correlation is affected by the scale of measurement of

In [None]:
answer = rag_chain.invoke("What are the properties of correlation?")
print(answer)


The properties of correlation are as follows: 

1. Correlation coefficient ranges from -1 to +1: The correlation coefficient is a standardized measure that ranges from -1 to +1. A correlation of -1 indicates a perfect negative correlation, while a correlation of +1 indicates a perfect positive correlation. A correlation coefficient of 0 indicates no correlation between the two variables.

2. Correlation is affected by scale: Correlation is affected by the scale of measurement of the variables. For example, if one variable is measured in inches and the other variable is measured in centimeters, the correlation coefficient will be affected by the difference in scale.

3. Correlation can be influenced by sample size: The correlation coefficient can be influenced by sample size, with larger sample sizes leading to more accurate estimates of the true correlation between variables.

4. Correlation is not affected by the units of measurement: Correlation is a unitless measure, which means tha

In [None]:
retrieved_docs = retriever.get_relevant_documents("Explain regression analysis in detail")
print("Retrieved Context:\n")
for i, doc in enumerate(retrieved_docs):
    print(f"Doc chunk {i+1}:\n{doc.page_content}\n{'-'*40}")

Retrieved Context:

Doc chunk 1:
Regression analysis uses statistical measures such as the coefficient of determination (R-squared), which indicates 
the proportion of variation in the dependent variable that can be explained by the independent variables, and the 
coefficients of the independent variables, which indicate the magnitude and direction of the effect of each 
independent variable on the dependent variable. 

Regression analysis can be used in many different fields, including economics, finance, social sciences, engineering, 
and biology, to name a few. It is a powerful tool for making predictions and understanding the relationships between 
variables.
----------------------------------------
Doc chunk 2:
Fitting a regression line involves finding the equation of a line that best fits the data points in a scatter plot. The 
regression line represents the linear relationship between the independent variable and the dependent variable. The 
line can then be used to make predic

In [None]:
answer = rag_chain.invoke("Explain regression analysis in detail")
print(answer)


Regression analysis is a statistical method used to establish a relationship between two or more variables. It involves using statistical measures such as the coefficient of determination (R-squared) and coefficients of independent variables to understand the magnitude and direction of the effect of each independent variable on the dependent variable. The goal of regression analysis is to develop an equation that can be used to predict the value of the dependent variable based on the value of the independent variable.

Regression analysis can be applied in various fields, including economics, finance, social sciences, engineering, and biology. It is a powerful tool for making predictions and understanding the relationships between variables. The analysis involves fitting a regression line, which represents the linear relationship between the independent variable and the dependent variable. The equation of a regression line can be written as Y = a + bX, where Y is the dependent variable

In [None]:
retrieved_docs = retriever.get_relevant_documents("what is the equation of regression line?")
print("Retrieved Context:\n")
for i, doc in enumerate(retrieved_docs):
    print(f"Doc chunk {i+1}:\n{doc.page_content}\n{'-'*40}")

Retrieved Context:

Doc chunk 1:
Fitting a regression line involves finding the equation of a line that best fits the data points in a scatter plot. The 
regression line represents the linear relationship between the independent variable and the dependent variable. The 
line can then be used to make predictions about the value of the dependent variable for a given value of the 
independent variable. 

The equation of a regression line can be written as: 

Y = a + bX
Where Y is the dependent variable, X is the independent variable, a is the intercept, and b is the slope of the line. 
The slope of the line (b) represents the change in Y for every unit change in X, while the intercept (a) represents the 
value of Y when X is equal to zero.
----------------------------------------
Doc chunk 2:
The equation of a simple linear regression line can be written as: 

Y = a + bX 

Where Y is the dependent variable, X is the independent variable, a is the intercept, and b is the slope of the line

In [None]:
answer = rag_chain.invoke("what is the equation of regression line?")
print(answer)


The equation of a regression line can be written as Y = a + bX, where Y is the dependent variable, X is the independent variable, a is the intercept, and b is the slope of the line. The slope of the line (b) represents the change in Y for every unit change in X, while the intercept (a) represents the value of Y when X is equal to zero. This equation represents the linear relationship between the independent variable and the dependent variable, and it can be used to make predictions about the value of the dependent variable for a given value of the independent variable.
