# **Gagan's RAG Chatbot for Loan Approval Data**

**Retrieval-Augmented Generation (RAG)** lets a chatbot answer questions by combining a language model with an **external knowledge base**.  

In a RAG pipeline,the information retrieval step first finds relevant data from a corpus, and then a generative model produces a fluent answer using that data.  

In our case, the “knowledge base” is a loan dataset (from Kaggle’s Loan Approval dataset), here we build a simple RAG system to answer questions about it.

## **Loading and EDA-ying the data**

In [5]:
!pip install pandas scikit-learn transformers

import pandas as pd
import numpy as np

df = pd.read_csv('/content/Training Dataset.csv')
print(df.shape)
print(df.head(3))

(614, 13)
    Loan_ID Gender Married Dependents Education Self_Employed  \
0  LP001002   Male      No          0  Graduate            No   
1  LP001003   Male     Yes          1  Graduate            No   
2  LP001005   Male     Yes          0  Graduate           Yes   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0         NaN             360.0   
1             4583             1508.0       128.0             360.0   
2             3000                0.0        66.0             360.0   

   Credit_History Property_Area Loan_Status  
0             1.0         Urban           Y  
1             1.0         Rural           N  
2             1.0         Urban           Y  


Each row corresponds to one loan application, with features such as Loan_ID, Gender, Married, ApplicantIncome, LoanAmount, Credit_History, Property_Area, and the target Loan_Status (Y / N).

We will treat each row as a “document” by concatenating its fields into a text string.

We store all such documents in a list.

In [2]:
docs = []
for _, row in df.iterrows():
    #handle missing values by converting to string
    vals = row.fillna('').astype(str)
    text = (
        f"Loan_ID: {vals['Loan_ID']}, Gender: {vals['Gender']}, Married: {vals['Married']}, "
        f"Dependents: {vals.get('Dependents', '')}, Education: {vals['Education']}, Self_Employed: {vals['Self_Employed']}, "
        f"ApplicantIncome: {vals['ApplicantIncome']}, CoapplicantIncome: {vals['CoapplicantIncome']}, "
        f"LoanAmount: {vals['LoanAmount']}, Loan_Amount_Term: {vals['Loan_Amount_Term']}, "
        f"Credit_History: {vals['Credit_History']}, Property_Area: {vals['Property_Area']}, Loan_Status: {vals['Loan_Status']}"
    )
    docs.append(text)

#example doc string
print(docs[0][:100], "...")

Loan_ID: LP001002, Gender: Male, Married: No, Dependents: 0, Education: Graduate, Self_Employed: No, ...


We now have docs, a list of strings where each string encodes one row’s information.  

These will serve as “knowledge documents” for retrieval.

## **Simple TF IDF retriver**

Used a TF-IDF vectorizer to turn each document into a vector.  

This allows us to compute similarity between a user’s query and each row.

We choose TF-IDF for simplicity; in practice one could also use semantic embeddings. Here, to ignore common English stop words.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#TF IDF matrix for documents
vectorizer = TfidfVectorizer(stop_words='english')
doc_vectors = vectorizer.fit_transform(docs)   #shape: (#docs, vocab_size)

Defined a function to retrieve the top-K most relevant documents given a text query.  

It vectorizes the query and computes cosine similarity with all row vectors, then returns the top matches.

In [6]:
def retrieve_top_docs(query, top_k=3):

    q_vec = vectorizer.transform([query])
    #computes cosine similarity with all documents
    scores = cosine_similarity(q_vec, doc_vectors).flatten()
    #gets indices of top k scores
    top_indices = np.argsort(scores)[-top_k:][::-1]
    #return the indiced docs
    return [docs[i] for i in top_indices]

#example retrieval test
query = "high income and good credit history"
top_docs = retrieve_top_docs(query, top_k=2)
print("Top retrieved rows for query:", query)
for doc in top_docs:
    print(doc)

Top retrieved rows for query: high income and good credit history
Loan_ID: LP001002, Gender: Male, Married: No, Dependents: 0, Education: Graduate, Self_Employed: No, ApplicantIncome: 5849, CoapplicantIncome: 0.0, LoanAmount: , Loan_Amount_Term: 360.0, Credit_History: 1.0, Property_Area: Urban, Loan_Status: Y
Loan_ID: LP002990, Gender: Female, Married: No, Dependents: 0, Education: Graduate, Self_Employed: Yes, ApplicantIncome: 4583, CoapplicantIncome: 0.0, LoanAmount: 133.0, Loan_Amount_Term: 360.0, Credit_History: 0.0, Property_Area: Semiurban, Loan_Status: N


## **Generating Answers with an LLM**

Use a **pretrained lightweight generative** model to produce answers. We’ll use **Hugging Face’s pipeline with a small T5 model (google/flan-t5-small)** for text generation.

This runs locally and is free to use. In practice, larger models or APIs (GPT-3.5, Llama2, etc.) could improve quality, but FLAN-T5-small suffices for a demo.

In [None]:
from transformers import pipeline

#Init the text-to-text generation pipeline
generator = pipeline("text2text-generation", model="google/flan-t5-small")

#example: Ask a question and generate an answer
user_query = "Which applicants have loans approved?"
retrieved = retrieve_top_docs(user_query, top_k=3)
prompt = (
    "Based on the following loan records:\n" +
    "\n".join(retrieved) +
    "\nAnswer the question: " + user_query
)
result = generator(prompt, max_length=100)[0]['generated_text']

print("Generated Answer:", result)

This code feeds the retrieved context and question to the model.  
The LLM then outputs a natural-sounding answer.  

For instance, it might say something like “According to the data above, loans are approved for applicants with [some pattern]...”, thereby “referencing” the dataset content.  

(The actual output depends on the model; FLAN may give a generic answer. In a real chatbot, one could further prompt the model to explicitly cite the loan IDs or features to increase trust.)

## **Example Queries**

In [8]:
#sample q&a
queries = [
    "Is a married applicant more likely to get a loan?",
    "What was the loan amount of the applicant with the highest income?",
]
for q in queries:
    top_rows = retrieve_top_docs(q, top_k=3)
    prompt = (
        "Loan records:\n" + "\n".join(top_rows) +
        f"\nQuestion: {q}\nAnswer:"
    )
    ans = generator(prompt, max_length=80)[0]['generated_text']
    print(f"\nQ: {q}\nA: {ans}")

Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



Q: Is a married applicant more likely to get a loan?
A: Dependents: 0, Education: Not Graduate, Self_Employed: No, ApplicantIncome: 2000, CoapplicantIncome: 0.0, Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: , Loan_Amount: ,


Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



Q: What was the loan amount of the applicant with the highest income?
A: 133.0


**Note:** The answers generated depend on how well the model uses the retrieved context. In practice, one might add instructions like “Answer with reference to the data above.” The key idea is that by augmenting the prompt with actual rows from the dataset, the model’s response is grounded in that data

Summary: This notebook demonstrated a basic RAG pipeline:
  1. treating each table row as a document, built a TF-IDF index for retrieval  
  2. used a generative LLM to answer questions using the retrieved rows as context  
  
  This allows the chatbot to produce natural answers grounded in the loan dataset, improving factual accuracy and enabling reference to specific data entries