<a href="https://colab.research.google.com/github/uxama-jamil/Langchain-RAG-LLM/blob/master/LM_NewHire_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### ✅ Install Required Libraries

This cell installs all necessary Python packages:
- `langchain`, `langchain-community`, `langchain-huggingface`, `langchain-groq` for building LLM applications.
- `chromadb` for vector store.
- `huggingface-hub`, `pandas`, and `tiktoken` for embeddings, data handling, and token management.


In [1]:
!pip install langchain langchain-community langchain-huggingface langchain-groq chromadb huggingface-hub pandas tiktoken


Collecting langchain-community
  Downloading langchain_community-0.3.23-py3-none-any.whl.metadata (2.5 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting langchain-groq
  Downloading langchain_groq-0.3.2-py3-none-any.whl.metadata (2.6 kB)
Collecting chromadb
  Downloading chromadb-1.0.7-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting groq<1,>=0.4.1 (fr

### 📥 Download Employees Dataset

This function downloads a CSV file (`Employees.csv`) from Google Drive using a file ID, if it doesn't already exist locally. This dataset contains information about employees (potential line managers).


In [2]:
  #function to download Employees.csv file from Google Drive
import requests
import os

def download_meta(filename, file_id):
    if os.path.exists(filename):
        print(f"📄 '{filename}' already exists.")
        return

    url = f"https://drive.google.com/uc?export=download&id={file_id}"

    response = requests.get(url)

    if response.status_code == 200:
        with open(filename, "wb") as f:
            f.write(response.content)
        print(f"✅ '{filename}' downloaded successfully!")
    else:
        print(f"❌ Failed to download '{filename}'")

#Download credentials.json from your shared Google Drive link
download_meta("Employees.csv", "1wYNx4Xaf4wEj8nfTEn22XyWs4_QQV6eU")

✅ 'Employees.csv' downloaded successfully!


In [None]:
# remove db directory
!rm -rf db/

### ⚙️ Setup Directories and File Paths

Defines paths for:
- the CSV file location (`Employees.csv`)
- the local directory to store the Chroma vector database (`line_manager_chroma_db`)


In [3]:
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_groq import ChatGroq
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.docstore.document import Document
import pandas as pd
import os

# --- Configuration ---
current_dir = os.path.dirname('/content/')
data_path = os.path.join(current_dir, "Employees.csv")
db_dir = os.path.join(current_dir, "db")
persistent_directory = os.path.join(db_dir, "line_manager_chroma_db")

### 🧠 Prompt Template for LLM

Defines a prompt template that takes:
- `documents`: employee data
- `question`: new hire’s profile
And asks the model to recommend the most suitable line manager based on Tech Stack, Experience, and Division.


In [4]:
prompt = PromptTemplate(
    template="""
    You are an assistant to match new hires with line managers.
    Based on the new hire's Tech Stack, Years of Experience, and Division,
    suggest the best matching line manager from the available documents.

    Documents:
    {documents}

    New Hire Details:
    {question}

    Recommend the most suitable Line Manager Name:
    """,
    input_variables=["question", "documents"],
)

### 🔐 Load Groq LLM (LLaMA 3.1)

Initializes the ChatGroq model with the LLaMA 3.1 8B model using a secure API key stored in Colab’s `userdata`. Combines the prompt, model, and output parser into a single `rag_chain`.


In [5]:
from google.colab import userdata

llm = ChatGroq(api_key=userdata.get('GROQ_API_KEY'), model_name="llama-3.1-8b-instant")

rag_chain = prompt | llm | StrOutputParser()

### 🚀 RAGApplication Class

Wraps the retrieval and generation logic:
- `run()` fetches relevant documents using the retriever
- Passes them into the LLM prompt to generate a suitable line manager recommendation


In [None]:
class RAGApplication:
    def __init__(self, retriever, rag_chain):
        self.retriever = retriever
        self.rag_chain = rag_chain

    def run(self, question):
        documents = self.retriever.invoke(question)
        doc_texts = "\n".join([doc.page_content for doc in documents])
        answer = self.rag_chain.invoke({"question": question, "documents": doc_texts})
        return answer

### 📊 Load Employee Data from CSV

Reads the `Employees.csv` and converts each row into a `Document` with both readable content and metadata (Name, Tech Stack, Experience, Division) for use in vector search.


In [None]:
# --- Load CSV Data ---
def load_employees_as_documents(csv_path):
    df = pd.read_csv(csv_path)
    documents = []
    for _, row in df.iterrows():
        content = (
            f"Name: {row['Name']}, "
            f"Tech Stack: {row['Tech Stack']}, "
            f"Experience: {row['Year of experience']} years, "
            f"Division: {row['Division']}, "
        )
        metadata = {
            "name": row["Name"],
            "tech_stack": row["Tech Stack"],
            "experience": row["Year of experience"],
            "division": row["Division"],
        }
        documents.append(Document(page_content=content, metadata=metadata))
    return documents

### 💾 Setup or Load Chroma Vector Store

Checks if a Chroma vector store already exists:
- If not, it splits the documents into chunks and creates a new vector store with HuggingFace embeddings (`all-MiniLM-L6-v2`)
- If it exists, it loads the existing vector store
Also sets up a retriever with similarity score filtering (`score_threshold=0.1`)


In [None]:
# --- Setup VectorStore ---
if not os.path.exists(persistent_directory):
    print("Vectorstore not found. Initializing...")

    documents = load_employees_as_documents(data_path)

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1500, chunk_overlap=200
    )
    doc_splits = []
    for doc in documents:
        splits = text_splitter.split_text(doc.page_content)
        for split in splits:
            doc_splits.append(Document(page_content=split, metadata=doc.metadata))

    embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

    vectorstore = Chroma.from_documents(
        doc_splits, embedding, persist_directory=persistent_directory
    )
    print("Vectorstore created and persisted!")
else:
    print("Vectorstore already exists. Loading...")

    embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    vectorstore = Chroma(persist_directory=persistent_directory, embedding_function=embedding)

    print("Vectorstore Loaded!")

retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 4, "score_threshold": 0.1},
)

rag_application = RAGApplication(retriever, rag_chain)

### 🧪 Run RAG Matching Loop

Runs an interactive loop to:
- Input new hire's Tech Stack, Experience, and Division
- Pass the combined description to the RAG pipeline
- Output the best matching line manager using the embedded knowledge base
Type "exit" anytime to quit the loop.


In [None]:
# --- Query Loop ---
print("\nReady to match new hires! Type 'exit' to quit.")
while True:
    print("\nPlease input the new hire details:")
    tech_stack = input("Tech Stack: ")
    if tech_stack.lower() == "exit":
        break
    experience = input("Years of Experience: ")
    if experience.lower() == "exit":
        break

    division = input("Division: ")
    if division.lower() == "exit":
        break

    new_hire_description = f"Tech Stack: {tech_stack}, Experience: {experience} years, Division: {division}"
    answer = rag_application.run(new_hire_description)
    print("\nRecommended Line Manager:", answer)
