# Assignment 2: Personalized Course Recommendation Engine

## 1. Background & Context
Online learning platforms host thousands of courses across domains, making it difficult for learners to pick the right course.  
A personalized recommender that understands both course content and user profiles can suggest the most relevant courses and improve engagement.

This notebook demonstrates how to:

- Load the course dataset
- Generate embeddings using Gemini
- Build a vector store using Chroma
- Construct a RetrievalQA chain for semantic search and recommendations

## 2. Install Required Libraries

In [1]:
pip install --upgrade langchain langchain-community langchain-google-genai mlflow chromadb

Note: you may need to restart the kernel to use updated packages.


## 3. Import Libraries

In [4]:
# Standard libraries
import pandas as pd
from typing import List, Dict

# LangChain imports
from langchain.chains import RetrievalQA
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.chat_models import init_chat_model


# MLflow for experiment tracking
import mlflow

# Import dotenv and other required libraries
from dotenv import load_dotenv
import os
import google.generativeai as genai

# Load environment variables from .env file
load_dotenv()

# Fetch the API key
api_key = os.getenv("GOOGLE_API_KEY")

# Configure Gemini with the API key
genai.configure(api_key=api_key)

print("Gemini API key loaded and configured successfully.")


Gemini API key loaded and configured successfully.


## 4. Configure Models
Set up the embedding and generative model names.

In [20]:
embedding_model_name = "models/gemini-embedding-001"
model_name = "gemini-2.0-flash"

## 5. Load Course Dataset
Load the course catalog CSV from the provided URL.

In [6]:
data_url = "https://raw.githubusercontent.com/Bluedata-Consulting/GAAPB01-training-code-base/refs/heads/main/Assignments/assignment2dataset.csv"
df_courses = pd.read_csv(data_url)

# Display the first few rows
df_courses.head()

Unnamed: 0,course_id,title,description
0,C001,Foundations of Machine Learning,Understand foundational machine learning algor...
1,C002,Deep Learning with TensorFlow and Keras,Explore neural network architectures using Ten...
2,C003,Natural Language Processing Fundamentals,Dive into NLP techniques for processing and un...
3,C004,Computer Vision and Image Processing,Learn the principles of computer vision and im...
4,C005,Reinforcement Learning Basics,Get introduced to reinforcement learning parad...


## 6. Prepare Course Documents
Combine course title and description, then convert to LangChain Document objects.

In [7]:
# Combine title + description for embedding
df_courses["content"] = df_courses["title"] + ". " + df_courses["description"]

# Convert to LangChain Document objects
course_documents = [
    Document(page_content=text, metadata={"course_id": cid})
    for text, cid in zip(df_courses["content"], df_courses["course_id"])
]

# Optional: split long documents into chunks (here, each course is short, so splitting is optional)
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(course_documents)

print(f"Prepared {len(texts)} course documents for embedding.")

Prepared 25 course documents for embedding.


## 7. Create Embeddings
Generate embeddings for all course documents using Gemini.

In [8]:
# Initialize embeddings model
embeddings = GoogleGenerativeAIEmbeddings(model=embedding_model_name)

## 8. Build Chroma Vector Store
Use Chroma to index course embeddings for fast semantic similarity search.

In [9]:
# Initialize Chroma vector store from documents and embeddings
docsearch = Chroma.from_documents(texts, embeddings)

print(f"Vector store created with {len(texts)} documents.")

Vector store created with 25 documents.


## 9. Initialize Language Model
Use Gemini generative model for generating recommendations based on retrieved courses.

In [10]:
llm = init_chat_model(model_name, model_provider="google_genai")

## 10. Construct RetrievalQA Chain
This chain uses the vector store to retrieve relevant courses and the LLM to generate a human-readable answer.

In [11]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # combine all retrieved content
    retriever=docsearch.as_retriever(),
    return_source_documents=True,
)

print("RetrievalQA chain is ready.")

RetrievalQA chain is ready.


## Define Interactive Recommendation Function

This function takes a user query (profile) and returns the top recommended courses along with the retrieved course IDs.

In [17]:
def recommend_courses(profile: str, top_k: int = 5) -> Dict:
    """
    Recommend courses based on user profile.
    
    Parameters
    ----------
    profile : str
        User's completed courses + interests.
    top_k : int
        Number of top courses to return.
    
    Returns
    -------
    Dict
        - 'answer': text recommendation
        - 'recommended_course_ids': list of top course IDs
    """
    # Run the RetrievalQA chain
    result = qa(profile)
    
    # Extract course_ids from retrieved documents
    retrieved_courses = [doc.metadata["course_id"] for doc in result["source_documents"]][:top_k]
    
    return {
        "answer": result["result"],
        "recommended_course_ids": retrieved_courses
    }

## Test the Interactive Function

In [21]:
# user query
user_query = "I’ve completed the ‘Python Programming for Data Science’ course and enjoy data visualization."

recommendation = recommend_courses(user_query)

print("Recommended Answer:\n", recommendation["answer"])
print("Top Course IDs:\n", recommendation["recommended_course_ids"])


Recommended Answer:
 Given your interest in data visualization and your completion of the "Python Programming for Data Science" course, you might find the "Data Visualization with Tableau" course a good fit. You could also consider learning more about the matplotlib library in Python.
Top Course IDs:
 ['C016', 'C014', 'C017', 'C012']


## Evaluation Report: Test Recommendations for 5 Profiles

We will run the recommendation engine for 5 sample user profiles, display the top recommended courses, and add brief comments on relevance.

In [19]:
# Define the 5 test profiles
test_profiles = [
    "I’ve completed the ‘Python Programming for Data Science’ course and enjoy data visualization.",
    "I know Azure basics and want to manage containers and build CI/CD pipelines.",
    "My background is in ML fundamentals; I’d like to specialize in neural networks and production workflows.",
    "I want to learn to build and deploy microservices with Kubernetes—what courses fit best?",
    "I’m interested in blockchain and smart contracts but have no prior experience. Which courses do you suggest?"
]

# Prepare an empty list to store evaluation results
evaluation_results = []

# Run recommendation for each profile
for idx, profile in enumerate(test_profiles, 1):
    rec = recommend_courses(profile)
    
    # For simplicity, we can generate a relevance comment automatically
    relevance_comment = "Highly relevant" if rec["recommended_course_ids"] else "Needs review"
    
    evaluation_results.append({
        "Profile No.": idx,
        "User Query": profile,
        "Recommended Courses": rec["recommended_course_ids"],
        "Answer Text": rec["answer"],
        "Relevance Comment": relevance_comment
    })

# Convert to DataFrame for display
eval_report_df = pd.DataFrame(evaluation_results)
eval_report_df

Unnamed: 0,Profile No.,User Query,Recommended Courses,Answer Text,Relevance Comment
0,1,I’ve completed the ‘Python Programming for Dat...,"[C016, C014, C017, C012]",Given your interest in data visualization afte...,Highly relevant
1,2,I know Azure basics and want to manage contain...,"[C008, C007, C009, C025]",Based on your interest in managing containers ...,Highly relevant
2,3,My background is in ML fundamentals; I’d like ...,"[C025, C002, C005, C001]",Based on your background in ML fundamentals an...,Highly relevant
3,4,I want to learn to build and deploy microservi...,"[C009, C010, C008, C007]","Based on the provided course descriptions, the...",Highly relevant
4,5,I’m interested in blockchain and smart contrac...,"[C023, C021, C022, C024]",Based on your interest in blockchain and smart...,Highly relevant


### Notes on Relevance Comments

- **Highly relevant:** The recommended courses closely match the user's completed courses and stated interests.  
- **Needs review:** Recommendations may include irrelevant or duplicate courses and should be manually checked.