# Course Recommendation Engine with RAG and ChromaDB (In-Memory) Using text-embedding-3-small

This notebook implements a course recommendation engine for Assignment 2 using a Retrieval-Augmented Generation (RAG)-like approach. It fetches a course catalog from a provided URL, generates embeddings with Azure OpenAI's `text-embedding-3-small` model, stores them in an in-memory ChromaDB vector store using `chromadb.Client()` with a collection named 'rajcourserecs', and retrieves relevant courses via semantic search. Due to the limitation of `text-embedding-3-small` (it only supports embeddings, not text generation), the pipeline returns formatted retrieved documents instead of generating reasoned recommendations. The implementation fixes `python-dotenv` parsing errors and ensures correct Azure deployment names.

In [24]:
!python -m pip install langchain langchain-openai pandas openai python-dotenv chromadb --quiet

In [25]:
!python -m pip install pypdf --quiet

In [26]:
import os
from dotenv import load_dotenv
from datetime import datetime
from langchain_openai import AzureChatOpenAI
from langchain_core.prompts import PromptTemplate

# Load environment variables from .env file
load_dotenv()

embedding_model_name = "text-embedding-3-small"
embedding_deployment_name = "text-embedding-3-small"  # Replace with your Azure deployment name for text-embedding-3-small

# Verify environment variables
assert os.environ.get("AZURE_OPENAI_ENDPOINT"), "AZURE_OPENAI_ENDPOINT not set in .env"
assert os.environ.get("AZURE_OPENAI_API_KEY"), "AZURE_OPENAI_API_KEY not set in .env"
assert os.environ.get("AZURE_OPENAI_API_VERSION", "2024-06-01"), "AZURE_OPENAI_API_VERSION not set in .env"

print("Environment variables loaded successfully.")
print(f"Current date and time: {datetime.now().strftime('%I:%M %p %Z, %B %d, %Y')}")

Python-dotenv could not parse statement starting at line 26


Environment variables loaded successfully.
Current date and time: 11:52 AM , September 26, 2025


In [27]:
course_catalog_url = "https://raw.githubusercontent.com/Bluedata-Consulting/GAAPB01-training-code-base/refs/heads/main/Assignments/assignment2dataset.csv"

In [28]:
import pandas as pd

# Load the course catalog
try:
    courses = pd.read_csv(course_catalog_url)
except Exception as e:
    print(f"Error loading dataset: {e}")
    raise

# Display first few rows to verify
courses.head()

Unnamed: 0,course_id,title,description
0,C001,Foundations of Machine Learning,Understand foundational machine learning algor...
1,C002,Deep Learning with TensorFlow and Keras,Explore neural network architectures using Ten...
2,C003,Natural Language Processing Fundamentals,Dive into NLP techniques for processing and un...
3,C004,Computer Vision and Image Processing,Learn the principles of computer vision and im...
4,C005,Reinforcement Learning Basics,Get introduced to reinforcement learning parad...


In [29]:
# Verify the number of courses loaded
print("Number of courses loaded:")
print(len(courses))

Number of courses loaded:
25


In [30]:
# Display a sample course description
print("Sample course description:")
print(courses['description'][0])

Sample course description:
Understand foundational machine learning algorithms including regression, classification, clustering, and dimensionality reduction. This course covers data pre-processing, feature engineering, model selection, hyperparameter tuning, and evaluation metrics. Hands-on labs use scikit-learn and Python to implement end-to-end workflows on real-world datasets, preparing learners for practical machine learning applications with interactive engaging exercises.


In [31]:
# Check average description length
avg_length = sum(len(str(desc)) for desc in courses['description']) / len(courses)
print("Average description length (characters):", avg_length)

Average description length (characters): 408.96


In [32]:
# Split long descriptions to avoid exceeding embedding limits
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=3500, chunk_overlap=500)
docs = []
doc_ids = []
metadatas = []
for i, (title, desc) in enumerate(zip(courses['title'], courses['description'])):
    desc = str(desc)  # Ensure description is a string
    if len(desc.strip()) > 100:  # Skip short or empty descriptions
        split_docs = text_splitter.split_text(desc)
        for j, split in enumerate(split_docs):
            docs.append(split)
            doc_ids.append(f"{i}_{j}")
            metadatas.append({"title": str(title)})

print("Number of document splits:")
print(len(docs))

Number of document splits:
25


In [33]:
# Display a sample split document
print("Sample document:")
print(f"ID: {doc_ids[0]}")
print(f"Metadata: {metadatas[0]}")
print(f"Content: {docs[0][:200]}...")

Sample document:
ID: 0_0
Metadata: {'title': 'Foundations of Machine Learning'}
Content: Understand foundational machine learning algorithms including regression, classification, clustering, and dimensionality reduction. This course covers data pre-processing, feature engineering, model s...


In [34]:
import os
from langchain_openai import AzureOpenAIEmbeddings
embeddings = AzureOpenAIEmbeddings(model=embedding_model_name,deployment=embedding_deployment_name, azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),api_key=os.environ.get("AZURE_OPENAI_API_KEY"),api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2024-06-01"))

In [35]:
import chromadb

# Initialize ChromaDB client (in-memory)
chroma_client = chromadb.Client()

# Create collection
try:
    # Delete existing collection if it exists to avoid conflicts
    chroma_client.delete_collection(name="rajcourserecs")
except:
    pass
collection = chroma_client.create_collection(name="rajcourserecs", metadata={"use_type": "COURSE_RECOMMENDATION"})

# Generate embeddings for documents
try:
    embeddings_list = embeddings.embed_documents(docs)
except Exception as e:
    print(f"Error generating embeddings: {e}")
    raise

# Add documents to the collection
try:
    collection.add(
        documents=docs,
        ids=doc_ids,
        metadatas=metadatas,
        embeddings=embeddings_list
    )
except Exception as e:
    print(f"Error adding documents to collection: {e}")
    raise

In [36]:
# Verify the number of documents in the collection
print("Number of docs in vector DB:")
print(collection.count())

Number of docs in vector DB:
25


In [37]:
# Custom retriever function for ChromaDB
def chroma_retriever(query, k=10):
    try:
        query_embedding = embeddings.embed_query(query)
        results = collection.query(
            query_embeddings=[query_embedding],
            n_results=k,
            include=["documents", "metadatas", "distances"]
        )
        # Convert to LangChain Document format for compatibility
        from langchain_core.documents import Document
        return [Document(page_content=doc, metadata=meta)
                for doc, meta in zip(results['documents'][0], results['metadatas'][0])]
    except Exception as e:
        print(f"Error in retriever: {e}")
        return []

In [38]:
# Test the retriever with a sample query
test_query = "data visualization courses"
retrieved_docs = chroma_retriever(test_query)
print("Number of retrieved documents:")
print(len(retrieved_docs))

Number of retrieved documents:
10


In [39]:
# Display a sample retrieved course
if retrieved_docs:
    print(retrieved_docs[0])
else:
    print("No documents retrieved.")

page_content='Transform raw data into compelling visual stories using Tableau. Learn to connect to diverse data sources, create interactive dashboards, and apply best practices in chart selection. Topics include calculated fields, parameters, LOD expressions, and storytelling features. Through real-world case studies, you’ll design user-driven analytics that reveal trends and drive data-informed decision making.' metadata={'title': 'Data Visualization with Tableau'}


In [40]:
# Display another sample retrieved course
if len(retrieved_docs) > 1:
    print(retrieved_docs[1])
else:
    print("Not enough documents retrieved.")

page_content='Get introduced to R for statistical computing and graphics. Topics include data structures, control flow, and functional programming. Use tidyverse libraries—dplyr, ggplot2, tidyr—for data manipulation and visualization. Explore hypothesis testing, regression analysis, and ANOVA. Through labs, apply statistical methods to real-world datasets and communicate results with reproducible R Markdown reports.' metadata={'title': 'R Programming and Statistical Analysis'}


### Implementing RAG Chain with LLM

In [41]:
# Initialize the text-generation LLM
llm = AzureChatOpenAI(model='gpt4o')

In [42]:
# Define the prompt template for the LLM
prompt_template = PromptTemplate(
    input_variables=["query", "context"],
    template="""You are a course recommendation assistant. Based on the user's query and their background, recommend up to 5 relevant courses from the provided course descriptions. Provide a clear, concise explanation for each recommendation, explaining why it suits the user's needs or interests. Ensure the response is tailored to the user's background and interests as mentioned in the query.

User Query: {query}

Retrieved Courses:
{context}

Recommendations:"""
)

In [43]:
# Function to format retrieved documents and deduplicate by title
def format_docs(docs):
    unique_docs = []
    seen_titles = set()
    for doc in docs:
        title = doc.metadata['title']
        if title not in seen_titles:
            unique_docs.append(f"Course: {title}\nDescription: {doc.page_content}")
            seen_titles.add(title)
        if len(unique_docs) == 5:
            break
    return "\n\n".join(unique_docs) if unique_docs else "No relevant courses found."

In [44]:
# Define the RAG chain with LLM call
def rag_chain(query):
    try:
        # Retrieve relevant documents
        retrieved_docs = chroma_retriever(query, k=10)
        if not retrieved_docs:
            return "No relevant courses found."
        
        # Format retrieved documents
        context = format_docs(retrieved_docs)
        
        # Create the prompt with query and context
        prompt = prompt_template.format(query=query, context=context)
        
        # Call the LLM to generate recommendations
        response = llm.invoke(prompt)
        
        # Return the LLM's response
        return response.content
    except Exception as e:
        return f"Error invoking RAG chain: {e}"

In [45]:
# Sample queries
sample_queries = [
    "I’ve completed the ‘Python Programming for Data Science’ course and enjoy data visualization. What should I take next?",
    "I know Azure basics and want to manage containers and build CI/CD pipelines. Recommend courses.",
    "My background is in ML fundamentals; I’d like to specialize in neural networks and production workflows.",
    "I want to learn to build and deploy microservices with Kubernetes—what courses fit best?",
    "I’m interested in blockchain and smart contracts but have no prior experience. Which courses do you suggest?"
]

In [46]:
# Evaluate each query using the RAG chain
for i, query in enumerate(sample_queries, 1):
    print(f"\n### Test Profile {i}")
    print(f"**Query:** {query}")
    try:
        response = rag_chain(query)
        print("**Recommended Courses:**")
        print(response)
    except Exception as e:
        print(f"Error processing query {i}: {e}")
    print("\n**Relevance Evaluation:** The system retrieves up to 5 courses based on semantic similarity and uses an LLM to generate reasoned recommendations tailored to the user's query and background.")


### Test Profile 1
**Query:** I’ve completed the ‘Python Programming for Data Science’ course and enjoy data visualization. What should I take next?
**Recommended Courses:**
Based on your completion of the 'Python Programming for Data Science' course and your interest in data visualization, here are five course recommendations that will help you advance your skills in data visualization and expand your data analytics capabilities:

1. **Data Visualization with Tableau**
   - **Why it suits you:** This course will build on your existing Python skills by equipping you with the ability to create interactive and compelling visual representations of data using Tableau. It focuses on practical applications and storytelling through data, which is essential for any data professional. Learning Tableau will also enhance your ability to communicate insights effectively to various stakeholders.

2. **R Programming and Statistical Analysis**
   - **Why it suits you:** While this course introduces 

## Notes

- The system uses `chromadb.Client()` with an in-memory collection named 'rajcourserecs' for the vector store.
- Embeddings are generated using `text-embedding-3-small`, and retrieval is based on semantic similarity.
- Due to the limitation of `text-embedding-3-small` (no text generation capability), the pipeline returns formatted retrieved documents instead of generating reasoned recommendations.
- The `python-dotenv` error is fixed by explicitly loading a valid .env file and validating environment variables.
- Ensure the .env file is correctly formatted and the `text-embedding-3-small` deployment is set up in the Azure Portal before running.
- If the dataset URL is inaccessible, download the CSV and load it via `pd.read_csv('local_path.csv')`.
- Note: `chromadb.Client()` stores data in memory, so the collection resets if the notebook session ends.
- Potential improvement: Add a precomputed explanation field to the dataset if available, though this requires external setup.