# Course Recommendation Engine with RAG and ChromaDB (In-Memory) Using text-embedding-3-small

This notebook implements a course recommendation engine for Assignment 2 using a Retrieval-Augmented Generation (RAG)-like approach. It fetches a course catalog from a provided URL, generates embeddings with Azure OpenAI's `text-embedding-3-small` model, stores them in an in-memory ChromaDB vector store using `chromadb.Client()` with a collection named 'rajcourserecs', and retrieves relevant courses via semantic search. Due to the limitation of `text-embedding-3-small` (it only supports embeddings, not text generation), the pipeline returns formatted retrieved documents instead of generating reasoned recommendations. The implementation fixes `python-dotenv` parsing errors and ensures correct Azure deployment names.

In [23]:
!python -m pip install langchain langchain-openai pandas openai python-dotenv chromadb --quiet

In [24]:
!python -m pip install pypdf --quiet

In [25]:
import os
from dotenv import load_dotenv
from datetime import datetime

# Load environment variables from .env file
load_dotenv()

embedding_model_name = "text-embedding-3-small"
embedding_deployment_name = "text-embedding-3-small"  # Replace with your Azure deployment name for text-embedding-3-small

# Verify environment variables
assert os.environ.get("AZURE_OPENAI_ENDPOINT"), "AZURE_OPENAI_ENDPOINT not set in .env"
assert os.environ.get("AZURE_OPENAI_API_KEY"), "AZURE_OPENAI_API_KEY not set in .env"
assert os.environ.get("AZURE_OPENAI_API_VERSION", "2024-06-01"), "AZURE_OPENAI_API_VERSION not set in .env"

print("Environment variables loaded successfully.")
print(f"Current date and time: {datetime.now().strftime('%I:%M %p %Z, %B %d, %Y')}")

Python-dotenv could not parse statement starting at line 26


Environment variables loaded successfully.
Current date and time: 01:52 PM , September 23, 2025


In [26]:
course_catalog_url = "https://raw.githubusercontent.com/Bluedata-Consulting/GAAPB01-training-code-base/refs/heads/main/Assignments/assignment2dataset.csv"

In [27]:
import pandas as pd

# Load the course catalog
try:
    courses = pd.read_csv(course_catalog_url)
except Exception as e:
    print(f"Error loading dataset: {e}")
    raise

# Display first few rows to verify
courses.head()

Unnamed: 0,course_id,title,description
0,C001,Foundations of Machine Learning,Understand foundational machine learning algor...
1,C002,Deep Learning with TensorFlow and Keras,Explore neural network architectures using Ten...
2,C003,Natural Language Processing Fundamentals,Dive into NLP techniques for processing and un...
3,C004,Computer Vision and Image Processing,Learn the principles of computer vision and im...
4,C005,Reinforcement Learning Basics,Get introduced to reinforcement learning parad...


In [28]:
# Verify the number of courses loaded
print("Number of courses loaded:")
print(len(courses))

Number of courses loaded:
25


In [29]:
# Display a sample course description
print("Sample course description:")
print(courses['description'][0])

Sample course description:
Understand foundational machine learning algorithms including regression, classification, clustering, and dimensionality reduction. This course covers data pre-processing, feature engineering, model selection, hyperparameter tuning, and evaluation metrics. Hands-on labs use scikit-learn and Python to implement end-to-end workflows on real-world datasets, preparing learners for practical machine learning applications with interactive engaging exercises.


In [30]:
# Check average description length
avg_length = sum(len(str(desc)) for desc in courses['description']) / len(courses)
print("Average description length (characters):", avg_length)

Average description length (characters): 408.96


In [31]:
# Split long descriptions to avoid exceeding embedding limits
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=3500, chunk_overlap=500)
docs = []
doc_ids = []
metadatas = []
for i, (title, desc) in enumerate(zip(courses['title'], courses['description'])):
    desc = str(desc)  # Ensure description is a string
    if len(desc.strip()) > 100:  # Skip short or empty descriptions
        split_docs = text_splitter.split_text(desc)
        for j, split in enumerate(split_docs):
            docs.append(split)
            doc_ids.append(f"{i}_{j}")
            metadatas.append({"title": str(title)})

print("Number of document splits:")
print(len(docs))

Number of document splits:
25


In [32]:
# Display a sample split document
print("Sample document:")
print(f"ID: {doc_ids[0]}")
print(f"Metadata: {metadatas[0]}")
print(f"Content: {docs[0][:200]}...")

Sample document:
ID: 0_0
Metadata: {'title': 'Foundations of Machine Learning'}
Content: Understand foundational machine learning algorithms including regression, classification, clustering, and dimensionality reduction. This course covers data pre-processing, feature engineering, model s...


In [33]:
import os
from langchain_openai import AzureOpenAIEmbeddings
embeddings = AzureOpenAIEmbeddings(model=embedding_model_name,deployment=embedding_deployment_name, azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),api_key=os.environ.get("AZURE_OPENAI_API_KEY"),api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2024-06-01"))

In [34]:
import chromadb

# Initialize ChromaDB client (in-memory)
chroma_client = chromadb.Client()

# Create collection
try:
    # Delete existing collection if it exists to avoid conflicts
    chroma_client.delete_collection(name="rajcourserecs")
except:
    pass
collection = chroma_client.create_collection(name="rajcourserecs", metadata={"use_type": "COURSE_RECOMMENDATION"})

# Generate embeddings for documents
try:
    embeddings_list = embeddings.embed_documents(docs)
except Exception as e:
    print(f"Error generating embeddings: {e}")
    raise

# Add documents to the collection
try:
    collection.add(
        documents=docs,
        ids=doc_ids,
        metadatas=metadatas,
        embeddings=embeddings_list
    )
except Exception as e:
    print(f"Error adding documents to collection: {e}")
    raise

In [35]:
# Verify the number of documents in the collection
print("Number of docs in vector DB:")
print(collection.count())

Number of docs in vector DB:
25


In [36]:
# Custom retriever function for ChromaDB
def chroma_retriever(query, k=10):
    try:
        query_embedding = embeddings.embed_query(query)
        results = collection.query(
            query_embeddings=[query_embedding],
            n_results=k,
            include=["documents", "metadatas", "distances"]
        )
        # Convert to LangChain Document format for compatibility
        from langchain_core.documents import Document
        return [Document(page_content=doc, metadata=meta)
                for doc, meta in zip(results['documents'][0], results['metadatas'][0])]
    except Exception as e:
        print(f"Error in retriever: {e}")
        return []

In [37]:
# Test the retriever with a sample query
test_query = "data visualization courses"
retrieved_docs = chroma_retriever(test_query)
print("Number of retrieved documents:")
print(len(retrieved_docs))

Number of retrieved documents:
10


In [38]:
# Display a sample retrieved course
if retrieved_docs:
    print(retrieved_docs[0])
else:
    print("No documents retrieved.")

page_content='Transform raw data into compelling visual stories using Tableau. Learn to connect to diverse data sources, create interactive dashboards, and apply best practices in chart selection. Topics include calculated fields, parameters, LOD expressions, and storytelling features. Through real-world case studies, you’ll design user-driven analytics that reveal trends and drive data-informed decision making.' metadata={'title': 'Data Visualization with Tableau'}


In [39]:
# Display another sample retrieved course
if len(retrieved_docs) > 1:
    print(retrieved_docs[1])
else:
    print("Not enough documents retrieved.")

page_content='Get introduced to R for statistical computing and graphics. Topics include data structures, control flow, and functional programming. Use tidyverse libraries—dplyr, ggplot2, tidyr—for data manipulation and visualization. Explore hypothesis testing, regression analysis, and ANOVA. Through labs, apply statistical methods to real-world datasets and communicate results with reproducible R Markdown reports.' metadata={'title': 'R Programming and Statistical Analysis'}


### Implementing RAG-like Chain (No Generation)

In [40]:
# Function to format retrieved documents and deduplicate by title
def format_docs(docs):
    unique_docs = []
    seen_titles = set()
    for doc in docs:
        title = doc.metadata['title']
        if title not in seen_titles:
            unique_docs.append(f"Course: {title}\nDescription: {doc.page_content}")
            seen_titles.add(title)
        if len(unique_docs) == 5:
            break
    return "\n\n".join(unique_docs) if unique_docs else "No relevant courses found."

In [41]:
# Build the RAG-like chain (no LLM, just retrieval)
rag_chain = lambda x: format_docs(chroma_retriever(x))

In [42]:
# Test the RAG-like chain with a sample query
try:
    test_response = rag_chain("data visualization courses")
    print("Sample RAG response:")
    print(test_response)
except Exception as e:
    print(f"Error invoking RAG chain: {e}")

Sample RAG response:
Course: Data Visualization with Tableau
Description: Transform raw data into compelling visual stories using Tableau. Learn to connect to diverse data sources, create interactive dashboards, and apply best practices in chart selection. Topics include calculated fields, parameters, LOD expressions, and storytelling features. Through real-world case studies, you’ll design user-driven analytics that reveal trends and drive data-informed decision making.

Course: R Programming and Statistical Analysis
Description: Get introduced to R for statistical computing and graphics. Topics include data structures, control flow, and functional programming. Use tidyverse libraries—dplyr, ggplot2, tidyr—for data manipulation and visualization. Explore hypothesis testing, regression analysis, and ANOVA. Through labs, apply statistical methods to real-world datasets and communicate results with reproducible R Markdown reports.

Course: Big Data Analytics with Spark
Description: Proce

## Evaluation with Sample Queries

We test the RAG-like engine with the five provided sample queries. The system retrieves relevant courses but does not generate reasoned recommendations due to the limitation of using only `text-embedding-3-small`.

In [43]:
# Sample queries
sample_queries = [
    "I’ve completed the ‘Python Programming for Data Science’ course and enjoy data visualization. What should I take next?",
    "I know Azure basics and want to manage containers and build CI/CD pipelines. Recommend courses.",
    "My background is in ML fundamentals; I’d like to specialize in neural networks and production workflows.",
    "I want to learn to build and deploy microservices with Kubernetes—what courses fit best?",
    "I’m interested in blockchain and smart contracts but have no prior experience. Which courses do you suggest?"
]

In [44]:
# Evaluate each query using the RAG-like chain
for i, query in enumerate(sample_queries, 1):
    print(f"\n### Test Profile {i}")
    print(f"**Query:** {query}")
    try:
        response = rag_chain(query)
        print("**Retrieved Courses:**")
        print(response)
    except Exception as e:
        print(f"Error processing query {i}: {e}")
    print("\n**Relevance Evaluation:** Review the retrieved courses to assess relevance. The output lists up to 5 courses based on semantic similarity to the query, but no explanations are generated due to using only text-embedding-3-small.")


### Test Profile 1
**Query:** I’ve completed the ‘Python Programming for Data Science’ course and enjoy data visualization. What should I take next?
**Retrieved Courses:**
Course: Python Programming for Data Science
Description: Learn Python fundamentals for data science: variables, control flow, functions, and object-oriented programming. Advance to data handling with pandas, numerical computing with NumPy, and basic plotting with matplotlib. You’ll build reproducible data workflows, clean and transform datasets, and perform exploratory analysis, laying the groundwork for machine learning and statistical modeling projects.

Course: Data Visualization with Tableau
Description: Transform raw data into compelling visual stories using Tableau. Learn to connect to diverse data sources, create interactive dashboards, and apply best practices in chart selection. Topics include calculated fields, parameters, LOD expressions, and storytelling features. Through real-world case studies, you’ll 

## Notes

- The system uses `chromadb.Client()` with an in-memory collection named 'rajcourserecs' for the vector store.
- Embeddings are generated using `text-embedding-3-small`, and retrieval is based on semantic similarity.
- Due to the limitation of `text-embedding-3-small` (no text generation capability), the pipeline returns formatted retrieved documents instead of generating reasoned recommendations.
- Ensure the .env file is correctly formatted and the `text-embedding-3-small` deployment is set up in the Azure Portal before running.
- If the dataset URL is inaccessible, download the CSV and load it via `pd.read_csv('local_path.csv')`.
- Note: `chromadb.Client()` stores data in memory, so the collection resets if the notebook session ends.
