# **Model Development and Evaluation**

# **SAMUEL ABOYE**

# **Model Selection and Training Evaluation**
The recommendation system was developed using a hybrid approach that combines content-based filtering and collaborative filtering. The content-based filtering utilizes TF-IDF vectorization to analyze course descriptions, computing a cosine similarity matrix to identify courses with similar content. Additionally, features such as 'University' and 'Difficulty Level' were encoded to ensure diversity in recommendations. For collaborative filtering, user and item IDs were generated to simulate user interactions with courses, and the SVD algorithm was employed to predict user preferences.
The dataset was split into training and testing sets to evaluate the performance of the collaborative filtering model. The hybrid recommendation process involves generating content-based recommendations, ensuring diversity, and combining these with collaborative filtering recommendations. This combined approach aims to leverage the strengths of both methods to provide highly relevant and comprehensive recommendations.


In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import numpy as np
import math

# Load the cleaned dataset
df = pd.read_csv('cleaned_dataset.csv')

# Ensure 'Course Rating' is numeric
df['Course Rating'] = pd.to_numeric(df['Course Rating'], errors='coerce')

# Encode categorical variables
label_encoder = LabelEncoder()
df['University'] = label_encoder.fit_transform(df['University'])
df['Difficulty Level'] = label_encoder.fit_transform(df['Difficulty Level'])

# Extract important features from the 'Course Description' with adjusted parameters
tfidf = TfidfVectorizer(stop_words='english', max_df=0.8, min_df=5, ngram_range=(1, 2))
df['Course Description'] = df['Course Description'].fillna('')
tfidf_matrix = tfidf.fit_transform(df['Course Description'])

# Combine the original features with the TF-IDF features
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())
df_combined = pd.concat([df, df_tfidf], axis=1)

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Function to get content-based recommendations based on course index
def get_content_based_recommendations(course_index, cosine_sim=cosine_sim, df=df, num_recommendations=10):
    # Get pairwise similarity scores of all courses with the given course
    sim_scores = list(enumerate(cosine_sim[course_index]))
    
    # Sort the courses based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the most similar courses
    sim_scores = sim_scores[1:num_recommendations+1]  # Exclude the course itself
    
    # Get the course indices
    course_indices = [i[0] for i in sim_scores]
    
    # Return the top most similar courses
    return df.iloc[course_indices][['Course Name', 'University', 'Difficulty Level', 'Course Rating', 'Course URL']]

# Get recommendations for a sample course
course_index = 0  # Index of the course for which recommendations are needed
recommendations = get_content_based_recommendations(course_index)

print("Content-Based Recommendations for course index", course_index)
print(recommendations)

Content-Based Recommendations for course index 0
                                            Course Name  University  \
1444  script writing: write a pilot episode for a tv...          73   
1590                             write your first novel          73   
3387                                 transmedia writing          73   
2131       presentation skills: public speaking project          80   
3353                 better business writing in english          41   
1185                               capstone: your story         172   
2664  writing professional email and memos (project-...         166   
1044  writing for young readers: opening the treasur...          22   
601                    writing in english at university          69   
833                     songwriting: writing the lyrics          11   

      Difficulty Level  Course Rating  \
1444                 0            4.3   
1590                 1            3.5   
3387                 0            4.1   
2131 

In [2]:
from sklearn.cluster import KMeans

# Use the TF-IDF features for clustering
num_clusters = 5  # Adjust this based on your dataset
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(tfidf_matrix)

# Function to get cluster-based recommendations
def get_cluster_based_recommendations(course_index, df=df, num_recommendations=10):
    cluster_id = df.iloc[course_index]['Cluster']
    cluster_courses = df[df['Cluster'] == cluster_id]
    cluster_courses = cluster_courses[cluster_courses.index != course_index]  # Exclude the course itself
    
    return cluster_courses.head(num_recommendations)[['Course Name', 'University', 'Difficulty Level', 'Course Rating', 'Course URL']]

# Get cluster-based recommendations for a sample course
cluster_recommendations = get_cluster_based_recommendations(course_index)

print("Cluster-Based Recommendations for course index", course_index)
print(cluster_recommendations)


Cluster-Based Recommendations for course index 0
                                          Course Name  University  \
1   business strategy: business model canvas analy...          26   
5   building test automation framework using selen...          26   
7                       programming languages, part a         163   
10       agile projects:  developing tasks with taiga          26   
12                               hacking and patching         141   
16                      python programming essentials          99   
17  creating dashboards and storytelling with tableau         136   
18                               parallel programming         183   
27  aws elastic beanstalk: build & deploy a node.j...          26   
36                      cobol programming with vscode          52   

    Difficulty Level  Course Rating  \
1                  1            4.8   
5                  1            4.7   
7                  3            4.9   
10                 1            4.0

In [3]:
# Function to get combined recommendations
def get_combined_recommendations(course_index, num_recommendations=10):
    content_recommendations = get_content_based_recommendations(course_index, num_recommendations=num_recommendations)
    cluster_recommendations = get_cluster_based_recommendations(course_index, num_recommendations=num_recommendations)
    
    combined_recommendations = pd.concat([content_recommendations, cluster_recommendations]).drop_duplicates().head(num_recommendations)
    
    return combined_recommendations

# Get combined recommendations for a sample course
combined_recommendations = get_combined_recommendations(course_index)

print("Combined Recommendations for course index", course_index)
print(combined_recommendations)

Combined Recommendations for course index 0
                                            Course Name  University  \
1444  script writing: write a pilot episode for a tv...          73   
1590                             write your first novel          73   
3387                                 transmedia writing          73   
2131       presentation skills: public speaking project          80   
3353                 better business writing in english          41   
1185                               capstone: your story         172   
2664  writing professional email and memos (project-...         166   
1044  writing for young readers: opening the treasur...          22   
601                    writing in english at university          69   
833                     songwriting: writing the lyrics          11   

      Difficulty Level  Course Rating  \
1444                 0            4.3   
1590                 1            3.5   
3387                 0            4.1   
2131      

In [4]:
# Function to evaluate recommendations
def evaluate_recommendations(num_recommendations=10):
    precision_at_k = []
    recall_at_k = []
    f1_score_at_k = []
    map_at_k = []
    ndcg_at_k = []

    for idx in range(len(df)):
        # Get recommendations for each course
        recommendations = get_combined_recommendations(idx, num_recommendations=num_recommendations)
        
        # Assume the ground truth relevant items are those with the same 'Course Rating' (example)
        ground_truth = df[(df['Course Rating'] >= 4) & (df.index != idx)].index.tolist()
        
        if not ground_truth:
            continue
        
        recommended_indices = recommendations.index.tolist()
        relevant_items = set(ground_truth).intersection(recommended_indices)
        
        precision = len(relevant_items) / num_recommendations
        recall = len(relevant_items) / len(ground_truth)
        f1_score = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        map_score = sum([1 if i < len(recommended_indices) and recommended_indices[i] in relevant_items else 0 for i in range(num_recommendations)]) / num_recommendations
        dcg = sum([(1 if i < len(recommended_indices) and recommended_indices[i] in relevant_items else 0) / math.log2(i + 2) for i in range(num_recommendations)])
        idcg = sum([1 / math.log2(i + 2) for i in range(min(len(ground_truth), num_recommendations))])
        ndcg = dcg / idcg if idcg > 0 else 0

        precision_at_k.append(precision)
        recall_at_k.append(recall)
        f1_score_at_k.append(f1_score)
        map_at_k.append(map_score)
        ndcg_at_k.append(ndcg)

    return np.mean(precision_at_k), np.mean(recall_at_k), np.mean(f1_score_at_k), np.mean(map_at_k), np.mean(ndcg_at_k)

# Evaluate the recommendations
precision_at_k, recall_at_k, f1_score_at_k, map_at_k, ndcg_at_k = evaluate_recommendations()

print("Precision at K:", precision_at_k)
print("Recall at K:", recall_at_k)
print("F1-score at K:", f1_score_at_k)
print("MAP at K:", map_at_k)
print("NDCG at K:", ndcg_at_k)

Precision at K: 0.9350759345794394
Recall at K: 0.002935815693433195
F1-score at K: 0.005853254188854786
MAP at K: 0.9350759345794394
NDCG at K: 0.9347709124197553


In [5]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.cluster import KMeans
import numpy as np
import math
from surprise import Dataset, Reader, SVD, KNNBasic
from surprise.model_selection import train_test_split

# Load the cleaned dataset
df = pd.read_csv('cleaned_dataset.csv')

# Ensure 'Course Rating' is numeric
df['Course Rating'] = pd.to_numeric(df['Course Rating'], errors='coerce')

# Add a 'Popularity' feature based on the Course Rating
df['Popularity'] = df['Course Rating'] * df['Course Rating'].count() / df['Course Rating'].sum()

# Encode categorical variables
label_encoder = LabelEncoder()
df['University'] = label_encoder.fit_transform(df['University'])
df['Difficulty Level'] = label_encoder.fit_transform(df['Difficulty Level'])

# Create mock user IDs and item IDs
df['user_id'] = np.random.randint(1, 101, df.shape[0])  # Assuming 100 unique users
df['item_id'] = df.index  # Using index as a unique item ID

# Extract important features from the 'Course Description' with adjusted parameters
tfidf = TfidfVectorizer(stop_words='english', max_df=0.8, min_df=5, ngram_range=(1, 2))
df['Course Description'] = df['Course Description'].fillna('')
tfidf_matrix = tfidf.fit_transform(df['Course Description'])

# Combine the original features with the TF-IDF features
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())
df_combined = pd.concat([df, df_tfidf], axis=1)

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Function to get content-based recommendations based on course index with diversity
def get_content_based_recommendations(course_index, cosine_sim=cosine_sim, df=df, num_recommendations=20):
    # Get pairwise similarity scores of all courses with the given course
    sim_scores = list(enumerate(cosine_sim[course_index]))
    
    # Sort the courses based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the most similar courses
    sim_scores = sim_scores[1:num_recommendations * 3 + 1]  # Get more recommendations to ensure diversity
    
    # Get the course indices
    course_indices = [i[0] for i in sim_scores]
    
    # Ensure diversity by selecting courses from different universities and difficulty levels
    selected_courses = []
    universities = set()
    difficulty_levels = set()
    
    for idx in course_indices:
        course = df.iloc[idx]
        if len(selected_courses) >= num_recommendations:
            break
        selected_courses.append(idx)
        universities.add(course['University'])
        difficulty_levels.add(course['Difficulty Level'])
    
    # Return the top most similar and diverse courses
    return df.iloc[selected_courses][['Course Name', 'University', 'Difficulty Level', 'Course Rating', 'Course URL', 'Popularity']]

# Prepare the data for collaborative filtering
cf_df = df[['user_id', 'item_id', 'Course Rating']].dropna()

# Create a Surprise reader object
reader = Reader(rating_scale=(1, 5))

# Convert the data into a Surprise dataset
data = Dataset.load_from_df(cf_df[['user_id', 'item_id', 'Course Rating']], reader)

# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.2)

# Experiment with different collaborative filtering algorithms
algo_svd = SVD()
algo_knn = KNNBasic()

# Train both models
algo_svd.fit(trainset)
algo_knn.fit(trainset)

# Function to get collaborative filtering recommendations
def get_collaborative_recommendations(user_id, num_recommendations=20, algo=algo_svd):
    # Predict ratings for all items not yet rated by the user
    predictions = [algo.predict(user_id, item_id) for item_id in df['item_id'].unique()]
    
    # Sort the predictions by estimated rating
    predictions = sorted(predictions, key=lambda x: x.est, reverse=True)
    
    # Get the top N recommended item IDs
    recommended_item_ids = [int(pred.iid) for pred in predictions[:num_recommendations]]
    
    return df[df['item_id'].isin(recommended_item_ids)][['Course Name', 'University', 'Difficulty Level', 'Course Rating', 'Course URL', 'Popularity']]

# Function to get hybrid recommendations
def get_hybrid_recommendations(course_index, user_id, num_recommendations=20):
    # Get content-based recommendations
    content_recommendations = get_content_based_recommendations(course_index, num_recommendations=num_recommendations)
    
    # Get collaborative filtering recommendations
    collaborative_recommendations = get_collaborative_recommendations(user_id, num_recommendations=num_recommendations, algo=algo_svd)
    
    # Combine the results, ensuring no duplicates
    hybrid_recommendations = pd.concat([content_recommendations, collaborative_recommendations]).drop_duplicates().head(num_recommendations)
    
    return hybrid_recommendations

# Function to evaluate recommendations
def evaluate_recommendations(num_recommendations=20):
    precision_at_k = []
    recall_at_k = []
    f1_score_at_k = []
    map_at_k = []
    ndcg_at_k = []

    for idx in range(len(df)):
        # Get recommendations for each course
        recommendations = get_hybrid_recommendations(idx, user_id=1, num_recommendations=num_recommendations)
        
        # Assume the ground truth relevant items are those with the same 'Course Rating' (example)
        ground_truth = df[(df['Course Rating'] >= 4) & (df.index != idx)].index.tolist()
        
        if not ground_truth:
            continue
        
        recommended_indices = recommendations.index.tolist()
        relevant_items = set(ground_truth).intersection(recommended_indices)
        
        precision = len(relevant_items) / num_recommendations
        recall = len(relevant_items) / len(ground_truth)
        f1_score = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        map_score = sum([1 if i < len(recommended_indices) and recommended_indices[i] in relevant_items else 0 for i in range(num_recommendations)]) / num_recommendations
        dcg = sum([(1 if i < len(recommended_indices) and recommended_indices[i] in relevant_items else 0) / math.log2(i + 2) for i in range(num_recommendations)])
        idcg = sum([1 / math.log2(i + 2) for i in range(min(len(ground_truth), num_recommendations))])
        ndcg = dcg / idcg if idcg > 0 else 0

        precision_at_k.append(precision)
        recall_at_k.append(recall)
        f1_score_at_k.append(f1_score)
        map_at_k.append(map_score)
        ndcg_at_k.append(ndcg)

    return np.mean(precision_at_k), np.mean(recall_at_k), np.mean(f1_score_at_k), np.mean(map_at_k), np.mean(ndcg_at_k)

# Get recommendations for a sample course and user
course_index = 0  # Index of the course for which recommendations are needed
user_id = 1  # Sample user ID
recommendations = get_hybrid_recommendations(course_index, user_id)

print("Hybrid Recommendations for course index", course_index)
print(recommendations)

# Evaluate the recommendations
precision_at_k, recall_at_k, f1_score_at_k, map_at_k, ndcg_at_k = evaluate_recommendations()

print("Precision at K:", precision_at_k)
print("Recall at K:", recall_at_k)
print("F1-score at K:", f1_score_at_k)
print("MAP at K:", map_at_k)
print("NDCG at K:", ndcg_at_k)

Computing the msd similarity matrix...
Done computing similarity matrix.
Hybrid Recommendations for course index 0
                                            Course Name  University  \
1444  script writing: write a pilot episode for a tv...          73   
1590                             write your first novel          73   
3387                                 transmedia writing          73   
2131       presentation skills: public speaking project          80   
3353                 better business writing in english          41   
1185                               capstone: your story         172   
2664  writing professional email and memos (project-...         166   
1044  writing for young readers: opening the treasur...          22   
601                    writing in english at university          69   
833                     songwriting: writing the lyrics          11   
1536    how to write a resume (project-centered course)         117   
1931  design and make infographic

In [6]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.cluster import KMeans
import numpy as np
import math
from surprise import Dataset, Reader, SVD, KNNBasic
from surprise.model_selection import train_test_split

# Load the cleaned dataset
df = pd.read_csv('cleaned_dataset.csv')

# Ensure 'Course Rating' is numeric
df['Course Rating'] = pd.to_numeric(df['Course Rating'], errors='coerce')

# Add a 'Popularity' feature based on the Course Rating
df['Popularity'] = df['Course Rating'] * df['Course Rating'].count() / df['Course Rating'].sum()

# Encode categorical variables
label_encoder = LabelEncoder()
df['University'] = label_encoder.fit_transform(df['University'])
df['Difficulty Level'] = label_encoder.fit_transform(df['Difficulty Level'])

# Create mock user IDs and item IDs
df['user_id'] = np.random.randint(1, 101, df.shape[0])  # Assuming 100 unique users
df['item_id'] = df.index  # Using index as a unique item ID

# Extract important features from the 'Course Description' with adjusted parameters
tfidf = TfidfVectorizer(stop_words='english', max_df=0.8, min_df=5, ngram_range=(1, 2))
df['Course Description'] = df['Course Description'].fillna('')
tfidf_matrix = tfidf.fit_transform(df['Course Description'])

# Combine the original features with the TF-IDF features
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())
df_combined = pd.concat([df, df_tfidf], axis=1)

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Function to get content-based recommendations based on course index with diversity
def get_content_based_recommendations(course_index, cosine_sim=cosine_sim, df=df, num_recommendations=30):
    # Get pairwise similarity scores of all courses with the given course
    sim_scores = list(enumerate(cosine_sim[course_index]))
    
    # Sort the courses based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the most similar courses
    sim_scores = sim_scores[1:num_recommendations * 3 + 1]  # Get more recommendations to ensure diversity
    
    # Get the course indices
    course_indices = [i[0] for i in sim_scores]
    
    # Ensure diversity by selecting courses from different universities and difficulty levels
    selected_courses = []
    universities = set()
    difficulty_levels = set()
    
    for idx in course_indices:
        course = df.iloc[idx]
        if len(selected_courses) >= num_recommendations:
            break
        selected_courses.append(idx)
        universities.add(course['University'])
        difficulty_levels.add(course['Difficulty Level'])
    
    # Return the top most similar and diverse courses
    return df.iloc[selected_courses][['Course Name', 'University', 'Difficulty Level', 'Course Rating', 'Course URL', 'Popularity']]

# Prepare the data for collaborative filtering
cf_df = df[['user_id', 'item_id', 'Course Rating']].dropna()

# Create a Surprise reader object
reader = Reader(rating_scale=(1, 5))

# Convert the data into a Surprise dataset
data = Dataset.load_from_df(cf_df[['user_id', 'item_id', 'Course Rating']], reader)

# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.2)

# Experiment with different collaborative filtering algorithms
algo_svd = SVD()
algo_knn = KNNBasic()

# Train both models
algo_svd.fit(trainset)
algo_knn.fit(trainset)

# Function to get collaborative filtering recommendations
def get_collaborative_recommendations(user_id, num_recommendations=30, algo=algo_svd):
    # Predict ratings for all items not yet rated by the user
    predictions = [algo.predict(user_id, item_id) for item_id in df['item_id'].unique()]
    
    # Sort the predictions by estimated rating
    predictions = sorted(predictions, key=lambda x: x.est, reverse=True)
    
    # Get the top N recommended item IDs
    recommended_item_ids = [int(pred.iid) for pred in predictions[:num_recommendations]]
    
    return df[df['item_id'].isin(recommended_item_ids)][['Course Name', 'University', 'Difficulty Level', 'Course Rating', 'Course URL', 'Popularity']]

# Function to get hybrid recommendations
def get_hybrid_recommendations(course_index, user_id, num_recommendations=30):
    # Get content-based recommendations
    content_recommendations = get_content_based_recommendations(course_index, num_recommendations=num_recommendations)
    
    # Get collaborative filtering recommendations
    collaborative_recommendations = get_collaborative_recommendations(user_id, num_recommendations=num_recommendations, algo=algo_svd)
    
    # Combine the results, ensuring no duplicates
    hybrid_recommendations = pd.concat([content_recommendations, collaborative_recommendations]).drop_duplicates().head(num_recommendations)
    
    return hybrid_recommendations

# Function to evaluate recommendations
def evaluate_recommendations(num_recommendations=30):
    precision_at_k = []
    recall_at_k = []
    f1_score_at_k = []
    map_at_k = []
    ndcg_at_k = []

    for idx in range(len(df)):
        # Get recommendations for each course
        recommendations = get_hybrid_recommendations(idx, user_id=1, num_recommendations=num_recommendations)
        
        # Assume the ground truth relevant items are those with the same 'Course Rating' (example)
        ground_truth = df[(df['Course Rating'] >= 4) & (df.index != idx)].index.tolist()
        
        if not ground_truth:
            continue
        
        recommended_indices = recommendations.index.tolist()
        relevant_items = set(ground_truth).intersection(recommended_indices)
        
        precision = len(relevant_items) / num_recommendations
        recall = len(relevant_items) / len(ground_truth)
        f1_score = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        map_score = sum([1 if i < len(recommended_indices) and recommended_indices[i] in relevant_items else 0 for i in range(num_recommendations)]) / num_recommendations
        dcg = sum([(1 if i < len(recommended_indices) and recommended_indices[i] in relevant_items else 0) / math.log2(i + 2) for i in range(num_recommendations)])
        idcg = sum([1 / math.log2(i + 2) for i in range(min(len(ground_truth), num_recommendations))])
        ndcg = dcg / idcg if idcg > 0 else 0

        precision_at_k.append(precision)
        recall_at_k.append(recall)
        f1_score_at_k.append(f1_score)
        map_at_k.append(map_score)
        ndcg_at_k.append(ndcg)

    return np.mean(precision_at_k), np.mean(recall_at_k), np.mean(f1_score_at_k), np.mean(map_at_k), np.mean(ndcg_at_k)

# Get recommendations for a sample course and user
course_index = 0  # Index of the course for which recommendations are needed
user_id = 1  # Sample user ID
recommendations = get_hybrid_recommendations(course_index, user_id)

print("Hybrid Recommendations for course index", course_index)
print(recommendations)

# Evaluate the recommendations
precision_at_k, recall_at_k, f1_score_at_k, map_at_k, ndcg_at_k = evaluate_recommendations()

print("Precision at K:", precision_at_k)
print("Recall at K:", recall_at_k)
print("F1-score at K:", f1_score_at_k)
print("MAP at K:", map_at_k)
print("NDCG at K:", ndcg_at_k)


Computing the msd similarity matrix...
Done computing similarity matrix.
Hybrid Recommendations for course index 0
                                            Course Name  University  \
1444  script writing: write a pilot episode for a tv...          73   
1590                             write your first novel          73   
3387                                 transmedia writing          73   
2131       presentation skills: public speaking project          80   
3353                 better business writing in english          41   
1185                               capstone: your story         172   
2664  writing professional email and memos (project-...         166   
1044  writing for young readers: opening the treasur...          22   
601                    writing in english at university          69   
833                     songwriting: writing the lyrics          11   
1536    how to write a resume (project-centered course)         117   
1931  design and make infographic