### Classify  Me
This notebook contains the source code for building a course recommendation system using collaborative filtering and the surprise module from scikit lear. The data used herein is simulated.


> Mounting the Google Drive containing the data to the Colab notebook environment.

Otherwise, the notebook runs on the data directly in the working directory.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Installing the named libraries

In [None]:
# Install required libriaries
!pip install scikit-surprise
!pip install category_encoders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3095459 sha256=83742f889834a38cb1ef8eab184c1b00b07dcb12e8eb8ac591108d165275c17e
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Co

Importing the relevant libraries and modules

In [None]:

# Importing libraries
import pandas as pd
import math
import numpy as np
import seaborn as sns
from surprise import dump
import os
import random
from datetime import datetime

# Ignore printing warnings for general readability
import warnings 
warnings.filterwarnings("ignore")

# For model selection
# pip install scikit-surprise
from sklearn.model_selection import train_test_split
from surprise import Reader, Dataset
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV
from surprise import KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline, SVD, SVDpp, SlopeOne, NMF, NormalPredictor,BaselineOnly, CoClustering
from surprise import accuracy


Loading data from the mounted Drive folders.

In [None]:
courses = pd.read_csv('/content/drive/MyDrive/ClassifyMe/course_catalogue.csv')
historical = pd.read_csv('/content/drive/MyDrive/ClassifyMe/historical_data.csv')

Loading data directly from the working directory as the notebook.

In [None]:
# #Loadig the courses catalogue ad historical data ito dataframes
courses = pd.read_csv('course_catalogue.csv')
historical = pd.read_csv('historical_data.csv')

In [None]:
# Display the first few rows of the dataframes to esure they are loaded correctly

courses.head()

Unnamed: 0,Course_name,Description,Keywords,Core_units,Reviews,Weighted_points,Job_satisfaction
0,Medicine and Surgery,This course trains students to become medical ...,"Medicine, Surgery, Health","Anatomy, Physiology, Pharmacology, Pathology",4.7,43.68,5.0
1,Pharmacy,This course trains students to become licensed...,"Pharmacy, Medications, Health","Pharmacology, Medicinal Chemistry, Pharmacy Pr...",4.2,42.1,4.5
2,Nursing,This course trains students to become register...,"Anatomy, Physiology, Medical-Surgical Nursing","Anatomy, Physiology, Pharmacology, Pathology",4.5,39.47,4.0
3,Medical Laboratory Sciences,This course trains students to become medical ...,"Medical Laboratory, Diagnostics, Health","Medical Microbiology, Hematology, Clinical Che...",4.5,37.89,4.5
4,Nutrition and Dietetics,This course trains students to become register...,"Nutrition, Dietetics, Health","Human Nutrition, Medical Nutrition Therapy, Fo...",4.3,36.84,4.0


In [None]:
historical.head()

Unnamed: 0,ID,First_Name,Last_Name,English,Kiswahili,Mathematics,Physics,Biology,Chemistry,Overall Grade,Interest,Cluster Points,Course
0,S001,Scarlett,Layla,10,9,10,7,6,10,61,Public Health,35.424,Health Records and Information Technology
1,S002,Amelia,Zoey,12,11,6,7,12,7,67,Pharmacy,38.143,Not elligible
2,S003,Emma,Aubrey,7,8,7,8,12,12,81,Laboratories,43.028,Medical Laboratory Sciences
3,S004,Luna,Cooper,11,8,7,6,11,12,63,Nursing,38.419,Palliative Care and Hospice Services
4,S005,Jackson,Adams,7,6,10,10,10,9,64,Surgery,36.285,Not elligible


In [None]:
# Dislayig the shape of the historical dataframe
historical.shape

(1000, 13)

# Data Preprocessing



Replacing spaces in the column names with underscore for ease in manipulation

In [None]:
# Replacing the spaces between the column names with "_"
courses.columns = courses.columns.str.replace(' ', '_')
historical.columns = historical.columns.str.replace(' ', '_')

Defining functions to pick highest grades in either English or Kiswahili and one between Mathematics and Physics

In [None]:
# Defining a function lang  that returns the maximum score between English and Kiswahili.
def lang(student):
  if student['English'] >= student['Kiswahili']:
    return student['English']
  else:
    return student['Kiswahili']
    
# Defining a function science  that returns the maximum score between Mathematics and Physics.
def science(student):
  if student['Mathematics'] >= student['Physics']:
    return student['Mathematics']
  else:
    return student['Physics']
  



In [None]:
#Add a new column 'languages' to the DataFrame.
historical['languages'] = historical.apply(lang, axis=1)
#Add a new column 'science' to the DataFrame.
historical['science'] = historical.apply(science, axis=1)
historical.sample(5)

Unnamed: 0,ID,First_Name,Last_Name,English,Kiswahili,Mathematics,Physics,Biology,Chemistry,Overall_Grade,Interest,Cluster_Points,Course,languages,science
794,S795,Brown,Sophia,6,12,7,9,8,8,63,Medical Research,36.497,Not elligible,12,9
741,S742,Avery,Nelson,12,7,10,6,8,7,64,Medical Research,36.785,Clinical Research,12,10
748,S749,Wright,Avery,10,10,12,6,7,8,71,Laboratories,38.745,Medical Laboratory Sciences,10,12
151,S152,Edwards,Wilson,10,9,9,8,11,12,61,Surgery,38.262,Not elligible,10,9
652,S653,Riley,Barnes,6,8,8,8,7,8,80,Laboratories,37.645,Medical Illustration and Animation,8,8


# Courses

This dataset  contains information on course names,a brief description about them, keywords associated with the courses, core units covered in the course, reviews by students who have taken the courses, weighted points representing cluster points and job_satisfaction which is a value representing the satisfaction rating from pursuing career along the lines of the various courses

In [None]:
courses.sample(5)

Unnamed: 0,Course_name,Description,Keywords,Core_units,Reviews,Weighted_points,Job_satisfaction
20,Orthotics and Prosthetics,"This course focuses on the design, fabrication...","orthotics, prosthetics, physical disability, a...","Anatomy and Kinesiology, Orthotics and Prosthe...",4.6,38.36,4.8
17,Health Education and Promotion,This course focuses on promoting health and pr...,"health promotion, disease prevention, educatio...","Public Health Education, Health Communication,...",4.3,32.75,4.5
21,Medical Biotechnology,This course covers the application of biotechn...,"medical biotechnology, biotechnology, medical ...","Molecular Biology, Genetics, Bioinformatics",4.4,36.24,4.5
14,Occupational Therapy,This course focuses on the rehabilitation of p...,"rehabilitation, physical therapy, mental therapy","Human Anatomy and Physiology, Rehabilitation T...",4.7,39.54,4.8
28,Medical Anthropology,This course focuses on the study of the relati...,"medical anthropology, culture and health, heal...","Medical Anthropology, Cross-Cultural Health an...",4.3,39.32,4.5


Creating course categories for courses based on the interest areas

In [None]:
# Defines a dictionary mapping course categories to broader fields of study.
course_categories = {'Nursing': 'Nursing','Palliative Care and Hospice Services': 'Nursing',
                     'Pharmacy': 'Pharmacy','Public Health': 'Public Health',
                     'Health Systems Management': 'Public Health',
                     'Health Records and Information Technology': 'Public Health',
                     'Health Education and Promotion': 'Public Health','Reproductive Health': 'Public Health',
                     'Environmental Health': 'Public Health','Medical Laboratory Sciences': 'Laboratories',
                     'Medical Biotechnology': 'Laboratories', 'Medical Physics': 'Laboratories',
                     'Forensic Science': 'Laboratories','Pathology': 'Medical Research',
                     'Medical Imaging Sciences': 'Medical Research','Epidemiology and Biostatistics': 'Medical Research',
                     'Medical Illustration and Animation': 'Medical Research','Clinical Research': 'Medical Research',
                     'Medical Anthropology': 'Medical Research','Physiotherapy': 'Therapy',
                     'Radiography': 'Therapy','Optometry': 'Therapy',
                     'Occupational Therapy': 'Therapy','Speech Therapy': 'Therapy',
                     'Clinical Psychology': 'Therapy','Orthotics and Prosthetics': 'Therapy',
                     'Nutrition and Dietetics': 'Therapy','Medicine and Surgery': 'Surgery',
                     'Dental Surgery': 'Surgery','Clinical Medicine and Surgery': 'Surgery'
                     }
# Creates a new column named 'Interest' in the DataFrame 'courses', 
# mapping course names to broader fields of study based on the 'course_categories' dictionary
courses['Interest']=courses['Course_name'].map(course_categories)


In [None]:
#Showing a statistical summary of the numerical columns
courses.describe()

Unnamed: 0,Reviews,Weighted_points,Job_satisfaction
count,30.0,30.0,30.0
mean,4.47,38.328,4.6
std,0.233637,2.757507,0.257307
min,4.0,32.75,4.0
25%,4.3,36.84,4.5
50%,4.45,38.505,4.6
75%,4.675,39.5225,4.8
max,4.9,44.74,5.0


In [None]:
historical.sample(5)

Unnamed: 0,ID,First_Name,Last_Name,English,Kiswahili,Mathematics,Physics,Biology,Chemistry,Overall_Grade,Interest,Cluster_Points,Course,languages,science
789,S790,Rodriguez,Hazel,11,8,9,10,10,7,72,Nursing,39.54,Nursing,11,10
464,S465,Parker,Luna,9,7,6,11,6,6,82,Nursing,38.722,Palliative Care and Hospice Services,9,11
168,S169,Aurora,Brooklyn,8,6,7,7,10,6,68,Therapy,34.707,Not elligible,8,7
149,S150,Aubrey,Claire,9,11,9,8,9,10,73,Nursing,40.334,Nursing,11,9
993,S994,Layla,Evans,6,7,10,9,6,8,79,Public Health,37.409,Health Records and Information Technology,7,10


Merging the courses and historical dataframes on course

In [None]:
# Merge the datasets on the 'Course' column
merged_df = pd.merge(historical, courses[['Course_name', 'Weighted_points', 'Job_satisfaction',  'Description', 'Keywords' ]],
                     left_on='Course', right_on='Course_name', how='left')
# Drop the extra 'Course_name' column
merged_df = merged_df.drop(['Course_name'], axis=1)

# Print the merged dataset
merged_df.sample(5)


Unnamed: 0,ID,First_Name,Last_Name,English,Kiswahili,Mathematics,Physics,Biology,Chemistry,Overall_Grade,Interest,Cluster_Points,Course,languages,science,Weighted_points,Job_satisfaction,Description,Keywords
201,S202,Hannah,Young,11,12,12,8,9,9,82,Nursing,44.362,Nursing,12,12,39.47,4.0,This course trains students to become register...,"Anatomy, Physiology, Medical-Surgical Nursing"
52,S053,Lee,Hall,7,11,9,7,9,12,76,Laboratories,42.197,Forensic Science,11,9,38.65,4.4,This course focuses on the application of scie...,"forensic science, crime investigation, legal p..."
889,S890,Ella,Davis,7,7,9,8,10,9,77,Laboratories,39.243,Medical Laboratory Sciences,7,9,37.89,4.5,This course trains students to become medical ...,"Medical Laboratory, Diagnostics, Health"
311,S312,Natalie,Moore,7,8,7,11,8,8,70,Nursing,37.417,Palliative Care and Hospice Services,8,11,34.21,4.6,This course focuses on the care of patients wi...,"palliative care, hospice services, serious ill..."
860,S861,Gonzalez,Mila,7,11,10,9,7,9,55,Public Health,34.101,Health Records and Information Technology,11,10,33.89,4.5,This course focuses on the management of patie...,"health records, healthcare technology"


In [None]:
#Counting unique values to understand distribution of courses
merged_df['Course'].value_counts()

Not elligible                                326
Palliative Care and Hospice Services         118
Medical Illustration and Animation            45
Nutrition and Dietetics                       43
Epidemiology and Biostatistics                40
Health Education and Promotion                37
Clinical Research                             37
Nursing                                       34
Health Records and Information Technology     29
Public Health                                 27
Medical Biotechnology                         26
Pharmacy                                      25
Environmental Health                          21
Medical Physics                               20
Health Systems Management                     17
Clinical Psychology                           16
Medical Laboratory Sciences                   15
Forensic Science                              15
Medical Anthropology                          14
Radiography                                   14
Medical Imaging Scie

Not elligible has 326 entries. There's need to drop these columns. The other courses seem to have a normal distribution

In [None]:
# Dropping all rows with courses as not elligible and printing the shape of the new dataframe
df = merged_df.loc[merged_df['Course'] != 'Not elligible']
df.shape

(674, 19)

Filtering the entries so as to only remain with rows where cluster points are greater than or equal to weighted points

In [None]:
# Define the filter condition based on Cluster Points and Weighted_points and Filter the dataset
filtered_df = df.loc[df['Cluster_Points'] >= df['Weighted_points']]
filtered_df.shape

(674, 19)

#Feature Engineering

Making copies of the dataframe to preserve the original filtered dataframe before feature engineering

In [None]:
#Making copies of the merged dataset
df_copy= df.copy()

In [None]:
#Displaying the unique entries in the interest column
df_copy['Interest'].unique()

array(['Public Health', 'Laboratories', 'Nursing', 'Medical Research',
       'Therapy', 'Pharmacy', 'Surgery'], dtype=object)

Encoding the interest column using ordinal encoder to convert the categorical to numerical values

In [None]:
#Importing necessary module
from category_encoders import OrdinalEncoder

# Define a mapping of category labels to integer codes, where each integer code represents a category label
mapping = [{'col': 'Interest', 'mapping': {'Public Health': 1,  'Laboratories': 2,
                                         'Nursing': 3, 'Medical Research': 4,
                                         'Therapy': 5, 'Pharmacy': 5, 'Surgery': 6}}]
# Create an OrdinalEncoder object and fit it to the DataFrame
encoder = OrdinalEncoder(cols=['Interest'], mapping=mapping)
encoder.fit(df_copy)


# Transform the DataFrame using the fitted encoder
df_encoded = encoder.transform(df_copy)


Encoding the interest column on the courses dataset and saving the new dataframe with encoded interest column to a csv file to be used later.


In [None]:
#Importing necessary module
from category_encoders import OrdinalEncoder

# Define a mapping of category labels to integer codes, where each integer code represents the rank of the category label
mapping = [{'col': 'Interest', 'mapping': {'Public Health': 1,  'Laboratories': 2,
                                         'Nursing': 3, 'Medical Research': 4,
                                         'Therapy': 5, 'Pharmacy': 6, 'Surgery': 7}}]
# Create an OrdinalEncoder object and fit it to the DataFrame
course_encoder = OrdinalEncoder(cols=['Interest'], mapping=mapping)
course_encoder.fit(courses)


# Transform the DataFrame using the fitted encoder
course_encoded = course_encoder.transform(courses)

In [None]:
# Save the encoded courses DataFrame to a CSV file named 'courses_encoded.csv'
course_encoded.to_csv('courses_catalogue.csv', index=False)

## Model Selection

In [None]:
#Displaying the columns of the new encoded dataframe
df_encoded.columns

Index(['ID', 'First_Name', 'Last_Name', 'English', 'Kiswahili', 'Mathematics',
       'Physics', 'Biology', 'Chemistry', 'Overall_Grade', 'Interest',
       'Cluster_Points', 'Course', 'languages', 'science', 'Weighted_points',
       'Job_satisfaction', 'Description', 'Keywords'],
      dtype='object')

Building a test-train-split using the surprise library

In [None]:
# Define the rating scale for the data
reader = Reader(rating_scale=(4.0, 5.0))

# Load the data into a Surprise Dataset object, containing the 'ID', 'Interest', and 'Job_satisfaction' columns
data   = Dataset.load_from_df(df_encoded[['ID','Interest','Job_satisfaction']], reader)

# Extract the raw ratings from the Surprise Dataset object
raw_ratings = data.raw_ratings

# Shuffle the raw ratings randomly to ensure random train-test split
random.shuffle(raw_ratings)                 # shuffle dataset


# Split the dataset into train and test sets using the train_test_split method from the surprise.model_selection module
trainset, testset = train_test_split(data, test_size=0.2)

# Determine the threshold index to split the raw ratings into train and test sets
threshold   = int(len(raw_ratings)*0.8)

# Split the raw ratings into train and test sets based on the threshold index
train_raw_ratings = raw_ratings[:threshold] # 80% of data is trainset
test_raw_ratings  = raw_ratings[threshold:] # 20% of data is testset

# Use the train and test raw ratings to create a new trainset and testset
data.raw_ratings = train_raw_ratings        # data is now the trainset
trainset         = data.build_full_trainset() 
testset          = data.construct_testset(test_raw_ratings)


Training and evaluating multiple recommender system models using the surprise library

In [None]:
# Define a list of recommender system models to evaluate
models=[KNNBasic(),KNNWithMeans(),KNNWithZScore(),KNNBaseline(),SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), CoClustering()]

# Create an empty dictionary to store the evaluation results for each model
results = {}

# Iterate over each model in the models list
for model in models:
    # perform 5 fold cross validation
    # evaluation metrics: root mean square error
    CV_scores = cross_validate(model, data, measures=["RMSE","MAE"], cv=5, n_jobs=-1)  
    
    # storing the average score across the 5 fold cross validation for each model
    result = pd.DataFrame.from_dict(CV_scores).mean(axis=0).\
             rename({'test_rmse': 'RMSE', 'test_mae': 'MAE'})
   # Add the average scores to the results dictionary, using the name of the model as the key          
    results[str(model).split("algorithms.")[1].split("object ")[0]] = result



In [None]:
# Create a Pandas DataFrame from the results dictionary
performance_df = pd.DataFrame.from_dict(results)

# Transpose the DataFrame and sort it by RMSE in ascending order and print it out
print("Model Performance: \n")
performance_df.T.sort_values(by='RMSE')

Model Performance: 



Unnamed: 0,RMSE,MAE,fit_time,test_time
knns.KNNBaseline,0.229674,0.183819,0.008491,0.00092
matrix_factorization.SVD,0.22969,0.185918,0.01086,0.000869
matrix_factorization.SVDpp,0.230926,0.182996,0.008899,0.00088
knns.KNNBasic,0.25333,0.18491,0.004942,0.001107
knns.KNNWithMeans,0.253435,0.185517,0.015395,0.001059
matrix_factorization.NMF,0.253552,0.184837,0.050212,0.000823
slope_one.SlopeOne,0.253718,0.185605,0.010643,0.001085
knns.KNNWithZScore,0.253737,0.185138,0.041348,0.001195
co_clustering.CoClustering,0.253879,0.185022,0.078714,0.000691
random_pred.NormalPredictor,0.355739,0.286307,0.001412,0.001406


\\-  **KNNBaseline** has the least RMSE and also has the least fit time. It is therefore our preffered model. The model performance can however be improved



## Hyperparameter Tuning

Performing hyperparameter tuning for the KNNBaseline model using the GridSearchCV function from the surprise library. This helps determine parameters for the best model performance.

In [None]:
# Hyperparameter tuning - KNNBaseline

# Define a grid of hyperparameters to search over
param_grid = {'k': [10, 20, 30],
              'sim_options' : {'name': ['msd','cosine','pearson'], \
                                'min_support': [3,5], \
                                'user_based': [False, True]}
             }

# Define a GridSearchCV object for the KNNBaseline model, using the specified hyperparameter grid and evaluation metrics
gridsearchKNNBaseline = GridSearchCV(KNNBaseline, param_grid, measures=['mae', 'rmse'], \
                                      cv=5, n_jobs=-1)
# Fit the GridSearchCV object to the data to search for the best hyperparameters                            
gridsearchKNNBaseline.fit(data)

# Print the best hyperparameters and corresponding performance metrics for the MAE and RMSE evaluation metric
print(f'MAE Best Parameters:  {gridsearchKNNBaseline.best_params["mae"]}')
print(f'MAE Best Score:       {gridsearchKNNBaseline.best_score["mae"]}\n')

print(f'RMSE Best Parameters: {gridsearchKNNBaseline.best_params["rmse"]}')
print(f'RMSE Best Score:      {gridsearchKNNBaseline.best_score["rmse"]}\n')

MAE Best Parameters:  {'k': 10, 'sim_options': {'name': 'msd', 'min_support': 3, 'user_based': False}}
MAE Best Score:       0.18311679894875404

RMSE Best Parameters: {'k': 10, 'sim_options': {'name': 'msd', 'min_support': 3, 'user_based': False}}
RMSE Best Score:      0.22805290688519916



Training and fitting the best model from the hyperparameter tuning.


In [None]:
# Model fit & prediction - KNNWithMeans
k=10
sim_options = {'name':'msd','min_support':3,'user_based':False}
final_model = KNNBaseline(k=k, sim_options=sim_options)

# Fitting the model on trainset & predicting on testset, printing test accuracy
pred = final_model.fit(trainset).test(testset)

print(f'\nUnbiased Testing Performance:')
print(f'MAE: {accuracy.mae(pred)}, RMSE: {accuracy.rmse(pred)}')


The parameters:'k': 10, 'sim_options': {'name': 'msd', 'min_support': 3, 'user_based': False are selected as they give the best RMSE and MAE scores of 0 implying 100% accuracy. 



# Building the recommendation system

In [None]:
# measure the time taken to train the model
startTraining = datetime.now()
print("> Training...")

# Create an instance of the KNNBaseline algorithm with the given hyperparameters
model = KNNBaseline(k=10, sim_options={'name':'cosine', 'min_support':10, 'user_based':False})

# fit the algorithm to the training set
model.fit(trainset)

# Get the current time and calculate time taken then print it out
endTraining = datetime.now()
print("It took:",(endTraining-startTraining).seconds, "seconds")


Compressing trained model using pickle

In [None]:
# Specify the file path and name for the pickled model file
model_filename = "./KNNBaseline_pickled_model"

# Expand the file path to the user's home directory
file_name = os.path.expanduser(model_filename)

# Use the dump function from the surprise library to save the trained KNNBaseline model to the specified file and print file name
dump.dump(file_name, algo=model)
print(model_filename)