#### Course Recommendation System using Udemy Dataset

#### Algo
+ Cosine Similarity
+ ML Models


#### Workflow
+ Dataset
+ Vectorized our dataset
+ Cosine Similarity Matrix
+ ID,Score
+ Train ML Model
+ Recommend


In [1]:
# Load EDA Pkgs
import pandas as pd
import neattext.functions as nfx

In [2]:
# Load ML/Rc Pkgs
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity,linear_kernel
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt

In [3]:
# Load our dataset
df = pd.read_csv("Courses_rec.csv")

In [4]:
df.head(10)

Unnamed: 0,course_title,course_field,diff_level
0,Data Engineer,School of Data Science,Intermediate
1,Data Scientist,School of Data Science,Advanced
2,Data Analyst,School of Data Science,Intermediate
3,C++,School of Autonomous Systems,Intermediate
4,Product Manager,School of Product Management,Beginner
5,Business Analytics,School of Business,Beginner
6,Introduction to Programming,School of Programming & Development,Beginner
7,Digital Marketing,School of Business,Beginner
8,Deep Learning,School of Artificial Intelligence,Intermediate
9,Blockchain Developer,School of Programming & Development,Intermediate


In [5]:
df['course_title']

0                                           Data Engineer
1                                          Data Scientist
2                                            Data Analyst
3                                                     C++
4                                         Product Manager
                              ...                        
3936    Learn jQuery from Scratch - Master of JavaScri...
3937    How To Design A WordPress Website With No Codi...
3938                        Learn and Build using Polymer
3939    CSS Animations: Create Amazing Effects on Your...
3940    Using MODX CMS to Build Websites: A Beginner's...
Name: course_title, Length: 3941, dtype: object

#  Data Cleaning & Preparing

In [11]:
# Clean Text:stopwords,special charac
df['clean_course_title'] = df['course_title'].apply(nfx.remove_stopwords)

In [12]:
# Clean Text:stopwords,special charac
df['clean_course_title'] = df['clean_course_title'].apply(nfx.remove_special_characters)

In [13]:
df[['course_title','clean_course_title']]

Unnamed: 0,course_title,clean_course_title
0,Data Engineer,Data Engineer
1,Data Scientist,Data Scientist
2,Data Analyst,Data Analyst
3,C++,C
4,Product Manager,Product Manager
...,...,...
3936,Learn jQuery from Scratch - Master of JavaScri...,Learn jQuery Scratch Master JavaScript library
3937,How To Design A WordPress Website With No Codi...,Design WordPress Website Coding
3938,Learn and Build using Polymer,Learn Build Polymer
3939,CSS Animations: Create Amazing Effects on Your...,CSS Animations Create Amazing Effects Website


#  Building Victorizer and Cosine Similarty Matrix For Courses Title

In [14]:
# Vectorize our Text
vect = TfidfVectorizer()
cv_mat = vect.fit_transform(df['clean_course_title'])

In [16]:
df_cv_words = pd.DataFrame(cv_mat.todense(),columns=vect.get_feature_names())

In [17]:
df_cv_words.head()

Unnamed: 0,000005,001,01,02,10,100,101,101master,102,10k,...,zend,zero,zerotohero,zf2,zinsen,zoho,zombie,zu,zuhause,zur
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
# Cosine Similarity Matrix
cosine_sim_mat = cosine_similarity(cv_mat)

In [21]:
cosine_sim_mat

array([[1.        , 0.36448689, 0.40875416, ..., 0.        , 0.        ,
        0.        ],
       [0.36448689, 1.        , 0.34747136, ..., 0.        , 0.        ,
        0.        ],
       [0.40875416, 0.34747136, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.12063291],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.12063291, 0.        ,
        1.        ]])

#  Calculate Cosine similarty

In [29]:
# Get Course ID/Index
course_indices = pd.Series(df.index,index=df['course_title']).drop_duplicates()

In [30]:
print(course_indices)
type(course_indices)

course_title
Data Engineer                                                  0
Data Scientist                                                 1
Data Analyst                                                   2
C++                                                            3
Product Manager                                                4
                                                            ... 
Learn jQuery from Scratch - Master of JavaScript library    3936
How To Design A WordPress Website With No Coding At All     3937
Learn and Build using Polymer                               3938
CSS Animations: Create Amazing Effects on Your Website      3939
Using MODX CMS to Build Websites: A Beginner's Guide        3940
Length: 3941, dtype: int64


pandas.core.series.Series

#  Feature Extraction & Model Traning

In [48]:
df['course_field'].value_counts()

Web Development                        1200
Business Finance                       1195
Musical Instruments                     680
Graphic Design                          603
School of Programming & Development     143
School of Artificial Intelligence        42
School of Business                       16
School of Data Science                   15
School of Product Management             13
Career Advancement                       11
School of Autonomous Systems             10
School of Cloud Computing                 9
School of Cybersecurity                   4
Name: course_field, dtype: int64

In [49]:
#plt.scatter(df.course_field,df.course_title)

In [51]:
import time
from nltk.classify.scikitlearn import SklearnClassifier 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [52]:
x_train, x_test, y_train, y_test = train_test_split(cv_mat, df['course_field'], random_state=42, test_size=0.2)


In [53]:
def train_models(X_train,Y_train,X_test,Y_test):
    
    print('---------------------Start Training-------------------------------')
    
    start_time = time.time()
    
    # Define models to train
    
    names = ["K Nearest Neighbors", "Decision Tree", "Random Forest", "Logistic Regression", "SGD Classifier",
             "Naive Bayes", "SVM Linear"]

    classifiers = [
        KNeighborsClassifier(),
        DecisionTreeClassifier(),
        RandomForestClassifier(),
        LogisticRegression(solver='lbfgs', max_iter=1000),
        SGDClassifier(max_iter = 100),
        MultinomialNB(),
        SVC(kernel = 'linear')
    ]

    models = zip(names, classifiers)
    
    scored_models=dict()

    for name, model in models:
        nltk_model = model
        nltk_model.fit(X_train,Y_train)
        pred = model.predict(X_test)
        scored_models[name]=[model,pred]
        score=f1_score(Y_test, pred,average='micro')
        accuracy = accuracy_score(Y_test,pred) 
        per_score=precision_score(y_test, pred,average='micro')
        rec_score=recall_score(y_test, pred,average='micro')
        print(name,"had Trained and it's Accuracy: ", accuracy," and it's Score: ",score)

    print('---------------------End of Training-------------------------------')
    
    print("-------- ",(time.time() - start_time),' Secounds --------')
    
    return scored_models

In [54]:
trained_models=train_models(X_train=x_train,Y_train=y_train,X_test=x_test,Y_test=y_test)


---------------------Start Training-------------------------------
K Nearest Neighbors had Trained and it's Accuracy:  0.49302915082382764  and it's Score:  0.49302915082382764
Decision Tree had Trained and it's Accuracy:  0.8403041825095057  and it's Score:  0.8403041825095057
Random Forest had Trained and it's Accuracy:  0.8770595690747782  and it's Score:  0.8770595690747782
Logistic Regression had Trained and it's Accuracy:  0.8694550063371356  and it's Score:  0.8694550063371356
SGD Classifier had Trained and it's Accuracy:  0.9112801013941698  and it's Score:  0.9112801013941698
Naive Bayes had Trained and it's Accuracy:  0.8517110266159695  and it's Score:  0.8517110266159695
SVM Linear had Trained and it's Accuracy:  0.8973384030418251  and it's Score:  0.8973384030418251
---------------------End of Training-------------------------------
--------  2.8906896114349365  Secounds --------


#  Test Model Against Real Data

In [61]:
course_name='Trading Options Basics'

In [62]:
result = trained_models['Naive Bayes'][0].predict(vect.transform([course_name]))

In [63]:
print('This Course Is ->',result[0],'<-Course')

This Course Is -> Business Finance <-Course


#  Recommend Top-N Courses 

In [66]:
def recommend_course(title,field,num_of_rec=10):
    # ID for title
    idx = course_indices[title]
    # Course Indice
    # Search inside cosine_sim_mat
    scores = list(enumerate(cosine_sim_mat[idx]))
    # Scores
    # Sort Scores
    sorted_scores = sorted(scores,key=lambda x:x[1],reverse=True)
    # Recommend
    selected_course_indices = [i[0] for i in sorted_scores[1:]]
    selected_course_scores = [i[1] for i in sorted_scores[1:]]

    result = {'Course':df['course_title'].iloc[selected_course_indices[0:]],'Field':df['course_field'].iloc[selected_course_indices[0:]]}
    rec_df = pd.DataFrame(result)
    dd=rec_df.loc[rec_df.Field == field]

    return dd.head(num_of_rec) 
    

In [67]:
recommend_course(course_name,result[0],20)

Unnamed: 0,Course,Field
358,Options Trading 101: The Basics,Business Finance
1124,Basics of Trading,Business Finance
456,Trading Options For Consistent Returns: Option...,Business Finance
1063,Trading: Basics of Trading for Beginners,Business Finance
1072,Learn Call Options and Put Options - Introduct...,Business Finance
1216,Options Basics & Trading With Small Capital! -...,Business Finance
329,Options Trading Basics (3-Course Bundle),Business Finance
999,Advanced Options Trading Course,Business Finance
396,Forex Trading for Beginners - Basics,Business Finance
357,Intermediate Options trading concepts for Stoc...,Business Finance
