### 1. Defining Objective
- In this recommendation system, content-based similarity filtering based on the course tags which the users either watch or search is being used. 

- The dfset used is of the [Courses dataset](https://www.kaggle.com/khusheekapoor/coursera-courses-dfset-2021) which contains over 3,000 courses!

Another approach can be collabrative filtering, which is not used in this notebook. An advance approach can be using both these approach in a hybrid mode.,

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.feature_extraction.text import CountVectorizer
import nltk 
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity
import pickle
warnings.filterwarnings('ignore')

### 2. data Collection

In [3]:
df = pd.read_csv("Datastore/Reocords.csv")
df.head(5)

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...


### 3. Data Pre-processing

In [4]:
df.shape

(3522, 7)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3522 entries, 0 to 3521
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Course Name         3522 non-null   object
 1   University          3522 non-null   object
 2   Difficulty Level    3522 non-null   object
 3   Course Rating       3522 non-null   object
 4   Course URL          3522 non-null   object
 5   Course Description  3522 non-null   object
 6   Skills              3522 non-null   object
dtypes: object(7)
memory usage: 192.7+ KB


In [6]:
df.isnull().sum() 

Course Name           0
University            0
Difficulty Level      0
Course Rating         0
Course URL            0
Course Description    0
Skills                0
dtype: int64

In [7]:
for index,i in enumerate(df.columns):
    if i != 'Course Description' and i != 'Skills':
        print(f'--------column {index}----------')
        print(df[i].value_counts())
    

--------column 0----------
Course Name
Google Cloud Platform Fundamentals: Core Infrastructure    8
Introduction to Artificial Intelligence (AI)               4
Python for Data Science and AI                             4
The Art of Music Production                                4
What is Data Science?                                      4
                                                          ..
Technical Support Fundamentals                             1
Understanding the Music Business: What is Music Worth?     1
Gut Check: Exploring Your Microbiome                       1
Managerial Accounting Fundamentals                         1
Architecting with Google Kubernetes Engine: Production     1
Name: count, Length: 3416, dtype: int64
--------column 1----------
University
Coursera Project Network                      562
University of Illinois at Urbana-Champaign    138
Johns Hopkins University                      110
University of Michigan                        101
University o

#### Features to be used: 

- Course Name : Names of the courses
- Course Description : Similar courses may have similar course description
- Skills : Users may want to see courses based on same skills
- Difficulty Level : Similar courses as per difficulty level 

#### Features not used for the recommendation system :

- Course Ratings : Numerical Column; Ratings can sometimes become a biased factor and distribution is not even
- University : Same university might offer multiple courses in different domains which the user might not want to see
- Course URL : No significance in the recommendation system

In [8]:
df = df[['Course Name','Difficulty Level','Course Description','Skills']]

In [9]:
df.loc[:, 'Course Name'] = df['Course Name'].str.replace(' ',',')
df.loc[:, 'Course Name'] = df['Course Name'].str.replace(',,',',')
df.loc[:, 'Course Name'] = df['Course Name'].str.replace(':','')

df.loc[:, 'Course Description'] = df['Course Description'].str.replace(' ',',')
df.loc[:, 'Course Description'] = df['Course Description'].str.replace(',,',',')
df.loc[:, 'Course Description'] = df['Course Description'].str.replace('_','')
df.loc[:, 'Course Description'] = df['Course Description'].str.replace(':','')
df.loc[:, 'Course Description'] = df['Course Description'].str.replace('(','')
df.loc[:, 'Course Description'] = df['Course Description'].str.replace(')','')

# Removing parentheses from the 'Skills' column
df.loc[:, 'Skills'] = df['Skills'].str.replace('(','')
df.loc[:, 'Skills'] = df['Skills'].str.replace(')','')


In [10]:
df.head()

Unnamed: 0,Course Name,Difficulty Level,Course Description,Skills
0,"Write,A,Feature,Length,Screenplay,For,Film,Or,...",Beginner,"Write,a,Full,Length,Feature,Film,Script,In,thi...",Drama Comedy peering screenwriting film D...
1,"Business,Strategy,Business,Model,Canvas,Analys...",Beginner,"By,the,end,of,this,guided,project,you,will,be,...",Finance business plan persona user experienc...
2,"Silicon,Thin,Film,Solar,Cells",Advanced,"This,course,consists,of,a,general,presentation...",chemistry physics Solar Energy film lambda...
3,"Finance,for,Managers",Intermediate,"When,it,comes,to,numbers,there,is,always,more,...",accounts receivable dupont analysis analysis...
4,"Retrieve,Data,using,Single-Table,SQL,Queries",Beginner,"In,this,course,you�ll,learn,how,to,effectively...",Data Analysis select sql database management...


In [11]:
df['tags'] = df['Course Name'] + df['Difficulty Level'] + df['Course Description'] + df['Skills']

In [12]:
df.head()

Unnamed: 0,Course Name,Difficulty Level,Course Description,Skills,tags
0,"Write,A,Feature,Length,Screenplay,For,Film,Or,...",Beginner,"Write,a,Full,Length,Feature,Film,Script,In,thi...",Drama Comedy peering screenwriting film D...,"Write,A,Feature,Length,Screenplay,For,Film,Or,..."
1,"Business,Strategy,Business,Model,Canvas,Analys...",Beginner,"By,the,end,of,this,guided,project,you,will,be,...",Finance business plan persona user experienc...,"Business,Strategy,Business,Model,Canvas,Analys..."
2,"Silicon,Thin,Film,Solar,Cells",Advanced,"This,course,consists,of,a,general,presentation...",chemistry physics Solar Energy film lambda...,"Silicon,Thin,Film,Solar,CellsAdvancedThis,cour..."
3,"Finance,for,Managers",Intermediate,"When,it,comes,to,numbers,there,is,always,more,...",accounts receivable dupont analysis analysis...,"Finance,for,ManagersIntermediateWhen,it,comes,..."
4,"Retrieve,Data,using,Single-Table,SQL,Queries",Beginner,"In,this,course,you�ll,learn,how,to,effectively...",Data Analysis select sql database management...,"Retrieve,Data,using,Single-Table,SQL,QueriesBe..."


In [13]:
df['tags'].iloc[1]

'Business,Strategy,Business,Model,Canvas,Analysis,with,MiroBeginnerBy,the,end,of,this,guided,project,you,will,be,fluent,in,identifying,and,creating,Business,Model,Canvas,solutions,based,on,previous,high-level,analyses,and,research,data.,This,will,enable,you,to,identify,and,map,the,elements,required,for,new,products,and,services.,Furthermore,it,is,essential,for,generating,positive,results,for,your,business,venture.,This,guided,project,is,designed,to,engage,and,harness,your,visionary,and,exploratory,abilities.,You,will,use,proven,models,in,strategy,and,product,development,with,the,Miro,platform,to,explore,and,analyse,your,business,propositions.,,We,will,practice,critically,examining,results,from,previous,analysis,and,research,results,in,deriving,the,values,for,each,of,the,business,model,sections.Finance  business plan  persona user experience  business model canvas  Planning  Business  project  Product Development  presentation  Strategy business business-strategy'

In [14]:
df = df[['Course Name','tags']]

In [15]:
df.head()

Unnamed: 0,Course Name,tags
0,"Write,A,Feature,Length,Screenplay,For,Film,Or,...","Write,A,Feature,Length,Screenplay,For,Film,Or,..."
1,"Business,Strategy,Business,Model,Canvas,Analys...","Business,Strategy,Business,Model,Canvas,Analys..."
2,"Silicon,Thin,Film,Solar,Cells","Silicon,Thin,Film,Solar,CellsAdvancedThis,cour..."
3,"Finance,for,Managers","Finance,for,ManagersIntermediateWhen,it,comes,..."
4,"Retrieve,Data,using,Single-Table,SQL,Queries","Retrieve,Data,using,Single-Table,SQL,QueriesBe..."


In [16]:
df['tags'] = df['tags'].str.replace(',',' ')

In [17]:
df['Course Name'] = df['Course Name'].str.replace(',',' ')

In [18]:
df.rename(columns = {'Course Name':'course_name'}, inplace = True)

In [19]:
df['tags'] = df['tags'].apply(lambda x:x.lower()) #lower casing the tags column

In [20]:
df.head(5)

Unnamed: 0,course_name,tags
0,Write A Feature Length Screenplay For Film Or ...,write a feature length screenplay for film or ...
1,Business Strategy Business Model Canvas Analys...,business strategy business model canvas analys...
2,Silicon Thin Film Solar Cells,silicon thin film solar cellsadvancedthis cour...
3,Finance for Managers,finance for managersintermediatewhen it comes ...
4,Retrieve Data using Single-Table SQL Queries,retrieve data using single-table sql queriesbe...


In [21]:
df.shape

(3522, 2)

# Using NLP to build Recommeder 

In [22]:
cv = CountVectorizer(max_features=5000,stop_words='english')

In [23]:
vectors = cv.fit_transform(df['tags']).toarray()

In [24]:
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [25]:
ps = PorterStemmer()

In [26]:
def stem(text):
    y=[]
    
    for i in text.split():
        y.append(ps.stem(i))
    
    return " ".join(y)

In [27]:
df['tags'] = df['tags'].apply(stem)

In [28]:
similarity = cosine_similarity(vectors)

In [29]:
def recommend(course):
    course_index = df[df['course_name'] == course].index[0]
    distances = similarity[course_index]
    course_list = sorted(list(enumerate(distances)),reverse=True, key=lambda x:x[1])[1:7]
    
    for i in course_list:
        print(df.iloc[i[0]].course_name)

In [30]:
recommend('Business Strategy Business Model Canvas Analysis with Miro') 

Product Development Customer Persona Development with Miro
Product and Service Development Empathy Mapping with Miro
Product Development Customer Journey Mapping with Miro
Analyzing Macro-Environmental Factors Using Creately
Business Strategy in Practice (Project-centered Course)
Innovating with the Business Model Canvas


### Exporting Model

In [31]:
pickle.dump(similarity,open('content_sim.pkl','wb'))
pickle.dump(df.to_dict(),open('content_list.pkl','wb')) #contains the dataframe in dict 
pickle.dump(df,open('content.pkl','wb'))