# Coursera Course Recommendations

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [6]:
df = pd.read_csv("./data/Coursera.csv")

In [7]:
# Display head of dataframe
df.head()

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...


The Skills column has some words that are capitalized and other words that are not. It might be better to make these words all lower case. Moreover, these words are not divided as noun vectors but are a string.

Note that the "Course Description" column also has words.
With NLP techniques, the Course Description section will be used to recommend courses based on their content. Term Frequency-Inverse Document Frequency (TF-IDF) vectors will be computed. This gives a matrix where each column represents a word in the course description vocabulary, and each column represents a movie. The TfIdfVectorizer in Scikit-Learn will be used.

In [12]:
# Import TfidfVectorizer from Scikit-Learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Exclude English stopwords
tfidf = TfidfVectorizer(stop_words='english')

# Remove NaN with empty string
df['Course Description'] = df['Course Description'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming data
tfidf_matrix = tfidf.fit_transform(df['Course Description'])

# Output the shape of tfidf_matrix, which gives dimensions
tfidf_matrix.shape

(3522, 20074)

There are 20074 words/vocabularies in the dataset have 3522 courses. We can now compute a similarity score. There are several similarity metrics that can be used, such as manhattan, euclidean, the Pearson, and cosine similarity scores. Cosine similarity score will be used in this system. Because the TF-IDF vectorizer has been used, calculating the dot product between each product will directly give the cosine similarity score, which is why Scikit-Learn's linear_kernel() will be used instead of the cosine_similarities(), for the sake of faster computation.

In [14]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [15]:
cosine_sim.shape

(3522, 3522)

This gives a matrix of shape 3522 x 3522, which means each course description similarity score with every other course similarity score.

In [16]:
# Show column index 1 as example of similarity scores
# Note that row 1, column 1 is just similarity score with itself = 1
# Diagonal scores in the matrix are all 1
cosine_sim[1]

array([0.03123665, 1.        , 0.00858915, ..., 0.0313672 , 0.00488239,
       0.04560336])

The following code will define a function that takes in a course title and outputs a list of the 10 most similar courses. For this, we need a reverse mapping of course titles and DataFrame indices. This means that a mechanism to identify the index of a course in the df is needed, given its title.

In [17]:
# Construct reverse map of indices and movie titles
indices = pd.Series(df.index, index=df['Course Name']).drop_duplicates()

In [18]:
# Display indices for course index 0 to 9
indices[:10]

Course Name
Write A Feature Length Screenplay For Film Or Television                                         0
Business Strategy: Business Model Canvas Analysis with Miro                                      1
Silicon Thin Film Solar Cells                                                                    2
Finance for Managers                                                                             3
Retrieve Data using Single-Table SQL Queries                                                     4
Building Test Automation Framework using Selenium and TestNG                                     5
Doing Business in China Capstone                                                                 6
Programming Languages, Part A                                                                    7
The Roles and Responsibilities of Nonprofit Boards of Directors within the Governance Process    8
Business Russian Communication. Part 3                                                           

Define the recommendation function:
- Get index of course given its name
- Get list of cosine similarity score for particular score with all courses. Convert it into a list of tuples where first element is its position (index) and the second element is its score.
- Sort the list of tuples based on similarity scores (the second element)
- Get top 10 elements of this list and ignore first element, because the movie with the highest similarity score of 1 will be itself
- Return titles corresponding to the indices of the top elements

In [28]:
indices['Finance for Managers']

3

In [29]:
# Define the recommendation function
def get_recommendations(course, cosine_sim=cosine_sim):
    # Get index of course that matches the title
    idx = indices[course]
    
    # Get pairwise similarity score of all courses with that course
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort courses based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get scores of the 10 most similar courses
    sim_scores = sim_scores[1:11]
    
    # Get course indices
    course_indices = [i[0] for i in sim_scores]
    
    # Return top 10 most similar courses
    return df['Course Name'].iloc[course_indices]

In [30]:
# Example - call function for the course 'Finance for Managers'
get_recommendations('Finance for Managers')

1839    Fundamentals of financial and management accou...
1891          Accounting and Finance for IT professionals
1985                  Introduction to Finance: The Basics
419                    Finance for Non-Financial Managers
1164                         Corporate Finance Essentials
708     Understanding Financial Statements: Company Po...
1090                    Financial Accounting Fundamentals
590                Corporate finance: Know your numbers 2
3119    Introduction to Finance: The Role of Financial...
3463    Operations Management: Analysis and Improvemen...
Name: Course Name, dtype: object

While we do not know much about the course 'Finance for Managers,' we can see that the recommendation function correctly recommends courses on finance.

In [32]:
# Another example - call function for the course 'Programming Languages, Part A'
get_recommendations('Programming Languages, Part A')

3505                        Programming Languages, Part C
1930                        Programming Languages, Part B
1706                   Functional Program Design in Scala
3042           Functional Programming Principles in Scala
1258               Introduction to Programming in Swift 5
1000                               Crash Course on Python
2364         Mastering Software Development in R Capstone
3362    Miracles of Human Language: An Introduction to...
857                              Programming with Scratch
16                          Python Programming Essentials
Name: Course Name, dtype: object

A different call with 'Programming Languages, Part A' recommends other courses on programming languages, and 'Programming Languages, Part C' and 'Programming Languages, Part B,' which seem to be a series along with the original course we searched for, are on the top of the ten-course recommendation list.

# Skills-based Recommendation System