# Collaborative Filtering

#### Import libraries

In [1]:
import os
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Dataset, Reader, KNNBasic

### Example 1: Lecture - Netflix Prize

**Data from Sheet 1 in `Collaborative_Filtering_Examples.xlsx` workbook.**

Set Customer ID as the index and make sure the data types are float.

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0_level_0,1,5,8,17,18,28,30,44,48
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
30878,4.0,1.0,,,3.0,3.0,4.0,5.0,
124105,4.0,,,,,,,,
822109,5.0,,,,,,,,
823519,3.0,,1.0,4.0,,4.0,5.0,,
885013,4.0,5.0,,,,,,,
893988,3.0,,,,,,4.0,4.0,
1248029,3.0,,,,,2.0,4.0,,3.0
1503895,4.0,,,,,,,,
1842128,4.0,,,,,,3.0,,
2238063,3.0,,,,,,,,


**Calculate average rating for customers 30878, 823519.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Customer ID
30878     3.33
823519    3.40
dtype: float64

**Calculate correlation similarity for customers 30878, 823519.**

Helper function provided since the average needs to be calculated for all items and not just the co-rated ones.

In [4]:
def calc_corr_sim(x, y):
    # Calculate mean for all values in x
    x_m = round(np.nanmean(x),2)
    # Calculate mean for all values in y
    y_m = round(np.nanmean(y),2)
    # Create a mask for shared ratings
    mask = ~np.isnan(x) & ~np.isnan(y)
    if np.sum(mask) > 1:  # Need at least two common elements for correlation
        # Numerator
        num = np.sum((x[mask] - x_m) * (y[mask] - y_m))
        # Denominator
        den = np.sqrt(sum((x[mask] - x_m)**2)) * np.sqrt(sum((y[mask] - y_m)**2))
        if den > 0:
            return np.divide(num, den)
        else:
            return np.nan
    else:
        return np.nan

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Corr(30878, 823519): 0.3441


**Calculate cosine similarity for customers 30878, 823519.**

Helper function provided that accounts for nans for items that are not co-rated.

**Demonstrate format output of sklearn cosine_similarity metric.**

In [6]:
def calc_cos_sim(u, v):
    # Create a mask for shared ratings
    mask = ~np.isnan(u) & ~np.isnan(v)
    if np.sum(mask) > 0:  # Need at least two common elements for correlation
        return 1 - cosine(u[mask], v[mask])
    else:
        return np.nan

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

array([[1.        , 0.97179743],
       [0.97179743, 1.        ]])

Cos(30878, 823519): 0.9718


**Create a similarity matrix for all customers.**

Helper function provided that accounts for nans for items that are not co-rated.

In [9]:
def sim_matrix_nan(data, name):
    m = data.shape[0]
    # Initialize the similarity matrix to np.nan
    result = np.full((m, m), np.nan)
    # Iterate over all pairs of columns
    for i in range(m):
        for j in range(i, m):
            if name == 'cosine':
                result[i, j] = calc_cos_sim(data.iloc[i], data.iloc[j])
            elif name == 'pearson':
                result[i, j] = calc_corr_sim(data.iloc[i], data.iloc[j])
            else:
                break
            result[j, i] = result[i, j]
    return pd.DataFrame(result, columns=data.index, index=data.index)

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Customer ID,30878,124105,822109,823519,885013,893988,1248029,1503895,1842128,2238063
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
30878,1.0,,,0.34415,-0.874981,0.206216,0.70475,,0.0,
124105,,,,,,,,,,
822109,,,,,,,,,,
823519,0.34415,,,1.0,,0.646233,0.402911,,-0.857493,
885013,-0.874981,,,,1.0,,,,,
893988,0.206216,,,0.646233,,1.0,0.44185,,-0.946773,
1248029,0.70475,,,0.402911,,0.44185,1.0,,-0.707107,
1503895,,,,,,,,,,
1842128,0.0,,,-0.857493,,-0.946773,-0.707107,,1.0,
2238063,,,,,,,,,,


Customer ID,30878,124105,822109,823519,885013,893988,1248029,1503895,1842128,2238063
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
30878,1.0,1.0,1.0,0.971797,0.795432,0.992915,0.986025,1.0,0.989949,1.0
124105,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
822109,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
823519,0.971797,1.0,1.0,1.0,1.0,0.994692,0.971668,1.0,0.926092,1.0
885013,0.795432,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
893988,0.992915,1.0,1.0,0.994692,1.0,1.0,1.0,1.0,0.96,1.0
1248029,0.986025,1.0,1.0,0.971668,1.0,1.0,1.0,1.0,0.96,1.0
1503895,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1842128,0.989949,1.0,1.0,0.926092,1.0,0.96,0.96,1.0,1.0,1.0
2238063,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**Create a correlation and cosine similarity matrix for all items.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,1,5,8,17,18,28,30,44,48
1,1.0,0.0,,,,0.0,-0.550482,0.928477,
5,0.0,1.0,,,,,,,
8,,,,,,,,,
17,,,,,,,,,
18,,,,,,,,,
28,0.0,,,,,1.0,0.707107,,
30,-0.550482,,,,,0.707107,1.0,,
44,0.928477,,,,,,,1.0,
48,,,,,,,,,


Unnamed: 0,1,5,8,17,18,28,30,44,48
1,1.0,0.83205,1.0,1.0,1.0,0.955395,0.963256,0.999512,1.0
5,0.83205,1.0,,,1.0,1.0,1.0,1.0,
8,1.0,,1.0,1.0,,1.0,1.0,,
17,1.0,,1.0,1.0,,1.0,1.0,,
18,1.0,1.0,,,1.0,1.0,1.0,1.0,
28,0.955395,1.0,1.0,1.0,1.0,1.0,0.983838,1.0,1.0
30,0.963256,1.0,1.0,1.0,1.0,0.983838,1.0,0.993884,1.0
44,0.999512,1.0,,,1.0,1.0,0.993884,1.0,
48,1.0,,,,,1.0,1.0,,1.0


### Problem 14.3: Course Ratings

We again consider the data in _CourseTopics.csv_ describing course purchases at Statistics.com (see Problem 14.2 and data sample in Table). We want to provide a course recommendation to a student who purchased the Regression and Forecast courses. Apply user-based collaborative filtering to the data.

**Read in data from `coursetopics.csv` file.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,Intro,DataMining,Survey,Cat Data,Regression,Forecast,DOE,SW
0,1,1,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0
2,0,1,0,1,1,0,0,1
3,1,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0


For the surprise Dataset loader, data should be presented in columns with a customer/user, item and purchase/rating.

**Transform data from matrix to columns.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,customer,course,purchase
0,0,Intro,1
365,0,DataMining,1
731,1,Survey,1
1097,2,Cat Data,1
367,2,DataMining,1
...,...,...,...
2916,361,SW,1
726,361,DataMining,1
2917,362,SW,1
1458,363,Cat Data,1


**Make predictions for all users.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [16]:
reader = Reader(rating_scale=(1, 1))
data = Dataset.load_from_df(purchases, reader)
trainset = data.build_full_trainset()
sim_options = {'name': 'cosine', 'user_based': False}  # compute cosine similarities between users
algo = KNNBasic(sim_options=sim_options)
algo.fit(trainset)

predictions = []
for user in course_df.index:
    predictions.append([algo.predict(user, course).est for course in course_df])
predictions = pd.DataFrame(predictions, columns=course_df.columns)
predictions.head()

Computing the cosine similarity matrix...
Done computing similarity matrix.


Unnamed: 0,Intro,DataMining,Survey,Cat Data,Regression,Forecast,DOE,SW
0,1,1,1,1,1,1,1,1
1,1,1,1,1,1,1,1,1
2,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1
4,1,1,1,1,1,1,1,1


**Interpret some of the rules.**

<h4 style="color:purple"> Write Your Free-Form Response Below: </h4>

The resulting predictions are all 1. This is because the input is not a rating matrix but a binary one and only the purchase rows were included. 

### Problem 14.5: Course Ratings

The Institute for Statistics Education at Statistics.com asks students to rate a variety of aspects of a course as soon as the student completes it. The Institute is contemplating instituting a recommendation system that would provide students with recommendations for additional courses as soon as they submit their rating for a completed course. Consider the excerpt from student ratings of online statistics courses shown in Table 14.7, and the problem of what to recommend to student E.N.

**Read in data from `courserating.csv` file.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0_level_0,SQL,Spatial,PA1,DM in R,Python,Forecast,R Prog,Hadoop,Regression
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
LN,4.0,,,,3.0,2.0,4.0,,2.0
MH,3.0,4.0,,,4.0,,,,
JH,2.0,2.0,,,,,,,
EN,4.0,,,4.0,,,4.0,,3.0
DU,4.0,4.0,,,,,,,
FL,,4.0,,,,,,,
GL,,4.0,,,,,,,
AH,,3.0,,,,,,,
SA,,,4.0,,,,,,
RW,,,2.0,,,,,4.0,


#### 14.5.a
First consider a user-based collaborative filter.  This requires computing correlations between all student pairs. 
For which students is it possible to compute correlations with E.N.? Compute them.

We need to identify the users that share ratings with E.N. These are: L.N., M.H., J.H., D.U., and D.S. However, only L.N. and D.S. share more than one rating with E.N. 

To compute this correlation, we first compute average rating by each of these 
students.  Note that the average is computed over a different number of 
courses for each of these students, because they each rated a different set 
of courses.

Average ratings:

- LN: (4 + 3 + 2 + 4 + 2) / 5 = 3
- EN: (4 + 4 + 4 + 3) / 4 = 3.75
- DS: (4 + 2 + 4) / 3 = 3.33

Co-rated courses for users EN and LN: SQL, R Prog, Regression.

- Denominator LN: sqrt((4-3)^2 + (4-3)^2 + (2-3)^2) = 1.732051
- Denominator EN: sqrt((4-3.75)^2 + (4-3.75)^2 + (3-3.75)^2) = 0.8291562

**Corr(LN, EN) = ((4-3)*(4-3.75) + (4-3)*(4-3.75) + (2-3)*(3-3.75)) / (1.732051 * 0.8291562) = 0.8703882**

Co-rated courses for users EN and LN: SQL, DM in R, R Prog.

- Denominator EN: sqrt((4-3.75)^2 + (4-3.75)^2 + (4-3.75)^2) = 0.4330127
- Denominator DS: sqrt((4-3.33)^2 + (2-3.33)^2 + (4-3.33)^2) = 1.633003

**Corr(EN, DS) = ((4-3.75)*(4-3.33) + (4-3.75)*(2-3.33) + (4-3.75)*(4-3.33)) / (0.4330127 * 1.633003) = 0.003535513**

**Calculate correlation similarity for users LN, EN and EN, DS.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Corr(LN, EN): 0.8704
Corr(EN, DS): 0.0035


#### 14.5.c
Use _scikit-learn_ function `sklearn.metrics.pairwise.cosine_similarity` to compute the cosine similarity between users. 

Co-rated courses for users EN and LN: SQL, R Prog, Regression.

- Denominator LN: sqrt(4^2 + 4^2 + 2^2) = 6
- Denominator EN: sqrt(4^2 + 4^2 + 3^2) = 6.403124

**Cosine(LN, EN) = (4*4 + 4*4 + 2*3) / (6 * 6.403124) = 0.9891005**

Co-rated courses for users EN and LN: SQL, DM in R, R Prog.

- Denominator EN: sqrt(4^2 + 4^2 + 4^2) = 6.928203
- Denominator DS: sqrt(4^2 + 2^2 + 4^2) = 6

**Cosine(EN, DS) = (4*4 + 4*2 + 4*4) / (6.928203 * 6) = 0.9622505**

**Calculate cosine similarity for users LN, EN and EN, DS.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Cos(LN, EN): 0.9891
Cos(EN, DS): 0.9623


**We can convert the rating matrix into binary form (course taken or not) and calcualte cosine similarity.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [20]:
binary_df = rating_df.copy()
binary_df[~np.isnan(binary_df)] = 1
binary_df[np.isnan(binary_df)] = 0
binary_df

Unnamed: 0_level_0,SQL,Spatial,PA1,DM in R,Python,Forecast,R Prog,Hadoop,Regression
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
LN,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
MH,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
JH,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
EN,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
DU,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
FL,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GL,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AH,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SA,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
RW,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


**Now calculate the cosine similarity using the binary matrix.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Cos(LN, EN): 0.6708
Cos(EN, DS): 0.8660


#### 14.5.f
With large datasets, it is computationally difficult to compute user-based recommendations in real time, and an item-based approach is used instead. Returning to the rating data (not the binary matrix), let's now take that approach.

**Create a cosine rating matrix for all courses.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,SQL,Spatial,PA1,DM in R,Python,Forecast,R Prog,Hadoop,Regression
SQL,1.0,0.990375,,0.948683,0.96,1.0,1.0,,0.980581
Spatial,0.990375,1.0,,,1.0,,,,
PA1,,,1.0,,,1.0,,1.0,
DM in R,0.948683,,,1.0,,,0.948683,,1.0
Python,0.96,1.0,,,1.0,1.0,1.0,,1.0
Forecast,1.0,,1.0,,1.0,1.0,1.0,,1.0
R Prog,1.0,,,0.948683,1.0,1.0,1.0,,0.980581
Hadoop,,,1.0,,,,,1.0,
Regression,0.980581,,,1.0,1.0,1.0,0.980581,,1.0


#### 14.5.g

**Apply item-based collaborative filtering to this dataset (using Python) and based on the results, recommend a course to E.N.**

**Convert the rating_df dataframe into a format suitable for the Surprise package.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,User,course,rating
0,AF,PA1,4.0
1,AH,Spatial,3.0
2,BA,PA1,4.0
3,DS,R Prog,4.0
4,DS,DM in R,2.0


**Make predictions for EN.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [24]:
reader = Reader(rating_scale=(1, 4))
data = Dataset.load_from_df(ratings, reader)
trainset = data.build_full_trainset()
# compute cosine similarities between items
sim_options = {'name': 'cosine', 'user_based': False}  
algo = KNNBasic(sim_options=sim_options)
algo.fit(trainset)

courses = rating_df.columns
for course in courses: 
    print(course, algo.predict('EN', course).est)

Computing the cosine similarity matrix...
Done computing similarity matrix.
SQL 3.7504416393899813
Spatial 4
PA1 3.433333333333333
DM in R 3.743416490252569
Python 3.6621621621621623
Forecast 3.6666666666666665
R Prog 3.7504416393899813
Hadoop 3.433333333333333
Regression 3.747548783981962


**Interpret the results.**

<h4 style="color:purple"> Write Your Free-Form Response Below: </h4>

The item-based collaborative filtering recommends the **Spatial** course to E.N.