# Chapter 14: Association Rules and Collaborative Filtering


> (c) 2019-2020 Galit Shmueli, Peter C. Bruce, Peter Gedeck 
>
> _Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python_ (First Edition) 
> Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel. 2019.
>
> Date: 2020-03-08
>
> Python Version: 3.8.2
> Jupyter Notebook Version: 5.6.1
>
> Packages:
>   - numpy: 1.18.1
>   - pandas: 1.0.1
>   - scipy: 1.4.1
>   - scikit-learn: 0.22.2
>   - scikit-surprise: 1.1.0
>
> The assistance from Mr. Kuber Deokar and Ms. Anuja Kulkarni in preparing these solutions is gratefully acknowledged.

> Edited and presented by Dillon Orr for University of San Diego, 2022

In [36]:
# Import required packages for this chapter
import pandas as pd
import numpy as np

from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity

from surprise import Dataset
from surprise import Reader
from surprise import KNNBasic

%matplotlib inline

# Problem 14.5: Course ratings
The Institute for Statistics Education at Statistics.com asks students to rate a variety of aspects of a course as soon as the student completes it. The Institute is contemplating instituting a recommendation system that would provide students with recommendations for additional courses as soon as they submit their rating for a completed course.  Consider the excerpt from student ratings of online statistics courses shown in Table 14.7, and the problem of what to recommend to student E.N.

In [38]:
rating_df = pd.read_csv('courserating.csv')
rating_df = rating_df.rename(columns={'Unnamed: 0' : "Student"})
rating_df = rating_df.set_index('Student')
rating_df

Unnamed: 0_level_0,SQL,Spatial,PA1,DM in R,Python,Forecast,R Prog,Hadoop,Regression
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
LN,4.0,,,,3.0,2.0,4.0,,2.0
MH,3.0,4.0,,,4.0,,,,
JH,2.0,2.0,,,,,,,
EN,4.0,,,4.0,,,4.0,,3.0
DU,4.0,4.0,,,,,,,
FL,,4.0,,,,,,,
GL,,4.0,,,,,,,
AH,,3.0,,,,,,,
SA,,,4.0,,,,,,
RW,,,2.0,,,,,4.0,


## 14.5.a First consider a user-based collaborative filter.  This requires computing correlations between all student pairs. For which students is it possible to compute correlations with E.N.? Compute them.

We need to identify the users that share ratings with E.N. These are: L.N., M.H., J.H., D.U., and D.S. However, only L.N. and D.S. share more than one rating with E.N. 

In [39]:
rating_df.loc[['EN', 'LN', 'DS']]

Unnamed: 0_level_0,SQL,Spatial,PA1,DM in R,Python,Forecast,R Prog,Hadoop,Regression
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
EN,4.0,,,4.0,,,4.0,,3.0
LN,4.0,,,,3.0,2.0,4.0,,2.0
DS,4.0,,,2.0,,,4.0,,


![Correlation equation](c14_M0009.gif)

To compute this correlation, we first compute average rating by each of these 
students.  Note that the average is computed over a different number of 
courses for each of these students, because they each rated a different set 
of courses.

Average ratings:

- LN: (4 + 3 + 2 + 4 + 2) / 5 = 3
- EN: (4 + 4 + 4 + 3) / 4 = 3.75
- DS: (4 + 2 + 4) / 3 = 3.33

### Co-rated courses for EN and LN: SQL, R Prog, Regression.

- Denominator EN: sqrt((4-3.75)^2 + (4-3.75)^2 + (3-3.75)^2) = 0.8291562
- Denominator LN: sqrt((4-3)^2 + (4-3)^2 + (2-3)^2) = 1.732051

**Corr(EN, LS) = ((4-3)*(4-3.75) + (4-3)*(4-3.75) + (2-3)*(3-3.75)) / (1.732051 * 0.8291562) = 0.8703882**

### Co-rated courses for EN and DS: SQL, DM in R, R Prog.

- Denominator EN: sqrt((4-3.75)^2 + (4-3.75)^2 + (4-3.75)^2) = 0.4330127
- Denominator DS: sqrt((4-3.33)^2 + (2-3.33)^2 + (4-3.33)^2) = 1.633003

**Corr(EN, DS) = ((4-3.75)*(4-3.33) + (4-3.75)*(2-3.33) + (4-3.75)*(4-3.33)) / (0.4330127 * 1.633003) = 0.003535513**

##  14.5.b Based on the single nearest student to E.N., which single course should we recommend to E.N.? Explain why. 

In [5]:
rating_df.loc[['EN', 'LN']][['Python', 'Forecast']]

Unnamed: 0_level_0,Python,Forecast
Student,Unnamed: 1_level_1,Unnamed: 2_level_1
EN,,
LN,3.0,2.0


From the correlations computed in (a) above, student LN is nearest to EN. Among the courses that LN has taken (but not taken by EN), Python is highly preferred by LN. So Python should be recommended to EN.

## 14.5.c Use _scikit-learn_ function `sklearn.metrics.pairwise.cosine_similarity` to compute the cosine similarity between users. 

Co-rated courses for users EN and LN: SQL, R Prog, Regression.

- Denominator LN: sqrt(4^2 + 4^2 + 2^2) = 6
- Denominator EN: sqrt(4^2 + 4^2 + 3^2) = 6.403124

**Cosine(LN, EN) = (4*4 + 4*4 + 2*3) / (6 * 6.403124) = 0.9891005**

Co-rated courses for users EN and LN: SQL, DM in R, R Prog.

- Denominator EN: sqrt(4^2 + 4^2 + 4^2) = 6.928203
- Denominator DS: sqrt(4^2 + 2^2 + 4^2) = 6

**Cosine(EN, DS) = (4*4 + 4*2 + 4*4) / (6.928203 * 6) = 0.9622505**

In [44]:
print('cosine(LN, EN) = ', cosine_similarity(rating_df.loc[['LN', 'EN'], ['SQL', 'R Prog', 'Regression']])[0, 1])

print('cosine(EN, DS) = ', cosine_similarity(rating_df.loc[['EN', 'DS'], ['SQL', 'DM in R', 'R Prog']])[0, 1])

cosine(LN, EN) =  0.9891004919611718
cosine(EN, DS) =  0.9622504486493764


## 14.5 Based on the cosine similarities of the nearest students to E.N., which course should be recommended to E.N.?

In [7]:
rating_df.loc[['EN', 'LN']][['Python', 'Forecast']]

Unnamed: 0_level_0,Python,Forecast
Student,Unnamed: 1_level_1,Unnamed: 2_level_1
EN,,
LN,3.0,2.0


From the cosine similarities based on course ratings, student LN is nearest to EN. Among the courses 
that LN has taken (but not taken by EN), Python is highly preferred by LN. 
So Python should be recommended to EN.

## 14.5.e What is the conceptual difference between using the correlation as opposed to cosine similarities?
\[_Hint_: how are the missing values in the matrix handled in each case?\]



If we consider the rating matrix, both methods basically only consider co-rated items. Correlation uses the not co-rated items to calculate the averages which will impact the correlation. 

## 14.5.f With large datasets, it is computationally difficult to compute user-based recommendations in real time, and an item-based approach is used instead. Returning to the rating data (not the binary matrix), let's now take that approach.

### 14.5.f.i If the goal is still to find a recommendation for E.N., for which course pairs is it possible and useful to calculate correlations?  

In [8]:
rating_df

Unnamed: 0_level_0,SQL,Spatial,PA1,DM in R,Python,Forecast,R Prog,Hadoop,Regression
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
LN,4.0,,,,3.0,2.0,4.0,,2.0
MH,3.0,4.0,,,4.0,,,,
JH,2.0,2.0,,,,,,,
EN,4.0,,,4.0,,,4.0,,3.0
DU,4.0,4.0,,,,,,,
FL,,4.0,,,,,,,
GL,,4.0,,,,,,,
AH,,3.0,,,,,,,
SA,,,4.0,,,,,,
RW,,,2.0,,,,,4.0,


There is enough data to find correlations for the following pairs:    
- SQL - Spatial     
- SQL - Python    
- Spatial - Python

In [45]:
rating_df.corr()

Unnamed: 0,SQL,Spatial,PA1,DM in R,Python,Forecast,R Prog,Hadoop,Regression
SQL,1.0,0.866025,,,-1.0,,,,
Spatial,0.866025,1.0,,,,,,,
PA1,,,1.0,,,,,,
DM in R,,,,1.0,,,,,
Python,-1.0,,,,1.0,,,,
Forecast,,,,,,1.0,,,
R Prog,,,,,,,,,
Hadoop,,,,,,,,,
Regression,,,,,,,,,1.0


However, EN has already taken SQL, DM in R, and R Prog.  Hence, only the Spatial and Python correlations are useful.

In [46]:
en_not_taken = rating_df.columns[rating_df.loc['EN'].isna()].to_list()
en_not_taken

['Spatial', 'PA1', 'Python', 'Forecast', 'Hadoop']

###  14.5.f.ii Just looking at the data, and without yet calculating course pair correlations, which course would you recommend to E.N., relying on item-based filtering?  Calculate two course pair correlations involving your guess and report the results. 

The SQL - Spatial ratings match the best, and there are more co-rated items, 
so Spatial would be the best guess.

In [47]:
print(cosine_similarity(rating_df.loc[['MH', 'JH'], ['SQL', 'Spatial']].transpose())[0,1])
print(cosine_similarity(rating_df.loc[['MH', 'LN'], ['SQL', 'Python']].transpose())[0,1])

0.9922778767136676
0.96


## 14.5.g Apply item-based collaborative filtering to this dataset (using Python) and based on the results, recommend a course to E.N. 

In [28]:
# convert the rating_df dataframe into a format suitable for the Surprise package
ratings = []

for customer, row in rating_df.iterrows():
    for course, value in row.iteritems():
        if np.isnan(value): continue
        ratings.append([customer, course, value])
        
ratings = pd.DataFrame(ratings, columns=['customer', 'course', 'rating'])
ratings

Unnamed: 0,customer,course,rating
0,LN,SQL,4.0
1,LN,Python,3.0
2,LN,Forecast,2.0
3,LN,R Prog,4.0
4,LN,Regression,2.0
5,MH,SQL,3.0
6,MH,Spatial,4.0
7,MH,Python,4.0
8,JH,SQL,2.0
9,JH,Spatial,2.0


In [48]:
reader = Reader(rating_scale=(1, 4))
data = Dataset.load_from_df(ratings, reader)
trainset = data.build_full_trainset()

# compute cosine similarities between items
sim_options = {'name': 'cosine', 'user_based': False}  
algo = KNNBasic(sim_options=sim_options)
algo.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x16c0afa4280>

In [49]:
courses = rating_df.columns
for course in courses: 
    print(course, algo.predict('EN', course).est)

SQL 3.7504416393899813
Spatial 4
PA1 3.433333333333333
DM in R 3.743416490252569
Python 3.6621621621621623
Forecast 3.6666666666666665
R Prog 3.7504416393899813
Hadoop 3.433333333333333
Regression 3.747548783981962


The item-based collaborative filtering recommends the **Spatial** course to E.N.