The Notebook is a sneak peak in Collaborative Filtering Implementation where the implicit assumption is users who have interacted with similar courses are similar. The implementation uses the Surprise Library for finding K Nearest Neighbors(Lazy evaluation). There are many variations worth experimenting in this approach. For example, we can experiment with:

- Different algorithms (other nearest neighbor algorithms, Matrix Factorization, Neural Network)
- Different Similarity measures
- Number of nearest neighbors to consider
- Minimum number of course view overlap
- Rescaling the view time
- The distrubution of the target view time is bimodal and it may be worth experimenting with different approaches to cater this particular aspect of data. 

Additionally, it may be worth experimenting with Neural Network implementations to see if it can learn the non-linear pattern in data more closely. 

When considering model evaluation, in implicit data scenarios one can look into rank metrics, such as those used in information retrieval settings. These include, for example, MAP@N, Precision@N, Recall@N and DCG. But as the data provided is implicit and there is nothing like rating, using the standard error metrics like MSE, MAE does not make sense. One can still generate a set of predicted values for the target, but those values are only meaningful in so far as they allow you to rank items to be recommended; the values themselves don't really matter. So using MSE in this case is a bit pointless! It is worth noting that the method is susceptible to weird results because of Sparsity problem

In [32]:
NUM_SIMILAR_USERS=10
RANDOM_STATE=101

In [33]:
import os
import sys
import heapq
import numpy as np
import pandas as pd

In [34]:
from sklearn.preprocessing import StandardScaler

In [35]:
from surprise import Dataset, KNNBasic, Reader

In [36]:
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

# Data Ingestion

In [37]:
from generator.data.data_utils import create_user_indices_from_user_handle
from generator.data.process_data import UserInterestDataProcessor
from paths import user_course_views_file

In [38]:
user_course_views_df=pd.read_csv(user_course_views_file)

In [39]:
user_interest_df=UserInterestDataProcessor.load_data()

In [40]:
user_course_views_df.head()

Unnamed: 0,user_handle,view_date,course_id,author_handle,level,view_time_seconds
0,1,2017-06-27,cpt-sp2010-web-designers-branding-intro,875,Beginner,3786
1,1,2017-06-28,cpt-sp2010-web-designers-branding-intro,875,Beginner,1098
2,1,2017-06-28,cpt-sp2010-web-designers-css,875,Intermediate,4406
3,1,2017-07-27,cpt-sp2010-web-designers-css,875,Intermediate,553
4,1,2017-09-12,aws-certified-solutions-architect-professional,281,Advanced,102


In [41]:
total_view_time_df=user_course_views_df.groupby(['user_handle', 'course_id']).agg({'view_time_seconds':'sum'}).reset_index()
total_view_time_df=total_view_time_df[total_view_time_df.view_time_seconds>0]

In [42]:
total_view_time_df.head()

Unnamed: 0,user_handle,course_id,view_time_seconds
0,1,aws-certified-solutions-architect-professional,102
1,1,aws-certified-sysops-admin-associate,83
2,1,aws-system-admin-fundamentals,2665
3,1,cpt-sp2010-web-designers-branding-intro,4884
4,1,cpt-sp2010-web-designers-css,4959


In [43]:
input_df=total_view_time_df[['user_handle','course_id', 'view_time_seconds']]

In [44]:
import logging
logger = logging.getLogger(__name__)


class KNNUserViewTimeSimilarityCFModel:
    def __init__(self, **params):
        logger.info('Entering Class {} '.format(self.__class__.__name__))
        self.params=params
        self.sim_options = {'name': self.params['similarity'],
                   'user_based': True
                           }
        self.algo = KNNBasic(sim_options=self.sim_options)
        self.target=self.params['target']
        self.data=None
        self.surprise_data_set=None
        self.X=None
        self.user_index_dict={}
        self.sims_matrix=None
        
    def fit(self, data):
        logger.debug('Entering fit method of class {} '.format(self.__class__.__name__))
        self.data=data
        self.user_index_dict=create_user_indices_from_user_handle(self.data)
        self.surprise_data_set=create_surprise_data_set(self.data, self.target)
        self.algo.fit(self.surprise_data_set)
        self.sims_matrix = self.algo.compute_similarities()
        return self
    
    def predict_similar_users(self, user_handle, num_similar_users=NUM_SIMILAR_USERS):
        user_handle_inner_id = self.surprise_data_set.to_inner_uid(user_handle)
        similarity_row = self.sims_matrix[user_handle_inner_id]
        similar_users = []
        for inner_id, score in enumerate(similarity_row):
            if (inner_id != user_handle_inner_id):
                similar_users.append( (inner_id, score))
        k_neighbors = heapq.nlargest(num_similar_users, similar_users, key=lambda t: t[1])
        data=[(self.surprise_data_set.to_raw_uid(user[0]),user[1]) for user in k_neighbors]
        logger.debug(f'Predictions for user_handle {user_handle} {data}')
        logger.debug(f'Predicted similar users for {user_handle}')
        return pd.DataFrame(data, columns=['similar_users', 'view_time_sim_score'])

    
def create_surprise_data_set(df, target_col):
    reader = Reader(rating_scale=(-df[target_col].min(), df[target_col].max()))
    data = Dataset.load_from_df(df, reader)
    surprise_data_set = data.build_full_trainset()
    return surprise_data_set


# Cosine Similarity

In [45]:
cosine_params={'similarity':'cosine','target':'view_time_seconds'}
cosine_knn=KNNUserViewTimeSimilarityCFModel(**cosine_params)
cosine_knn.fit(data=input_df)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.


<__main__.KNNUserViewTimeSimilarityCFModel at 0x136b8ceb8>

In [46]:
user_handle=100
cosine_similar_users=cosine_knn.predict_similar_users(user_handle=user_handle)
cosine_similar_users

Unnamed: 0,similar_users,view_time_sim_score
0,27,1.0
1,39,1.0
2,49,1.0
3,60,1.0
4,69,1.0
5,70,1.0
6,73,1.0
7,89,1.0
8,126,1.0
9,134,1.0


In [47]:
user_interest_df.merge(cosine_similar_users, left_on='user_handle', right_on='similar_users').sort_values(by='view_time_sim_score', ascending=False)

Unnamed: 0,user_handle,interest_tag,similar_users,view_time_sim_score
0,27,"zbrush,3d-modeling,3d-animation,3ds-max,3d-tex...",27,1.0
1,39,"mvc2,mvc3,mvc5,design-patterns,mvc-scaffolding...",39,1.0
2,49,"javascript-frameworks,html5,data-analysis,java...",49,1.0
3,60,"css,mongodb,javascript-frameworks,junit,projec...",60,1.0
4,69,it-fundamentals,69,1.0
5,70,"css,performance,javascript,test-driven-develop...",70,1.0
6,73,"javascript-frameworks,design-patterns,javascri...",73,1.0
7,89,"unit-testing,javascript-frameworks,css,html5,j...",89,1.0
8,126,"async,machine-learning,data-modeling,mvc-scaff...",126,1.0
9,134,"css3,css,azure-deployment,javascript,nodejs,mv...",134,1.0


# Pearson Baseline

In [48]:
pearson_params={'similarity':'pearson_baseline','target':'view_time_seconds'}
pearson_knn=KNNUserViewTimeSimilarityCFModel(**pearson_params)
pearson_knn.fit(data=input_df)
pearson_similar_users=pearson_knn.predict_similar_users(user_handle=user_handle)
pearson_similar_users

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


Unnamed: 0,similar_users,view_time_sim_score
0,8982,0.028173
1,2298,0.024789
2,839,0.019597
3,8851,0.019522
4,9640,0.0195
5,9865,0.019355
6,1012,0.019303
7,3504,0.019068
8,7121,0.017133
9,256,0.015758


# Mean Squared Distance

In [49]:
msd_params={'similarity':'msd','target':'view_time_seconds'}
msd_knn=KNNUserViewTimeSimilarityCFModel(**msd_params)
msd_knn.fit(data=input_df)
msd_similar_users=msd_knn.predict_similar_users(user_handle=user_handle)
msd_similar_users

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


Unnamed: 0,similar_users,view_time_sim_score
0,722,1.0
1,1875,1.0
2,2434,1.0
3,3237,1.0
4,7393,1.0
5,9674,1.0
6,467,0.5
7,8896,0.1
8,3645,0.058824
9,5819,0.02


# Qualitative Assesment: Sanity Check for Predictions

## Cosine Similarity

In [50]:
cosine_similar_users.loc[-1] = [user_handle, 1] 
cosine_similar_users.index = cosine_similar_users.index + 1 
cosine_similar_users = cosine_similar_users.sort_index()
cosine_similar_users.sort_values(by=['view_time_sim_score'], ascending=False)

Unnamed: 0,similar_users,view_time_sim_score
0,100,1.0
1,27,1.0
2,39,1.0
3,49,1.0
4,60,1.0
5,69,1.0
6,70,1.0
7,73,1.0
8,89,1.0
9,126,1.0


In [53]:
cosine_inspect_df=input_df[input_df.user_handle.isin(cosine_similar_users.similar_users)]
cosine_user_item_matrix=pd.pivot_table(cosine_inspect_df, values='view_time_seconds', index='user_handle', columns='course_id')
cosine_user_item_matrix.dropna(thresh=2, axis=1,how='all')

course_id,angular-2-getting-started-update,angular-fundamentals,angularjs-get-started,aws-developer-big-picture,clean-architecture-patterns-practices-principles,data-science-big-picture,getting-started-kubernetes,html-fundamentals,java-fundamentals-language,java-microservices-spring-cloud-coordinating-services,...,modern-software-architecture-domain-models-cqrs-event-sourcing,node-intro,nodejs-express-web-applications,oauth2-json-web-tokens-openid-connect-introduction,react-flux-building-applications,react-fundamentals,react-js-getting-started,react-redux-react-router-es6,spring-cloud-fundamentals,xhttp-fund
user_handle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
27,,,,,,,,,,,...,,,,,,,,,,
39,6757.0,10185.0,4193.0,1029.0,,,,,,,...,,522.0,,,,,,,,
49,122.0,,,,,,,184.0,,,...,,,,,,,,,,
60,,,,,,,,,,,...,,11693.0,4866.0,155.0,,,,325.0,,9272.0
69,286.0,,,,,,,,,,...,,,,,,,,,,
70,,,,4318.0,430.0,2485.0,12119.0,428.0,2467.0,24357.0,...,1736.0,,,5422.0,1402.0,5850.0,2652.0,,12773.0,4170.0
73,,5434.0,,,3254.0,,3319.0,,,3168.0,...,988.0,,,,,,,,7114.0,
89,,,1449.0,,,,,,8713.0,,...,,,2919.0,,1389.0,,13928.0,2089.0,,
100,387.0,,,,401.0,,,,,,...,,,,,,,,677.0,,
126,,,,,,282.0,,,,,...,,,,,,50.0,1544.0,367.0,,


# Pearson Baseline

In [54]:
pearson_similar_users.loc[-1] = [user_handle, 1] 
pearson_similar_users.index = pearson_similar_users.index + 1 
pearson_similar_users = pearson_similar_users.sort_index()
pearson_similar_users.sort_values(by=['view_time_sim_score'], ascending=False)

Unnamed: 0,similar_users,view_time_sim_score
0,100,1.0
1,8982,0.028173
2,2298,0.024789
3,839,0.019597
4,8851,0.019522
5,9640,0.0195
6,9865,0.019355
7,1012,0.019303
8,3504,0.019068
9,7121,0.017133


In [55]:
user_interest_df.merge(pearson_similar_users, left_on='user_handle', right_on='similar_users').sort_values(by='view_time_sim_score', ascending=False)

Unnamed: 0,user_handle,interest_tag,similar_users,view_time_sim_score
0,100,"vue,javascript-libraries,javascript,javascript...",100,1.0
8,8982,"it-fundamentals,responsive-design,css,machine-...",8982,0.028173
4,2298,"javascript-frameworks,javascript-libraries,jav...",2298,0.024789
2,839,"information-security,cryptography,javascript-f...",839,0.019597
7,8851,"responsive-design,mobile-design,art-and-design...",8851,0.019522
9,9640,"information-security,website-security,cryptogr...",9640,0.0195
10,9865,react.js,9865,0.019355
3,1012,"async,css,orm,performance-optimization,mvc-htm...",1012,0.019303
5,3504,big-data,3504,0.019068
6,7121,".net,asp.net-core,asp.net",7121,0.017133


In [60]:
pearson_inspect_df=input_df[input_df.user_handle.isin(pearson_similar_users.similar_users)]
pearson_user_item_matrix=pd.pivot_table(pearson_inspect_df, values='view_time_seconds', index='user_handle', columns='course_id')
pearson_user_item_matrix.dropna(thresh=2, axis=1,how='all')

course_id,ads-part1,advanced-javascript,advanced-js-jquery-pure-dom-scripting-fundamentals,advanced-python,algorithmics-introduction,android-fundamentals-activities,android-start-developing,angular-2-end-to-end,angular-2-first-look,angular-2-getting-started-update,...,unity-vuforia-building-ar-experience,using-wireshark-analyze-troubleshoot-wifi-networks,ux-driven-software-design,vuejs-getting-started,web-app-pentesting-fundamentals,web-perf,webapi-v2-security,webpack-fundamentals,what-is-programming,writing-clean-code-humans
user_handle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100,,,,,,,,,,387.0,...,,,,,,,1024.0,,,
256,,,,,,,,,,5194.0,...,,,14410.0,,,,,1430.0,,
839,,,,,,,,,,108.0,...,,,,,,,,,,
1012,,,,,,938.0,,8779.0,,,...,3309.0,135.0,,7903.0,,,,3974.0,,
2298,366.0,864.0,1205.0,,65.0,70.0,,,,3420.0,...,,,,216.0,,,,,,
3504,,3631.0,282.0,130.0,,,,57.0,1223.0,583.0,...,,,,,,1251.0,166.0,,,
7121,,,,,,,2988.0,,9590.0,,...,,,,474.0,,,4285.0,,,9457.0
8851,,1029.0,,,,,344.0,1764.0,240.0,314.0,...,,,102.0,,,,,,,
8982,374.0,6635.0,,2283.0,1365.0,,,,,89.0,...,1161.0,,,44.0,,,,,4892.0,247.0
9640,,,,,,,,,,120.0,...,,135.0,,,22113.0,240.0,1455.0,,2649.0,


## MSD

In [56]:
msd_similar_users.loc[-1] = [user_handle, 1] 
msd_similar_users.index = msd_similar_users.index + 1 
msd_similar_users = msd_similar_users.sort_index()
msd_similar_users.sort_values(by=['view_time_sim_score'], ascending=False)

Unnamed: 0,similar_users,view_time_sim_score
0,100,1.0
1,722,1.0
2,1875,1.0
3,2434,1.0
4,3237,1.0
5,7393,1.0
6,9674,1.0
7,467,0.5
8,8896,0.1
9,3645,0.058824


In [57]:
user_interest_df.merge(msd_similar_users, left_on='user_handle', right_on='similar_users').sort_values(by='view_time_sim_score', ascending=False)

Unnamed: 0,user_handle,interest_tag,similar_users,view_time_sim_score
0,100,"vue,javascript-libraries,javascript,javascript...",100,1.0
2,722,"after-effects,environment-modeling,mudbox,uv-m...",722,1.0
3,1875,"after-effects,environment-modeling,uv-mapping,...",1875,1.0
4,2434,"3d-sculpting,marvelous-designer,zbrush,quixel-...",2434,1.0
5,3237,"zbrush,pipeline,3d-modeling,3d-animation,maya",3237,1.0
8,7393,"3d-sculpting,zbrush,game-programming,environme...",7393,1.0
10,9674,illustrator,9674,1.0
1,467,"javascript,nodejs,mvc-scaffolding,c#,typescrip...",467,0.5
9,8896,"digital-audio,environment-modeling,uv-mapping,...",8896,0.1
6,3645,devops,3645,0.058824


In [59]:
msd_inspect_df=input_df[input_df.user_handle.isin(cosine_similar_users.similar_users)]
msd_user_item_matrix=pd.pivot_table(msd_inspect_df, values='view_time_seconds', index='user_handle', columns='course_id')
msd_user_item_matrix.dropna(thresh=2, axis=1,how='all')

course_id,angular-2-getting-started-update,angular-fundamentals,angularjs-get-started,aws-developer-big-picture,clean-architecture-patterns-practices-principles,data-science-big-picture,getting-started-kubernetes,html-fundamentals,java-fundamentals-language,java-microservices-spring-cloud-coordinating-services,...,modern-software-architecture-domain-models-cqrs-event-sourcing,node-intro,nodejs-express-web-applications,oauth2-json-web-tokens-openid-connect-introduction,react-flux-building-applications,react-fundamentals,react-js-getting-started,react-redux-react-router-es6,spring-cloud-fundamentals,xhttp-fund
user_handle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
27,,,,,,,,,,,...,,,,,,,,,,
39,6757.0,10185.0,4193.0,1029.0,,,,,,,...,,522.0,,,,,,,,
49,122.0,,,,,,,184.0,,,...,,,,,,,,,,
60,,,,,,,,,,,...,,11693.0,4866.0,155.0,,,,325.0,,9272.0
69,286.0,,,,,,,,,,...,,,,,,,,,,
70,,,,4318.0,430.0,2485.0,12119.0,428.0,2467.0,24357.0,...,1736.0,,,5422.0,1402.0,5850.0,2652.0,,12773.0,4170.0
73,,5434.0,,,3254.0,,3319.0,,,3168.0,...,988.0,,,,,,,,7114.0,
89,,,1449.0,,,,,,8713.0,,...,,,2919.0,,1389.0,,13928.0,2089.0,,
100,387.0,,,,401.0,,,,,,...,,,,,,,,677.0,,
126,,,,,,282.0,,,,,...,,,,,,50.0,1544.0,367.0,,


# Summary

The Collaborative Filtering based models created above capture the implicit interest shown by user by consuming some content.

# References

In [None]:
http://courses.ischool.berkeley.edu/i290-dm/s11/SECURE/a1-koren.pdf