# Baseline Recommenders

This notebook provides two types of prediction algorithms which constitute a baseline for model comparisons.

**Note:** following are from [suprise docs](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html)

## Baseline

BaselineOnly - predicts the baseline estimate for given user item (this is most important to compare with).

## Collaborative Filtering

KNNBasic - knn based collaborative filtering

KNNWithMeans - KNN based collaborative filtering, taking mean ratings of each user into account

KNNBaseline - KNN based collaborative filtering taking into account a _baseline_ rating.


In [122]:
import pandas as pd
from surprise import Dataset, Reader, BaselineOnly, KNNBasic, KNNWithMeans, KNNBaseline
# from surprise.model_selection.validation import cross_validate
import numpy as np
from sklearn.model_selection import train_test_split

In [21]:
all_users = pd.read_csv('./top_users.csv')
all_repos = pd.read_csv('./top_repos.csv')
all_ratings = pd.read_csv('./user-item-ratings.csv')

all_ratings.head()

Unnamed: 0,User,Repos,Rating
0,0x00evil,atom/atom,1.0
1,0x00evil,scrapy/scrapy,1.0
2,0x00evil,jekyll/jekyll,1.0
3,0x00evil,git/git,1.0
4,0x00evil,torvalds/linux,1.0


In [56]:
codes, uniques = pd.factorize(all_ratings['User'].unique())
user_ids = pd.Series(codes, index=uniques)

codes, uniques = pd.factorize(all_ratings['Repos'].unique())
repo_ids = pd.Series(codes, index=uniques)

display(user_ids.head(), rating_ids.head())

0x00evil    0
0xWDG       1
11ph22il    2
1pete       3
1suming     4
dtype: int64

atom/atom         0
scrapy/scrapy     1
jekyll/jekyll     2
git/git           3
torvalds/linux    4
dtype: int64

In [49]:
tmp = pd.Series(user_ids[0], index=user_ids[1])
tmp[tmp.index == 'jgarcia'][0]


606

In [22]:
n_users = all_ratings.User.unique().shape[0]
n_repos = all_ratings.Repos.unique().shape[0]

print(f'Number of Users: {n_users}')
print(f'Number of Repos: {n_repos}')

Number of Users: 1162
Number of Repos: 272


In [59]:
train_df, test_df = train_test_split(all_ratings, test_size=0.2)

display(train_df)
display(test_df)

Unnamed: 0,User,Repos,Rating
1492,alekpopovic,microsoft/TypeScript,1.0
6864,q4323636,angular/angular.js,1.0
4654,johnnyreilly,microsoft/TypeScript,1.0
4364,israelsantiago,jquery/jquery,1.0
6439,omargourari,apache/superset,1.0
...,...,...,...
3863,hackhowtofaq,facebook/hhvm,1.0
3822,grimen,videojs/video.js,1.0
702,KyleCharters,denoland/deno,1.0
6696,pmarin,Developer-Y/cs-video-courses,1.0


Unnamed: 0,User,Repos,Rating
8298,tsega,ossu/computer-science,1.0
3131,edifierx666,justjavac/free-programming-books-zh_CN,1.0
6628,pengyunchou,tensorflow/tensorflow,1.0
2177,boiyoo,microsoft/vscode,1.0
8541,vitorsilverio,arduino/Arduino,1.0
...,...,...,...
3776,gpicchiarelli,SerenityOS/serenity,1.0
4243,ikkira,ionic-team/ionic-framework,1.0
3470,flipflop,videojs/video.js,1.0
844,ParaXY,tensorflow/tensorflow,1.0


In [88]:
# User-item matrix for train
trainset = np.zeros((n_users, n_repos))
for row in train_df.itertuples():
    uid = user_ids[user_ids.index == row[1]][0]
    iid = repo_ids[repo_ids.index == row[2]][0]
    trainset[uid, iid] = row[3]

# User-item matrix for test
testset = np.zeros((n_users, n_repos))
for row in test_df.itertuples():
    uid = user_ids[user_ids.index == row[1]][0]
    iid = repo_ids[repo_ids.index == row[2]][0]
    testset[uid, iid] = row[3]

In [92]:
trainset.T

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
DELTA = 25
EPSILON = 1e-9

user_pearson_corr = np.zeros((n_users, n_users))
for i, user_i in enumerate(trainset.T):
    for j, user_j in enumerate(trainset.T):
        

In [133]:
# Data for surpise
reader = Reader(rating_scale=(0, 1))
data = Dataset.load_from_df(all_ratings[['User', 'Repos', 'Rating']], reader)


# bsl_options = {
#     "reg_u": 1,
#     "reg_i": 1,
# }
# cross_validate(BaselineOnly(bsl_options), data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# trainset_fill = data.build_full_trainset()
# bl = BaselineOnly()
# model = bl.fit(trainset_full)
# preds = bl.test(testset)

In [20]:
from surprise import NormalPredictor

cross_validate(NormalPredictor(), data, measures=['RMSE', 'MAE'], cv=5, verbose=True)



Evaluating RMSE, MAE of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  
MAE (testset)     0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  
Fit time          0.04    0.03    0.04    0.03    0.03    0.04    0.01    
Test time         0.05    0.03    0.06    0.03    0.03    0.04    0.01    


{'test_rmse': array([0., 0., 0., 0., 0.]),
 'test_mae': array([0., 0., 0., 0., 0.]),
 'fit_time': (0.0444638729095459,
  0.03491091728210449,
  0.0435178279876709,
  0.03423190116882324,
  0.028619050979614258),
 'test_time': (0.04812312126159668,
  0.03122997283935547,
  0.06273603439331055,
  0.03461813926696777,
  0.03024911880493164)}

In [11]:
# benchmark = []

# algorithms = [BaselineOnly(), KNNBasic(), KNNWithMeans(), KNNBaseline()]
# for algorithm in algorithms:
#     # Perform cross validation
#     results = cross_validate(algorithm, 
#                              data, 
#                              measures=['MAE', 'RMSE'], 
#                              cv=5, 
#                              return_train_measures=True, 
#                              verbose=False)

#     # Get results & append algorithm name
#     tmp = pd.DataFrame.from_dict(results).mean(axis=0)
#     tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
#     benchmark.append(tmp)
    
# pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')


Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


  tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))


Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


  tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))


Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


  tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))


Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


  tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))


Unnamed: 0_level_0,test_mae,train_mae,test_rmse,train_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BaselineOnly,0.0,0.0,0.0,0.0,0.059746,0.032281
KNNBasic,0.0,0.0,0.0,0.0,0.119249,0.35364
KNNWithMeans,0.0,0.0,0.0,0.0,0.123956,0.445351
KNNBaseline,0.0,0.0,0.0,0.0,0.110794,0.467172


In [167]:

# A reader is still needed but only the rating_scale param is requiered.
# reader = Reader(rating_scale=(1, 5))

# The columns must correspond to user id, item id and ratings (in that order).
# data = Dataset.load_from_df(df[["userID", "itemID", "rating"]], reader)

# We can now use this dataset as we please, e.g. calling cross_validate
# cross_validate(KNNWithMeans(sim_options={
#     'name': 'cosine',
#     'user_based': False
# }), data, cv=2)

reader = Reader(rating_scale=(0, 1))
data = Dataset.load_from_df(all_ratings[['User', 'Repos', 'Rating']], reader)

training_set = data.build_full_trainset()

sim_options = {
    "name": "cosine",
    "user_based": True
}
algo = KNNWithMeans(sim_options=sim_options)

algo.fit(training_set)

pred = algo.predict('0x00evil', 'asbdd')
pred.est

Computing the cosine similarity matrix...
Done computing similarity matrix.


1

In [170]:
training_set

<surprise.trainset.Trainset at 0x136b37f70>

In [100]:
all_ratings.head(20)

Unnamed: 0,User,Repos,Rating
0,0x00evil,atom/atom,1.0
1,0x00evil,scrapy/scrapy,1.0
2,0x00evil,jekyll/jekyll,1.0
3,0x00evil,git/git,1.0
4,0x00evil,torvalds/linux,1.0
5,0x00evil,rails/rails,1.0
6,0x00evil,spree/spree,1.0
7,0x00evil,SwiftGGTeam/the-swift-programming-language-in-...,1.0
8,0x00evil,sinatra/sinatra,1.0
9,0x00evil,golang/go,1.0
