Davide Sbetti - 14032

# PDA19 Challenge - KNNZScore Predictions

## Libraries

We start importing various different libraries that will then be used to import and process the given dataset

In [1]:
import pandas as pd
import numpy as np
from surprise import Dataset, Reader, KNNBaseline

## Data pre-processing

We start importing the dataset from the local folder

In [2]:
movie_rating = pd.read_csv("data/train-PDA2019.csv")
movie_rating.head()

Unnamed: 0,userID,itemID,rating,timeStamp
0,5,648,5,978297876
1,5,1394,5,978298237
2,5,3534,5,978297149
3,5,104,4,978298558
4,5,2735,5,978297919


Let's try to generate the dense matrix used then to understand which movies have not been rated by each user

In [3]:
movie_rating_full = movie_rating.pivot(index='userID',
                                       columns='itemID', 
                                       values='rating')

In [4]:
movie_rating_full.fillna(0).astype(int)

itemID,89,93,94,95,97,98,100,101,102,104,...,3929,3930,3931,3932,3937,3938,3945,3946,3950,3952
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,4,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12069,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12071,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12073,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12077,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we read the user test data set so we know which users are we interested into

In [5]:
users_test = pd.read_csv("data/test-PDA2019.csv")
users_test.head()

Unnamed: 0,userID,recommended_itemIDs
0,1,
1,3,
2,11,
3,29,
4,31,


Extracting all users IDs we are interested into

In [6]:
users = users_test.loc[:,'userID']

## SVD Recommender

We can now build the SVD recommender, using only the user's rating data, that will then be used for predictions

In [7]:
reader = Reader(rating_scale=(1,5))

In [8]:
data = Dataset.load_from_df(movie_rating.iloc[:,0:3], reader)

In [9]:
trainset = data.build_full_trainset()

In [10]:
my_sym_options = {'name' : 'pearson', 
                 'user_based': True}
recommender = KNNBaseline(sim_options=my_sym_options)

In [11]:
recommender.fit(trainset)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7fb9d3345e90>

Try to predict rating of unseen movies for user 1

Retrieving all his movies, consider only ratings = nan and predict those. We then convert the predictions into a strint and we append it in the users' test table

In [12]:
users[0:9]

0     1
1     3
2    11
3    29
4    31
5    33
6    35
7    51
8    53
Name: userID, dtype: int64

In [None]:
columns_name = movie_rating_full.columns


#print("Considering user", user)
for j in range(0,len(users)):
    user_predictions = {}
    user = users[j]
    rating = movie_rating_full.loc[user,:]
    for i in range(0, len(rating)):
        current_rating = rating.iloc[i]
        if pd.isna(current_rating):
            prediction = recommender.predict(user,columns_name[i],0)
            user_predictions[columns_name[i]] = prediction.est

    top10 = sorted(user_predictions, key = user_predictions.__getitem__)[:10]
    rec_string = " ".join(str(item) for item in top10)
    users_test.loc[j,'recommended_itemIDs'] = " " + rec_string

In [15]:
users_test.head(10)

Unnamed: 0,userID,recommended_itemIDs
0,1,89 93 94 95 97 98 100 101 102 104
1,3,1311 3376 3847 2516 1323 2449 3593 1739 3945 ...
2,11,2449 2462 2650 2955 2992 3012 3043 3140 3309 ...
3,29,1311 1323 3880 1826 2955 2449 1595 181 3945 1739
4,31,98 152 1323 1812 1900 2179 2351 2462 2515 2816
5,33,1311 2449 3376 3641 2955 1323 1739 1595 2515 ...
6,35,1311 3376 1323 1826 3641 2955 867 181 1595 2555
7,51,1311 3641 3945 1739 2449 1323 3563 3390 98 1826
8,53,1311 2462 2955 3437 3563 3587 1323 2655 2555 ...
9,55,1311 1891 2955 1323 2655 2449 3945 181 1826 2516


In [16]:
users_test.to_csv(path_or_buf = 'generated/KNNBaseline_recommendations.csv', 
                  index = False,
                  header = True, sep = ',')

In [44]:
for j in range(0,len(users)):
    print(users[j])

1
3
11
29
31
33
35
51
53
55
57
67
73
79
85
89
91
97
117
119
125
127
129
137
139
157
159
161
173
179
181
185
187
189
199
201
209
213
229
231
235
259
261
263
267
269
271
273
275
279
287
291
293
307
311
327
337
353
377
387
389
397
407
409
415
425
443
447
455
457
463
467
473
475
485
487
495
513
519
525
527
529
533
539
567
571
573
583
589
595
613
615
623
625
629
633
641
647
649
651
657
659
665
667
679
683
687
691
695
697
701
715
717
719
729
735
737
741
749
751
757
759
771
773
777
779
785
791
797
801
817
819
825
829
849
857
859
861
863
869
875
883
885
887
889
893
903
905
909
911
917
923
933
935
937
941
943
947
959
985
997
1005
1007
1009
1025
1039
1045
1049
1051
1053
1057
1061
1069
1073
1077
1079
1081
1085
1089
1091
1093
1105
1113
1119
1133
1135
1139
1141
1149
1157
1167
1171
1175
1177
1183
1189
1197
1201
1205
1219
1221
1227
1229
1233
1239
1247
1255
1257
1273
1277
1281
1285
1293
1295
1303
1309
1311
1323
1325
1331
1333
1341
1345
1347
1353
1361
1365
1379
1385
1391
1395
1397
1399
1417
1423
1431
1

11339
11341
11343
11347
11349
11359
11367
11371
11373
11379
11381
11391
11403
11407
11427
11443
11449
11451
11455
11459
11461
11471
11473
11485
11491
11495
11497
11501
11509
11513
11525
11531
11541
11551
11553
11555
11559
11563
11569
11577
11581
11589
11591
11593
11597
11601
11603
11605
11617
11627
11629
11633
11639
11641
11643
11649
11661
11665
11671
11673
11689
11693
11703
11713
11715
11717
11723
11727
11729
11731
11747
11757
11761
11769
11775
11779
11781
11795
11801
11805
11811
11813
11831
11833
11835
11837
11839
11851
11861
11865
11869
11871
11873
11879
11883
11891
11903
11909
11919
11931
11933
11939
11941
11955
11961
11965
11973
11975
11981
12005
12009
12011
12015
12021
12025
12029
12031
12037
12043
12047
12051
12061
12063
12073
