Davide Sbetti - 14032

# PDA19 Challenge - SVD Predictions

## Libraries

We start importing various different libraries that will then be used to import and process the given dataset

In [1]:
import pandas as pd
import numpy as np
from surprise import Dataset, Reader, SVDpp

## Data pre-processing

We start importing the dataset from the local folder

In [2]:
movie_rating = pd.read_csv("data/train-PDA2019.csv")
movie_rating.head()

Unnamed: 0,userID,itemID,rating,timeStamp
0,5,648,5,978297876
1,5,1394,5,978298237
2,5,3534,5,978297149
3,5,104,4,978298558
4,5,2735,5,978297919


Let's try to generate the dense matrix used then to understand which movies have not been rated by each user

In [3]:
movie_rating_full = movie_rating.pivot(index='userID',
                                       columns='itemID', 
                                       values='rating')

In [4]:
movie_rating_full.fillna(0).astype(int)

itemID,89,93,94,95,97,98,100,101,102,104,...,3929,3930,3931,3932,3937,3938,3945,3946,3950,3952
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,4,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12069,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12071,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12073,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12077,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we read the user test data set so we know which users are we interested into

In [5]:
users_test = pd.read_csv("data/test-PDA2019.csv")
users_test.head()

Unnamed: 0,userID,recommended_itemIDs
0,1,
1,3,
2,11,
3,29,
4,31,


Extracting all users IDs we are interested into

In [6]:
users = users_test.loc[:,'userID']

## SVD Recommender

We can now build the SVD recommender, using only the user's rating data, that will then be used for predictions

In [7]:
reader = Reader(rating_scale=(1,5))

In [8]:
data = Dataset.load_from_df(movie_rating.iloc[:,0:3], reader)

In [9]:
trainset = data.build_full_trainset()

In [10]:
recommender = SVDpp()

In [11]:
recommender.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x7fde9c949e90>

We now predict, for each considered user, the ratings for all movies that have not been seen yet. We add all predictions to a dictionary, sorting then the dictionary by rating score and retrieving so the top 10 recommended movies. We then prepare the formatted string and we add it in the data frame containing the target users, so that at the end of the procedure it will be simple to export our results to a CSV file. 

In [13]:
columns_name = movie_rating_full.columns


#print("Considering user", user)
for j in range(0,len(users)):
    user_predictions = {}
    user = users[j]
    rating = movie_rating_full.loc[user,:]
    for i in range(0, len(rating)):
        current_rating = rating.iloc[i]
        if pd.isna(current_rating):
            prediction = recommender.predict(user,columns_name[i],0)
            user_predictions[columns_name[i]] = prediction.est

    top10 = sorted(user_predictions, key = user_predictions.__getitem__,
                   reverse=True)[:10]
    rec_string = " ".join(str(item) for item in top10)
    users_test.loc[j,'recommended_itemIDs'] = " " + rec_string

In [14]:
users_test.head(10)

Unnamed: 0,userID,recommended_itemIDs
0,1,318 2931 3338 3307 858 1148 1207 593 922 2762
1,3,318 1207 2931 2762 1148 908 3307 3338 260 1262
2,11,318 2762 1207 3338 2329 260 1148 593 2931 1262
3,29,2931 2762 3338 2329 260 593 1148 858 1262 1207
4,31,2931 318 3338 3307 1207 1148 858 908 922 2329
5,33,318 3338 2931 1207 1148 3307 908 912 858 2762
6,35,3030 3307 1178 858 923 3338 912 908 913 608
7,51,318 3338 1207 3307 2931 1148 908 912 858 913
8,53,3338 2931 3307 1207 858 912 922 1148 1178 3030
9,55,3338 2931 858 318 3307 1193 2858 593 1178 1207


Having now all predictions, we can export them to a CSV file using pandas, so that it can be submitted to Kaggle for the evaluation

In [15]:
users_test.to_csv(path_or_buf = 'generated/SVD++_recommendations.csv', 
                  index = False,
                  header = True, sep = ',')