<a href="https://colab.research.google.com/github/rodolfoarruda/MachineLearning/blob/main/recom_sys_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Rodolfo Arruda - 6381848

### **SCC5966 – Sistemas de Recomendação**

## **Setup**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import numpy as np
np.set_printoptions(suppress=True)

import pandas as pd
import matplotlib.pyplot as plt

# calculate sparsity
from numpy import array
from numpy import count_nonzero

# calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import pairwise_distances

# Split train test
from sklearn.model_selection import train_test_split

# Machine learning 
from sklearn.metrics import mean_squared_error as mse
from xgboost import XGBRegressor

## **1 - Data Preparation**

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/train_data.csv',sep=',')
df.head()

In [None]:
plt.bar(df['rating'].value_counts().index,df['rating'].value_counts())
plt.title('Ratings Distribution')
plt.xlabel('Rating')
plt.ylabel('# Evaluations');

In [None]:
df['rating'].mean()

#### **1.1 - Dummy submission - by average movie**

In [None]:
avg_movie = pd.DataFrame(df['rating'].groupby(df['movie_id']).mean())
avg_movie.reset_index(inplace=True)

In [None]:
avg_movie.head()

In [None]:
df_test = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/test_data.csv',sep=',')
df_test.head()

In [None]:
df_test.count()

In [None]:
pred_dummy_movie = pd.merge(df_test, avg_movie, on="movie_id",how="left").fillna(4)

In [None]:
pred_dummy_movie.head()

In [None]:
pred = pred_dummy_movie[['id','rating']]
pred.head()

In [None]:
pred.to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/s1.csv',index=False)

#### **1.2 - Dummy submission - by average user**

In [None]:
  avg_user= pd.DataFrame(df['rating'].groupby(df['user_id']).mean())
  avg_user.reset_index(inplace=True)

In [None]:
avg_user.head()

In [None]:
pred_dummy_user = pd.merge(df_test, avg_user, on="user_id",how="left").fillna(3.603814223642363)

In [None]:
pred_dummy_user.count()

In [None]:
pred = pred_dummy_user[['id','rating']]
pred.head()

In [None]:
pred.to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/s2.csv',index=False)

## **2 - Claborative Filtering based on movie**


In [None]:
df = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/train_data.csv',sep=',')
df.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1160,5,974769817
1,1,1129,3,974769817
2,1,3328,4,974769817
3,1,2659,2,974769817
4,1,980,3,974769817


#### **2.1 - Data Normalization**

In [None]:
def sub_mean(df):
  ## Normalize rating by movie
  avg_movie = pd.DataFrame(df['rating'].groupby(df['movie_id']).mean())
  avg_movie = avg_movie.rename(columns = {'rating': 'avg_movie'})
  avg_movie.reset_index(inplace=True)
  result1 = pd.merge(df, avg_movie, on="movie_id")
  result1['rating_avgr_movie'] = result1['rating'] - result1['avg_movie']

  ## Normalize rating by user
  avg_user= pd.DataFrame(df['rating'].groupby(df['user_id']).mean())
  avg_user = avg_user.rename(columns = {'rating': 'avg_user'})
  avg_user.reset_index(inplace=True)
  result2 = pd.merge(result1, avg_user, on="user_id")
  result2['rating_avgr_user'] = result2['rating'] - result2['avg_user']

  return result2

In [None]:
df_norm = sub_mean(df)

In [None]:
df_norm.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,avg_movie,rating_avgr_movie,avg_user,rating_avgr_user
0,1,1160,5,974769817,3.937198,1.062802,3.769231,1.230769
1,1,1129,3,974769817,3.99332,-0.99332,3.769231,-0.769231
2,1,3328,4,974769817,3.662202,0.337798,3.769231,0.230769
3,1,2659,2,974769817,3.688333,-1.688333,3.769231,-1.769231
4,1,980,3,974769817,3.927287,-0.927287,3.769231,-0.769231


#### **2.2 - user x item matrix**


In [None]:
A = df_norm.pivot(index='user_id', columns='movie_id', values='rating').fillna(0)

In [None]:
A

movie_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,3525,3526,3527,3528,3529,3530,3531,3532,3533,3534,3535,3536,3537,3538,3539,3540,3541,3542,3543,3544,3545,3546,3547,3548,3549,3550,3551,3552,3553,3554,3555,3556,3557,3558,3559,3560,3561,3562,3563,3564
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,5.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,5.0,0.0,0.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3970,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3971,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3972,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3973,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### **2.3 - Sparcity evaluation**
##### The sparsity of a matrix can be quantified with a score, which is the number of zero values in the matrix divided by the total number of elements in the matrix.

In [None]:
sparsity = 1.0 - count_nonzero(A) /A.size
print(sparsity)

0.9619391144037263


##### A dense matrix stored in a NumPy array can be converted into a sparse matrix using the CSR representation by calling the csr_matrix() function.

In [None]:
from scipy import sparse

In [None]:
train = sparse.csr_matrix((df_norm.rating_avgr_movie, (df_norm.user_id, df_norm.movie_id)))

#### **2.4 - Compute similar movies**

##### A similarity matrix is critical to measure and calculate the similarity between user-profiles and movies to generate recommendations. To remove movie and user bias, we need to re-escale ratings base on average. 

In [None]:
similarity = cosine_similarity(train.T, dense_output = False)

In [None]:
print(similarity)

In [None]:
# Reference rating parameters
avg_movie = pd.DataFrame(df_norm['rating'].groupby(df_norm['movie_id']).mean())
avg_movie.reset_index(inplace=True)

# Reference movies
moviex=df_norm['movie_id'].unique()

In [None]:
#moviex = [1160, 1129, 3328]

#### **2.5 - Compute top similar movies**

In [None]:
def sim_knearb(movie,k,similarity):
  y =pd.DataFrame(np.matrix(sparse.find(similarity)).T,columns=['similar','base','w'])
  z = y[y['base'] != y['similar']]

  return z[z['base'].isin([movie])].sort_values(by='w',ascending=False).head(k)

In [None]:
def avg_knearb(moviex,k,similarity):
  
  # auxiliar variables
  j = 0
  aux  = {'similar': [0.0], 'base': [0], 'w':[0.0]}
  base = pd.DataFrame(aux, columns = ['similar','base','w'])
 
  for i in moviex:
 
    top = sim_knearb(i , k , similarity)
    base = pd.concat([base, top])

    j += 1
    print('Iteração #:', j)
    
  base['sub_group_rank'] = base.groupby('base')['w'].rank(ascending=False)
  
  
  return base

In [None]:
base = avg_knearb(moviex,10,similarity)

In [None]:
base.head()

Unnamed: 0,similar,base,w,sub_group_rank
0,0.0,0.0,0.0,1.0
2970687,1112.0,1160.0,0.293818,1.0
2971797,2368.0,1160.0,0.267087,2.0
2971799,2370.0,1160.0,0.214225,3.0
2971802,2373.0,1160.0,0.211034,4.0


In [None]:
base.to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/pto_checagem_sim2.csv',index=False)

#### **2.6 - Predictions**

In [None]:
base = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/pto_checagem_sim2.csv')

In [None]:
base.head()

Unnamed: 0,similar,base,w,sub_group_rank
0,0.0,0.0,0.0,1.0
1,1112.0,1160.0,0.293818,1.0
2,2368.0,1160.0,0.267087,2.0
3,2370.0,1160.0,0.214225,3.0
4,2373.0,1160.0,0.211034,4.0


In [None]:
base.count()

similar           34021
base              34021
w                 34021
sub_group_rank    34021
dtype: int64

In [None]:
 base = base[base['sub_group_rank'] <=6]

In [None]:
df_norm.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,avg_movie,rating_avgr_movie,avg_user,rating_avgr_user
0,1,1160,5,974769817,3.937198,1.062802,3.769231,1.230769
1,1,1129,3,974769817,3.99332,-0.99332,3.769231,-0.769231
2,1,3328,4,974769817,3.662202,0.337798,3.769231,0.230769
3,1,2659,2,974769817,3.688333,-1.688333,3.769231,-1.769231
4,1,980,3,974769817,3.927287,-0.927287,3.769231,-0.769231


In [None]:
x_train, x_test,= train_test_split(df_norm, test_size=0.3, random_state=0)

In [None]:
x_train.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,avg_movie,rating_avgr_movie,avg_user,rating_avgr_user
71608,227,34,5,974738104,3.884694,1.115306,3.357553,1.642447
276128,2979,171,2,966267076,3.577406,-1.577406,3.503378,-1.503378
270463,2920,967,4,965277713,4.065617,-0.065617,3.645,0.355
263232,2784,1866,3,965340485,3.051471,-0.051471,3.320099,-0.320099
234148,2457,2007,4,965941989,3.571429,0.428571,3.951027,0.048973


In [None]:
avg_movie = pd.DataFrame(x_train['rating'].groupby(df_norm['movie_id']).mean())
avg_movie.reset_index(inplace=True)

In [None]:
#x_test_min = x_test[['user_id','movie_id','rating']]

x_test_min = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/test_data.csv')

In [None]:
x_test_min.head()

Unnamed: 0,id,user_id,movie_id,timestamp
0,0,5,2962,974769784
1,1,5,3177,974769768
2,2,5,3153,974769768
3,3,5,501,974769768
4,4,5,3159,974769768


In [None]:
pred_movie_avg = pd.merge(x_test_min,avg_movie, on="movie_id",how="left").fillna(3.603814223642363)

In [None]:
pred_movie_avg.head()


Unnamed: 0,id,user_id,movie_id,timestamp,rating
0,0,5,2962,974769784,3.637931
1,1,5,3177,974769768,3.637931
2,2,5,3153,974769768,3.637931
3,3,5,501,974769768,3.637931
4,4,5,3159,974769768,3.637931


In [None]:
pred_movie_avg2 = pd.merge(pred_movie_avg,base, how='left',left_on=['movie_id'],right_on=['base'])

In [None]:
pred_movie_avg2.head()

Unnamed: 0,id,user_id,movie_id,timestamp,rating,similar,base,w,sub_group_rank
0,0,5,2962,974769784,3.637931,104.0,5.0,0.51903,1.0
1,0,5,2962,974769784,3.637931,725.0,5.0,0.497589,2.0
2,0,5,2962,974769784,3.637931,112.0,5.0,0.478498,3.0
3,0,5,2962,974769784,3.637931,611.0,5.0,0.460403,4.0
4,0,5,2962,974769784,3.637931,360.0,5.0,0.439991,5.0


In [None]:
df_norm[['rating_avgr_movie','user_id','movie_id']].dtypes

rating_avgr_movie    float64
user_id                int64
movie_id               int64
dtype: object

id                  int64
user_id             int64
movie_id            int64
timestamp           int64
rating            float64
similar             int64
base              float64
w                 float64
sub_group_rank    float64
dtype: object

In [None]:
pred_movie_avg3 = pd.merge(pred_movie_avg2,df_norm[['rating_avgr_movie','user_id','movie_id']],\ how='left',left_on=['user_id','similar'],right_on=['user_id','movie_id']).fillna(0)

In [None]:
pred_movie_avg3.head()

Unnamed: 0,id,user_id_x,movie_id,timestamp,rating,similar,base,w,sub_group_rank,avg_user,user_id_y
0,0,5,2962,974769784,3.637931,104,5.0,0.51903,1.0,0.0,0.0
1,0,5,2962,974769784,3.637931,725,5.0,0.497589,2.0,0.0,0.0
2,0,5,2962,974769784,3.637931,112,5.0,0.478498,3.0,3.111111,112.0
3,0,5,2962,974769784,3.637931,611,5.0,0.460403,4.0,0.0,0.0
4,0,5,2962,974769784,3.637931,360,5.0,0.439991,5.0,0.0,0.0


In [None]:
pred_movie_avg3['avg_pond'] = pred_movie_avg3['w']* pred_movie_avg3['rating_avgr_movie']

In [None]:
pred_movie_avg3.head()

Unnamed: 0,id,user_id_x,movie_id,timestamp,rating,similar,base,w,sub_group_rank,avg_user,user_id_y,avg_pond
0,0,5,2962,974769784,3.637931,104,5.0,0.51903,1.0,0.0,0.0,0.0
1,0,5,2962,974769784,3.637931,725,5.0,0.497589,2.0,0.0,0.0,0.0
2,0,5,2962,974769784,3.637931,112,5.0,0.478498,3.0,3.111111,112.0,1.48866
3,0,5,2962,974769784,3.637931,611,5.0,0.460403,4.0,0.0,0.0,0.0
4,0,5,2962,974769784,3.637931,360,5.0,0.439991,5.0,0.0,0.0,0.0


In [None]:
pred_movie_avg3.count()

id                23250
user_id_x         23250
movie_id          23250
timestamp         23250
rating            23250
similar           23250
base              23250
w                 23250
sub_group_rank    23250
avg_user          23250
user_id_y         23250
avg_pond          23250
dtype: int64

In [None]:
pred_movie_avg3.to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/escoragem_v3.csv',index=False)

## **3 - Claborative filtering based on user**


In [None]:
df = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/train_data.csv',sep=',')
df.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1160,5,974769817
1,1,1129,3,974769817
2,1,3328,4,974769817
3,1,2659,2,974769817
4,1,980,3,974769817


#### **3.1 - Data Normalization**

In [None]:
def sub_mean(df):
  ## Normalize rating by movie
  avg_movie = pd.DataFrame(df['rating'].groupby(df['movie_id']).mean())
  avg_movie = avg_movie.rename(columns = {'rating': 'avg_movie'})
  avg_movie.reset_index(inplace=True)
  result1 = pd.merge(df, avg_movie, on="movie_id")
  result1['rating_avgr_movie'] = result1['rating'] - result1['avg_movie']

  ## Normalize rating by user
  avg_user= pd.DataFrame(df['rating'].groupby(df['user_id']).mean())
  avg_user = avg_user.rename(columns = {'rating': 'avg_user'})
  avg_user.reset_index(inplace=True)
  result2 = pd.merge(result1, avg_user, on="user_id")
  result2['rating_avgr_user'] = result2['rating'] - result2['avg_user']

  return result2

In [None]:
df_norm = sub_mean(df)

In [None]:
df_norm.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,avg_movie,rating_avgr_movie,avg_user,rating_avgr_user
0,1,1160,5,974769817,3.937198,1.062802,3.769231,1.230769
1,1,1129,3,974769817,3.99332,-0.99332,3.769231,-0.769231
2,1,3328,4,974769817,3.662202,0.337798,3.769231,0.230769
3,1,2659,2,974769817,3.688333,-1.688333,3.769231,-1.769231
4,1,980,3,974769817,3.927287,-0.927287,3.769231,-0.769231


#### **3.2 - item x user matrix**


In [None]:
A = df_norm.pivot(index='movie_id', columns='user_id', values='rating').fillna(0)

In [None]:
A

user_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,3934,3936,3937,3938,3939,3940,3941,3942,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952,3953,3954,3955,3956,3957,3958,3959,3960,3961,3962,3963,3964,3965,3966,3967,3968,3969,3970,3971,3972,3973,3974
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3560,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
3561,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3562,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3563,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### **3.3 - Sparcity evaluation**
##### The sparsity of a matrix can be quantified with a score, which is the number of zero values in the matrix divided by the total number of elements in the matrix.

In [None]:
sparsity = 1.0 - count_nonzero(A) /A.size
print(sparsity)

0.9619391144037263


##### A dense matrix stored in a NumPy array can be converted into a sparse matrix using the CSR representation by calling the csr_matrix() function.

In [None]:
from scipy import sparse

In [None]:
train = sparse.csr_matrix((df_norm.rating, (df_norm.movie_id,df_norm.user_id)))

#### **3.4 - Compute similar user**

##### A similarity matrix is critical to measure and calculate the similarity between user-profiles and movies to generate recommendations. To remove movie and user bias, we need to re-escale ratings base on average. 

In [None]:
similarity = cosine_similarity(train.T, dense_output = False)

In [None]:
print(similarity)

In [None]:
# Reference rating parameters
avg_movie = pd.DataFrame(df_norm['rating'].groupby(df_norm['user_id']).mean())
avg_movie.reset_index(inplace=True)

# Reference movies
userx=df_norm['user_id'].unique()

In [None]:
userx

array([   1,   32,  107, ..., 3943, 3851, 3933])

In [None]:
#userx = [3943, 3851, 3933]

#### **3.5 - Compute top similar movies**

In [None]:
def sim_knearb(user,k,similarity):
  y =pd.DataFrame(np.matrix(sparse.find(similarity)).T,columns=['similar','base','w'])
  z = y[y['base'] != y['similar']]

  return z[z['base'].isin([user])].sort_values(by='w',ascending=False).head(k)

In [None]:
def avg_knearb(moviex,k,similarity):
  
  # auxiliar variables
  j = 0
  aux  = {'similar': [0.0], 'base': [0], 'w':[0.0]}
  base = pd.DataFrame(aux, columns = ['similar','base','w'])
 
  for i in userx:
 
    top = sim_knearb(i , k , similarity)
    base = pd.concat([base, top])

    j += 1
    print('Iteração #:', j)
    
  base['sub_group_rank'] = base.groupby('base')['w'].rank(ascending=False)
  
  
  return base

In [None]:
base = avg_knearb(userx,10,similarity)

In [None]:
base.head()

Unnamed: 0,similar,base,w,sub_group_rank
0,0.0,0.0,0.0,1.0
1669,1988.0,1.0,0.65691,1.0
2340,2743.0,1.0,0.550155,2.0
3184,3722.0,1.0,0.549234,3.0
2420,2834.0,1.0,0.511894,4.0


In [None]:
base.to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/pto_checagem_user.csv',index=False)

#### **3.6 - Predictions**

In [None]:
base = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/pto_checagem_user.csv')

In [None]:
base.head()

Unnamed: 0,similar,base,w,sub_group_rank
0,0.0,0.0,0.0,1.0
1,1988.0,1.0,0.65691,1.0
2,2743.0,1.0,0.550155,2.0
3,3722.0,1.0,0.549234,3.0
4,2834.0,1.0,0.511894,4.0


In [None]:
base.count()

similar           39521
base              39521
w                 39521
sub_group_rank    39521
dtype: int64

In [None]:
 base = base[base['sub_group_rank'] <=6]

In [None]:
df_norm.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,avg_movie,rating_avgr_movie,avg_user,rating_avgr_user
0,1,1160,5,974769817,3.937198,1.062802,3.769231,1.230769
1,1,1129,3,974769817,3.99332,-0.99332,3.769231,-0.769231
2,1,3328,4,974769817,3.662202,0.337798,3.769231,0.230769
3,1,2659,2,974769817,3.688333,-1.688333,3.769231,-1.769231
4,1,980,3,974769817,3.927287,-0.927287,3.769231,-0.769231


In [None]:
x_train, x_test,= train_test_split(df_norm, test_size=0.3, random_state=0)

In [None]:
x_train.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,avg_movie,rating_avgr_movie,avg_user,rating_avgr_user
71608,227,34,5,974738104,3.884694,1.115306,3.357553,1.642447
276128,2979,171,2,966267076,3.577406,-1.577406,3.503378,-1.503378
270463,2920,967,4,965277713,4.065617,-0.065617,3.645,0.355
263232,2784,1866,3,965340485,3.051471,-0.051471,3.320099,-0.320099
234148,2457,2007,4,965941989,3.571429,0.428571,3.951027,0.048973


In [None]:
avg_movie = pd.DataFrame(df_norm['rating'].groupby(df_norm['user_id']).mean())
avg_movie.reset_index(inplace=True)

In [None]:
avg_movie.head()

Unnamed: 0,user_id,rating
0,1,3.769231
1,2,3.428571
2,3,3.818182
3,4,4.375
4,5,3.637931


In [None]:
#x_test_min = x_test[['user_id','movie_id','rating']]

x_test_min = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/test_data.csv')

In [None]:
x_test_min.head()

Unnamed: 0,id,user_id,movie_id,timestamp
0,0,5,2962,974769784
1,1,5,3177,974769768
2,2,5,3153,974769768
3,3,5,501,974769768
4,4,5,3159,974769768


In [None]:
pred_movie_avg = pd.merge(x_test_min,avg_movie, on="user_id",how="left").fillna(3.603814223642363)

In [None]:
pred_movie_avg.head()


Unnamed: 0,id,user_id,movie_id,timestamp,rating
0,0,5,2962,974769784,3.637931
1,1,5,3177,974769768,3.637931
2,2,5,3153,974769768,3.637931
3,3,5,501,974769768,3.637931
4,4,5,3159,974769768,3.637931


In [None]:
pred_movie_avg2 = pd.merge(pred_movie_avg,base, how='left',left_on=['user_id'],right_on=['base'])

In [None]:
pred_movie_avg2.head()

Unnamed: 0,id,user_id,movie_id,timestamp,rating,similar,base,w,sub_group_rank
0,0,5,2962,974769784,3.637931,104.0,5.0,0.51903,1.0
1,0,5,2962,974769784,3.637931,725.0,5.0,0.497589,2.0
2,0,5,2962,974769784,3.637931,112.0,5.0,0.478498,3.0
3,0,5,2962,974769784,3.637931,611.0,5.0,0.460403,4.0
4,0,5,2962,974769784,3.637931,360.0,5.0,0.439991,5.0


In [None]:
df_norm[['rating_avgr_movie','user_id','movie_id']].dtypes

rating_avgr_movie    float64
user_id                int64
movie_id               int64
dtype: object

In [None]:
pred_movie_avg2['similar'] = pred_movie_avg2['similar'].fillna(0).astype(int)
pred_movie_avg2.dtypes

id                  int64
user_id             int64
movie_id            int64
timestamp           int64
rating            float64
similar             int64
base              float64
w                 float64
sub_group_rank    float64
dtype: object

In [None]:
pred_movie_avg3 = pd.merge(pred_movie_avg2,df_norm[['rating_avgr_user','user_id','movie_id']],\
                           how='left',left_on=['movie_id','similar'],right_on=['movie_id','user_id']).fillna(0)

In [None]:
pred_movie_avg3.head()

Unnamed: 0,id,user_id_x,movie_id,timestamp,rating,similar,base,w,sub_group_rank,rating_avgr_user,user_id_y
0,0,5,2962,974769784,3.637931,104,5.0,0.51903,1.0,0.0,0.0
1,0,5,2962,974769784,3.637931,725,5.0,0.497589,2.0,0.0,0.0
2,0,5,2962,974769784,3.637931,112,5.0,0.478498,3.0,-2.111111,112.0
3,0,5,2962,974769784,3.637931,611,5.0,0.460403,4.0,0.0,0.0
4,0,5,2962,974769784,3.637931,360,5.0,0.439991,5.0,0.0,0.0


In [None]:
pred_movie_avg3['avg_pond'] = pred_movie_avg3['w']* pred_movie_avg3['rating_avgr_user']

In [None]:
pred_movie_avg3.head()

Unnamed: 0,id,user_id_x,movie_id,timestamp,rating,similar,base,w,sub_group_rank,rating_avgr_user,user_id_y,avg_pond
0,0,5,2962,974769784,3.637931,104,5.0,0.51903,1.0,0.0,0.0,0.0
1,0,5,2962,974769784,3.637931,725,5.0,0.497589,2.0,0.0,0.0,0.0
2,0,5,2962,974769784,3.637931,112,5.0,0.478498,3.0,-2.111111,112.0,-1.010162
3,0,5,2962,974769784,3.637931,611,5.0,0.460403,4.0,0.0,0.0,0.0
4,0,5,2962,974769784,3.637931,360,5.0,0.439991,5.0,0.0,0.0,0.0


In [None]:
pred_movie_avg3.count()

id                  23250
user_id_x           23250
movie_id            23250
timestamp           23250
rating              23250
similar             23250
base                23250
w                   23250
sub_group_rank      23250
rating_avgr_user    23250
user_id_y           23250
avg_pond            23250
dtype: int64

In [None]:
pred_movie_avg3.to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/escoragem_v4.csv',index=False)

## **4 - Baseline**

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/train_data.csv',sep=',')
df.head()

x_train, x_test,= train_test_split(df, test_size=0.3, random_state=0)

In [None]:
x_train.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
71608,697,2348,1,974754467
276128,2214,1351,4,967388817
270463,2180,1134,1,967022501
263232,2143,742,3,967234398
234148,1953,1708,4,967919937


In [None]:
global_mean = x_train['rating'].mean()
global_mean

3.6036080715001813

In [None]:
def sub_mean(df):

  ## Global mean
  global_mean = df['rating'].mean()

  ## Normalize rating by movie
  avg_movie = pd.DataFrame((df['rating']-global_mean).groupby(df['movie_id']).mean())
  avg_movie = avg_movie.rename(columns = {'rating': 'avg_movie'})
  avg_movie.reset_index(inplace=True)
  
  return global_mean, avg_movie

def sub_mean2(df,global_mean):
  ## Normalize rating by user
  avg_user= pd.DataFrame((df['rating'] - global_mean- df['avg_movie']).groupby(df['user_id']).mean())
  avg_user = avg_user.rename(columns = {0: 'avg_user'})
  avg_user.reset_index(inplace=True)

  return avg_user

In [None]:
global_mean, avg_movie = sub_mean(x_train)

In [None]:
x_train1 = pd.merge(x_train, avg_movie, on="movie_id")

In [None]:
x_train1.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,avg_movie
0,697,2348,1,974754467,-1.237629
1,1801,2348,1,968814334,-1.237629
2,2402,2348,3,966068330,-1.237629
3,2921,2348,2,965277359,-1.237629
4,648,2348,2,974677635,-1.237629


In [None]:
avg_user = sub_mean2(x_train1,global_mean)

In [None]:
avg_user.head()

Unnamed: 0,user_id,avg_user
0,1,-0.149574
1,2,-0.182651
2,3,0.237893
3,4,0.346358
4,5,-0.00886


In [None]:
x_train2 = pd.merge(x_train1, avg_user, on="user_id")

In [None]:
x_train2['predict'] = global_mean + x_train2['avg_user'] + x_train2['avg_movie']

In [None]:
x_train2.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,avg_movie,avg_user,predict
0,697,2348,1,974754467,-1.237629,-0.087227,2.278753
1,697,2924,4,974756147,0.102836,-0.087227,3.619217
2,697,2286,5,974755277,0.742121,-0.087227,4.258502
3,697,1690,2,974751475,-0.436941,-0.087227,3.07944
4,697,562,5,974755555,0.30397,-0.087227,3.820352


In [None]:
((x_train2['rating'] - x_train2['predict']) ** 2).mean() ** .5

0.8892404516947805

In [None]:
global_mean

3.6036080715001813

In [None]:
# Estimativas
#global_mean = 3.6036080715001813
avg_user.to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/baseline_avg_user.csv',index=False)
avg_movie.to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/baseline_avg_movie.csv',index=False)

#### **4.1 - Predictions**

In [None]:
avg_user['avg_user'].mean()

0.025793103889861925

In [None]:
df_test1 = pd.merge(x_test, avg_movie, on="movie_id",how='left').fillna(avg_movie['avg_movie'].mean())
df_test2 = pd.merge(df_test1, avg_user, on="user_id",how='left').fillna(avg_user['avg_user'].mean())
# Predições
df_test2['predict'] = global_mean + df_test2['avg_user'] + df_test2['avg_movie']
df_test2.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,avg_movie,avg_user,predict
0,3142,2842,3,964983390,0.172511,-0.31159,3.464529
1,316,3531,2,974710002,-1.300578,0.442251,2.745282
2,1425,1675,5,972958612,0.648728,0.291974,4.54431
3,3022,2027,5,965165835,0.156857,0.3441,4.104565
4,2218,1775,5,966694023,0.73741,-0.268923,4.072095


In [None]:
((df_test2['rating'] - df_test2['predict']) ** 2).mean() ** .5

0.9088475951084722

In [None]:
df_valid = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/test_data.csv',sep=',')
df_valid.count()

id           3970
user_id      3970
movie_id     3970
timestamp    3970
dtype: int64

In [None]:
df_valid1 = pd.merge(df_valid, avg_movie, on="movie_id",how='left').fillna(avg_movie['avg_movie'].mean())
df_valid2 = pd.merge(df_valid1, avg_user, on="user_id",how='left').fillna(avg_user['avg_user'].mean())
# Predições
df_valid2['predict'] = global_mean + df_valid2['avg_user'] + df_valid2['avg_movie']
df_valid2.head()

Unnamed: 0,id,user_id,movie_id,timestamp,avg_movie,avg_user,predict
0,0,5,2962,974769784,-0.152741,-0.00886,3.442007
1,1,5,3177,974769768,-0.298675,-0.00886,3.296073
2,2,5,3153,974769768,-0.736941,-0.00886,2.857807
3,3,5,501,974769768,0.018014,-0.00886,3.612761
4,4,5,3159,974769768,-0.567894,-0.00886,3.026854


In [None]:
df_valid2[['id','predict']].to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/baseline_v2.csv',index=False)

## **5 - Gradiente**

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/train_data.csv',sep=',')
df.head()

x_train, x_test,= train_test_split(df, test_size=0.3, random_state=0)

In [None]:
df_movie = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/movies_data.csv',sep=',')
df_movie.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
df_user = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/users_data.csv',sep=',')
df_user.head()

Unnamed: 0,user_id,gender,age,occupation,zip_code
0,1,M,35,17,49508
1,2,M,35,1,10918
2,3,M,25,20,14895
3,4,F,25,0,97401
4,5,M,35,12,75069


#### **5.1 - Criação de Features**

In [None]:
x_train1 = pd.merge(x_train, df_user, on="user_id",how='left').fillna(avg_user['avg_user'].mean())

In [None]:
x_train1.to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/expxloratoria1.csv',index=False)

In [None]:
x_train2 = pd.merge(x_train, df_movie, on="movie_id",how='left').fillna(avg_movie['avg_movie'].mean())

In [None]:
x_train2.to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/expxloratoria2.csv',index=False)

#### **5.2 - Dados com features**

In [None]:
df_movie = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/movies_data2.csv',sep=';')
df_movie.head()

Unnamed: 0,movie_id,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy
0,1,1995,3.898734,0.707834,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0
1,2,1995,3.32315,0.458927,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1
2,3,1995,3.557447,0.554742,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3,4,1995,3.73847,0.63207,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,5,1995,3.493419,0.541901,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
df_user = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/users_data2.csv',sep=';')
df_user.head()

Unnamed: 0,user_id,age,occupationA,occupationB,occupationC,occupationD,occupationE,occupationF,occupationG,sexo_M,CEP2
0,1,35,0,0,0,1,0,0,0,1,0.457415
1,2,35,0,0,0,0,1,0,0,1,0.596626
2,3,25,0,0,0,0,0,0,0,1,0.594607
3,4,25,0,0,0,0,0,1,0,0,0.599375
4,5,35,0,1,0,0,0,0,0,1,0.609027


In [None]:
# Estimativas
global_mean = 3.6036080715001813
avg_user = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/baseline_avg_user.csv')
avg_movie = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/baseline_avg_movie.csv')

In [None]:
avg_user.head()

Unnamed: 0,user_id,avg_user
0,1,-0.149574
1,2,-0.182651
2,3,0.237893
3,4,0.346358
4,5,-0.00886


In [None]:
avg_movie.head()

Unnamed: 0,movie_id,avg_movie
0,1,0.514974
1,2,-0.382264
2,3,-0.544436
3,4,-0.849222
4,5,-0.480013


In [None]:
df_1 = pd.merge(df, avg_movie, on="movie_id",how='left').fillna(avg_movie['avg_movie'].mean())
df_2 = pd.merge(df_1, avg_user, on="user_id",how='left').fillna(avg_user['avg_user'].mean())
# Predições
df_2['baseline'] = global_mean + df_2['avg_user'] + df_2['avg_movie']
df_3 = pd.merge(df_2, df_movie, on="movie_id",how='left').fillna(0)
df_4 = pd.merge(df_3, df_user, on="user_id",how='left').fillna(0)
#df_4.set_index(['user_id','movie_id'],inplace=True)

df_4.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,avg_movie,avg_user,baseline,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy,age,occupationA,occupationB,occupationC,occupationD,occupationE,occupationF,occupationG,sexo_M,CEP2
0,1,1160,5,974769817,0.268732,-0.149574,3.722766,1956,3.638651,0.590337,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,0,0,1,0,0,0,1,0.457415
1,1,1129,3,974769817,0.394473,-0.149574,3.848506,1985,3.560698,0.576774,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,0,0,1,0,0,0,1,0.457415
2,1,3328,4,974769817,0.068232,-0.149574,3.522266,1979,3.244162,0.455037,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,0,0,1,0,0,0,1,0.457415
3,1,2659,2,974769817,0.097541,-0.149574,3.551575,1981,3.692946,0.607884,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,35,0,0,0,1,0,0,0,1,0.457415
4,1,980,3,974769817,0.284284,-0.149574,3.738318,1982,3.887892,0.702915,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,35,0,0,0,1,0,0,0,1,0.457415


#### **5.3 - Criação do banco de dados**

##### **5.3.1 - train**

In [None]:
df_train1 = pd.merge(x_train, avg_movie, on="movie_id",how='left').fillna(avg_movie['avg_movie'].mean())
df_train2 = pd.merge(df_train1, avg_user, on="user_id",how='left').fillna(avg_user['avg_user'].mean())
# Predições
df_train2['baseline'] = global_mean + df_train2['avg_user'] + df_train2['avg_movie']
df_train3 = pd.merge(df_train2, df_movie, on="movie_id",how='left').fillna(0)
df_train4 = pd.merge(df_train3, df_user, on="user_id",how='left').fillna(0)
df_train4.set_index(['user_id','movie_id'],inplace=True)
df_train4.head()

##### **5.3.2 - test**

In [None]:
df_test1 = pd.merge(x_test, avg_movie, on="movie_id",how='left').fillna(avg_movie['avg_movie'].mean())
df_test2 = pd.merge(df_test1, avg_user, on="user_id",how='left').fillna(avg_user['avg_user'].mean())
# Predições
df_test2['baseline'] = global_mean + df_test2['avg_user'] + df_test2['avg_movie']
df_test3 = pd.merge(df_test2, df_movie, on="movie_id",how='left').fillna(0)
df_test4 = pd.merge(df_test3, df_user, on="user_id",how='left').fillna(0)
df_test4.set_index(['user_id','movie_id'],inplace=True)
df_test4.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,timestamp,avg_movie,avg_user,baseline,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy,age,occupationA,occupationB,occupationC,occupationD,occupationE,occupationF,occupationG,sexo_M,CEP2
user_id,movie_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1
3142,2842,3,964983390,0.172511,-0.31159,3.464529,1999,3.735943,0.619573,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,25,0,0,0,0,0,0,0,1,0.528647
316,3531,2,974710002,-1.300578,0.442251,2.745282,1992,3.068596,0.404398,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,18,0,0,0,0,1,0,0,1,0.567772
1425,1675,5,972958612,0.648728,0.291974,4.54431,1930,3.92402,0.715686,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,56,0,0,0,0,1,0,0,1,0.570423
3022,2027,5,965165835,0.156857,0.3441,4.104565,1992,3.493419,0.541901,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,45,0,0,1,0,0,0,0,1,0.609027
2218,1775,5,966694023,0.73741,-0.268923,4.072095,1998,4.061375,0.760638,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,18,0,0,0,0,0,1,0,1,0.576065


##### **5.3.3 - valid**

In [None]:
df_valid = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/test_data.csv',sep=',')
df_valid.head()

Unnamed: 0,id,user_id,movie_id,timestamp
0,0,5,2962,974769784
1,1,5,3177,974769768
2,2,5,3153,974769768
3,3,5,501,974769768
4,4,5,3159,974769768


In [None]:
df_valid1 = pd.merge(df_valid, avg_movie, on="movie_id",how='left').fillna(avg_movie['avg_movie'].mean())
df_valid2 = pd.merge(df_valid1, avg_user, on="user_id",how='left').fillna(avg_user['avg_user'].mean())
# Predições
df_valid2['baseline'] = global_mean + df_valid2['avg_user'] + df_valid2['avg_movie']
df_valid3 = pd.merge(df_valid2, df_movie, on="movie_id",how='left').fillna(0)
df_valid4 = pd.merge(df_valid3, df_user, on="user_id",how='left').fillna(0)
df_valid4.set_index(['id','user_id','movie_id'],inplace=True)
df_valid4.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,timestamp,avg_movie,avg_user,baseline,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy,age,occupationA,occupationB,occupationC,occupationD,occupationE,occupationF,occupationG,sexo_M,CEP2
id,user_id,movie_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1
0,5,2962,974769784,-0.152741,-0.00886,3.442007,2000,3.244162,0.455037,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027
1,5,3177,974769768,-0.298675,-0.00886,3.296073,2000,3.273504,0.487179,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027
2,5,3153,974769768,-0.736941,-0.00886,2.857807,1999,3.792322,0.653395,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027
3,5,501,974769768,0.018014,-0.00886,3.612761,1993,3.792322,0.653395,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027
4,5,3159,974769768,-0.567894,-0.00886,3.026854,2000,3.493419,0.541901,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027


##### **5.3.4 - Ajuste do modelo**

In [None]:
# Dados
df_4

df = df_4.copy()
del df['rating']
y = np.array(df_4['rating'].copy())

train = df_train4.copy()
del train['rating']
test = df_test4.copy()
del test['rating']
valid = df_valid4.copy()
y_train = np.array(df_train4['rating'].copy())
y_test  = np.array(df_test4['rating'].copy())

In [None]:
from xgboost import XGBRegressor

model = XGBRegressor(nthread=-1,n_estimators=1000,max_depth=25, 
                      colsample_bytree=0.8, 
                      learning_rate=0.1,
                      subsample=0.8,
                      gamma=1,
                      min_child_weight = 30,
                      reg_lambda = 22,
                      reg_alpha = 22,
                      objective='reg:squarederror',

                      seed=123)
model.fit(df, y)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.8, gamma=1,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=30, min_child_weight=5, missing=None, n_estimators=1000,
             n_jobs=1, nthread=-1, objective='reg:squarederror', random_state=0,
             reg_alpha=22, reg_lambda=22, scale_pos_weight=1, seed=123,
             silent=None, subsample=0.8, verbosity=1)

In [None]:
y_pred = model.predict(train) # Predictions
y_true = y_train # True val

In [None]:
from sklearn.metrics import mean_squared_error as mse
MSE = mse(y_true, y_pred)
RMSE = np.sqrt(MSE)
RMSE

0.6971275911816086

In [None]:
y_pred = model.predict(test) # Predictions
y_true = y_test # True values

MSE = mse(y_true, y_pred)
RMSE = np.sqrt(MSE)
RMSE

0.7047505720714365

In [None]:
valid['rating'] = model.predict(valid) # Predictions
valid.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,timestamp,avg_movie,avg_user,baseline,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy,age,occupationA,occupationB,occupationC,occupationD,occupationE,occupationF,occupationG,sexo_M,CEP2,rating
id,user_id,movie_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1
0,5,2962,974769784,-0.152741,-0.00886,3.442007,2000,3.244162,0.455037,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.527803
1,5,3177,974769768,-0.298675,-0.00886,3.296073,2000,3.273504,0.487179,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.975123
2,5,3153,974769768,-0.736941,-0.00886,2.857807,1999,3.792322,0.653395,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,2.785301
3,5,501,974769768,0.018014,-0.00886,3.612761,1993,3.792322,0.653395,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.502699
4,5,3159,974769768,-0.567894,-0.00886,3.026854,2000,3.493419,0.541901,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.514776


In [None]:
valid.reset_index(inplace=True)

In [None]:
valid[['id','rating']].to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/boost7.csv',index=False)

##### **5.3.5 - Ajuste de modelos segmentados**

In [None]:
# Dados

train = df_train4.loc[df_train4["Fantasy"] == 1].copy()
y_train = np.array(train['rating'].copy())
del train['rating']
test = df_test4.loc[df_test4["Fantasy"] == 1].copy()
y_test  = np.array(test['rating'].copy())
del test['rating']

valid = df_valid4.loc[df_valid4["Fantasy"] == 1].copy()


In [None]:
model = XGBRegressor(nthread=-1,n_estimators=300,max_depth=7, 
                      colsample_bytree=0.8, 
                      learning_rate=0.1,
                      subsample=0.8,
                      gamma=5,
                      min_child_weight = 30,
                      reg_lambda = 22,
                      reg_alpha = 22,
                      objective='reg:squarederror',

                      seed=123)
model.fit(train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.8, gamma=5,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=7, min_child_weight=30, missing=None, n_estimators=300,
             n_jobs=1, nthread=-1, objective='reg:squarederror', random_state=0,
             reg_alpha=22, reg_lambda=22, scale_pos_weight=1, seed=123,
             silent=None, subsample=0.8, verbosity=1)

In [None]:
y_pred = model.predict(train) # Predictions
y_true = y_train # True val

In [None]:
MSE = mse(y_true, y_pred)
RMSE = np.sqrt(MSE)
RMSE

0.8855421756179384

In [None]:
y_pred = model.predict(test) # Predictions
y_true = y_test # True values

MSE = mse(y_true, y_pred)
RMSE = np.sqrt(MSE)
RMSE

0.9094839052541972

In [None]:
valid['rating'] = model.predict(valid) # Predictions
valid.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,timestamp,avg_movie,avg_user,baseline,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy,age,occupationA,occupationB,occupationC,occupationD,occupationE,occupationF,occupationG,sexo_M,CEP2,rating
id,user_id,movie_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1
49,16,2334,974768764,-0.182992,0.082134,3.50275,1999,3.929251,0.673976,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,25,0,0,0,0,1,0,0,0,0.588821,3.363832
79,39,1889,974765618,-0.340834,-0.116032,3.146742,1985,3.521368,0.561254,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,25,0,1,0,0,0,0,0,0,0.588293,3.101764
120,93,1889,974769772,-0.340834,0.371767,3.634541,1985,3.521368,0.561254,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,35,0,0,0,0,0,0,0,1,0.570423,3.613564
121,93,1907,974769712,-0.023455,0.371767,3.95192,1984,3.32315,0.458927,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,35,0,0,0,0,0,0,0,1,0.570423,3.919712
122,93,2176,974769688,-0.389322,0.371767,3.586053,1986,3.369048,0.452381,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,35,0,0,0,0,0,0,0,1,0.570423,3.554451


In [None]:
valid.reset_index(inplace=True)

In [None]:
valid[['id','rating']].to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/boost_Fantasy.csv',index=False)

## **6 - Fatoração de matrizes**

#### **6.1 - Data**

In [13]:
df = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/train_data.csv',sep=',')
df.head()

x_train, x_test,= train_test_split(df, test_size=0.3, random_state=0)

In [14]:
df_valid = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/test_data.csv',sep=',')
df_valid.head()

Unnamed: 0,id,user_id,movie_id,timestamp
0,0,5,2962,974769784
1,1,5,3177,974769768
2,2,5,3153,974769768
3,3,5,501,974769768
4,4,5,3159,974769768


In [None]:
df[['user_id','movie_id','rating']].to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/x_train.csv',index=False,sep='\t',header=None)
#x_test[['user_id','movie_id','rating']].to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/x_test.csv',index=False,sep='\t',header=None)
df_valid[['user_id','movie_id','id']].to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/valiacao.csv',index=False,sep='\t',header=None)

In [4]:
# Metadata
df_movie = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/movies_data2.csv',sep=';')
df_movie.head()

Unnamed: 0,movie_id,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy
0,1,1995,3.898734,0.707834,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0
1,2,1995,3.32315,0.458927,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1
2,3,1995,3.557447,0.554742,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3,4,1995,3.73847,0.63207,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,5,1995,3.493419,0.541901,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [5]:
df_movie['fg'] = 1

In [6]:
df_movie.head()

Unnamed: 0,movie_id,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy,fg
0,1,1995,3.898734,0.707834,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1
1,2,1995,3.32315,0.458927,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,1
2,3,1995,3.557447,0.554742,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,4,1995,3.73847,0.63207,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
4,5,1995,3.493419,0.541901,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [7]:
df_movie[['movie_id','fg']].to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/metadata.csv',index=False,sep='\t',header=False)

In [3]:
#!pip install -U git+git://github.com/caserec/CaseRecommender.git

Collecting git+git://github.com/caserec/CaseRecommender.git
  Cloning git://github.com/caserec/CaseRecommender.git to /tmp/pip-req-build-flez3n1r
  Running command git clone -q git://github.com/caserec/CaseRecommender.git /tmp/pip-req-build-flez3n1r
Building wheels for collected packages: CaseRecommender
  Building wheel for CaseRecommender (setup.py) ... [?25l[?25hdone
  Created wheel for CaseRecommender: filename=CaseRecommender-1.1.0-py2.py3-none-any.whl size=102476 sha256=35898d83c7936848c84122bde8f48505cd4952c0ecad4b567ef019f1adfd0a98
  Stored in directory: /tmp/pip-ephem-wheel-cache-nl2tyuwo/wheels/ec/77/4d/eb41f89bb045567e0471af1099690c3886bbcdb045d80b75d0
Successfully built CaseRecommender
Installing collected packages: CaseRecommender
Successfully installed CaseRecommender-1.1.0


##### **6.2 - Ajuste do modelos - Fatoração de Matrizes**

In [10]:
from caserec.recommenders.rating_prediction.matrixfactorization import MatrixFactorization

In [11]:
train = '/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/x_train.csv'
test = '/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/valiacao.csv'
#output_file = '/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/valiacaoXX.csv'
output_file = '/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/valid_test.csv'


MatrixFactorization(train, test,output_file, factors = 15).compute()

[Case Recommender: Rating Prediction > Matrix Factorization]

train data:: 3952 users and 3562 items (535784 interactions) | sparsity:: 96.19%
test data:: 418 users and 1611 items (3970 interactions) | sparsity:: 99.41%

training_time:: 129.520665 sec
prediction_time:: 0.013535 sec


Eval:: MAE: 1980.863908 RMSE: 2288.496491 


In [17]:
df_pred = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/valid_test.csv',sep='\t',header=None)
df_pred = df_pred.rename(columns = {0: 'user_id', 1: 'movie_id',2:'rating'}, inplace = False)
df_pred.head()

Unnamed: 0,user_id,movie_id,rating
0,5,2962,3.233787
1,5,3177,3.406981
2,5,3153,2.978621
3,5,501,3.526279
4,5,3159,3.196067


In [18]:
df_pred2 = pd.merge(df_valid, df_pred,left_on=['user_id','movie_id'], right_on = ['user_id','movie_id'])
df_pred2.head()

Unnamed: 0,id,user_id,movie_id,timestamp,rating
0,0,5,2962,974769784,3.233787
1,1,5,3177,974769768,3.406981
2,2,5,3153,974769768,2.978621
3,3,5,501,974769768,3.526279
4,4,5,3159,974769768,3.196067


In [19]:
df_pred2.count()

id           3970
user_id      3970
movie_id     3970
timestamp    3970
rating       3970
dtype: int64

In [21]:
df_pred2[['id','rating']].to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/naovaidaremnada.csv',index=False)

##### **6.3 - Ajuste do modelos - Fatoração de Matrizes (usando Gradiente Descendente Estocástico)**

In [8]:
from caserec.recommenders.rating_prediction.gsvdplusplus import GSVDPlusPlus

In [9]:
train = '/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/x_train.csv'
test = '/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/valiacao.csv'
#output_file = '/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/valiacaoXX.csv'
output_file = '/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/valid_testx2.csv'
metadata = '/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/metadata.csv'

GSVDPlusPlus(train, test,output_file, metadata,learn_rate=0.1, stop_criteria=0.9).compute()

[Case Recommender: Rating Prediction > GSVDPlusPlus]

train data:: 3952 users and 3562 items (535784 interactions) | sparsity:: 96.19%
test data:: 418 users and 1611 items (3970 interactions) | sparsity:: 99.41%



  error_final += (eui ** 2.0)
  part_2_user = (np.multiply(eui, pi) - np.multiply(delta3, self.p[user]))
  part_2_item = (np.multiply(eui, pu) - np.multiply(delta4, self.q[item]))
  part_2 = (eui * self.n_g[item] * pu - delta5 * self.x[g])
  part_2 = (eui * self.n_u[user] * pi - delta6 * self.y[j])
  self.p[user] = self.p[user] + learn_rate2 * part_2_user
  self.x[g] = self.x[g] + learn_rate2 * part_2
  self.y[j] = self.y[j] + learn_rate2 * part_2


KeyboardInterrupt: ignored

##### **6.1 - Combinação com BOOST**

In [None]:
df_1 = pd.merge(df, avg_movie, on="movie_id",how='left').fillna(avg_movie['avg_movie'].mean())
df_2 = pd.merge(df_1, avg_user, on="user_id",how='left').fillna(avg_user['avg_user'].mean())
# Predições
df_2['baseline'] = global_mean + df_2['avg_user'] + df_2['avg_movie']
df_3 = pd.merge(df_2, df_movie, on="movie_id",how='left').fillna(0)
df_4 = pd.merge(df_3, df_user, on="user_id",how='left').fillna(0)
#df_4.set_index(['user_id','movie_id'],inplace=True)
df_4.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,avg_movie,avg_user,baseline,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy,age,occupationA,occupationB,occupationC,occupationD,occupationE,occupationF,occupationG,sexo_M,CEP2
0,1,1160,5,974769817,0.268732,-0.149574,3.722766,1956,3.638651,0.590337,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,0,0,1,0,0,0,1,0.457415
1,1,1129,3,974769817,0.394473,-0.149574,3.848506,1985,3.560698,0.576774,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,0,0,1,0,0,0,1,0.457415
2,1,3328,4,974769817,0.068232,-0.149574,3.522266,1979,3.244162,0.455037,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,0,0,1,0,0,0,1,0.457415
3,1,2659,2,974769817,0.097541,-0.149574,3.551575,1981,3.692946,0.607884,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,35,0,0,0,1,0,0,0,1,0.457415
4,1,980,3,974769817,0.284284,-0.149574,3.738318,1982,3.887892,0.702915,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,35,0,0,0,1,0,0,0,1,0.457415


In [None]:
df_4.count()

user_id           535784
movie_id          535784
rating            535784
timestamp         535784
avg_movie         535784
avg_user          535784
baseline          535784
ano               535784
genres_ratings    535784
genres_score      535784
Comedy            535784
Action            535784
Crime             535784
Thriller          535784
Romance           535784
Adventure         535784
Horror            535784
Children's        535784
Drama             535784
Sci-Fi            535784
Musical           535784
Animation         535784
Documentary       535784
Western           535784
Mystery           535784
Film-Noir         535784
War               535784
Fantasy           535784
age               535784
occupationA       535784
occupationB       535784
occupationC       535784
occupationD       535784
occupationE       535784
occupationF       535784
occupationG       535784
sexo_M            535784
CEP2              535784
dtype: int64

In [None]:
df_pred = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/valiacaoXX.csv',sep='\t',header=None)
df_pred = df_pred.rename(columns = {0: 'user_id', 1: 'movie_id',2:'pred'}, inplace = False)
df_pred.head()

Unnamed: 0,user_id,movie_id,pred
0,1,1160,4.160847
1,1,1129,3.905909
2,1,3328,3.93645
3,1,2659,3.837109
4,1,980,4.042356


In [None]:
df_pred.count()

user_id     535784
movie_id    535784
pred        535784
dtype: int64

In [None]:
df_pred2 = pd.merge(df_4, df_pred,left_on=['user_id','movie_id'], right_on = ['user_id','movie_id'])
df_pred2.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,avg_movie,avg_user,baseline,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy,age,occupationA,occupationB,occupationC,occupationD,occupationE,occupationF,occupationG,sexo_M,CEP2,pred
0,1,1160,5,974769817,0.268732,-0.149574,3.722766,1956,3.638651,0.590337,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,0,0,1,0,0,0,1,0.457415,4.160847
1,1,1129,3,974769817,0.394473,-0.149574,3.848506,1985,3.560698,0.576774,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,0,0,1,0,0,0,1,0.457415,3.905909
2,1,3328,4,974769817,0.068232,-0.149574,3.522266,1979,3.244162,0.455037,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,0,0,1,0,0,0,1,0.457415,3.93645
3,1,2659,2,974769817,0.097541,-0.149574,3.551575,1981,3.692946,0.607884,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,35,0,0,0,1,0,0,0,1,0.457415,3.837109
4,1,980,3,974769817,0.284284,-0.149574,3.738318,1982,3.887892,0.702915,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,35,0,0,0,1,0,0,0,1,0.457415,4.042356


In [None]:
df_pred2.count()

user_id           535784
movie_id          535784
rating            535784
timestamp         535784
avg_movie         535784
avg_user          535784
baseline          535784
ano               535784
genres_ratings    535784
genres_score      535784
Comedy            535784
Action            535784
Crime             535784
Thriller          535784
Romance           535784
Adventure         535784
Horror            535784
Children's        535784
Drama             535784
Sci-Fi            535784
Musical           535784
Animation         535784
Documentary       535784
Western           535784
Mystery           535784
Film-Noir         535784
War               535784
Fantasy           535784
age               535784
occupationA       535784
occupationB       535784
occupationC       535784
occupationD       535784
occupationE       535784
occupationF       535784
occupationG       535784
sexo_M            535784
CEP2              535784
pred              535784
dtype: int64

In [None]:
df_valid1 = pd.merge(df_valid, avg_movie, on="movie_id",how='left').fillna(avg_movie['avg_movie'].mean())
df_valid2 = pd.merge(df_valid1, avg_user, on="user_id",how='left').fillna(avg_user['avg_user'].mean())
# Predições
df_valid2['baseline'] = global_mean + df_valid2['avg_user'] + df_valid2['avg_movie']
df_valid3 = pd.merge(df_valid2, df_movie, on="movie_id",how='left').fillna(0)
df_valid4 = pd.merge(df_valid3, df_user, on="user_id",how='left').fillna(0)
#df_valid4.set_index(['id','user_id','movie_id'],inplace=True)
df_valid4.head()

Unnamed: 0,id,user_id,movie_id,timestamp,avg_movie,avg_user,baseline,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy,age,occupationA,occupationB,occupationC,occupationD,occupationE,occupationF,occupationG,sexo_M,CEP2
0,0,5,2962,974769784,-0.152741,-0.00886,3.442007,2000,3.244162,0.455037,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027
1,1,5,3177,974769768,-0.298675,-0.00886,3.296073,2000,3.273504,0.487179,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027
2,2,5,3153,974769768,-0.736941,-0.00886,2.857807,1999,3.792322,0.653395,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027
3,3,5,501,974769768,0.018014,-0.00886,3.612761,1993,3.792322,0.653395,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027
4,4,5,3159,974769768,-0.567894,-0.00886,3.026854,2000,3.493419,0.541901,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027


In [None]:
df_valid4.count()

id                3970
user_id           3970
movie_id          3970
timestamp         3970
avg_movie         3970
avg_user          3970
baseline          3970
ano               3970
genres_ratings    3970
genres_score      3970
Comedy            3970
Action            3970
Crime             3970
Thriller          3970
Romance           3970
Adventure         3970
Horror            3970
Children's        3970
Drama             3970
Sci-Fi            3970
Musical           3970
Animation         3970
Documentary       3970
Western           3970
Mystery           3970
Film-Noir         3970
War               3970
Fantasy           3970
age               3970
occupationA       3970
occupationB       3970
occupationC       3970
occupationD       3970
occupationE       3970
occupationF       3970
occupationG       3970
sexo_M            3970
CEP2              3970
dtype: int64

In [None]:
df_predval = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/valid_test.csv',sep='\t',header=None)
df_predval = df_predval.rename(columns = {0: 'user_id', 1: 'movie_id',2:'pred'}, inplace = False)
df_predval.head()

Unnamed: 0,user_id,movie_id,pred
0,5,2962,3.216521
1,5,3177,3.259308
2,5,3153,3.024161
3,5,501,3.486193
4,5,3159,3.199285


In [None]:
df_predval.count()

user_id     3970
movie_id    3970
pred        3970
dtype: int64

In [None]:
df_valid2 = pd.merge(df_valid4, df_predval,left_on=['user_id','movie_id'], right_on = ['user_id','movie_id'])
df_valid2.head()

Unnamed: 0,id,user_id,movie_id,timestamp,avg_movie,avg_user,baseline,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy,age,occupationA,occupationB,occupationC,occupationD,occupationE,occupationF,occupationG,sexo_M,CEP2,pred
0,0,5,2962,974769784,-0.152741,-0.00886,3.442007,2000,3.244162,0.455037,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.216521
1,1,5,3177,974769768,-0.298675,-0.00886,3.296073,2000,3.273504,0.487179,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.259308
2,2,5,3153,974769768,-0.736941,-0.00886,2.857807,1999,3.792322,0.653395,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.024161
3,3,5,501,974769768,0.018014,-0.00886,3.612761,1993,3.792322,0.653395,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.486193
4,4,5,3159,974769768,-0.567894,-0.00886,3.026854,2000,3.493419,0.541901,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.199285


In [None]:
df_valid2.count()

id                3970
user_id           3970
movie_id          3970
timestamp         3970
avg_movie         3970
avg_user          3970
baseline          3970
ano               3970
genres_ratings    3970
genres_score      3970
Comedy            3970
Action            3970
Crime             3970
Thriller          3970
Romance           3970
Adventure         3970
Horror            3970
Children's        3970
Drama             3970
Sci-Fi            3970
Musical           3970
Animation         3970
Documentary       3970
Western           3970
Mystery           3970
Film-Noir         3970
War               3970
Fantasy           3970
age               3970
occupationA       3970
occupationB       3970
occupationC       3970
occupationD       3970
occupationE       3970
occupationF       3970
occupationG       3970
sexo_M            3970
CEP2              3970
pred              3970
dtype: int64

In [None]:
# Dados

df = df_pred2.copy()
del df['rating']
y = np.array(df_pred2['rating'].copy())

valid = df_valid2.copy()


In [None]:
df.set_index(['user_id','movie_id','timestamp'],inplace=True)

In [None]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,avg_movie,avg_user,baseline,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy,age,occupationA,occupationB,occupationC,occupationD,occupationE,occupationF,occupationG,sexo_M,CEP2,pred
user_id,movie_id,timestamp,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1
1,1160,974769817,0.268732,-0.149574,3.722766,1956,3.638651,0.590337,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,0,0,1,0,0,0,1,0.457415,4.160847
1,1129,974769817,0.394473,-0.149574,3.848506,1985,3.560698,0.576774,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,0,0,1,0,0,0,1,0.457415,3.905909
1,3328,974769817,0.068232,-0.149574,3.522266,1979,3.244162,0.455037,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,0,0,1,0,0,0,1,0.457415,3.93645
1,2659,974769817,0.097541,-0.149574,3.551575,1981,3.692946,0.607884,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,35,0,0,0,1,0,0,0,1,0.457415,3.837109
1,980,974769817,0.284284,-0.149574,3.738318,1982,3.887892,0.702915,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,35,0,0,0,1,0,0,0,1,0.457415,4.042356


In [None]:
from xgboost import XGBRegressor

model = XGBRegressor(nthread=-1,n_estimators=300,max_depth=15, 
                      colsample_bytree=0.8, 
                      learning_rate=0.1,
                      subsample=0.8,
                      gamma=1,
                      min_child_weight = 30,
                      reg_lambda = 22,
                      reg_alpha = 22,
                      objective='reg:squarederror',

                      seed=123)
model.fit(df, y)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.8, gamma=1,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=15, min_child_weight=30, missing=None, n_estimators=300,
             n_jobs=1, nthread=-1, objective='reg:squarederror', random_state=0,
             reg_alpha=22, reg_lambda=22, scale_pos_weight=1, seed=123,
             silent=None, subsample=0.8, verbosity=1)

In [None]:
valid.head()

Unnamed: 0,id,user_id,movie_id,timestamp,avg_movie,avg_user,baseline,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy,age,occupationA,occupationB,occupationC,occupationD,occupationE,occupationF,occupationG,sexo_M,CEP2,pred
0,0,5,2962,974769784,-0.152741,-0.00886,3.442007,2000,3.244162,0.455037,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.216521
1,1,5,3177,974769768,-0.298675,-0.00886,3.296073,2000,3.273504,0.487179,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.259308
2,2,5,3153,974769768,-0.736941,-0.00886,2.857807,1999,3.792322,0.653395,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.024161
3,3,5,501,974769768,0.018014,-0.00886,3.612761,1993,3.792322,0.653395,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.486193
4,4,5,3159,974769768,-0.567894,-0.00886,3.026854,2000,3.493419,0.541901,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.199285


In [None]:
valid.set_index(['id','user_id','movie_id','timestamp'],inplace=True)

valid['rating'] = model.predict(valid) # Predictions
valid.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,avg_movie,avg_user,baseline,ano,genres_ratings,genres_score,Comedy,Action,Crime,Thriller,Romance,Adventure,Horror,Children's,Drama,Sci-Fi,Musical,Animation,Documentary,Western,Mystery,Film-Noir,War,Fantasy,age,occupationA,occupationB,occupationC,occupationD,occupationE,occupationF,occupationG,sexo_M,CEP2,pred,rating
id,user_id,movie_id,timestamp,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
0,5,2962,974769784,-0.152741,-0.00886,3.442007,2000,3.244162,0.455037,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.216521,3.236366
1,5,3177,974769768,-0.298675,-0.00886,3.296073,2000,3.273504,0.487179,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.259308,3.388315
2,5,3153,974769768,-0.736941,-0.00886,2.857807,1999,3.792322,0.653395,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.024161,2.978996
3,5,501,974769768,0.018014,-0.00886,3.612761,1993,3.792322,0.653395,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.486193,3.33595
4,5,3159,974769768,-0.567894,-0.00886,3.026854,2000,3.493419,0.541901,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,35,0,1,0,0,0,0,0,1,0.609027,3.199285,3.267828


In [None]:
valid.reset_index(inplace=True)

In [None]:
valid[['id','rating']].to_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/boost_p_matrixf15_v2.csv',index=False)

## **9 - Movie Reviews**

In [None]:
pd.set_option('display.max_colwidth', -1) 
df_review = pd.read_csv('/content/drive/MyDrive/Doutorado/disciplinas/recom_sys/scc5966/movie_reviews.csv',sep=',')
df_review .head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,movie_id,text
0,1,"Andy's toys live a reasonable life of fun and peace, their only worries are birthdays and Christmases, when new toys could easily replace those already there. One such birthday Andy's top toy, Woody the cowboy, finds himself in direct competition with Andy's new Buzz Lightyear doll. When rivalries boil over Woody tries to hide Buzz down the side of the bed but accidentally pushes him out the window, the other tops expel Woody, and he leaves with no choice but to find Buzz and return him to the house. But with only two days before Andy moves house, time is of the essence. Given how often the same mix of animation, wit, jokes and kids humour has been used since Toy Story (Ice Age, Monsters Inc, Bugs Life) it is easy to forget how refreshing it was when it first came out. I have just watched it again and it is dating a little in comparison to more recent twists on the formula. It seems each one has to be sharper and have more references etc in the background. However it is still very funny and deserves praise for being the first of a successful formula. The plot is simple but effective and actually has genuine drama and excitement to it. The main story is fun but the degree of character development is what really shores it up. The conflict between Buzz and Woody is taken deeper than this and, when confronted by the truth of his status as a toy, Buzz's turmoil is very real as opposed to him being a cartoon character and nothing more. Despite the two strong leads there is a real depth in the support cast. They may not actually have that many lines, but they have all the funniest lines. Most of the `adult' wit comes from the Potato Head, dinosaur, the pig and slinky dog. They are funny and are very well used. In fact the majority of this humour and plot will go right over kids heads. Looking back on it, I do feel a cynical edge on it in so much as this film must really have helped sales of the toy companies in the film. It's hard not to see the marketing department standing behind this film rubbing their hands. However the actual product is so wonderfully fun that I forgot this quickly. The voice work is excellent and the characters match the actors. Hanks is good as Woody and Allen has a good B-movie type voice for Buzz. Varney, Ratzenberger, Ermey (doing his usual), Rickles and others are all really good in the support roles and, probably, come out as the favourite characters for adults. Overall this is a classic film that will appeal to adults as much as to kids (if not more). A good plot and a really sharp script make the already short running time fly by. The only downside is that your kids will want you to go out and buy the damn things!"
1,1,"I am a big fan of the animated movies coming from the Pixar Studios. They are always looking for the newest technological possibilities to use in their movies, creating movies that are more than just worth a watch, even when they were made a decade ago. The movie is about toys that come to life when their owner is asleep or not in the same room. When the young boy's birthday is coming up, all the toys are nervous. They don't want to be ignored when the new one arrives. Woody the cowboy is their ""leader"" because he's the most popular one of them all. He's the only one that hasn't got to be afraid, but than a new favorite arrives ... Buzz Lightyear. He hates him and tries everything possible to get rid of him, but as the time passes by they learn to appreciate each other... When you see Toy Story, you may think that the different human like characters (Woody the cowboy for instance) aren't always as perfect as we are used to see in todays animated movies. Perhaps that's true, but if you keep in mind that all this was done in 1995, when computers weren't yet as strong and the technology for creating such movies was almost unknown, than you can only have a lot of respect for what the creators did. I loved the story and liked the animations a lot. I give it an 8.5/10."
2,1,"This is a very clever animated story that was a big hit, and justifiably so. It had a terrific sequel and if a third film came out, that would probably be a hit, too. When this came out, computer technology just was beginning to strut its stuff. Man, this looked awesome. Now, it's routine because animation, which took a giant leap with this movie, has made a lot more giant strides. The humor in here, however, is what made this so popular. There are tons of funny lines, issued by characters voiced by Tom Hanks, Tim Allen, Jim Varney, Don Rickles, Wallace Shawn and John Ratzenberger, among others. As good as Hanks is as ""Woody"" and Allen as ""Buzz Armstrong,"" I think the supporting characters just about stole the show: Mr. Potato Head, Slinky, Rex the dinosaur, etc. Multiple viewings don't diminish the entertainment, either. There are so many things to catch, audibly and visually, that you always seem to discover something new. The colors in here are beautiful, too. This is a guaranteed ""winner"" as is the sequel."
3,1,"Toy Story – 5/5 stars Children play with toys. It is a known fact. At one time or another, we all played with toys, whether they were action figures, dolls, little green soldiers, etc But what if toys were real? What if they could talk? Pixar and Disney serve us this theory in what was the first full-length computer-animated film ever, 'Toy Story,' chronicling the events in the life of a cowboy doll, Woody (voiced by Tom Hanks). Woody is the favorite toy of his owner, a small child named Andy. Andy brings Woody everywhere, and cherishes him, as we see in the beginning of the film. However, this all changes on Andy's birthday when Andy gets a new toy: a Buzz Lightyear doll (voiced by Tim Allen). Woody is suddenly forgotten, left with the rest of his friends: Mr. Potato Head (Don Rickles), Rex (Wallace Shawn), Slinky Dog (Jim Varney, better known as Ernest) and Ham (see if you can guess the voice of this one? I'll give you a hint: 'Cheers'). But after Buzz accidentally gets knocked out an upstairs window, Woody is the prime suspect. Now, after Woody and Buzz end up next door, in toy killer Sid's house, Woody must prove his innocence by getting both Buzz and him back to Andy's house safely. 'Toy Story' builds on an element we all shrug off carelessly and thoughtlessly. Much like they did last year with monsters under the bed, Pixar took the theory of live toys to a new level in 'Toy Story,' filling our minds with endless possibilities. What Pixar does is a strange thing. It doesn't just try to expand our mind, but also out world. I respect and enjoy that. In 'Monsters, Inc.,' Pixar managed to preach to us 'What if monsters under the bed are real, and what if they have a world much like ours, and have feelings like humans,' while never forgetting the equally important formula of humor. Much is the same with their earlier film 'Toy Story.' What if those wooden and plastic toys we all played with as kids are real? What if they have feelings, emotions, voices, and human qualities? An interesting idea by itself, but when mixed with a wicked sense of humor and reality, you've got yourself one of the best films ever. Tom Hanks is perfect as Woody. Pixar must have modeled the doll's expressions and movements after Hanks, because after a while, I feel like I AM watching Hanks on screen, and NOT a computer-generated image. When you get to the point of not being able to tell animation from reality, you know that the voices are good. The same goes for Tim Allen, though the body gestures were most likely not modeled after Allen's physical expressions (Buzz is a short, pot-bellied toy). The rest of the cast is excellent, all very believable and entertaining. You begin to love each character for their distinguishing traits, and that is always refreshing. I can safely say that I have not enjoyed animated films quite so much over the years as I have enjoyed Pixar films. The only film they made that I named forgettable was 'A Bug's Life,' which was in and of it not horrible, but lacking the sense of humor the other Pixar films have and had. Pixar makes very refreshing films. In an era of cheap, made-for-video Disney sequels, rip-off cartoons and television babysitters (i.e. 'The Jungle Book 2), Pixar holds true to the values that made Disney films so entertaining back in the 30's-60's: Respect for the audience's intelligence, humor, provocative ideas to base the film upon, and respect for the audience (not the exact same thing as the first element), all of which are forgotten in this day and age of money-makers. I respect Pixar very much, and after hearing how little Disney does in helping with their films, I feel that Disney is just trying to cash in on their ideas by having their name branded on the posters for Pixar films. Shame on you, Disney. Proof that Disney has no respect for audiences is the fact that they will not let another sequel be made – something that fans like me would rather have than something like 'Finding Nemo.' 'Toy Story' 1 & 2 are both on my 'favorite films' list. It may sound stupid, but if I made up a top 250 list like IMDb.com, both of those films would be on there; so would 'Monsters Inc.' After an unpromising trailer for Pixar's upcoming film 'Finding Nemo,' I think that after their licensing deal with Disney is disputed (they have to cough up five more ORIGINAL films – not sequels – by 2005), they should definitely try to make a 'Toy Story 3.' I'll be first in line for it, anyway."
4,1,"Y'know, I always suspected that my toys were coming to life when I wasn't looking! In Andy's Room, his toys lead lives of noisy desperation come every birthday and Christmas - no one wants to be one-upped by a new addition to the toy box. Nominally led by Cowboy Woody (there's a Brokeback joke in there just waiting to happen), Mr. Potato Head, Rex the Dinosaur, Ham the piggybank, Bo Peep, Slinky the dog and a smattering of other playthings go about their toy business of playing checkers, hanging with the hometoys and ""plastic corrosion awareness meetings,"" until Andy's birthday party, when they gather expectantly around a transistor radio, listening to the reports of their toy soldier troops ""in the field"" (downstairs watching Andy's gift-opening), hoping that no gift will be exciting enough to cause Andy to neglect *them.* There is. His name is Buzz Lightyear, Space Ranger. Directed by Pixar mainstay John Lasseter, with the voice talents of Tom Hanks (as Woody), Don Rickles, John Ratzenberger (forever Cliff from *Cheers*), R. Lee Ermey, Annie Potts, Jim Varney and Tim Allen (as Buzz), *Toy Story* is that *rara avis* that succeeds on all levels  in its animation, storyline, character development, its messages of friendship and self-realization and, most importantly, its entertainment value. The fact that this is a cartoon (or animated feature  just what DO we call this new wave of computer-generated movies?) is incidental. Which makes the slightly dodgy animation (of the ""real people"") irrelevant - it gets the point across with or without the technological finesse. The ""Disney Movie"" has become synonymous with maudlin messages, redneck fundamentalism, anachronistic family values, boneheaded parents, smart-mouthing youngsters, too-hip-to-be-smart teens and insufferable pets. Though Disney's tyrannical umbrella overarches this film's production studio, Pixar Animation, *Toy Story* somehow avoided all trace of Disney's craven hand, which is doubly surprising, considering this is Pixar's first feature length film, after years of experimentation. Right outa the gate and right outa the field. Sure, there are ""messages,"" but they are heartfelt, rather than maudlin (Woody tells Buzz during Buzz's greatest depression that it matters not what Buzz thinks of himself, what makes him important is what his owner, Andy, thinks of him); there are emotional segments, which are truly heartbreaking, rather than cheesy (when Buzz's escape attempt lands him with a broken arm, proving he is Not A Flying Toy, the lyric, ""Clearly I will go sailing no more,"" launches a thousand hankies); and the portrayal of Andy's family was Pixar's triumphal achievement. Boldly contravening Disney's *idée fixe* of the 1950's nuclear family and Norman Rockwell fantasies, one of the many incarnations of a modern-day family is presented: a single mother with two kids, who are neither geniuses nor monsters, just normal children; happy to visit Pizza Planet and disappointed when favorite toys are lost. Buzz  who believes he is a real life space ranger on a mission to save the universe - become Andy's favorite toy over Woody. The funny thing is: though Buzz believes he is real, he still adheres to toy protocol of ""playing inert"" when humans are in the area. (Maybe it's instinct?) When he mentions saving a toy from Sid, the vicious boy next door, how does he propose to do it if he is to adhere to the inert protocol? Buzz's ingenuousness regarding his role as a toy infuriates Woody to the point of attempted toy-assassination. Through a concatenation of accidents, both he and Buzz become lost and must use teamwork, trust and ingenuity to beat their path back to Andy, which finds them ensconced in scorchingly funny vignettes (Buzz fastening himself in an over-sized seatbelt; both falling in with green, three-eyed aliens; Buzz hyperventilating as ""Mrs. Nesbitt""). During a climactic rocket ride, the callback line, ""This is not flying - this is falling with style,"" simply seals this movie's greatness. At least I now have a plausible explanation as to why my toys always got lost: after going about their toy business, they would just go inert anywhere they happened to be, instead of paying attention to their master's infallible toy filing system."


### **10 - Results**

##### Movie Filter colaborative wiht K = 4: RMSE 1.04849
##### Movie Filter colaborative wiht K = 6: RMSE 1.04461
##### Average between Movie Filter colaborative wiht K = 4 and dummy avg by user: RMSE 1.00042
##### Average between Movie Filter colaborative wiht K = 6 and dummy avg by user: RMSE 0.99707
##### user Filter colaborative wiht K = 6: RMSE 1.12605
##### user Filter colaborative wiht K = 3: RMSE 1.14487
##### baseline: RMSE 0.95653
##### SVD com 2 fatores: RMSE 0.96758
##### SVD com 5 fatores: RMSE 0.95662
##### SVD com 10 fatores: RMSE 0.94786
##### SVD com 15 fatores: RMSE 0.93649
##### SVD com 20 fatores: RMSE 0.93942

##### gradiente com adição de features: RMSE 0.94647 > Ajuste com a base toda: 0.93756



## **References**

##### https://machinelearningmastery.com/sparse-matrices-for-machine-learning/

##### https://pub.towardsai.net/recommendation-system-in-depth-tutorial-with-python-for-netflix-using-collaborative-filtering-533ff8a0e444