<a href="https://colab.research.google.com/github/russro/anime-recommendation-bot/blob/main/cosineSimilarityModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anime Recommendation System using a Cosine Similarity Model

In [6]:
import numpy as np
import pandas as pd
import matplotlib as plt

from google.colab import files
import zipfile
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

## Upload data

In [2]:
uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving sample.zip to sample.zip
User uploaded file "sample.zip" with length 76022667 bytes


## Unzip and read data

In [3]:
zf = zipfile.ZipFile('/content/sample.zip') 

anime = pd.read_csv(zf.open("anime.csv"),sep='\t')
users = pd.read_csv(zf.open("user.csv"),sep='\t')
userScores = pd.read_csv(zf.open("user_anime000000000000.csv"),sep='\t')

## Concatenate and Clean userScores

In [5]:
# TODO
# Change

# Filter out unused data
userScores = userScores.loc[:,['user_id','anime_id','score']]
userScores

Unnamed: 0,user_id,anime_id,score
0,-------,1,8.0
1,-------,1000,
2,-------,1002,
3,-------,1003,8.0
4,-------,1004,7.0
...,...,...,...
4003076,_vampirek_,30911,
4003077,_vampirek_,31043,9.0
4003078,_vampirek_,31229,9.0
4003079,_vampirek_,31240,10.0


## Split data into train-validation-test partitions
The dataset will be split between the majority data used for training and validation and a single test set that will not be touched until the model is ready to be conclusively evaluated. K-fold cross validation will be employed on the rest of the data.

From the EDA within *cosineSimilarityMethodAnalysis.ipynb*, the data is not only just sparse but also contains very sparse vectors (e.g. users or anime with only one score). Therefore, the chosen splitting strategy is very important. The following approaches are considered:

1. **Random Split**: Split all entries at random. Potentially suffers from masking out too many or all entries for one user or anime.

2. **Stratified Split**: Mask *n* entries for each user or anime, where *n* is some arbitrarily chosen number. This approach does not normalize for users/anime that may have many entries compared to those that have very little.

3. **Proportional Split**: Take some percentage from each user/entry. Though this somewhat accounts for the problem of high entry-count versus low entry-count users/anime, this may still suffer from users/anime with very little amount of entries (e.g. less than 3 entries may be infeasible for splitting).

4. **Class Split**: Take out animes/users that fall into some class, and see how well our model generalizes for predicting their scores. Though this approach is useful for certain models/studies, the aim of this model is to generalize to all users and predict/suggest anime for users given their scoring history.

5. **Time Split**: Take out a proportion of entries from the future (e.g. 10% of most recent data) then train model to test how well it predicts these 'future' entries. This method may be suitable for predicting user scores for anime that have been recently released or will be released in the future. Due to the simplicity of the model (not using many features) and that there is no currently known MAL data pipeline, this method does not fit the objectives for the goal of the model.

This model will simply split randomly as the very act of protecting low entry users introduces bias into the model. Also, evaluation cannot be properly conducted as the low entry users would be excluded from the validation and test sets. On both ends of the extreme for this tradeoff, if *all* low entry users are trained on, then there would be no way to evaluate how well it generalizes as none would be in validation nor test sets. If all low entry users were simply removed and not considered, then the model may not generalize well to low entry users in general. Therefore, a simple compromise is to split the data randomly.

In [None]:
RANDOM_SEED = 42069

TODO: 
1. Split data into training+validation and testing sets.
2. Create cosineSimilarity object with method to fit input data.
3. Perform k-fold cross-validation on all iterations of models.
4. Choose best model and validate using test set.
5. Save model for later deployment (save the training folds for the particular iteration and model class/object).

In [7]:
def kfold_train_val_test(df, k=5, val_size=0.15, test_size=0.10, rand_seed=RANDOM_SEED):
  kf = KFold(n_splits=k, random_state=rand_seed)
  

  temp, test = train_test_split(df, test_size=test_size, random_state=rand_seed)
  train, val = train_test_split(temp, test_size=val_size/(1-test_size), random_state=rand_seed)
  return train, val, test

NameError: ignored

In [None]:
for train_index , test_index in kf.split(userScores):
    X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
     
    model.fit(X_train,y_train)
    pred_values = model.predict(X_test)
     
    acc = accuracy_score(pred_values , y_test)
    acc_score.append(acc)
     
avg_acc_score = sum(acc_score)/k
 
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))

In [None]:
userScoresTrain, userScoresVal, userScoresTest = train_val_test(userScores)

## Create pivot table from userScores

In [None]:
scoreMatrix = pd.pivot_table(userScoresTrain, values='score',index='user_id',columns='anime_id')
scoreMatrix.sample(5)

anime_id,1,5,6,7,8,15,16,17,18,19,...,51194,51195,51198,51218,51221,51222,51224,51225,51234,51236
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
_jetsy,,,,,,,,,,,...,,,,,,,,,,
4shiryu,10.0,8.0,9.0,,,,,,,,...,,,,,,,,,,
_dukedevlin_,,,,,,,,,,6.0,...,,,,,,,,,,
35arata,,,,,,,,,,,...,,,,,,,,,,
-karasu,7.0,,,,,,,,,,...,,,,,,,,,,


## Column-Wise Collaborative Filtering Model

Create estimator object using sklearn

In [None]:
#TODO

Center scoreMatrix

In [None]:
# Subtract existing entries with mean of each column
userCenteredCos = (scoreMatrix[pd.notna(scoreMatrix)] - 
    scoreMatrix[pd.notna(scoreMatrix)].mean(axis=0))

# Replace NaN's with 0s
userCenteredCos = userCenteredCos.fillna(0)
userCenteredCos

In [None]:
# Vector lengths of each user vector
l2Norms = np.sqrt(np.square(userCenteredCos).sum(axis=1))
l2Norms

In [None]:
# Normalize user vectors by dividing each row/user vector with their vector lengths
normUserCenteredCos = userCenteredCos.copy()
normUserCenteredCos[l2Norms != 0] = userCenteredCos[l2Norms != 0].divide(l2Norms, axis=0)
normUserCenteredCos

In [None]:
# User-User Similarity Matrix
uuSim = np.dot(normUserCenteredCos,normUserCenteredCos.transpose())
uuSim

In [None]:
# L1 norms of similarity matrix for normalizing predicted ratings
l1Norms = abs(uuSim).sum(axis=0)
l1Norms

Reconstruct predictions with inner product between user-user sim and centered (not normalized) rating matrix, dividing column-wise by l1Norms, and then adding back row-wise means.

In [None]:
# Inner product
predictions = pd.DataFrame(uuSim,index=userCenteredCos.index,
                           columns=userCenteredCos.index).dot(userCenteredCos)
predictions

In [None]:
# Divide column-wise by l1Norms
predictions = predictions.divide(l1Norms, axis=0)

# Add original row means
predictions = np.add(predictions, 
                     np.asarray(
                         scoreMatrix[pd.notna(scoreMatrix)].mean(axis=1).to_frame()
                     ))
predictions