<a href="https://colab.research.google.com/github/russro/anime-recommendation-bot/blob/main/cosineSimilarityModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [12]:
import numpy as np
import pandas as pd
import matplotlib as plt

from google.colab import files
import zipfile
from sklearn.model_selection import train_test_split

## Upload data

In [3]:
uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving sample.zip to sample.zip
User uploaded file "sample.zip" with length 76022667 bytes


## Unzip and read data

In [8]:
zf = zipfile.ZipFile('/content/sample.zip') 

anime = pd.read_csv(zf.open("anime.csv"),sep='\t')
users = pd.read_csv(zf.open("user.csv"),sep='\t')
userScores = pd.read_csv(zf.open("user_anime000000000000.csv"),sep='\t')

## Concatenate userScores

In [None]:
# TODO

## Split data into train-test-validation partitions
From the EDA within *cosineSimilarityMethodAnalysis.ipynb*, the data is not only just sparse but also contains very sparse vectors (e.g. users or anime with only one score). Therefore, the chosen splitting strategy is very important. The following approaches are considered:

1. **Random Split**: Split all entries at random. Potentially suffers from masking out too many or all entries for one user or anime.

2. **Stratified Split**: Mask *n* entries for each user or anime, where *n* is some arbitrarily chosen number. This approach does not normalize for users/anime that may have many entries compared to those that have very little.

3. **Proportional Split**: Take some percentage from each user/entry. Though this somewhat accounts for the problem of high entry-count versus low entry-count users/anime, this may still suffer from users/anime with very little amount of entries (e.g. less than 3 entries may be infeasible for splitting).

4. **Class Split**: Take out animes/users that fall into some class, and see how well our model generalizes for predicting their scores. Though this approach is useful for certain models/studies, the aim of this model is to generalize to all users and predict/suggest anime for users given their scoring history.

5. **Time Split**: Take out a proportion of entries from the future (e.g. 10% of most recent data) then train model to test how well it predicts these 'future' entries. This method may be suitable for predicting user scores for anime that have been recently released or will be released in the future. Due to the simplicity of the model (not using many features) and that there is no currently known MAL data pipeline, this method does not fit the objectives for the goal of the model.

## Create pivot table from userScores

In [13]:
scoreMatrix = pd.pivot_table(userScores, values='score',index='user_id',columns='anime_id')
scoreMatrix.sample(5)

anime_id,1,5,6,7,8,15,16,17,18,19,...,51194,51195,51198,51218,51221,51222,51224,51225,51234,51236
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
_jetsy,,,,,,,,,,,...,,,,,,,,,,
4shiryu,10.0,8.0,9.0,,,,,,,,...,,,,,,,,,,
_dukedevlin_,,,,,,,,,,6.0,...,,,,,,,,,,
35arata,,,,,,,,,,,...,,,,,,,,,,
-karasu,7.0,,,,,,,,,,...,,,,,,,,,,


## Check if all users and anime have at least one non-NaN entry

In [10]:
u = pd.isna(scoreMatrix).all(axis=1) # Returns boolean series to check if rows have ALL NaN's
display(u)
u[u == True].empty # All users have provided at least one score

user_id
-------             False
----------yea       False
-------m-------     False
------____------    False
-----____----       False
                    ...  
_valio              False
_valkyrie_          False
_vall               False
_valuwu             False
_vampirek_          False
Length: 17352, dtype: bool

True

In [11]:
a = pd.isna(scoreMatrix).all(axis=0) # Returns boolean series to check if columns have ALL NaN's  
display(a)
a[a == True].empty # All users have provided at least one score

anime_id
1        False
5        False
6        False
7        False
8        False
         ...  
51222    False
51224    False
51225    False
51234    False
51236    False
Length: 11700, dtype: bool

True