# Anime Recommendation System Using Collaborative Filtering

This project uses [data from MyAnimeList](https://www.kaggle.com/datasets/svanoo/myanimelist-dataset) in order to recommend users anime to watch. This can be accomplished using the following methods:

1. **Cosine Similarity (Memory-Based)**: This method predicts missing scores by assigning weights to users that are most similar to each other. These weights are effectively the Pearson correlations (R) between users. They are computed via normalized inner product between all user vectors. Bias is introduced by arbitrarily filling missing entries with the mean of each row or column.
2. **Matrix Factorization** or **Singular Value Decomposition**
3. **Deep Learning Methods**

This notebook explores **Cosine Similarity** as a method for recommending anime.

**References**:
1. [User-Based Collaborative Filtering](https://www.geeksforgeeks.org/user-based-collaborative-filtering/): Implementation of Pearson Correlation.

In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt

from pathlib import Path

## Import MyAnimeList Data

[Dataset Source](https://www.kaggle.com/datasets/svanoo/myanimelist-dataset)

In [2]:
def tsvToDF(path, fileName, sep='\t'):
    """
    Function to convert tsv file to DataFrame.
    
    Parameters
    ----------
    path: pathlib.WindowsPath
        Path object from pathlib
    fileName: string
        Name of file to be extracted
    sep: string
        Separator for input file
    
    Returns
    -------
    df: pandas DataFrame
    """
    
    df = pd.read_csv(path.joinpath(fileName), sep=sep)
    
    return df

In [3]:
path = Path(r"C:\Users\prrus\Downloads\archive")

In [4]:
anime = tsvToDF(path,"anime.csv")
users = tsvToDF(path,"user.csv")
userScores = tsvToDF(path,"user_anime000000000001.csv")

In [5]:
userScores.sample(5)

Unnamed: 0,user_id,anime_id,favorite,review_id,review_date,review_num_useful,review_score,review_story_score,review_animation_score,review_sound_score,review_character_score,review_enjoyment_score,score,status,progress,last_interaction_date
931690,abo_f7el,14813,0,,,,,,,,,,10.0,completed,13.0,2013-06-28 00:00:00
1887765,adelina-elena,38691,0,,,,,,,,,,,completed,24.0,2019-12-13 00:00:00
187491,a1i0s,50631,0,,,,,,,,,,,plan_to_watch,0.0,
605135,abakerjo,32182,0,,,,,,,,,,8.0,completed,12.0,2016-09-27 00:00:00
2307140,adsas1234,37221,0,,,,,,,,,,,completed,12.0,2018-12-23 00:00:00


In [6]:
# Create Pivot Table from userScores
scoreMatrix = pd.pivot_table(userScores, values='score',index='user_id',columns='anime_id')
scoreMatrix.sample(5)

anime_id,1,5,6,7,8,15,16,17,18,19,...,51018,51036,51048,51109,51149,51150,51162,51225,51234,51236
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
adopted-potato,,,,,,,,,,,...,,,,,,,,,,
adrokx,8.0,9.0,,,,,,,,,...,,,,,,,,,,
aenubrix326,,,,,,,,,,,...,,,,,,,,,,
adelmukhlisin,,,,,,,,,,,...,,,,,,,,,,
abcboy75,,,,,,,,,,,...,,,,,,,,,,


In [7]:
# Delete users with all NaN scores
u = pd.isna(scoreMatrix).all(axis=1) # Returns boolean series to check if rows have ALL NaN's
display(u)
u[u == True].empty # All users have provided at least one score

user_id
_vampirek_       False
_vampirelord_    False
_vander_         False
_vanivani_       False
_vanix_          False
                 ...  
afinty           False
afiownz          False
afipax           False
afiq_            False
afiq_456         False
Length: 12461, dtype: bool

True

In [8]:
# Delete anime with all NaN scores
a = pd.isna(scoreMatrix).all(axis=0) # Returns boolean series to check if columns have ALL NaN's  
display(a)
a[a == True].empty # All users have provided at least one score

anime_id
1        False
5        False
6        False
7        False
8        False
         ...  
51150    False
51162    False
51225    False
51234    False
51236    False
Length: 11117, dtype: bool

True

## Cosine Similarity

One issue with this method is that the mean of either the row or the column is arbitrarily chosen to fill in the missing entries. A better method is to use Matrix Factorization, but it is more computationally expensive.

### Row-Wise Mean as Initial Condition

The mean of each user's scores fills in the missing entries.

In [9]:
# Center scoreMatrix

# Subtract existing entries with mean of each row
userCenteredCos = np.subtract(scoreMatrix[pd.notna(scoreMatrix)],
                              np.asarray(
                                  scoreMatrix[pd.notna(scoreMatrix)].mean(axis=1).to_frame()
                              ))

# Replace NaN's with 0s
userCenteredCos = userCenteredCos.fillna(0)
userCenteredCos

anime_id,1,5,6,7,8,15,16,17,18,19,...,51018,51036,51048,51109,51149,51150,51162,51225,51234,51236
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
_vampirek_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vampirelord_,-1.031250,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.968750,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vander_,1.146067,1.146067,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.146067,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vanivani_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vanix_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
afinty,2.625000,1.625000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
afiownz,0.526786,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.473214,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
afipax,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
afiq_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
# Vector lengths of each user vector
l2Norms = np.sqrt(np.square(userCenteredCos).sum(axis=1))
l2Norms

user_id
_vampirek_        8.238858
_vampirelord_     7.808249
_vander_         19.212005
_vanivani_        0.000000
_vanix_          22.567072
                   ...    
afinty           11.185929
afiownz          22.357086
afipax           13.493018
afiq_             0.925820
afiq_456          9.275692
Length: 12461, dtype: float64

In [11]:
# Normalize user vectors by dividing each row/user vector with their vector lengths
normUserCenteredCos = userCenteredCos.copy()
normUserCenteredCos[l2Norms != 0] = userCenteredCos[l2Norms != 0].divide(l2Norms, axis=0)
normUserCenteredCos

anime_id,1,5,6,7,8,15,16,17,18,19,...,51018,51036,51048,51109,51149,51150,51162,51225,51234,51236
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
_vampirek_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vampirelord_,-0.132072,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.252137,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vander_,0.059654,0.059654,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.163755,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vanivani_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vanix_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
afinty,0.234670,0.145272,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
afiownz,0.023562,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.021166,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
afipax,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
afiq_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# User-User Similarity Matrix
uuSim = np.dot(normUserCenteredCos,normUserCenteredCos.transpose())
uuSim

array([[ 1.        ,  0.00285573,  0.        , ...,  0.01542859,
         0.03632238,  0.        ],
       [ 0.00285573,  1.        , -0.01868061, ..., -0.00449936,
         0.        ,  0.00811494],
       [ 0.        , -0.01868061,  1.        , ...,  0.07252705,
         0.        ,  0.06818708],
       ...,
       [ 0.01542859, -0.00449936,  0.07252705, ...,  1.        ,
        -0.01407482,  0.00575844],
       [ 0.03632238,  0.        ,  0.        , ..., -0.01407482,
         1.        ,  0.        ],
       [ 0.        ,  0.00811494,  0.06818708, ...,  0.00575844,
         0.        ,  1.        ]])

In [13]:
# L1 norms of similarity matrix for normalizing predicted ratings
l1Norms = abs(uuSim).sum(axis=0)
l1Norms

array([267.86244091, 292.10339024, 685.41829374, ..., 532.5377448 ,
       246.83478659, 381.71095242])

In [14]:
# Reconstruct predictions with inner product between user-user sim and 
# centered (not normalized) rating matrix, dividing column-wise by
# l1Norms, and then adding back row-wise means

# Inner product
predictions = pd.DataFrame(uuSim,index=userCenteredCos.index,
                           columns=userCenteredCos.index).dot(userCenteredCos)
predictions

anime_id,1,5,6,7,8,15,16,17,18,19,...,51018,51036,51048,51109,51149,51150,51162,51225,51234,51236
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
_vampirek_,21.896407,3.606579,4.153230,0.008526,-0.145385,-1.008150,0.356548,0.063844,2.012492,9.733911,...,-0.023279,-0.023279,0.004699,-0.090762,0.018892,-0.023279,0.023591,-0.018954,-0.023279,-0.023279
_vampirelord_,-60.550817,-10.089116,-7.295273,0.358153,0.006566,0.573492,-1.736165,0.163408,-0.236011,45.115406,...,0.010139,0.010139,0.002586,0.014574,-0.008228,0.010139,-0.005642,0.007497,0.010139,0.010139
_vander_,243.757680,60.770614,40.988720,-1.180544,-0.435133,2.760497,9.954429,0.241059,8.983084,186.788098,...,-0.019687,-0.019687,-0.011892,0.008715,0.015977,-0.019687,0.004085,-0.010640,-0.019687,-0.019687
_vanivani_,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
_vanix_,145.402715,33.227414,28.577988,-1.992401,-0.271743,3.569623,5.539685,-0.013384,9.283917,99.262650,...,-0.021363,-0.021363,-0.004895,-0.033690,0.017337,-0.021363,0.012442,-0.010604,-0.021363,-0.021363
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
afinty,283.388022,66.444909,44.267445,-1.383462,-0.169481,3.486171,9.640364,0.152645,8.024216,128.759164,...,-0.013039,-0.013039,-0.008518,0.009228,0.010582,-0.013039,0.002064,-0.004844,-0.013039,-0.013039
afiownz,173.061794,41.817242,39.279139,-1.948390,-1.141341,4.491712,15.961108,0.366903,9.148301,97.003579,...,-0.036906,-0.036906,-0.005841,-0.072297,0.029951,-0.036906,0.024110,-0.026084,-0.036906,-0.036906
afipax,112.588853,27.250311,17.216791,-0.646741,-0.047603,1.086640,1.074748,-0.001656,5.242119,94.997559,...,-0.007137,-0.007137,-0.006289,0.013815,0.005792,-0.007137,-0.000497,-0.000944,-0.007137,-0.007137
afiq_,-2.214236,-0.883475,-0.823748,0.390114,-0.049925,-0.116119,-0.704335,-0.097421,0.274163,-4.704585,...,0.019191,0.019191,0.002207,0.042066,-0.015575,0.019191,-0.013368,0.017077,0.019191,0.019191


In [15]:
# Divide column-wise by l1Norms
predictions = predictions.divide(l1Norms, axis=0)

# Add original row means
predictions = np.add(predictions, 
                     np.asarray(
                         scoreMatrix[pd.notna(scoreMatrix)].mean(axis=1).to_frame()
                     ))
predictions

anime_id,1,5,6,7,8,15,16,17,18,19,...,51018,51036,51048,51109,51149,51150,51162,51225,51234,51236
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
_vampirek_,8.142351,8.074070,8.076111,8.060638,8.060063,8.056842,8.061937,8.060844,8.068119,8.096945,...,8.060519,8.060519,8.060624,8.060267,8.060677,8.060519,8.060694,8.060535,8.060519,8.060519
_vampirelord_,7.823958,7.996710,8.006275,8.032476,8.031272,8.033213,8.025306,8.031809,8.030442,8.185700,...,8.031285,8.031285,8.031259,8.031300,8.031222,8.031285,8.031231,8.031276,8.031285,8.031285
_vander_,7.209566,6.942595,6.913734,6.852210,6.853298,6.857960,6.868456,6.854284,6.867039,7.126450,...,6.853904,6.853904,6.853915,6.853945,6.853956,6.853904,6.853939,6.853917,6.853904,6.853904
_vanivani_,,,,,,,,,,,...,,,,,,,,,,
_vanix_,7.497577,7.324110,7.316920,7.269646,7.272307,7.278247,7.281294,7.272707,7.287084,7.426226,...,7.272694,7.272694,7.272720,7.272675,7.272754,7.272694,7.272747,7.272711,7.272694,7.272694
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
afinty,7.851627,7.486753,7.449453,7.372673,7.374715,7.380863,7.391214,7.375257,7.388496,7.591559,...,7.374978,7.374978,7.374986,7.375016,7.375018,7.374978,7.375003,7.374992,7.374978,7.374978
afiownz,7.835606,7.560780,7.555465,7.469134,7.470824,7.482620,7.506637,7.473983,7.492371,7.676340,...,7.473137,7.473137,7.473202,7.473063,7.473277,7.473137,7.473265,7.473160,7.473137,7.473137
afipax,7.765266,7.605017,7.586176,7.552632,7.553757,7.555887,7.555864,7.553843,7.563690,7.732233,...,7.553833,7.553833,7.553834,7.553872,7.553857,7.553833,7.553845,7.553844,7.553833,7.553833
afiq_,9.848172,9.853564,9.853806,9.858723,9.856941,9.856672,9.854289,9.856748,9.858254,9.838083,...,9.857221,9.857221,9.857152,9.857313,9.857080,9.857221,9.857089,9.857212,9.857221,9.857221


In [16]:
s = scoreMatrix.loc['_vanivani_']
s[pd.notna(s)] # Contains a single score

anime_id
21    10.0
Name: _vanivani_, dtype: float64

#### Analysis:

I noticed that this approach fails to predict scores for users that have given a set of scores with no variance. The missing entries are filled with the mean of each user's set of scores, so this corresponds to the *zero vector* after centering each row. The algorithm used to compute the similarity matrix necessitates computing the difference between the entry and the row's mean which fails if a row has zero variance. Therefore, I switch to using column-based means since it is much more likely for all animes to have variance in their scoring distribution. In other words, it is much less likely for for users to have variance in their scoring distribution.

### Column-Wise as Initial Condition

The same computations are done as the row-wise but the missing entries are filled using the mean of each column (mean score given to each anime).

In [17]:
# Center scoreMatrix

# Subtract existing entries with mean of each column
userCenteredCos = (scoreMatrix[pd.notna(scoreMatrix)] - 
    scoreMatrix[pd.notna(scoreMatrix)].mean(axis=0))

# Replace NaN's with 0s
userCenteredCos = userCenteredCos.fillna(0)
userCenteredCos

anime_id,1,5,6,7,8,15,16,17,18,19,...,51018,51036,51048,51109,51149,51150,51162,51225,51234,51236
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
_vampirek_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vampirelord_,-1.632188,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.1728,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vander_,-0.632188,-0.331909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.1728,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vanivani_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vanix_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
afinty,1.367812,0.668091,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
afiownz,-0.632188,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.8272,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
afipax,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
afiq_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
# Vector lengths of each user vector
l2Norms = np.sqrt(np.square(userCenteredCos).sum(axis=1))
l2Norms

user_id
_vampirek_        9.364916
_vampirelord_    11.046957
_vander_         16.033453
_vanivani_        1.207033
_vanix_          20.105464
                   ...    
afinty            8.014844
afiownz          17.670293
afipax           11.601547
afiq_             6.189572
afiq_456          8.591590
Length: 12461, dtype: float64

In [19]:
# Normalize user vectors by dividing each row/user vector with their vector lengths
normUserCenteredCos = userCenteredCos.copy()
normUserCenteredCos[l2Norms != 0] = userCenteredCos[l2Norms != 0].divide(l2Norms, axis=0)
normUserCenteredCos

anime_id,1,5,6,7,8,15,16,17,18,19,...,51018,51036,51048,51109,51149,51150,51162,51225,51234,51236
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
_vampirek_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vampirelord_,-0.147750,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.106165,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vander_,-0.039429,-0.020701,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.073147,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vanivani_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
_vanix_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
afinty,0.170660,0.083357,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
afiownz,-0.035777,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.103405,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
afipax,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
afiq_,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
# User-User Similarity Matrix
uuSim = np.dot(normUserCenteredCos,normUserCenteredCos.transpose())
uuSim

array([[ 1.        , -0.00256296,  0.        , ..., -0.00296078,
         0.02386162,  0.        ],
       [-0.00256296,  1.        , -0.00706964, ...,  0.01097259,
         0.        ,  0.00144753],
       [ 0.        , -0.00706964,  1.        , ...,  0.00114657,
         0.        , -0.00204342],
       ...,
       [-0.00296078,  0.01097259,  0.00114657, ...,  1.        ,
         0.00316481, -0.00485213],
       [ 0.02386162,  0.        ,  0.        , ...,  0.00316481,
         1.        ,  0.        ],
       [ 0.        ,  0.00144753, -0.00204342, ..., -0.00485213,
         0.        ,  1.        ]])

In [21]:
# L1 norms of similarity matrix for normalizing predicted ratings
l1Norms = abs(uuSim).sum(axis=0)
l1Norms

array([254.60805648, 256.01396555, 775.26940792, ..., 380.4478476 ,
       197.87484986, 282.15380908])

In [22]:
# Reconstruct predictions with inner product between user-user sim and 
# centered (not normalized) rating matrix, dividing column-wise by
# l1Norms, and then adding back row-wise means

# Inner product
predictions = pd.DataFrame(uuSim,index=userCenteredCos.index,
                           columns=userCenteredCos.index).dot(userCenteredCos)
predictions

anime_id,1,5,6,7,8,15,16,17,18,19,...,51018,51036,51048,51109,51149,51150,51162,51225,51234,51236
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
_vampirek_,9.104611,1.160548,6.365385,0.851781,0.272714,1.898668,2.203838,0.448710,2.091812,1.101944,...,0.0,0.0,0.0,-0.044750,0.0,0.0,0.089500,-0.033174,0.0,0.0
_vampirelord_,-56.310182,-5.350399,-8.520138,-0.299803,-0.293587,-0.394999,-2.341945,-0.120365,-1.638308,10.652612,...,0.0,0.0,0.0,-0.010590,0.0,0.0,0.021181,-0.011925,0.0,0.0
_vander_,-50.525419,-12.032095,-24.601230,-0.240153,-0.948919,-9.703222,-4.779473,-1.068195,-11.061943,-0.260034,...,0.0,0.0,0.0,0.021607,0.0,0.0,-0.043215,0.025864,0.0,0.0
_vanivani_,11.204120,0.595214,7.428254,-0.171256,0.418475,5.357588,0.504580,0.382272,1.684821,13.188219,...,0.0,0.0,0.0,-0.006109,0.0,0.0,0.012219,-0.002627,0.0,0.0
_vanix_,-15.135435,-5.954356,-11.054872,-0.876287,-0.539872,-3.156412,-3.078781,-0.447441,-3.448736,-3.051003,...,0.0,0.0,0.0,0.018091,0.0,0.0,-0.036183,0.002631,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
afinty,52.357111,7.920244,3.361568,0.023382,-0.059830,-1.466679,-2.820803,-0.225745,-0.834125,6.962837,...,0.0,0.0,0.0,0.023659,0.0,0.0,-0.047318,0.031169,0.0,0.0
afiownz,-11.443406,0.410848,0.436265,-0.476486,0.193383,1.263529,2.547213,0.085223,2.044493,-16.627895,...,0.0,0.0,0.0,-0.010709,0.0,0.0,0.021417,-0.008779,0.0,0.0
afipax,-15.784991,-3.515395,-8.582935,-0.514212,-0.690309,-4.076502,-5.782462,-0.333068,-4.568547,1.118387,...,0.0,0.0,0.0,0.025516,0.0,0.0,-0.051033,0.015814,0.0,0.0
afiq_,10.775590,2.139552,3.472178,0.239590,0.197296,1.356509,2.133794,0.229884,2.818481,2.844889,...,0.0,0.0,0.0,-0.002798,0.0,0.0,0.005595,0.005944,0.0,0.0


In [23]:
# Divide column-wise by l1Norms
predictions = predictions.divide(l1Norms, axis=0)

# Add original row means
predictions = np.add(predictions, 
                     np.asarray(
                         scoreMatrix[pd.notna(scoreMatrix)].mean(axis=1).to_frame()
                     ))
predictions

anime_id,1,5,6,7,8,15,16,17,18,19,...,51018,51036,51048,51109,51149,51150,51162,51225,51234,51236
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
_vampirek_,8.096365,8.065164,8.085607,8.063952,8.061677,8.068063,8.069262,8.062368,8.068822,8.064934,...,8.060606,8.060606,8.060606,8.060430,8.060606,8.060606,8.060958,8.060476,8.060606,8.060606
_vampirelord_,7.811300,8.010351,7.997970,8.030079,8.030103,8.029707,8.022102,8.030780,8.024851,8.072859,...,8.031250,8.031250,8.031250,8.031209,8.031250,8.031250,8.031333,8.031203,8.031250,8.031250
_vander_,6.788761,6.838413,6.822200,6.853623,6.852709,6.841417,6.847768,6.852555,6.839664,6.853597,...,6.853933,6.853933,6.853933,6.853960,6.853933,6.853933,6.853877,6.853966,6.853933,6.853933
_vanivani_,10.037536,10.001994,10.024886,9.999426,10.001402,10.017949,10.001690,10.001281,10.005645,10.044183,...,10.000000,10.000000,10.000000,9.999980,10.000000,10.000000,10.000041,9.999991,10.000000,10.000000
_vanix_,7.235908,7.258242,7.245835,7.270596,7.271414,7.265049,7.265238,7.271639,7.264338,7.265305,...,7.272727,7.272727,7.272727,7.272771,7.272727,7.272727,7.272639,7.272734,7.272727,7.272727
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
afinty,7.572601,7.404892,7.387687,7.375088,7.374774,7.369465,7.364354,7.374148,7.371852,7.401278,...,7.375000,7.375000,7.375000,7.375089,7.375000,7.375000,7.374821,7.375118,7.375000,7.375000
afiownz,7.434928,7.474589,7.474674,7.471620,7.473861,7.477442,7.481736,7.473499,7.480055,7.417583,...,7.473214,7.473214,7.473214,7.473178,7.473214,7.473214,7.473286,7.473185,7.473214,7.473214
afipax,7.512356,7.544606,7.531286,7.552495,7.552032,7.543131,7.538647,7.552971,7.541838,7.556786,...,7.553846,7.553846,7.553846,7.553913,7.553846,7.553846,7.553712,7.553888,7.553846,7.553846
afiq_,9.911599,9.867956,9.874690,9.858354,9.858140,9.863998,9.867926,9.858305,9.871387,9.871520,...,9.857143,9.857143,9.857143,9.857129,9.857143,9.857143,9.857171,9.857173,9.857143,9.857143


In [24]:
# Check user-user similarity matrix for rows with no variance
chk = uuSim.diagonal()
chk[chk == 0]

array([0.])

In [25]:
# Check for NaNs in prediction matrix
predictions[predictions.isnull().any(axis=1)]

anime_id,1,5,6,7,8,15,16,17,18,19,...,51018,51036,51048,51109,51149,51150,51162,51225,51234,51236
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ab1111977,,,,,,,,,,,...,,,,,,,,,,


In [26]:
bug = scoreMatrix.loc['ab1111977']
bug[bug.notna()]

anime_id
853    8.0
Name: ab1111977, dtype: float64

In [27]:
anm = scoreMatrix.loc[:,853]
anm[anm.notna()].mean() # The average of this anime is the same as this user's lone score!

8.0

#### Analysis

This error occurred due to the rare instance in which a user had only one score and it happened to match the mean of the average score of the anime. Thus, this left the user with a *zero vector* after centering the matrix. This can be resolved by filtering out entries that match this edge case and necessitate that the user add variance relative to the anime's score (if column-wise means are used to fill missing entries) or relative to their own scoring distribution (if row-wise means are used to fill missing entries).