## Other Modeling Approaches

After having tried Association Rule Mining, I have decided that it would be interesting to try to use other methods to produce Anime Recommendations, in order
to compare and contrast methods and give a comprehensive final report.

In this notebook, we will explore the following methods:

- User-based collaborative filtering using `scikit-surprise` library. The idea is that we use a person's profile of anime they have watched and rated to try to find other users who have a similar profile based only on the anime they have watched and rated as well, and then predict how well the person we want to predict for would have rated an anime that the other similar profiles have watched and rated. So we try to find similar watchers and use their ratings to generalize what our person would probably rate the other anime they haven't watched as. Return the predicted higher rated anime.
- Feature Similarity using KNN. We have a mix of data types, which makes the problem a bit tricky. We have categorical variables (genre and tags) and numerical variables (number of episodes). In order to use both of these data, we will need to do two things
    - normalize the number of episodes to [0,1] since we know that the distribution is NOT gaussian and there are many outliers in the data. the distribution is an exponential distribution.
    - one hot encode the genre categorical variables. there are not that many genres, so we can do that.
    - one hote encode the tag variables. this is more tricky, because there are a LOT of differing tags. we will need to do more data preprocessing with this and potentially apply dimensionality reduction techniques that will remove the principle components that explain the least variance in the data and keep the ones that explain the most.

## User-based Collaborative Filtering
### Use `scikit-surprise` package for more advanced recommendation system 

In [24]:
import pandas as pd
import pickle

In [2]:
df_anime_lists = pd.read_csv("./data/animelists_cleaned.csv")

In [3]:
df_users_cleaned = pd.read_csv("./data/users_cleaned.csv")

In [4]:
# create map from name to id
userName2userIDMap = {}
def foo(row):
    global userName2userIDMap
    k,v = row.username, row.user_id
    if k not in userName2userIDMap:
        userName2userIDMap[k] = v

df_users_cleaned.apply(foo, axis=1);

In [5]:
# create reverse map from id to name
userID2userNameMap = {}
for k,v in userName2userIDMap.items():
    userID2userNameMap[v] = k

In [25]:
# save the username maps
with open('userName2userIDMap.pkl', 'wb') as f:
    pickle.dump(userName2userIDMap, f)

with open('userID2userNameMap.pkl', 'wb') as f:
    pickle.dump(userID2userNameMap, f)

In [6]:
# convert the usernames into user_ids
user_ids = []
def foo(row):
    global user_ids
    user_ids.append(userName2userIDMap[row.username])

df_anime_lists.apply(foo, axis=1);

In [26]:
# save the ordered list of all user ids for the anime list observations (really long)
with open('user_ids.pkl', 'wb') as f:
    pickle.dump(user_ids, f)

In [7]:
df_anime = pd.read_csv("./data/anime_cleaned.csv")

animeID2animeNameMap = {}
animeName2animeIDMap = {}

def foo(row):
    global animeID2animeNameMap, animeName2animeIDMap
    id_,title = row.anime_id, row.title
    animeID2animeNameMap[id_] = title
    animeName2animeIDMap[title] = id_

df_anime.apply(foo, axis=1)

del df_anime

In [27]:
with open('animeID2animeNameMap.pkl', 'wb') as f:
    pickle.dump(animeID2animeNameMap, f)

with open('animeName2animeIDMap.pkl', 'wb') as f:
    pickle.dump(animeName2animeIDMap, f)

In [8]:
df = pd.DataFrame(data={
    'user_id': user_ids,
    'anime_id': df_anime_lists['anime_id'],
    'score': df_anime_lists['my_score'],
})

In [9]:
del df_anime_lists
del df_users_cleaned

In [10]:
with open('df_custom.pkl', 'wb') as f:
    pickle.dump(df, f)

In [11]:
df

Unnamed: 0,user_id,anime_id,score
0,2255153,21,9
1,2255153,59,7
2,2255153,74,7
3,2255153,120,7
4,2255153,178,7
...,...,...,...
31284025,4862000,15611,9
31284026,4862000,27815,9
31284027,299167,5945,8
31284028,263803,1316,9


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31284030 entries, 0 to 31284029
Data columns (total 3 columns):
 #   Column    Dtype
---  ------    -----
 0   user_id   int64
 1   anime_id  int64
 2   score     int64
dtypes: int64(3)
memory usage: 716.0 MB


In [13]:
df['score'].value_counts()

score
0     12111905
8      4834595
7      4234726
9      3443674
10     2507404
6      2128502
5      1085660
4       480871
3       223202
2       130314
1       103177
Name: count, dtype: int64

remove all observations in which the score is 0, which means user did not give a rating.

In [14]:
df = df[df['score'] != 0]

In [15]:
df['score'].value_counts()

score
8     4834595
7     4234726
9     3443674
10    2507404
6     2128502
5     1085660
4      480871
3      223202
2      130314
1      103177
Name: count, dtype: int64

In [16]:
# filter out user_ids which have fewer than 5 anime rated. 

# it looks like we will filter our 3926 users if we do that.
(df['user_id'].value_counts() < 5).value_counts()

count
False    102476
True       3926
Name: count, dtype: int64

In [17]:
# Count the number of ratings per user
ratings_per_user = df['user_id'].value_counts()

# Identify users with fewer than 5 ratings
users_with_few_ratings = ratings_per_user[ratings_per_user < 5].index

# Filter out users with fewer than 5 ratings
df = df[~df['user_id'].isin(users_with_few_ratings)]

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19162287 entries, 0 to 31284016
Data columns (total 3 columns):
 #   Column    Dtype
---  ------    -----
 0   user_id   int64
 1   anime_id  int64
 2   score     int64
dtypes: int64(3)
memory usage: 584.8 MB


In [18]:
from surprise import SVD, Dataset, Reader
from surprise.model_selection import cross_validate, GridSearchCV

In [19]:
# Define a Reader object to parse the file
reader = Reader(rating_scale=(1, 10))  # Assuming scores are from 1 to 10 (ignore 0 scores, which mean no rating given.)

# Load the dataset
data = Dataset.load_from_df(df[['user_id', 'anime_id', 'score']], reader)

In [20]:
# # Use SVD algorithm
# algo = SVD(n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02)

# # Run 5-fold cross-validation and print results
# cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

In [21]:
# SVD
# https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD

# GridSearchCV
# https://surprise.readthedocs.io/en/stable/model_selection.html#surprise.model_selection.search.GridSearchCV
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

# parameters = {
#     'n_factors': [100, 200, 300],
#     'n_epochs': [20, 30],
#     'lr_all': [0.005, 0.001],
#     'reg_all': [0.02, 0.03],
# }

# grid = GridSearchCV(algo_class = SVD, 
#                    param_grid = parameters,
#                    measures = ['rmse', 'mae'],
#                    n_jobs = 2,
#                    cv = 5,
#                    refit = True)

In [23]:
# grid_result = grid.fit(data)
# grid_result

In [None]:
# Ok, if we got good enough results, train on everything.
# trainset = data.build_full_trainset()
# algo = SVD(n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02)
# algo.fit(trainset)

In [None]:
def get_top_n_recommendations(algo, user_id, n=10):
    # Assume we have a list of all anime_ids in the dataset
    all_anime_ids = set(df['anime_id'].unique())
    
    # Get the list of anime_ids that the user has already rated
    rated_anime_ids = set(df[df['user_id'] == user_id]['anime_id'].unique())
    
    # Predict ratings for all anime the user hasn't rated
    predictions = [algo.predict(user_id, anime_id) for anime_id in all_anime_ids if anime_id not in rated_anime_ids]

    # print(predictions)
    
    # Sort the predictions in descending order of the estimated rating
    predictions.sort(key=lambda x: x.est, reverse=True)
    
    # Return the top N anime_ids
    top_n_anime_ids = [pred.iid for pred in predictions[:n]]
    return top_n_anime_ids


In [None]:
unique_user_ids = list(set(user_ids))
unique_user_ids[:10]

In [None]:
# use a untrained algo for comparison?
# algo_untrained = SVD(n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02)

In [None]:
for test_user_id in unique_user_ids[:5]:
    top_recommendations = get_top_n_recommendations(algo=grid, user_id=str(test_user_id), n=10)
    print(f"Recommendations for {userID2userNameMap[test_user_id]}: ")
    for rec in top_recommendations:
        print(animeID2animeNameMap[rec])
    print()

This approach requires that the user be in the dataset so that we know what anime they've watched and what they rated them as in order for us to predict new anime that they haven't watched and would likely rate highly, based on how other people rated those anime and what anime they have rated likely. This is user-based collaborative filtering.


Below, we will explore content-based collaborative filtering, which uses the feature information of the content (anime) in order to try to predict anime with similar features to a new user's anime list. We would need to fallback to this approach if the new user is not in the database already, so this is an approach to solve the "cold start" problem in recommendation systems.

### kNNs

This will be the fallback approach. If a new user who has never been in the database comes along with a list of anime they liked, how are we going to use the SVD algo? We can't unless we put this new data point into the dataset and retrain. But that is slow. Therefore, we will need to have a different, fallback method. To keep things simple, we can use only the genre type as a feature of the anime and try to compute distance metrics on those features between different anime. Close anime are similar anime we will recommend. 