Hello there! In this kernel, I'll be building a simple but effective recommender system using matrix factorization. Let's get to it. 

The chapter names will describe what happens in each chapter.

Contents:
    1. A breif introduction to:
        - Anime
        - Recommender Systems
    2. Workflow
    3. Matrix Factorization: An introduction
    4. Building the recommender system
    5. Making recommendations

MyAnimeList is a website where anime fans gather and share their views on anime/manga they've watched. The site currently has information on about 17K anime (it did when I last checked). For those of you who don't know what anime are, they are Japanese animated serials / movies which are generally created from Japanese manga. Manga are Japanese comics. But, some anime are different. They are created from what are called light novels. Light novels are pretty awesome. You should check 'em out. Sword Art Online or Violet Evergarden would be a great place to start. 
Then there are other anime like Pokemon that are created from games.

Anime are pretty popular. If you were born before 2005 and watched Cartoon Network, chances are you've watched anime (remember Pokemon, Dragon Ball-Z, BeyBlade, etc. ? They're all anime). 

A recommender system or a recommendation system (sometimes replacing "system" with a synonym such as platform or engine) is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item. That's from [wikipedia](https://en.wikipedia.org/wiki/Recommender_system).

There are many algorithms to build recommender systems. In this kernel, I'll be building a recommender system that makes use of a particular set of algorithms called [Matrix Factorization](https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems)

I'll be first writing a simple algorithm and in later versions use the Surprise Python library to factorize the matrix and make recommendations.

Let's begin!

# Workflow
I'll be using a simple approach in this kernel. First, I'll load the data and wrangle the data to get rid of missing values and stuff. Next, we'll generate the rating matrix using which the recommendations will be made. Third, is the matrix factorization step and fourth will be making the actuall recommendations.

# Matrix Factorization: A Brief Introduction
In 2006, Netflix launched the Netflix prize competition. The goal of the competition was to improve the accuracy of their recommendation algorithm by 10%. The team which had registered this improvement after a year would receive a million dollars. It was during this competition that the matrix factorization approach was popularized. Out of all the algorithms that were used, the set of matrix factorization techniques stood out for their power and accuracy.

If you don't know what matrix factorization is, then I suggest that you read about it. [Nicolas Hug](http://nicolas-hug.com/blog/matrix_facto_1)'s series of blog posts on the topic does an amazing job of acquainting the reader with the concept of matrix factorization. It gives you a good intuition about the process while keeping the math as easy as possible. I'll only give you a really short introduction here and dive into the actual process in the kernel. 

Matrix factorization is a singular vector decomposition technique (or is it the other way round? I always get confused), which reduces the dimension of your feature space. What SVD does is, it enables you to express a really big matrix as the product of two (technically three, but for the sake of simplicity, let's keep it to two), smaller matrices. Here's a neat mathematical formula. I'm including it for no other reason than coolness. It just looks cool.
$$R = P Q^{T}$$

where:
    $R$ is the original matrix of order $nxn$; $P$ is an orthogonal matrix of order $nxk$; $Q$ is an orthogonal matrix of  order $nxk$.

For more information on singular vector decomposition, check a linear algebra textbook. For more information on recommender systems and matrix factorization, check this [paper](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf).

I'm limiting the amount of information I include here about matrix factorization because I don't want this kernel to be about MF. It's about making anime recommendations and I'll stick to that. 

# Building the Recommender System
I'll start with the implementation detail for the recommender system here. First we'll start by loading the data and doing a little exploratory analysis. Maybe plot a few things. Next, we'll prepare the rating matrix and write our [Stochastic Gradient Descent](https://www.coursera.org/learn/machine-learning/lecture/DoRHJ/stochastic-gradient-descent) algorithm, which we'll use to train our model. Lastly, I'll use the model to make recommendations given an anime. 

In [None]:
# import required libraries
import numpy as np
import pandas as pd

In [None]:
# load in the data
# the anime dataset
anime = pd.read_csv("../input/anime-recommendations-database/anime.csv")

# the users rating dataset
user_ratings = pd.read_csv("../input/anime-recommendations-database/rating.csv")

In [None]:
print(anime.shape); anime.info()

The dataset on anime has about 12.3K anime and 7 columns. One of the columns is the name of the anime. We'll be using this later. 

In [None]:
print(user_ratings.shape); user_ratings.info()

The rating database has about 8 million rows and three columns: The user id, the anime id and the rating given by the user.

The rating column contains a -1 for those anime that a particular user watched but didn't rate. We'll start off our data prep by replacing these values with NaN's.

In [None]:
# replaing the -1's in user_rating.rating with np.nan
user_ratings.loc[user_ratings.rating == -1, "rating"] = np.nan

In [None]:
# number of nulls in user_rating
user_ratings.isnull().mean() # about 19% of the values in user_rating.rating are missing.

Next up, we'll look at the number of missing values in anime to get a good idea about what's happening over there. 

In [None]:
anime.isnull().mean()

There are not a lot of missing values. But, these are not really of our concern in this kernel. I'll need only the anime ids and the anime names. So, I'll merge the two datasets and drop off all the columns that are not required.

In [None]:
# merging anime and user_ratings
user_ratings = pd.merge(user_ratings, anime, on = "anime_id")

# dropping the unnecessary columns
user_ratings.drop(["genre", "type", "episodes", "rating_y", "members"], axis = 1, inplace = True)

# renaming rating_x to rating
user_ratings.rename(columns = {"rating_x": "rating"}, inplace = True)

Now that we have the required data loaded, I can start with building the actual recommender system. The first step is to get the matrix of ratings from the data. We can use the pivot_table method of pd.DataFrame to get that. 

But, before that, I'll filter out about 60000 users to make the computation process easy. SGD is an iterative process that works well when there's more data. But, this also means that it's going to take a long time to run. So for computation reasons, I'm going to limit the dataset to about 1000 users. (More on this at the end)

In [None]:
# filtering out the first 5000 users
user_ratings = user_ratings[user_ratings.user_id <= 1000]

In [None]:
user_ratings.head()

In [None]:
# getting the rating matrix
rating_matrix = user_ratings.pivot_table(values = "rating", index = "user_id", columns = "anime_id")

In [None]:
rating_matrix.shape

The rating matrix we have here has a lot of missing values and only a few non missing values. This is what we call a sparese matrix. Our job is to predict those missing values. 

We're going to finish up with our data preparation by filling out those missing values with 0's.

In [None]:
rating_matrix.fillna(0, inplace = True)

Now that we have everything we need to make predictions, the final missing ingredient is the algorithm that will give us our recommendations. For the implementation detail, I suggest you to go through Andrew Ng's lecture on [Stochastic Gradient Descent]() and this amazing [post]() on Analytics Vidhya which gives you the code for implementing SGD. I've borrowed the code here. 

In [None]:
# setting to raise exceptions
np.seterr(all = "raise")

In [None]:
class MF():
    
    def __init__(self, rating_matrix, learning_rate = 0.01, reg_coef = 0.02, n_factors = 10,
                 n_epochs = 5):
        self.R = rating_matrix
        self.alpha = learning_rate
        self.reg_coef = reg_coef
        self.k = n_factors
        self.n_epochs = n_epochs
        self.n_users, self.n_items = self.R.shape
        
    def getTrainset(self):
        trainset = [(u, i, self.R[u, i]) for u in range(self.n_users) for i in range(self.n_items) if self.R[u, i] > 0]
        self.trainset =  trainset
        
    def fit(self):
        self.getTrainset()
        
        self.p = np.random.normal(0, 0.1, size = (self.n_users, self.k))
        self.q = np.random.normal(0, 0.1, size = (self.n_items, self.k))
        
        training_errors = []
        
        for epoch in range(self.n_epochs):
            np.random.shuffle(self.trainset)
            
            self.sgd()
            
            mse = self.mse()
            
            training_errors.append((epoch + 1, mse))
            
        self.training_errors = training_errors
        
    def mse(self):
        xs, ys = self.R.nonzero()
            
        predicted = self.getPredictedMatrix()
            
        err = 0
            
        for x, y in zip(xs, ys):
            err += ((self.R[x, y] - predicted[x, y]) ** 2)
                
        return np.sqrt(err)
        
    def sgd(self):
        for u, i, r_ui in self.trainset:
                
            prediction = np.dot(self.p[u, :], self.q[i, :].T)
            err = r_ui - prediction
                
                # updates
            self.p[u, :] += self.alpha * (err * self.q[i, :] - self.reg_coef * self.p[u, :])
            self.q[i, :] += self.alpha * (err * self.p[u, :] - self.reg_coef * self.q[i, :])
                
    def getPrediction(self, u, i):
        return np.dot(self.p[u, :], self.q[i, :].T)
        
    def getPredictedMatrix(self):
        return np.dot(self.p, self.q.T)

For computational efficiency, it's better to pass the rating matrix as a numpy array. So, I'll convert it into a numpy array and then run the training process.

In [None]:
# converting the rating matrix into a numpy array
R = np.array(rating_matrix)

In [None]:
mf = MF(rating_matrix = R, n_epochs = 10)

In [None]:
mf.fit()

In [None]:
mf.training_errors

The Stochastic Gradiet Descent algorithm seems to work. The reason I'm considering only 1000 users is because I faced some problems with np.random.normal with 5000 users. The function started generating values that caused floating point errors. So, I settled on testing the algorithm with 1000 users first. 

The model needs to be tuned a lot more. For the next version, here's what I have planned:
    1. Write another class that makes recommendations but uses the Surprise library and it's advanced algorithms.
    2. Create a function to recommend anime.