# Introduction (TBD)

# Part 0: Loading dependencies and data

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # plotting
from scipy import stats # statistics
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
anime = pd.read_csv("../input/anime.csv") # load data

In [3]:
anime.head(10)

# Part 1a: Exploratory Data Analysis

The dataset contains information regarding the title, genres, type, number of episodes, average user rating, and number of members for each anime. The default sort of this dataset appears to output the same sorted list as what is displayed at https://myanimelist.net/topanime.php, where the data was obtained approximately a year ago. 

Interesting note - at the time, *Kimi no Na Wa* was newly in theaters (I remember watching it in April 2017). The movie was very well received and highly praised, so unsurprisingly it nabbed the top spot. However, some of the entries in the top 10 have shuffled around since then. In particular, *Kimi no Na wa* has been dethroned by the former #1 title *Fullmetal Alchemist: Brotherhood*, which came out in 2009. There's also a new entry for *Steins;Gate 0* that is holding position 3 at the time of writing. *Steins;Gate 0* also happens to be a show that is currently airing in Spring 2018. Unfortunately, this dataset lacks airing/release date information; investigating the effects of recency on artificially inflating the ratings of newer shows may be potentially interesting, as they seem to converge to a lower equilibrium rating as time passes. Perhaps it is a good idea to rescrape and update the data to include air dates as another predictor column, but I digress. 

Let's see what we can do with this data for now.

In [4]:
anime.describe()

## Data cleaning: Episode counts

The describe() summary function provides boxplot metrics for numeric variables. However, episode count, which should be numeric, doesn't show up. Let's find out why.

The datatype is object, and converting it to numeric results in an error due to the presence of "Unknown" strings. We need to clean the data to fill these values. An easy fix is to assign unknown values to the median value

In [5]:
# anime.episodes.describe() # dtype: object
# anime['episodes'] = pd.to_numeric(anime['episodes']) # ValueError: invalid literal for int() with base 10: 'Unknown'

anime["episodes"] = anime["episodes"].map(lambda x:np.nan if x=="Unknown" else x)
anime["episodes"].fillna(anime["episodes"].median(),inplace = True)
anime['episodes'] = pd.to_numeric(anime['episodes'])
anime.describe()

Excellent, now it appears as a numeric variable. Now let's see if there's any relationship between the episode count and the rating.

## Analysis: Rating vs episode count

In [6]:
sns.jointplot(x='episodes', y='rating', data=anime)

Wow, there are shows with crazy numbers of episodes! Let's see what they are.

In [7]:
anime[(anime.episodes > 200)].sort_values(by='episodes', ascending=False)

Oyako Club has 1818 episodes! That's insane. But let's check out the distribution of shows that are less insane (i.e. fewer than 100 episodes).

In [8]:
sns.jointplot(x='episodes', y='rating', data=anime[(anime.episodes < 100)])

Very cool! There appears to be a high density of points at lower episode counts, with the highest amount at 1. There appears to be some equally spaced out bands of higher density around 13, 26, 39 (sort of) and 52 episodes. This data captures the trend that anime tends to be either a standalone movie (1 episode), OVA (few episodes), or a longer series that follows seasonal "cours" which are multiples of 12 or 13 episodes. We can look at the most common values of episodes to confirm this.

In [9]:
anime.episodes.value_counts().head(10)

## Analysis: Rating vs members

Next, let's take a look at the highly rated shows. We'll return a table for the shows that have average rating > 9, and see if we can find anything interesting.

In [10]:
anime.loc[(anime.rating > 9)].sort_values(by='rating', ascending=False)

There are 5 animes on this list that "belong" on the top 10 if going by pure average user rating; in fact, 3 of them have a higher average rating than the actual top anime. However, they all have suspiciously low (2 digit) members while the rest of the shows in this list have at least mid to high 5 digit members. The algorithm that MyAnimeList (MAL) uses to display its leaderboard appears robust against the shenanigans of a small group of raters who want to inflate the ranking of certain animes.

Let's investigate whether the rating distribution for the suspiciously low member animes differs from the rating distribution overall. First, we can plot the distributions split on member count.

In [11]:
sns.distplot(anime.rating.dropna(), bins=20) # overall distribution
sns.jointplot(x='members', y='rating', data=anime) # scatterplot of rating vs members

In [12]:
sns.distplot(anime[(anime.members >= 100)].rating.dropna(), bins=20) # distribution for animes with at least 100 members
sns.jointplot(x='members', y='rating', data=anime[(anime.members) > 100]) # scatterplot of rating vs members > 100

In [13]:
sns.distplot(anime[(anime.members < 100)].rating.dropna(), bins=20) # distribution for animes with less than 100 members
sns.jointplot(x='members', y='rating', data=anime[(anime.members < 100)]) # scatterplot of rating vs members < 100

The rating distribution for animes with fewer than 100 members is visually different from that of the overall distribution. Setting the cutoff at 100 members and up seems to roughly capture the same overall shape as the full distribution at first glance.

To quantitaively assess this difference, we can turn to the Two-Sample Kolmogorov-Smirnov Test (https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test), which tests whether two samples are drawn from the same distribution. For this test, we will randomly sample 100 values from the low members (< 100), normal members (>= 100), and the overall distribution. The KS statistic is defined as the maximum value of the absolute difference between two points on the cumulative distribution function (CDF) of each sample. If the samples are drawn from the same distribution, we would expect this value to be very low as their CDFs should be approximately equal.

In [14]:
low_members = anime[(anime.members < 100)].rating.dropna().sample(100, random_state = 0)
all_members = anime.rating.dropna().sample(100, random_state = 0)
stats.ks_2samp(low_members, all_members)

In [15]:
normal_members = anime[(anime.members >= 100)].rating.dropna().sample(100, random_state = 0)
stats.ks_2samp(normal_members, all_members)

We can see here that the KS statistic for the difference between the low member distribution and the overall distribution is quite high, corresponding to a p-value of 1.78 x 10^-3. At the 5% significance level, we therefore reject the null hypothesis that the low member ratings are drawn from the same overall rating distribution.
Meanwhile, the analogous KS statistic for the difference between the normal member distribution and the overall distribution is very low, corresponding to a p-value of 7.94 x 10^-1. At the 5% significance level, we cannot reject the null hypothesis that the low member ratings are drawn from the same distribution as the overall rating distribution.

We have set a sample seed of 0 to ensure reproducibility in this example, but we can also repeatedly sample and calculate the KS statistics and associated p-value. Taking 1000 such random samples and storing the results of the KS Test for the low member distribution vs overall distribution, we see that we almost never sample a p-value higher than 0.05 (approx 0 or 1 times out of 1000). Analogously for the normal member distribution vs overall distribution, we will sample a p-value lower than 0.05 about 10% of the time (approx 100 times out of 1000).

In [16]:
low = []
norm = []
for i in range(1000):
    low_members = anime[(anime.members < 100)].rating.dropna().sample(100)
    all_members = anime.rating.dropna().sample(100)
    norm_members = anime[(anime.members >= 100)].rating.dropna().sample(100)
    low_stat, low_p = stats.ks_2samp(low_members, all_members)
    norm_stat, norm_p = stats.ks_2samp(norm_members, all_members)
    low.append((low_stat, low_p))
    norm.append((norm_stat, norm_p))
low_values = pd.DataFrame.from_records(low, columns = ['KS_stat', 'p_value'])
norm_values = pd.DataFrame.from_records(norm, columns = ['KS_stat', 'p_value'])

In [17]:
low_values[low_values.p_value >= 0.05].p_value.count() / len(low_values)

In [18]:
norm_values[norm_values.p_value < 0.05].p_value.count() / len(norm_values)

# Part 1b:  Content-based Recommender System

There is a ton of cool stuff to find out about this dataset, and we have barely scratched the surface. However, we can start implementing a recommender system using the content-based approach. In this dataset, each anime has a list of associated genre tags in the 'genre' column. We can perform feature extraction to convert these tags into a binary feature vector of genres for each anime show (i.e. 1 if it belongs to a given genre or 0 if it does not). For example *Kimi no Na Wa* will have a value of 1 for Drama, Romance, School, and Supernatural and 0 for all other genres. One easy way to do this is by using the get_dummies function in pandas.

## Feature extraction of genre data

In [19]:
genre_features = anime.genre.str.get_dummies(sep=', ')
print(genre_features.shape)
genre_features.head()

We now have information for all 43 genres stored in the genre_features dataframe, and we can do some analysis on it. Let's plot the genres that are most highly represented

In [20]:
genre_counts = pd.DataFrame(genre_features.sum()).reset_index().rename(columns={'index':'genre', 0:'count'})
genre_counts.sort_values('count', ascending = False, inplace = True)
sns.barplot(x = 'count', y = 'genre', data = genre_counts.head(20))

Comedy is by far the most common genre tag. These feature vectors are not mutually exclusive, as a show can be Comedy while also being a different genre (e.g. Romance). Meanwhile, there are other genres tags that are orthogonal. For example, if a Fantasy show is unlikely to also be Sci-Fi, since they take place in different fictional settings. A Shounen show (targetted audience is approximately male PG-13) would probably be correlated with Action or Adventure, while a Shoujo show (target audience is approximately female PG-13) would probably be correlated with Romance or Slice of Life.

## Genre correlations

We can investigate the correlations further by plotting a heatmap of the correlation matrix between the genres. For the sake of preventing overcrowding, let's focus on the 20 most common genres.

In [21]:
genre_list = [genre for genre in genre_counts.genre.head(20)]
sns.heatmap(genre_features[genre_list].corr(), vmax = 0.6)

The brightest spots correpond to the highest correlation scores between the genres. The diagonal is the autocorrelation of the genre with itself, so it is always 1. However, there are some other bright spots on here that we didn't mention earlier. Mecha (giant robot animes) is correlated with Action and especially with Sci-Fi. Magic is correlated with Fantasy (as a thematic element) and also with Shoujo (presumably due to the popularity of the Magical Girl sub-genre of shows, such as Sailor Moon).

Now let's use the genre features to calculate the similarity between shows. The similarity function we will use is the cosine similarity (https://en.wikipedia.org/wiki/Cosine_similarity), which takes in two vectors and returns the dot product of the vectors over the product of their magnitudes. The cosine similarity of a vector with itself will be 1, while the cosine similarity of a vector with an orthogonal vector is 0. Here, we generate the cosine similarity for all 12294 anime shows.

In [22]:
cosine_matrix = cosine_similarity(genre_features)
print(cosine_matrix.shape)

## Naive approach: Weighted cosine similarity

Let's use this cosine similarity matrix to generate relevant shows based on a naive approach. For a given input show, we'll calculate a "cosine_score" which is the cosine similarity with all shows weighted by the average rating of each show, and return a sorted list of the top 15 shows that have the highest cosine score. Naturally, the highest cosine score will come from  the input entry, which has perfect cosine similarity of 1 with itself,  and likely a high average score (since the user enjoyed the show), so we should drop this value from consideration. 

In [55]:
def recommend_anime(anime_index):
    cos_similarity = pd.DataFrame(cosine_matrix[anime_index]).rename({0: 'cos_sim'}, axis = 'columns')
    weighted_score = pd.DataFrame(anime.rating * cos_similarity.cos_sim, columns = ['cos_score'])
    result = pd.concat([anime.name, anime.genre, cos_similarity.cos_sim, weighted_score.cos_score], axis = 'columns')
    title, genres = anime.loc[anime_index, ['name', 'genre']]
    print("Shows similar to '%s', which has genres '%s' are:" % (title, genres))
    return result.drop([anime_index]).sort_values(by='cos_score', ascending = False)[['name', 'cos_score']].head(15)

In [61]:
my_favorite_animes = ['Monster', 'Full Metal Panic? Fumoffu', 'Juuni Kokuki', 'Mononoke Hime', 'Ghost in the Shell: Stand Alone Complex', 'Nana', 'Higurashi no Naku Koro ni', 'Clannad', 'Tengen Toppa Gurren Lagann', 'Steins;Gate', 'Kimi no Na wa.']
#my_favorite_animes = ['Boku no Hero Academia', 'Shingeki no Kyojin', 'One Punch Man', 'Fullmetal Alchemist', 'Nanatsu no Taizai', 'One Piece']

anime_list_indices = anime.loc[anime.name.isin(my_favorite_animes)].index
for index in anime_list_indices:
    output = recommend_anime(index)
    print(output)

This naive approach seems to work quite well. It is very good at finding sequels, prequels, movies, and other titles in the same series as the input title if they exist, probably because their cosine similarity values should be the same, so the cosine score should be quite similar unless there was significant variance in the ratings or the genres between seasons. For one off shows like *Kimi No Na Wa*, or *Mononoke Hime*, it was able to recommend other shows that do have similar themes. Curiously, I noticed that *Clannad: After Story*, the sequel to *Clannad*, did not appear in the list of recommendations for the latter show. Let's see if we can figure out why.

In [62]:
my_favorite_animes = ['Clannad: After Story', 'Clannad']

anime_list_indices = anime.loc[anime.name.isin(my_favorite_animes)].index
for index in anime_list_indices:
    output = recommend_anime(index)
    print(output)

Aha! The genre tags did change significantly between the two seasons (Comedy and School were dropped, and Fantasy was added in the more serious sequel), and therefore they are not similar enough using this cosine score metric. In any case, this seems to be a reasonable baseline implementation of the content-based recommendation system.

Next, we will want to use the user rating data to implement a collaborative filtering approach to recommending new animes

# Part 2a: Exploratory Data Analysis

Let's check out the user ratings dataset.

In [63]:
user = pd.read_csv('../input/rating.csv')
user.head()

There seem to be a lot of values of -1 on a 1 to 10 scale. The documentation explains that -1 indicates that the user watched the anime and didn't assign a rating, so it's a placeholder for NaN. We can replace the values with NaN.

In [64]:
user.rating.replace({-1: np.nan}, regex=True, inplace = True)
user.head()

In [65]:
user.describe()

In [66]:
user.rating.value_counts()

In [67]:
user[(user.rating.isin([7,8,9]))].rating.value_counts().sum() / user.rating.value_counts().sum()

In [68]:
sns.countplot(user.rating)

Most users appear to give shows fairly high ratings. Only 2.9% of the ratings were below 5, and 67.5% of the ratings were either 7, 8 or 9 with 8 being the most common rating. There is likely a correlation with users who enjoyed a show being more likely to rate it.
MAL requires that all ratings be integer values between 1 and 10, and this type of coarse graining makes it difficult to differentiate between ratings when most (97.1%) of the data is in the higher end of this range. Perhaps MAL should allow 0.5 point increments from 10 down to 5 and then a catchall for ratings less than 5/10. This will allow us to pick out finer details about the distribution that are lost when using bins that are too large.

Let's take a look  at the average ratings of these users.

In [69]:
avg = user.groupby('user_id').rating.mean().sort_values(ascending=False).dropna()
print(avg.head())
print(avg.tail())

There appear to be a lot of very round numbers. This can arise from users only rating one show, or rating everything the same value. Trying to find the distributions of these users will not give anything meaningful, as subtracting away the mean will result in exactly 0. We can determine the number of people who fall into this category as follows:

In [70]:
user.groupby('user_id').rating.mean().isin([float(i+1) for i in range(10)]).value_counts()

In [71]:
sns.distplot(user.groupby('user_id').rating.mean().dropna(), bins = 20)

As expected, there is a heavy spike for the people who gave only ratings of 10, but the overal peak is around 8.

# Part 2b: Collaborative filtering
TBD
- Implement collaborative filtering
- Write Introduction/motivation
- Do content-based analysis without low member data
- Rescrape data to include more features (e.g. airdate)