Anime_Rec

A Data Science Portfolio Project by Sam Kenkel

Purpose: Build a recommendation engine to recommend Anime (Japanese Cartoons) based on the existing data on MyAnimeList.net.

Anime fans (or 'Otaku', as they prefer to be called), like consumers of any media, can use assistance from modern machine learning and data science techniques in deciding Anime to watch.

The project consist of three phases, Data Collection (webscraping large quantities of user reviews and score from myanimelist.net), item-item filtering(using user scores, tf-idf of user reviews, and latent factors to return anime that are similar to anime user's like), and score prediction (5 different neural nets who are trained on differently processed data).

The Data

Collection Process

A number of scripts are used to gather data, because MyAnimeList (referred to as mal from here on out) has a history of DDOS attacks, they no longer give out production API keys. As such, all data for this project was acquired via scraping.

Usernames

MAL does not give any way to enumerate users, except via a search function. MAL_username Scrape generates a pseudo random collection of valid mal usernames by randomly choosing a three digit alphanumeric string, searching for usernames containing that string, getting every user, and repeating. I stopped collecting usernames once I had 1.5 million user_names, as I deemed that to be a large enough sample to allow for adequate data collection.

User Scores

Once I have a set of usernames (A list->set-> list conversion removes duplicates nicely) Mal-Score-Scraper is used. This script uses Mal-Scraper (https://github.com/QasimK/mal-scraper ) to get scores (a number from 1-10) and status (Watching, Completed, On Hold, etc) for each user. Once I have this information I use pandas to save it to a dataframe in memory, then write out to a csv. It's important to note that I parallelized this aspect of the scrape (with three systems using a vpn and different vpn endpoints) and that this could be easily parallelized to an arbitrary number of systems.

The raw data of user scores looks like:

After running a scraper, I can convert this to a data frame:

Anime_Score_Scraper.ipynb shows the code I used for this. It's important to note that I have a UNKEY (Which is unique key) field. This allows for multiple score files to be merged later and duplicates easily removed.

Reviews

Users can give reviews of shows:

We can scrape into a dataframe

This gives me plaintext reviews of 6000 out of 11000 shows (the other 5000 shows lack reviews because they are too obscure for a user to have actively written a review. This is another example of nonresponse bias in this dataset).
Anime_Review_scraper is used for this purpose. To prep for this, another tool (Anime_Info_Scraper) had to pull the URLs of every Anime's review page. (While each Anime's info page can be found by using myanimelist.net/ANIME_ID/ to get to an anime's review page you must have the correct myanimelist.net/ANIME_ID/Anime_NAME/Reviews address. Anime_Info_Scrape was used to generate that url, get a a description and genre for every anime, and get the url of the user-recommendations pages (for future use).

EDA

Exploratory Data analysis

We can look at this data, and notice some important trends:

Here are scores with the 0's removed.

Here is a comparison of average scores based on status.

AVG Scores a user gives as a function of how many reviews the user has done.

Notably: There are significant differences in how otaku score shows that they have not seen, but plan to vs shows they are seeing vs shows have seen and decided to stop watching.

In the Data, any time a user indicates they are watching, will watch or have begun watching but stopped a show without giving it a score, the score is represented by a 0.

Data (Continued)

Other Notes:

It's important to note that all of this data is explicit user feedback from a site that requires users to register in order to rate shows.

This means that the "Average" user of this site is likely not a casual fan of Anime (The 'nonresponse' bias of fans who are not willing to register on a website is significant). Any analysis, or modeling done on this data may not apply to casual Anime fans.

Finally, as with all explicit feedback based data on media 'Guilty Pleasures' are not modeled well. Anime that is popular and controversial, but that Otaku do not rate highly in a public forum, will not be well served.

The data does not have any time elements to it, so separating scores that users have given to shows recently, or shows that they cannot have seen (because they have not yet aired) or shows that they likely have seen years ago is impossible with the data available.

Architecture

As a brief explanation of the overall architecture, Once the Score and Review scrapers have been used to generate full data-lake of user scores and Anime Reviews, there is multi-step process used to generate a recommendation. To show the larger architecture before going into the components:

In short, Each user's N highest scored shows (scaled by the user's avg score, and scaled by the Anime's avg score) are used to make seed shows.

In a 'production' environment, if the user had rated less than N these would be determined by multi-arm bandit testing, or by business needs.

Each of these seed shows is fed to three different item-item similarity filters, which return the K most similar Anime (ignoring Anime that the user has either scores or marked a status for.

This generates between K (if the K most similar non-completed anime from each of the three filters were the same, for every show) and 3KN (if there is no overlap) of 'possible recommendations'. In a production environment new releases or a static list of shows could be inserted at this step.

The 'possible recommendations' are all fed to 3 different user score prediction systems (Neural Nets). All of these neural nets have the same architecture, but are trained to predict a different kind of score. Each of these networks predicts the user's score for each of 'possible recommendations', and returns a ranked list.

Those T-3T Anime are the recommendations for the user.

Generating Possible Recommendations

Three Types of Item-Item Similarity filtering

Type 1: User Score Cosine Distance collaborative filtering

Anime_Score_Sim shows the code for this. First, we drop all 0's. Next, we find the average score for each user, and subtract that score from every user's score. This is normalize the data (Without doing this, user's having different ideas of what constitutes a 6 vs a 7 vs 8, adds a lot of noise to the data). Then make a sparse matrix of every user x every anime (with the user's scaled scores being at the intersection). After some remapping to get those into anime names for eda, we can spot check to confirm that our similar anime are logical: If we look at Death Note, one of the most popular and successful anime of all time, we get a number of other very popular and successful shows, including Attack On Titan (which is directed by the same director, Tetsuro Araki). However, it's hard to judge these recommendations as this almost just a list of 'Crossover Anime'.

Let's look at something more obscure: Girls Und Panzer, the twee show about 'Cute Girls riding in Tanks as a highschool sport' is a little more informative. We see a great number of spin-off and sequel works (MAL lists prequels and sequels as different anime), as well as a number of other 'Cute Girls doing Cute Things shows', such as Hidamari Sketch and Non Non Biyori.

If we examine the polarizing show Bakemonogatari: We see one of the downsides of this form of recommendation system. We see 10 recommendations for the sequels to Bakemonogatari, and 1 recommendation for another work by the same author (Katanagatari). While this certainly indicates that these are similar shows, it is likely that using this system of similarity with the MAL dataset will give 'possible recs' that the user has already seen, but not scored, or is already aware of.

Type 2: Anime Review NLP TF-IDF Cosine Distance collaborative filtering

Review_Sim shows the code for this. We begin by merging all of our reviews into a one dataframe and removing duplicates. Then we generate a dataframe that consists of each row containing one anime with every review's text merged together. Then we use a TF-IDF (Term Frequency, Inverse Document Frequency) vectorizer to generate a sparse matrix for each anime. We use some custom stop-words (such as anime), which we want the vectorizer to ignore. N-grams are set from 1-3, which means that two word phrases (Such as 'moe trash', are treated as a word to vectorized and counted) In short, this is taking every 1 to 3 word phrase in every review, determining how often that 'word' appears in review text for one anime (such as cowboy bebop) and compares it to how often that 'word' appears in all reviews.

There are few ways we can vectorize this, namely, we can do all words, with all counts, all words with a binary(count of 0 or 1) or we can slice this down to 900 most common words. To clean up the text before we vectorize it, we remove some common stop words, and replace that name of the anime with the word 'anime_name' (which is also ignored) With this sparse matrix of values, we can once again compute cosine pairwise distance between all shows. (So shows which are described by anime fans using similar uncommon words will be closer to each other). We then turn these similarity scores into a dataframe which allows us to look at shows most similar to each other.

Then we check the results. For Death Note we see: A who's who of very popular, well liked anime, but this is a different list than user score distance gave us. (Specifically, this list tilts towards older, 'darker' shows.

If we check Girls Und Panzer: we see excellent results. This list is not only different from the user score distances, but is a wide variety of "Moe" (Cute girls doing cute things) shows, with the highest scored show being "High School Fleet" which is a show that has the same premise, but with battleships rather than tanks.

When we look at Bakemonogatari, we also see a marked improvement. We see the shows direct sequel (Nisemonogatari), and a show based on a novel by the same author (Katanagatari), but we see a number of similar shows (Durara, Spice and Wolf, Mawaru Penguindrum) that are different from what user score distance returned.

Type 3: Aime Hidden Factors( HF) Cosine Distance collaborative filtering

Mal_Keras_Fm_Clustering shows the code for this. What I do is make a factorization machine neural net. This neural net embeds each user and each anime as N hidden factors, then makes the score a simple dot product of these hidden factors. After training this neural network, we use the embedding for each show (which are the shows 'hidden factors') and do cosine pairwise distance. By changing the number of hidden factors, and the amount of regularization, we can modify how 'hidden' these factors become.

Here is the network that is used to generate this

The more Latent factors and the heavier the regularization, the stranger the similar anime become. Without regularization and too few hidden factors the results overlap too much with prior 2 methods. (The shows similar to bakemonogatari are all bakemonogatari sequels) In a production environment, multi-arm bandit testing would help tweak these hyperparameters (number of hidden factors, regularization). The best results I have found so far are regularizing User factors but not anime factors. These 'logical' groupings found have been with 56 hidden factors, user l2(.0065), anime l2(0) It's important to note that because there are two alternative methods which provide 'traditional' answers, this can be used to help with 'novel' or 'wildcard' recs. If this was the core of a recommende system without either of the other two item-item similarity methods, this would be better left with less hidden factors and less regularization.

Here are some examples of these 'wildcard' similarities

Predicting User Scores

3 Neurals Nets for collaborative filtering

There are 3 pre-trained neural nets. Each net has been trained to predict one type of score: Score, Usr Scaled Score, Anime Scaled score. The Nets are loaded with their trained weights, only the user embedding layer is unlocked . Each neural net predicts the user’s score for every possible rec (from the item-item methods). These predicted scores are sorted to give recommendations.

Every neural net has the same architecture:

Architecture:

Neural Net 1: Predicting user score

Data Prep:All scores of 0 (status but no score) are dropped. Scores from outlier (<25 scores, >1500 scores) users are dropped.
No Modifications are made for Watching, On-hold, Dropped, etc. Both users and Animes are embedded with N hidden factors. I use 36 factors in the final model. (These factors are not shared between models)

The point of this neural net is simply try to chase the highest score (Completed, In Progress and Want to Watch will be ignored in this net). Because it is trained against the entire dataset, this will tilt towards 'popular' shows.

Neural Net 2: Only 'completed' or 'Dropped' shows matters: User_scaled Score

Data Prep:All scores not listed with a status of ‘Completed’ or ‘Dropped’ are ignored. Scores from outlier (<25 scores, >1500 scores) users are dropped. Both users and Animes are embedded with 36 latent factors. (These factors are not shared between models) On the 1 million score test set, the best MAE was: 0.916866 The final MAE was 0.949197 This neural net is trying to predict shows that users will like more than their personal average.

Neural Net 3: Only 'completed' or 'Dropped' shows matters: User_scaled Score

Data Prep:All scores not listed with a status of ‘Completed’ or ‘Dropped’ are ignored. Scores from outlier (<25 scores, >1500 scores) users are dropped. Both users and Animes are embedded with 36 latent factors. (These factors are not shared between models) On the 1 million score test set, the best MAE was:0.932877 The final MAE was 0.98392

The Guilty Pleasure Variation: Predict completed rather than score

New Hyper-Parameter( Max # of reviews in the dataset to be recommended): The Neural Net has the same architecture as the score predictor but with a sigmoid function on the final layer. Trained on the entire 22 million User, Anime, Status data. This model requires different hyperparameter tuning than the score predictors: The best I've found so far: n_highest_rated_shows= 7 k_nearest_shows = 5 t_highest_predicted_score=15 l_max_reviews = 30000

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Data		Data
Readmeimg		Readmeimg
readmepics		readmepics
Anime_Score_Sim-Production.ipynb		Anime_Score_Sim-Production.ipynb
Anime_Score_Sim.ipynb		Anime_Score_Sim.ipynb
Anime_info_Scraper.ipynb		Anime_info_Scraper.ipynb
Data_Prep_1.ipynb		Data_Prep_1.ipynb
KerasEnsemble.ipynb		KerasEnsemble.ipynb
Keras_Score_Pred_Production_original.ipynb		Keras_Score_Pred_Production_original.ipynb
Keras_Score_Pred_Production_working_hipster.ipynb		Keras_Score_Pred_Production_working_hipster.ipynb
Keras_Score_Pred_Weight_generation.ipynb		Keras_Score_Pred_Weight_generation.ipynb
Keras_Weights_Status_Pred.ipynb		Keras_Weights_Status_Pred.ipynb
Live_Demo_Item-Item similarity.ipynb		Live_Demo_Item-Item similarity.ipynb
Live_Demo_Recommender-most-watchable.ipynb		Live_Demo_Recommender-most-watchable.ipynb
Live_Demo_Recommender.ipynb		Live_Demo_Recommender.ipynb
MalScores.png		MalScores.png
Mal_Keras_FM_Clustering.ipynb		Mal_Keras_FM_Clustering.ipynb
Mal_Keras_FM_CosimilarDistanceV2.ipynb		Mal_Keras_FM_CosimilarDistanceV2.ipynb
Mal_Keras_LF_COSINE_Distance.ipynb		Mal_Keras_LF_COSINE_Distance.ipynb
Mal_Review_NLP.ipynb		Mal_Review_NLP.ipynb
Mal_username_scrape.ipynb		Mal_username_scrape.ipynb
README.md		README.md
Recommender_hip.ipynb		Recommender_hip.ipynb
Recommender_original.ipynb		Recommender_original.ipynb
Review_Scrapper.ipynb		Review_Scrapper.ipynb
Review_Sim.ipynb		Review_Sim.ipynb
Review_Sim_production.ipynb		Review_Sim_production.ipynb
Score_Scraper.ipynb		Score_Scraper.ipynb
Screenshot from 2017-07-24 23-20-24.png		Screenshot from 2017-07-24 23-20-24.png
Screenshot from 2017-07-24 23-31-02.png		Screenshot from 2017-07-24 23-31-02.png
Screenshot from 2017-07-25 08-58-36.png		Screenshot from 2017-07-25 08-58-36.png
Screenshot from 2017-07-25 09-01-00.png		Screenshot from 2017-07-25 09-01-00.png
Untitled.ipynb		Untitled.ipynb
anime_rec_arch3.jpg		anime_rec_arch3.jpg
dataframe.png		dataframe.png
dot1.png		dot1.png
dot2.png		dot2.png
example_anime_info.png		example_anime_info.png
example_ratings.png		example_ratings.png
lfbake.png		lfbake.png
lfdeathnote.png		lfdeathnote.png
lfgirlsundpanz.png		lfgirlsundpanz.png
nn1_1.png		nn1_1.png
nn5_1.png		nn5_1.png
readme.html		readme.html
review.png		review.png
reviewframe.png		reviewframe.png
reviews per user.png		reviews per user.png
score by status.png		score by status.png
tfidf3bakemonogatari.png		tfidf3bakemonogatari.png
tfidf3deathnote.png		tfidf3deathnote.png
tfidf3girlsundpanzer.png		tfidf3girlsundpanzer.png
tfidf3katanagatari.png		tfidf3katanagatari.png
tfidf3sao.png		tfidf3sao.png
tfidf900deathnote.png		tfidf900deathnote.png
tfidf900girlsundpanzer.png		tfidf900girlsundpanzer.png
user reviews.png		user reviews.png
user_score_cosim_girlsundpanzer.png		user_score_cosim_girlsundpanzer.png
user_score_cosim_haikyu.png		user_score_cosim_haikyu.png
user_score_cosim_sao.png		user_score_cosim_sao.png
user_score_cosrim_bakemono.png		user_score_cosrim_bakemono.png
users_cosim_deathnote.png		users_cosim_deathnote.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anime_Rec

A Data Science Portfolio Project by Sam Kenkel

The Data

Collection Process

Usernames

User Scores

Reviews

EDA

Exploratory Data analysis

Data (Continued)

Other Notes:

Architecture

Generating Possible Recommendations

Three Types of Item-Item Similarity filtering

Type 1: User Score Cosine Distance collaborative filtering

Type 2: Anime Review NLP TF-IDF Cosine Distance collaborative filtering

Type 3: Aime Hidden Factors( HF) Cosine Distance collaborative filtering

Predicting User Scores

3 Neurals Nets for collaborative filtering

Neural Net 1: Predicting user score

Neural Net 2: Only 'completed' or 'Dropped' shows matters: User_scaled Score

Neural Net 3: Only 'completed' or 'Dropped' shows matters: User_scaled Score

The Guilty Pleasure Variation: Predict completed rather than score

About

Releases

Packages

Languages

Skenkel/Anime_Rec

Folders and files

Latest commit

History

Repository files navigation

Anime_Rec

A Data Science Portfolio Project by Sam Kenkel

The Data

Collection Process

Usernames

User Scores

Reviews

EDA

Exploratory Data analysis

Data (Continued)

Other Notes:

Architecture

Generating Possible Recommendations

Three Types of Item-Item Similarity filtering

Type 1: User Score Cosine Distance collaborative filtering

Type 2: Anime Review NLP TF-IDF Cosine Distance collaborative filtering

Type 3: Aime Hidden Factors( HF) Cosine Distance collaborative filtering

Predicting User Scores

3 Neurals Nets for collaborative filtering

Neural Net 1: Predicting user score

Neural Net 2: Only 'completed' or 'Dropped' shows matters: User_scaled Score

Neural Net 3: Only 'completed' or 'Dropped' shows matters: User_scaled Score

The Guilty Pleasure Variation: Predict completed rather than score

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages