In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## EDA

Let's learn about our users.

In [19]:
df_users_cleaned = pd.read_csv("./data/users_cleaned.csv")
df_users_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108711 entries, 0 to 108710
Data columns (total 17 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   username                  108710 non-null  object 
 1   user_id                   108711 non-null  int64  
 2   user_watching             108711 non-null  int64  
 3   user_completed            108711 non-null  int64  
 4   user_onhold               108711 non-null  int64  
 5   user_dropped              108711 non-null  int64  
 6   user_plantowatch          108711 non-null  int64  
 7   user_days_spent_watching  108711 non-null  float64
 8   gender                    108711 non-null  object 
 9   location                  108706 non-null  object 
 10  birth_date                108711 non-null  object 
 11  access_rank               0 non-null       float64
 12  join_date                 108711 non-null  object 
 13  last_online               108711 non-null  o

In [20]:
df_users_cleaned['gender'].value_counts()

gender
Male          70880
Female        37330
Non-Binary      501
Name: count, dtype: int64

In [21]:
df_users_cleaned['location'].value_counts()

location
Poland                      1656
Germany                     1132
Brazil                      1022
Canada                       900
California                   848
                            ... 
Melacca, Malaysia              1
not important                  1
Mexico city, Mexico            1
Vancouver, Washington486       1
nhollywood, california         1
Name: count, Length: 40438, dtype: int64

In [23]:
df_users_cleaned.iloc[0]

username                               karthiga
user_id                                 2255153
user_watching                                 3
user_completed                               49
user_onhold                                   1
user_dropped                                  0
user_plantowatch                              0
user_days_spent_watching              55.091667
gender                                   Female
location                        Chennai, India 
birth_date                  1990-04-29 00:00:00
access_rank                                 NaN
join_date                   2013-03-03 00:00:00
last_online                 2014-02-04 01:32:00
stats_mean_score                           7.43
stats_rewatched                             0.0
stats_episodes                             3391
Name: 0, dtype: object

In [12]:
# user data doesn't have the list of anime they watch, let's explore the other csvs

df_anime_lists_cleaned = pd.read_csv("./data/animelists_cleaned.csv")
df_anime_lists_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31284030 entries, 0 to 31284029
Data columns (total 11 columns):
 #   Column               Dtype  
---  ------               -----  
 0   username             object 
 1   anime_id             int64  
 2   my_watched_episodes  int64  
 3   my_start_date        object 
 4   my_finish_date       object 
 5   my_score             int64  
 6   my_status            int64  
 7   my_rewatching        float64
 8   my_rewatching_ep     int64  
 9   my_last_updated      object 
 10  my_tags              object 
dtypes: float64(1), int64(5), object(5)
memory usage: 2.6+ GB


In [17]:
df_anime_lists_cleaned.iloc[0]
# we can see an anime_id and how many of this type of anime they watched...including what they rated it.
# this is good. now we just need to find the mapping from the anime_id to the anime name in english or japanese.

username                          karthiga
anime_id                                21
my_watched_episodes                    586
my_start_date                   0000-00-00
my_finish_date                  0000-00-00
my_score                                 9
my_status                                1
my_rewatching                          NaN
my_rewatching_ep                         0
my_last_updated        2013-03-03 10:52:53
my_tags                                NaN
Name: 0, dtype: object

In [25]:
df_anime_cleaned = pd.read_csv("./data/anime_cleaned.csv")
df_anime_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6668 entries, 0 to 6667
Data columns (total 33 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   anime_id         6668 non-null   int64  
 1   title            6668 non-null   object 
 2   title_english    3438 non-null   object 
 3   title_japanese   6663 non-null   object 
 4   title_synonyms   4481 non-null   object 
 5   image_url        6666 non-null   object 
 6   type             6668 non-null   object 
 7   source           6668 non-null   object 
 8   episodes         6668 non-null   int64  
 9   status           6668 non-null   object 
 10  airing           6668 non-null   bool   
 11  aired_string     6668 non-null   object 
 12  aired            6668 non-null   object 
 13  duration         6668 non-null   object 
 14  rating           6586 non-null   object 
 15  score            6668 non-null   float64
 16  scored_by        6668 non-null   int64  
 17  rank          

In [26]:
df_anime_cleaned[df_anime_cleaned['anime_id'] == 21]

Unnamed: 0,anime_id,title,title_english,title_japanese,title_synonyms,image_url,type,source,episodes,status,...,broadcast,related,producer,licensor,studio,genre,opening_theme,ending_theme,duration_min,aired_from_year
34,21,One Piece,One Piece,ONE PIECE,OP,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,0,Currently Airing,...,Sundays at 09:30 (JST),"{'Adaptation': [{'mal_id': 13, 'type': 'manga'...","Fuji TV, TAP, Shueisha","Funimation, 4Kids Entertainment",Toei Animation,"Action, Adventure, Comedy, Super Power, Drama,...","['#01: ""We Are! (ウィーアー!)"" by Hiroshi Kitadani ...","['#01: ""memories"" by Maki Otsuki (eps 1-30)', ...",24.0,1999.0


## Brainstorming:

Ideas:
1. Use KNNs to find similar anime based off of anime similarity using info such as genre. Something like what this person did: https://github.com/Mohitkumar6122/Anime-Recommendation/blob/master/Anime_Recommend_using_KNN.ipynb

2. Use Frequent Pattern Mining (similar to our hw1 with movies, actually very similar, but now we actually want to generate recommendations from input)
    > what if input anime is NOT in our list? do we do a web scrape?
    > if input anime is in our list, then task is easy...

3. RNN / Transformers (more advanced and not covered in this class)
> getting ideas from here: https://ieeexplore.ieee.org/document/9873070