**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - EDA Checkpoint

# Names

- Tony Bai
- Ryan Regala
- Jiwon Kim
- Colin Isidro
- Rambharath Saravanan

# Research Question

Based on an anime's genre(s), number of episodes, studio(s) that animated it, platform(s) it is being streamed on, and its source material (manga, light novel, visual novel, etc.), can we predict its score on MyAnimeList, a platform that gives an anime a score by aggregating scores given by its users from 1-10

## Background and Prior Work

For our project, we want to look at multiple different qualities in anime and see if we can use that to predict the score of an anime on MyAnimeList. From our research, we know that MyAnimeList has the biggest data on anime with having the least amount of missing data. The same source mentions that there were no strict web scraping policies as of early 2024, but further research suggests the scraping the website wouldn't be allowed, yet we would be allowed to get data from their API that has similar data if we wanted. We also know a lot about how MyAnimeList calculates its weighted average anime scores from the previous projects. <a name="ref-2"></a>[<sup>2</sup>](#ref-2) On top of that, we also found out that using a non-linear regression model would be more accurate than a linear one since their scores are also calculated on a non-linear formula.<a name="ref-1"></a>[<sup>1</sup>](#ref-1)

Another project suggests that multimodal data decreases the error in prediction models for MyAnimeList scores. <a name="ref-2"></a>[<sup>2</sup>](#ref-2) So, our project could put that into consideration as well as the genre, episodes, liscensing platforms, and source material previously mentioned in the research question. A project also puts forth a good practice of checking the collinearity of the variables we are checking for.<a name="ref-1"></a>[<sup>1</sup>](#ref-1)

## Metis Project 2: Prediction Model on Anime Rating Score
In this project, the author tries to make a prediction model based on MyAnimeList scores. They first scraped the data from that website, created a collinearity heat map to remove highly collinear features. They then used multiple linear regression models and finalized with using the one that had the least amount of error (Polynomial). They found that anime based off manga, added as Favorite on MAL, and produced by Production I.G. showed a correlation with popular anime rating scores. Though, they also specified that MAL specifies that their anime scores are calculated on a non-linear formula while this prediction model was more linear. 

1. <a name="ref1"></a> [^](#ref-1)Ting, K. S. (2021, December 16). Metis Project 2: Prediction Model on Anime Rating Score. Medium. https://medium.com/@sitingkoh1808/metis-project-2-prediction-model-on-anime-rating-score-65d9b5e3a6

## Anime Popularity Prediction Before Huge Investments: a Multimodal Approach Using Deep Learning
This project aims to predict the popularity of an anime (based off of MAL scores) using multimodal text-image data. They used a 3 input deep neural network with the input of synopsis, main character descriptions, and main character portraits. They found that their model best worked with the multimodal data in their prediction of MAL ratings. Though, they say that the model could be improved with more RAM allocated to the analysis, allowing images to be encoded into more tokens.

2. <a name="ref2"></a> [^](#ref-2) Armenta-Segura, J., & Sidorov, G. (2024). Anime Popularity Prediction Before Huge Investments: a Multimodal Approach Using Deep Learning. arXiv preprint arXiv:2406.16961.




# Hypothesis


Our hypothesis is that the more popular genres, more well known animation studios, and more well known streaming platforms will be strongly positively correlated with a higher score compared to less popular genres, animation studios, and streaming platforms; an anime's episode count will be positively correlated with its score; and the source material of an anime will have no correlation with its score. 

More popular genres, well known animation studios and streaming platforms, by virtue of being more popular well known, will have higher scores because they already have a fanbase that will inevitably be biased and think more highly of them. We think an anime's episode count will be positively correlated with its score because most of the highest rated animes on MyAnimeList have more episodes than the average episode count nowadays, which is 12. We think an anime's source material will have no correlation with its score because we think all the source materials have similar amounts of good and bad adaptations.

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: Anime Dataset 2023
  - Link to the dataset: https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset/data
  - Number of observations: 24905
  - Number of variables: 24

In Anime Dataset 2023, we have data on animes as recent as September 2022 (Summer 2022 season) and as far back as 1900s. There are a total of 24 columns in this dataset but the ones we care about are name, score, genres, episodes, studio, source, and licensors. 

Name contains the English pronouciation of names of every anime, score contains the aggregated score from each user that rated the animes, genres contains the genres for each anime (often multiple genres for each anime), studio contains the name of the animating studio(s) that did the work of adapting the show into anime (e.g. Bones, Madhouse, Toei Animations, etc.), source contains the source material that each anime was adapted from (anime original, manga, light novel, etc.), and licensors contains the platform(s) the anime was streamed on (e.g. Crunchyroll, Funimation, Bandai Entertainment, etc.). 

The first step of cleaning would be filter out all the animes in the dataset that doesn't have "TV" in the Type column since we want to only focus on animes that are aired as TV shows, having other types within our model (movies, OVAs ONAs, etc.) could skew our results since those are a different medium. In addition, there are a lot of missing data (marked as UNKNOWN) in the dataset which we can fill in ourselves using MyAnimeList's API, and if there are any UNKNOWN data after that we can remove because those data doesn't exist, the most common explanations would be either the anime is still airing (unknown episode count) or no one has ever scored the anime (unknown aggregate score).

## Anime Dataset 2023

In [None]:
import pandas as pd
import requests
import numpy as np
import time

In [None]:
def scoreScraping(subset : pd.DataFrame):
    '''For reading in MyAnimeList scores using its API'''
    url = "https://api.myanimelist.net/v2/anime/"
    clientHeader = {'X-MAL-CLIENT-ID':'be63c0a5e8517ce10df18a744cbf9045'}
    for i in range(subset.shape[0]):
        row = subset.iloc[i]

        if row.get('Score') == 'UNKNOWN': # if score is missing
            anime_id = row.get('anime_id')
            index = row.name

            response = requests.get(url + str(anime_id) + '?fields=mean', headers=clientHeader)
            if response.status_code == 200: # if response is returned successfully
                json = response.json()
                if 'mean' in json: # if a score was read in
                    subset.loc[index, 'Score'] = json['mean']

    return subset

In [None]:
def aniListScraping(dataset : pd.DataFrame):
    '''For reading in genres, episode count, studios, streaming website, and source off AniList API'''
    url = 'https://graphql.anilist.co'
    query = '''
    query ($idMal: Int!) { # Define which variables will be used in the query (id)
        Media (idMal: $idMal, type: ANIME) { # Insert our variables into the query arguments (id) (type: ANIME is hard-coded in the query)
            genres
            episodes
            studios{
                nodes{
                    name
                    isAnimationStudio
                }
            }   
            streamingEpisodes{
                site
            }
            source
        }
    }
    '''
    # get the subset of missing data and get the index of subset
    subset = dataset[(dataset.get('Genres') == 'UNKNOWN') | (dataset.get('Episodes') == 'UNKNOWN') | (dataset.get('Studios') == 'UNKNOWN') 
                     | (dataset.get('Licensors') == 'UNKNOWN') | (dataset.get('Source') == 'Unknown')]
    indicies = subset.index

    for index in indicies:
        entry = dataset.loc[index]
        variables = {'idMal': entry.get('anime_id').item()}
        response = requests.post(url, json={'query': query, 'variables': variables})

        if response.status_code == 429: # if ratelimit was reached
            time.sleep(61)
            response = requests.post(url, json={'query': query, 'variables': variables})

        if response.status_code == 200: # if reponse is returned successfully
            data = response.json()['data']['Media']

            if entry.get('Genres') == 'UNKNOWN' and 'genres' in data:
                dataset.loc[index, 'Genres'] = ','.join(data['genres'])
            if entry.get('Episodes') == 'UNKNOWN' and 'episodes' in data:
                dataset.loc[index, 'Episodes'] = data['episodes']
            if entry.get('Studios') == 'UNKNOWN' and 'studios' in data:
                studios = data['studios']['nodes']
                cellEntry = []
                for studio in studios:
                    if studio['isAnimationStudio']:
                        cellEntry.append(studio['name'])
                dataset.loc[index, 'Studios'] = ','.join(cellEntry)
            if entry.get('Licensors') == 'UNKNOWN' and 'streamingEpisodes' in data:
                if len(data['streamingEpisodes']) != 0:
                    dataset.loc[index, 'Licensors'] = data['streamingEpisodes'][0]['site']
            if entry.get('Source') == 'Unknown' and 'source' in data:
                dataset.loc[index, 'Source'] = data['source']


In [None]:
# import initial dataset
dataset = pd.read_csv('anime-dataset-2023.csv')
dataset = dataset[dataset.get('Type') == 'TV'].get(['anime_id', 'Name', 'Score', 'Genres', 'Episodes', 'Studios', 'Licensors', 'Source'])

Because MyAnimeList's API forces a cooldown if it is accessed too many times in a short amount of time, we are forced to wait 5 minutes between each chunk of API requests

In [None]:
ranges = np.arange(500, 7597, 500)
ranges = np.append(ranges, 7597)

print(f'Starting chunk 0')
dataset[ranges[0]+1:ranges[0+1]] = scoreScraping(dataset[ranges[0]+1:ranges[0+1]])
print(f'Finished chunk 0')
for i in range(len(ranges) - 1):
    print(f'Starting chunk {ranges[i]}')
    dataset[ranges[i]+1:ranges[i+1]] = scoreScraping(dataset[ranges[i]+1:ranges[i+1]])
    print(f'Finished chunk {ranges[i]}')
    print('Starting 5 min cooldown')
    time.sleep(300)


Because this one single cell took more than 1 hour, I will save the edited dataset as 'edited-anime-dataset.csv' so we won't have to do that ever again

In [None]:
dataset.to_csv('edited-anime-dataset.csv', index=False)

In [None]:
# import edited dataset
editedset = pd.read_csv('edited-anime-dataset.csv')
# check how many unknown scores we have now
editedset[editedset.get('Score') == 'UNKNOWN'].shape

(2807, 8)

Although we have run our function to call the MAL API to try and get the score, many of them still has that cell empty because no one has ever scored them. Since an anime's score is crucial in our model, we will remove the ones that have it missing because they are unhelpful

In [None]:
editedset = editedset[dataset.get('Score') != 'UNKNOWN']
editedset[editedset.get('Score') == 'UNKNOWN'].shape

(0, 8)

We will now use AniList's API to get any missing data in Genres, Episodes, Studios, Licensors, and Source since AniList's API is more sophisticated and allows us to get Licensors, a piece of information that we cannot get off of MyAnimeList's API

In [None]:
aniListScraping(editedset)

This one function call also took more than an hour because AniList's API has a rate limit of 30 calls per minute, and if it is reached we get put on a 1 minute cooldown, so for timesaving we will also save this final dataset. After this we are free from getting information from APIs

In [None]:
editedset.to_csv('final-anime-dataset.csv', index=False)

In [None]:
finalset = pd.read_csv('final-anime-dataset.csv')

In [None]:
print(finalset[(finalset.get('Genres') == 'UNKNOWN') | (finalset.get('Episodes') == 'UNKNOWN') | (finalset.get('Studios') == 'UNKNOWN') | 
                     (finalset.get('Licensors') == 'UNKNOWN')].shape)
print(finalset[(finalset.get('Licensors') == 'UNKNOWN')].shape)
finalset[(finalset.get('Licensors') == 'UNKNOWN')].head()

(1732, 8)
(1732, 8)


Unnamed: 0,anime_id,Name,Score,Genres,Episodes,Studios,Licensors,Source
6,17,Hungry Heart: Wild Striker,7.55,"Comedy, Slice of Life, Sports",52.0,Nippon Animation,UNKNOWN,Manga
12,23,Ring ni Kakero 1,6.39,"Action, Sports",12.0,Toei Animation,UNKNOWN,Manga
33,62,D.C.: Da Capo,6.72,"Drama, Romance",26.0,"feel., Zexcs",UNKNOWN,Visual novel
64,102,Aishiteruze Baby★★,7.44,"Comedy, Drama, Romance",26.0,TMS Entertainment,UNKNOWN,Manga
65,103,Akazukin Chacha,7.49,"Adventure, Comedy, Fantasy, Romance",74.0,Gallop,UNKNOWN,Manga


after scraping information off AniList's API we still have a lot of unknown information, all of them having at least the Licensors column missing, most likely because they were aired on TV before streaming platforms like Crunchyroll were a thing so these animes aren't on any streaming platforms. We are free to remove these animes since that information simply doesn't exist

In [None]:
finalset = finalset[(finalset.get('Licensors') != 'UNKNOWN')]
finalset.shape

(3058, 8)

In [None]:
print(finalset.isnull().sum(axis=0))

anime_id      0
Name          0
Score         0
Genres        4
Episodes      7
Studios      16
Licensors     0
Source        0
dtype: int64


Looks like we also got some nulls from the function call

In [None]:
finalset[finalset.get('Genres').isnull()]

Unnamed: 0,anime_id,Name,Score,Genres,Episodes,Studios,Licensors,Source
1968,8753,Ultraman Kids: Haha wo Tazunete 3000-man Kounen,6.09,,26.0,,Mill Creek Entertainment,Other
2399,15547,Cross Fight B-Daman eS,6.45,,52.0,SynergySP,ADK Emotions NY,Unknown
2685,21835,Majin Bone,6.62,,52.0,Toei Animation,Crunchyroll,Game
2731,22735,Oreca Battle,5.86,,51.0,"Xebec, OLM",Crunchyroll,Game


searching for these animes on AniList, they simply have no genres listed, one reason as to why might be because they are all children's shows. Since we want genres in our analysis, we will drop these animes because they are unhelpful for our prediction

In [None]:
finalset[finalset.get('Episodes').isnull()]

Unnamed: 0,anime_id,Name,Score,Genres,Episodes,Studios,Licensors,Source
10,21,One Piece,8.69,"Action, Adventure, Fantasy",,Toei Animation,"Funimation, 4Kids Entertainment",Manga
159,235,Detective Conan,8.17,"Adventure, Comedy, Mystery",,TMS Entertainment,"Funimation, Crunchyroll",Manga
522,966,Crayon Shin-chan,7.77,"Comedy, Ecchi",,Shin-Ei Animation,Funimation,Manga
1488,4459,Ojarumaru,6.32,"Adventure, Award Winning, Comedy, Fantasy",,Gallop,Enoki Films,Original
1732,6149,Chibi Maruko-chan (1995),7.27,"Comedy, Slice of Life",,Nippon Animation,Crunchyroll,Manga
3172,32353,Bonobono (TV 2016),6.33,"Comedy, Slice of Life",,Eiken,Crunchyroll,4-koma manga
4451,50250,Chiikawa,5.68,Comedy,,Doga Kobo,Sentai Filmworks,Web manga


these animes are still airing, so it makes sense that they don't have a concrete episode count. Since we want episode counts in our analysis, we will need to drop these animes

In [None]:
finalset[finalset.get('Studios').isnull()].head()

Unnamed: 0,anime_id,Name,Score,Genres,Episodes,Studios,Licensors,Source
1223,3202,Daisuki! Hello Kitty,6.27,Fantasy,26.0,,ADV Films,Original
1298,3519,Garakuta-doori no Stain,6.37,"Award Winning, Comedy",13.0,,Funimation,Unknown
1394,3880,Makyou Densetsu Acrobunch,5.9,Sci-Fi,24.0,,Discotek Media,Original
1416,4025,Asobou! Hello Kitty,5.88,"Adventure, Fantasy",26.0,,ADV Films,Unknown
1467,4244,Ginga Shippuu Sasuraiger,6.4,"Action, Adventure",43.0,,Discotek Media,Original


searching for these animes on AniList, they either have no studios listed, or only have producers instead of the traditional animation studios we want. We are free to drop these animes as well

In [None]:
finalset = finalset.dropna()
finalset.shape

(3032, 8)

# Results

## Exploratory Data Analysis

Carry out whatever EDA you need to for your project.  Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context.

### Section 1 of EDA - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

### Section 2 of EDA if you need it  - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

We got our data from Kaggle, which ranks the animes off of MyAnimeList.com. There is privacy between the data because we are going through the anonymous opinions of viewers on show popularity. However, there is potential bias because the genres of anime can depend on the viewer.
For instance, if the most common shows are shonen, then there could be higher-rated animes on MyAnimeList that belong to different genres, hence underrepresenting other types of anime.
One way to detect these biases throughout the project is by making sure the data we collect isn’t invasive to any personal information of anybody.
One problem that can affect our data analysis is that the ratings might be skewed towards popular genres, which impacts the fairness of rankings. To handle this issue, we can analyze the distribution of genres in the dataset to check for any disproportion.
1. We should Calculate the distribution of anime genres in the dataset. We can do this by visualizing the distribution through bar charts, and compare the genres.
2. Rate the analysis by Genre. Segment the data by genre, and compute the average ratings for each. By doingso, we can identify any genres with significantly higher or lower ratings, which could be overrepresented.
3. Check if there is a correlation between popularity and ratings. This can be done by examining the relationship between the two, and checking if highly popular anime genres have higher ratings.
4. Ensure the sample of anime included in the dataset is representative of the full range of genres. If certain genres are underrepresented, consider stratified sampling methods to balance the dataset.
5. Check for Bias through statistical testing to check for significant differences in ratings across genres.
6. Mitigate any Bias through weighing ratings by genre representation to balance their impact. Alternatively, we could separate rankings for each genre to avoid cross-genre comparison bias.
7. Clearly document any biases detected and the steps taken to address them. Ensure transparency by explaining how the data was sourced and any limitations it might have.

# Team Expectations 

* *We will use Discord as our main form of communication, and a reasonable response time is a day as we are busy uni students that have our own schedules*
* *We will meet virtually once a week over Discord call*
* *Decision making will be unanimous*

Tentative Project Responsibility
* *Tony will work on wrangling the data and modeling*
* *Ryan will work on discussing Ethics & Privacy*
* *Colin will work on dicussing similar works done prior*
* *Jiwon and Rambharath will work on EDA*

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/19  |  4:30 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  |  Use Discord to call about topics to brainstorm | 
| 2/5  |  7:00 PM |  Decide on the topic; Search for datasets on Kaggle | Draft project proposal, have everyone contribute to writing the proposal | 
| 2/9  | 10:30 AM  | Edit, finalize, and submit proposal | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part  |
| 2/16  | 3 PM  | Import & Wrangle Data; EDA (AnimeList Prediction) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 3 PM  | Finalize wrangling/EDA; Begin Analysis (Anime Prediction Analysis) | Discuss/edit Predictions Analysis; Complete project check-in |
| 3/2   | 3 PM  | Complete analysis; Draft results/conclusion/discussion (AnimeList on predictions)| Discuss/edit full project |
| 3/9   | 3 PM  | Get project around 90% done | Discuss final changes we want to make |
| 3/16  | 3 PM  | N/A | Discuss last minute changes before turning the project in |