# Step 3: Scrape MyAnimeList Information

**Metis Project 2, Andrew Zhou**

Now that we've linked the anime from our sales database to MyAnimeList, we scrape all the information we want from MAL.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import re
import pickle
import time
import sys

sys.path.append('..')

from utilities.scraping_utilities import \
    get_anime_link, get_anime_data, create_mal_info_df

import pandas as pd
import numpy as np

In [2]:
anime_sales_df = pd.read_pickle("../data/anime_sales_df_matched.pickle")
studio_df = pd.read_pickle("../data/studio_df.pickle")

## Looking at the Studio DataFrame

Let's take a look at the format of the data.

In [19]:
studio_df.head(5)

Unnamed: 0,link,anime_info
8bit,https://myanimelist.net/anime/producer/441/8bit,{'Tensei shitara Slime Datta Ken': 'https://my...
A-1 Pictures,https://myanimelist.net/anime/producer/56/A-1_...,{'Sword Art Online': 'https://myanimelist.net/...
A-Real,https://myanimelist.net/anime/producer/1257/A-...,{'Kenka Banchou Otome: Girl Beats Boys': 'http...
A.C.G.T.,https://myanimelist.net/anime/producer/179/ACGT,{'Freezing': 'https://myanimelist.net/anime/93...
Acca effe,https://myanimelist.net/anime/producer/2085/Ac...,{'Strike Witches: 501 Butai Hasshin Shimasu!':...


In [20]:
studio_df.iloc[20]["anime_info"]

{'Uchuu Koukyoushi Maetel: Ginga Tetsudou 999 Gaiden': 'https://myanimelist.net/anime/1377/Uchuu_Koukyoushi_Maetel__Ginga_Tetsudou_999_Gaiden',
 'Chou Kuse ni Narisou': 'https://myanimelist.net/anime/2771/Chou_Kuse_ni_Narisou',
 'Shima Shima Tora no Shimajirou': 'https://myanimelist.net/anime/9768/Shima_Shima_Tora_no_Shimajirou'}

An example of the anime info for a studio.

## Matching Anime to MAL Links

Now let's match anime in the sales data to their MAL links. We use fuzzy matching because
some names are slightly different. Some anime fail to be matched, and we drop them. A few will match erroneously and require manual cleaning, which may be done as future work.

In [21]:
def get_anime_link_helper(x):
    """
    Helper for DataFrame.apply's function parameter below. Preferred to
    lambda function because of clarity.
    """
    studio_anime_info = studio_df.loc[x["studio"]]["anime_info"]
    return get_anime_link(x["title"], studio_anime_info, 
                          fuzzy_match = True, ratio = 90)

anime_sales_df["link"] = anime_sales_df.apply(get_anime_link_helper, axis=1)
anime_sales_df = anime_sales_df.dropna(subset=["link"])

## Scraping MAL Anime Data

Now we finally scrape anime data from MAL, specifying the information we want to keep.

In [None]:
keep_cols = ["episodes", "broadcast", "genres", "duration", "rating",\
        "score", "members", "favorites"]

mal_info_df = create_mal_info_df(anime_sales_df[["title", "link"]], keep_cols)

Since `create_mal_info_df` can fail on certain pages when we're rate-limited, we include code to find the list of anime where the query failed, scrape MAL information for only those anime, and then integrate the information with our existing dataframe. If we fail again, we can run this code repeatedly until we have scraped all our desired information.

In [35]:
#additional_info = create_mal_info_df(anime_sales_df[mal_info_df["score"].isnull()][["title", "link"]], keep_cols)
#mal_info_df = mal_info_df.combine_first(additional_info)

In [37]:
anime_sales_df.to_pickle("../data/anime_sales_df_linked.pickle") 
mal_info_df.to_pickle("../data/mal_info_df.pickle")