# My Anime List Web Scraping 2021
### *Kim Santos*

<font size ='4' /> This set up shows web scraping in the Top Anime section from My Anime List by using python in JupyterLab. Data will be scraped and then wrangled into a clean data set for future uses. <font/>

---

## **Set Up**


<font size ='4' /> Libraries are imported that necessary to use for web scraping and create dataframes.   <br />
**Selenium:** Helps automate web browsers. It will help navigate certain areas on a web page and scrape the data. <br />
**Pandas:** Will be used to create a dataframe and then set to a .csv file <br />
<font />

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import pandas as pd

## **Web Scraping**

<font size = '4'> This section scrapes over 4,000 animes from the 'Top Anime' section in https://myanimelist.net/. The information that will be scraped are the following: **titles, scores, genres, types, episodes, status, aired, premiered, broadcasts, producers, licensors, studios, sources, duration, rating, popularity, members, favorite and rank**. <br />
    <br />
In order to scrape data, the Selenium library and its executable, ChromeDriver, will be utilized. Selenium is an automated web tool that controls web browsers and it utilizes ChromeDriver to control the Chrome web browser. ChromeDriver is separately installed and can be downloaded through https://chromedriver.chromium.org/, however, it must be the same version as the Chrome browser. Selenium will control Chrome by going through multiple urls and have statements search and store elements of the necessary information we need.
<font />

In [None]:
# Variables are defined to locate ChromeDriver in the desktop to control ChromeDriver and define a wait time of 15 seconds
# for conditions to load and become visible before moving on to other conditions
driver = webdriver.Chrome('/Users/kimochi/Desktop/chromedriver')
wait = WebDriverWait(driver, 15)

<span style="background-color: #FFFF00">  **<font size = '4'>Note**</span>:<font size = '4'> Since this project is collecting over 4,000 data, I believe it is best to web scrape the information in intervals. We would do intervals of gathering 300 animes, and once all of the information is gathered, we move onto the next set of 300 and so forth. This is to ensure the information being collected is stored safely. We do not have to start over in the beginning of the data collection if there were interruptions, and we can start in the beginning of the interval we are currently on.

In [None]:
# a list variable is defined to store anime urls collected from the 'Top Anime' in MAL
# The for-loop locates 300 animes in 50 increments. It locates elements that contains their url and store into the list.
urls = []
for page in range (0, 300, 50):
    driver.get('https://myanimelist.net/topanime.php?limit=' + str(page))
    url = driver.find_elements(By.CSS_SELECTOR, 'div[class ="detail"] h3 a')
    for item in url:
        urls.append(item.get_attribute('href'))

In [None]:
# List variables are defined for individual variables to be stored once they collected information they are specified to locate
titles = [] 
scores = [] 
genres = []
anime_types = []
episodes = []
statuses = []
aired = []
premiered = []
broadcasts = []
producers = []
licensors = []
studios = []
sources = []
durations = []
ratings = []
popularity = []
members = []
favorites = []
ranks = []

In [None]:
# For-loop accesses the list variable that contains all the anime urls, and variables are defined to locate the information we need by the visibility of elements.
# If no elements are found, the variable(s) will produce a null
for anime_url in urls:
    driver.get(anime_url)
    
    title = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div[class="h1-title"]'))).text
    score = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div[class*="score-label"]'))).text
    genre = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Genre")]/parent::div'))).text
    anime_type = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'span[class="information type"]'))).text
    episode = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Episodes")]/parent::div'))).text
    status = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Status")]/parent::div'))).text
    air = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Air")]/parent::div'))).text

    try:
        premier = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Premiered")]/parent::div'))).text
    except:
        premier = None

    try:
        broadcast = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Broadcast")]/parent::div'))).text
    except:
        broadcast = None

    producer = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Producers")]/parent::div'))).text
    licensor = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Licensors")]/parent::div'))).text
    studio = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Studios")]/parent::div'))).text
    source = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Source")]/parent::div'))).text
    duration = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Duration")]/parent::div'))).text
    rating = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Rating")]/parent::div'))).text
    pop = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Popularity")]/parent::div'))).text
    member = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'span[class="numbers members"]'))).text
    favorite = wait.until(EC.visibility_of_element_located((By.XPATH, '//span[contains(text(), "Favorites")]/parent::div'))).text
    rank = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'span[class*="numbers ranked"]'))).text
    

# Unecessary strings are removed in the following block after elements are located.
    genre = genre.replace("Genres: ", "")
    episode = episode.replace("Episodes: ", "")
    status = status.replace("Status: ", "")
    air = air.replace("Aired: ", "")     
    try:
        premier = premier.replace("Premiered: ", "")
    except:
        premier = None

    try:
        broadcast = broadcast.replace("Broadcast: ", "")
    except:
        broadcast = None    
    producer = producer.replace("Producers: ", "")
    licensor = licensor.replace("Licensors: ", "")
    studio = studio.replace("Studios: ", "")
    source = source.replace("Source: ", "")
    duration = duration.replace("Duration: ", "")
    rating = rating.replace("Rating: ", "")
    pop = pop.replace("Popularity: #", "")
    member = member.replace("Members ", "")
    favorite = favorite.replace("Favorites: ", "")
    rank = rank.replace("Ranked #", "")

    
# Variables are then stored in their respective lists
    titles.append(title)
    scores.append(score)
    genres.append(genre)
    anime_types.append(anime_type)
    episodes.append(episode)
    statuses.append(status)
    aired.append(air)
    premiered.append(premier)
    broadcasts.append(broadcast)
    producers.append(producer)
    licensors.append(licensor)
    studios.append(studio)
    sources.append(source)
    durations.append(duration)
    ratings.append(rating)
    popularity.append(pop)
    members.append(member)
    favorites.append(favorite)
    ranks.append(rank)

<font size= '4'> <span style="background-color: #FFFF00"> **Note**</span>: The last three cells are repeated to start the next interval of 300 animes. Once 4,000+ data are collected, the next cell is executed. <font/><br />

<font size ='4'> The last few cells creates a dataframe from all the information gathered. Then it will be executed into a .csv file

In [None]:
# before defining a data frame, we must create column names and assign each column their corresponding list 
my_data = {"Title": titles, "Score": scores, "Genres": genres, "Type": anime_types, "Episodes": episodes,
          "Status": statuses, "Aired": aired, "Premiered": premiered, "Broadcast": broadcasts, "Producers": producers,
          "Licensors": licensors, "Studios": studios, "Source": sources, "Duration": durations, 
          "Rating": ratings, "Popularity": popularity, "Members": members, "Favorites": favorites,
          "Ranked": ranks}

In [None]:
#data frame created
df = pd.DataFrame(data = my_data)

In [None]:
# saving dataframe into a .csv file
df.to_csv('4K Anime.csv')

In [None]:
# quit ChromeDriver as it is no longer needed
driver.quit()