<a href="https://colab.research.google.com/github/wiherreira/webscraperIMDb/blob/main/imbd_webScraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The IMDb Web Scraper

**Addapted by**: [Willian Herreira Lima](https://www.linkedin.com/in/willianherreira/).

To build this model I did follow the step-by-step guide written by [Angeliga Dietzel](https://medium.com/better-programming/the-only-step-by-step-guide-youll-need-to-build-a-web-scraper-with-python-e79066bd895a) and published on medium.com. The entire [template code](https://github.com/angelicadietzel/data-projects/blob/master/single-page-web-scraper/imdb_scraper.py) is available at Angeliga's data-project folder.  

What is web scraping? Web scraping consists of gathering data available on websites to build your own databases, it is a powerful tool to have on your portfolio. Some scrapers can be started manually or be automatically triggered by a function call.

The article is really interesting, it covers topics such as understanding HTML web pages, building a web scraper using Python, and creating a DataFrame with pandas. It also covers data quality, data cleaning, and data-type conversion, it is an entirely step-by-step and with instructions, code, and explanations on how every piece of it works. I did add a few twists to my model, first I did add the “Director” to the data storage, and a nested “for loop” to build a list of urls.

In [1]:
# Import Libraries 
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from google.colab import files

## Look for patterns

An important aspect of system automation is to look for patterns, note that the first fifth titles' URL differs from the following pages after the and ("&") character, also note that the only difference between further addresses is where to start the query by item number on the web site's list. Check the addresses listed below to spot these patterns:

* 1-50 of 1,000 titles.
> https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv

* Next » 51-100 of 1,000 titles
> https://www.imdb.com/search/title/?groups=top_1000&start=51&ref_=adv_nxt

* Next » 101-150 of 1,000 titles
> https://www.imdb.com/search/title/?groups=top_1000&start=101&ref_=adv_nxt

As a side note, if you run the code from a country where English is not the main language, it’s very likely that you’ll get some of the movie names translated into the main language of that country. Most likely, this happens because the server infers your location from your IP address. Even if you are located in a country where English is the main language, you may still get translated content. This may happen if you’re using a VPN while you’re making the GET requests. For example, as I'm in Australia connected to a VPN somewhere in Asia, the scraper it first output was in, what I believe to be, hanzi (Chinese characters ).

If you run into this issue, pass the following values to the headers parameter of the get() function:

**headers = {"Accept-Language": "en-US, en;q=0.5"}**

## Create Dataframe

In [2]:
# Get a list of all urls to the IMDb "Top 1000"
#urls = ["https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv", 
#        "https://www.imdb.com/search/title/?groups=top_1000&start=51&ref_=adv_nxt",
#        "https://www.imdb.com/search/title/?groups=top_1000&start=101&ref_=adv_nxt"]

urls=["https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv"]
url_fp = "https://www.imdb.com/search/title/?groups=top_1000&start="
url_sp = "&ref_=adv_nxt"
for ii in range(0,1000,50):
  if ii%50 == 0:
    if ii > 1:
      url = url_fp+str(ii+1)+url_sp
      urls.append(url)

# Initiate data storage
titles = []
years = []
time = []
imdb_ratings = []
metascores = []
votes = []
us_gross = []
#Need to figure how to get Director name
directors = []

# A nexted loop for each url.
for url in urls:
  headers = {"Accept-Language": "en-US, en;q=0.5"}
  results = requests.get(url, headers=headers)
  #Beautiful Soup is a Python library for pulling data out of HTML and XML files. 
  soup = BeautifulSoup(results.text, "html.parser")
  movie_div = soup.find_all('div',class_='lister-item mode-advanced')

  #this tells your scraper to iterate through 
  #every div container we stored in move_div
  #
  # for each lister-item mode-advanced div container:
  #     scrape these elements
  
  for container in movie_div:
    # Name
    titles.append(container.h3.a.text)
  
    #year
    years.append(container.h3.find('span', class_='lister-item-year').text)
  
    #runtime
    runtime = container.p.find('span', class_='runtime').text if container.p.find('span', class_='runtime').text else '-'
    time.append(runtime)
  
    #IMDb Rating
    imdb_ratings.append(float(container.strong.text))
  
    #Metascore
    # Some movies might not have a Metascore, to avoide a ValueError performing
    # the data cleaning we added "-1" which can be consireded false or NULL. 
    scores = container.find('span', class_='metascore').text if container.find('span', class_='metascore') else -1
    metascores.append(scores)
    # Votes
    nv = container.find_all('span', attrs={'name':'nv'})
    votes.append(nv[0].text) # filter number of votes
    # Revenue
    grosses =nv[1].text if len(nv) > 1 else '-'
    us_gross.append(grosses)
    #Director
    director = container.find_all('p',class_='')
    director = director[1].find('a').text
    directors.append(director)

## Build a DataFrame using Pandas

Next we use Pandas to join the lists created and clean the data. For example, if we print the first five "years" the output will look similar to this:

>> ['(2020)', '(2020)', '(1978)', '(1993)', '(1940)']

Therefore, we used regex to extract only digits from a string and then have it converted to a integer. This method is slightly changed when cleaning the revenue data. 



In [3]:
# Joining lists into a dictionary
movies = pd.DataFrame({
    'movie': titles,
    'year': years,
    'timeMin': time,
    'imdb': imdb_ratings,
    'metascores': metascores,
    'votes': votes,
    'director': directors,
    'us_grossMillions': us_gross,
})

#Clean Data
movies['year'] = movies['year'].str.extract('(\d+)').astype(int)
movies['timeMin'] = movies['timeMin'].str.extract('(\d+)').astype(int)
movies['metascores'] = movies['metascores'].astype(int)
movies['votes'] = movies['votes'].str.replace(',','').astype(int)
movies['us_grossMillions'] = movies['us_grossMillions'].map(lambda x: x.lstrip('$').rstrip('M'))
movies['us_grossMillions'] = pd.to_numeric(movies['us_grossMillions'], errors='coerce')

#add dataframe to csv name 'movies.csv'
movies.to_csv('movies.csv')

## SAVE FILE

In [4]:
# Save csv file to future use.
files.download('movies.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>