# Web Scraping

Hopefully, you can get all the data you need easily and accessibly, and don't need to scour the web to find a source that will let you do your analysis. 

We'd all prefer one of these:

<img src="images/other_options.png" alt="image showcasing a downloadable csv, database connection, or API, but we're not always so lucky. not sure of image source, took from materials provided by another instructor" width=650>

But we're not always so lucky! Sometimes we need data that's less accessible.

Enter...

<img alt="beautiful soup logo" src="images/bs.png" width=500>

> "You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects."

- From the Beautiful Soup [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#)

## Grabbing Movie Data

We might think about grabbing more movie data, as we gear up towards our Phase 1 project which uses movie data. 

If we go to [IMDB](https://www.imdb.com/), their only API content seems expensive, and their advanced search results in tabular data that seems _extremely_ scrapable.

**BUT** 

Enter - [conditions of use pages](https://www.imdb.com/conditions) ... and ethics!

> "**Robots and Screen Scraping:** You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below."

**Let's Discuss**

- Do people scrape sites they shouldn't? Sure, all the time. But am I going to tell you to ignore conditions/terms of use? Absolutely not. Make good choices.


Instead, let's scrape Wikipedia for movie data - Wikipedia has a very accessible Creative Commons license for use!

Let's explore a few [years in film](https://en.wikipedia.org/wiki/Table_of_years_in_film).

## Task: Grab the top 10 highest-grossing films for each year, 2000-2019

### Imports

Our goal is to collect data into a Pandas dataframe. Plus we're still working with websites, so we'll still need the requests library.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup # note this odd import statement structure

In [None]:
# you may also need lxml - https://lxml.de/index.html
# helps process html or xml in python
# !pip install lxml

Test case - [the year 2000](https://en.wikipedia.org/wiki/2000_in_film).

In [2]:
# Get the response from the website, using requests
resp = requests.get("https://en.wikipedia.org/wiki/2000_in_film")

In [4]:
# Let's check out the text attribute of that response...
resp.text[:1000]
# (ew)

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>2000 in film - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"43a79920-5e13-4174-9f0a-c06c0b8bc006","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"2000_in_film","wgTitle":"2000 in film","wgCurRevisionId":1012865260,"wgRevisionId":1012865260,"wgArticleId":167091,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Use mdy dates from May 2019","2000 in film","Film by year"],"wgPageContentLanguage":

In [5]:
# And now... beautiful soup! Let's soup-ify that text attribute
soup = BeautifulSoup(resp.text)

In [16]:
# Now we need to find the table we want in the soup - use .find()
# Can pass a dictionary in the attributes argument
grossing_2000 = soup.find('table', {'class': 'wikitable sortable'})

In [62]:
grossing_2000

<table class="wikitable sortable" style="margin:auto; margin:auto;">
<caption>Highest-grossing films of 2000
</caption>
<tbody><tr>
<th>Rank</th>
<th>Title</th>
<th>Distributor</th>
<th>Worldwide gross
</th></tr>
<tr>
<th style="text-align:center;">1
</th>
<td><i><a href="/wiki/Mission:_Impossible_2" title="Mission: Impossible 2">Mission: Impossible 2</a></i>
</td>
<td><a href="/wiki/Paramount_Pictures" title="Paramount Pictures">Paramount</a>
</td>
<td>$546,388,105
</td></tr>
<tr>
<th style="text-align:center;">2
</th>
<td><i><a href="/wiki/Gladiator_(2000_film)" title="Gladiator (2000 film)">Gladiator</a></i>
</td>
<td><a href="/wiki/Universal_Pictures" title="Universal Pictures">Universal</a>
</td>
<td>$460,583,960
</td></tr>
<tr>
<th style="text-align:center;">3
</th>
<td><i><a href="/wiki/Cast_Away" title="Cast Away">Cast Away</a></i>
</td>
<td><a class="mw-redirect" href="/wiki/20th_Century_Fox" title="20th Century Fox">Fox</a>
</td>
<td>$429,632,142
</td></tr>
<tr>
<th style="te

In [24]:
# Explore that result
len(grossing_2000.find_all("tr"))

11

In [40]:
example = grossing_2000.find_all('tr')[1]

In [None]:
example.get_text()

In [46]:
# Check out the first real row in the table
grossing_2000.find_all('tr')[1].get_text(separator="-", strip=True).split("-")

['1', 'Mission: Impossible 2', 'Paramount', '$546,388,105']

In [37]:
# Check out the last row... what's missing?
grossing_2000.find_all("tr")[-1].get_text().split("\n\n")

['\n10', 'What Lies Beneath', '$291,420,351\n']

In [51]:
pd.DataFrame([grossing_2000.find_all('tr')[1].get_text(separator="-", strip=True).split("-")], 
             columns=['Number', 'Name', 'Studio', "Gross"])

Unnamed: 0,Number,Name,Studio,Gross
0,1,Mission: Impossible 2,Paramount,"$546,388,105"


**But wait...** there's a shortcut (thanks pandas)

In [65]:
print(grossing_2000.prettify())

<table class="wikitable sortable" style="margin:auto; margin:auto;">
 <caption>
  Highest-grossing films of 2000
 </caption>
 <tbody>
  <tr>
   <th>
    Rank
   </th>
   <th>
    Title
   </th>
   <th>
    Distributor
   </th>
   <th>
    Worldwide gross
   </th>
  </tr>
  <tr>
   <th style="text-align:center;">
    1
   </th>
   <td>
    <i>
     <a href="/wiki/Mission:_Impossible_2" title="Mission: Impossible 2">
      Mission: Impossible 2
     </a>
    </i>
   </td>
   <td>
    <a href="/wiki/Paramount_Pictures" title="Paramount Pictures">
     Paramount
    </a>
   </td>
   <td>
    $546,388,105
   </td>
  </tr>
  <tr>
   <th style="text-align:center;">
    2
   </th>
   <td>
    <i>
     <a href="/wiki/Gladiator_(2000_film)" title="Gladiator (2000 film)">
      Gladiator
     </a>
    </i>
   </td>
   <td>
    <a href="/wiki/Universal_Pictures" title="Universal Pictures">
     Universal
    </a>
   </td>
   <td>
    $460,583,960
   </td>
  </tr>
  <tr>
   <th style="text-align:ce

In [77]:
soup.find_all('table', {'class': 'wikitable sortable'})[2].prettify()

'<table class="wikitable sortable" style="width:100%">\n <caption>\n </caption>\n <tbody>\n  <tr>\n   <th colspan="2">\n    Opening\n   </th>\n   <th style="width:17%">\n    Title\n   </th>\n   <th style="width:16%">\n    Studio\n   </th>\n   <th>\n    Cast and crew\n   </th>\n   <th style="width:10%">\n    Genre\n   </th>\n   <th style="width:8%">\n    Medium\n   </th>\n  </tr>\n  <tr>\n   <th rowspan="24" style="text-align:center; background:#ffa07a">\n    <b>\n     A\n     <br/>\n     P\n     <br/>\n     R\n     <br/>\n     I\n     <br/>\n     L\n    </b>\n   </th>\n   <td style="text-align:center; background:#ffdacc">\n    5\n   </td>\n   <td>\n    <i>\n     <a href="/wiki/Black_and_White_(1999_drama_film)" title="Black and White (1999 drama film)">\n      Black and White\n     </a>\n    </i>\n   </td>\n   <td>\n    <a href="/wiki/Screen_Gems" title="Screen Gems">\n     Screen Gems\n    </a>\n   </td>\n   <td>\n    <a href="/wiki/James_Toback" title="James Toback">\n     James Toba

In [73]:
for x in soup.find_all('table', {'class': 'wikitable sortable'}):
    print(x.prettify())

<table class="wikitable sortable" style="margin:auto; margin:auto;">
 <caption>
  Highest-grossing films of 2000
 </caption>
 <tbody>
  <tr>
   <th>
    Rank
   </th>
   <th>
    Title
   </th>
   <th>
    Distributor
   </th>
   <th>
    Worldwide gross
   </th>
  </tr>
  <tr>
   <th style="text-align:center;">
    1
   </th>
   <td>
    <i>
     <a href="/wiki/Mission:_Impossible_2" title="Mission: Impossible 2">
      Mission: Impossible 2
     </a>
    </i>
   </td>
   <td>
    <a href="/wiki/Paramount_Pictures" title="Paramount Pictures">
     Paramount
    </a>
   </td>
   <td>
    $546,388,105
   </td>
  </tr>
  <tr>
   <th style="text-align:center;">
    2
   </th>
   <td>
    <i>
     <a href="/wiki/Gladiator_(2000_film)" title="Gladiator (2000 film)">
      Gladiator
     </a>
    </i>
   </td>
   <td>
    <a href="/wiki/Universal_Pictures" title="Universal Pictures">
     Universal
    </a>
   </td>
   <td>
    $460,583,960
   </td>
  </tr>
  <tr>
   <th style="text-align:ce

In [79]:
list_of_res = []
for x in soup.find_all('table', {'class': 'wikitable sortable'}):
    list_of_res.append(pd.read_html(x.prettify()))

In [85]:
list_of_res[3][0]

Unnamed: 0,Opening,Opening.1,Title,Studio,Cast and crew,Genre,Medium
0,J U L Y,7,But I'm a Cheerleader,Lions Gate Films,Jamie Babbit (director); Brian Wayne Peterso...,"Satire , Romance , Comedy",Live action
1,J U L Y,7,The Kid,Walt Disney Pictures,Jon Turteltaub (director); Audrey Wells (sc...,"Comedy , Family",Live action
2,J U L Y,7,Scary Movie,Dimension Films,Keenen Ivory Wayans (director); Shawn Wayans...,"Comedy , Horror , Spoof",Live action
3,J U L Y,13,Los Pintin al rescate,Artear / Patagonik Film Group / Pol-Ka Pro...,"Arturo Maly , Alfredo Casero , Diego Peret...","Adventure , Family",Animation
4,J U L Y,14,X-Men,20th Century Fox / Marvel Enterprises,Bryan Singer (director); David Hayter (scre...,"Action , Sci-Fi",Live action
5,J U L Y,14,Ready to Run,Buena Vista Television,Duwayne Dunham (director); John Wierick (scre...,"Family , Sports",Live action
6,J U L Y,14,Thomas and the Magic Railroad,Destination Films / Gullane Pictures / Ico...,Britt Allcroft (director/screenplay); Mr. Co...,"Fantasy , Comedy",Live action
7,J U L Y,19,The In Crowd,Warner Bros. Pictures / Morgan Creek Product...,"Mary Lambert (director); Mark Gibson , Phil...",Thriller,Live action
8,J U L Y,21,Loser,Columbia Pictures,Amy Heckerling (director/screenplay); Jason ...,"Romance , Comedy",Live action
9,J U L Y,21,Pokémon: The Movie 2000,Warner Bros. Pictures / Nintendo / 4Kids E...,Kunihiko Yuyama (director); Takeshi Shudo (...,"Adventure , Fantasy , Family",Animation


In [68]:
# Note - pandas likes the prettify objects better
df = pd.read_html(grossing_2000.prettify())[0]

In [98]:
# Check our work
df['Year'] = 2000

In [99]:
df

Unnamed: 0,Rank,Title,Distributor,Worldwide gross,Year
0,1,Mission: Impossible 2,Paramount,"$546,388,105",2000
1,2,Gladiator,Universal,"$460,583,960",2000
2,3,Cast Away,Fox,"$429,632,142",2000
3,4,What Women Want,Paramount,"$374,111,707",2000
4,5,Dinosaur,Disney,"$349,822,765",2000
5,6,How the Grinch Stole Christmas,Universal,"$345,141,403",2000
6,7,Meet the Parents,Universal,"$330,444,045",2000
7,8,The Perfect Storm,Warner Bros.,"$328,718,434",2000
8,9,X-Men,Fox,"$296,339,527",2000
9,10,What Lies Beneath,Fox,"$291,420,351",2000


### Now Loop It!

2000 - 2020

In [97]:
resp.status_code == 200

True

In [100]:
import time

In [102]:
# My preference - create a list of dataframes, then concat afterwards
# Are there other ways to create one big df from this? OF COURSE!

list_of_dfs = []

for year in range(2000, 2022):
    url = f"https://en.wikipedia.org/wiki/{year}_in_film"
    resp = requests.get(url)
    if resp.status_code == 200:
        soup = BeautifulSoup(resp.text)
        table = soup.find('table', {'class': 'wikitable sortable'})
        df = pd.read_html(table.prettify())[0]
        df["Year"] = year
        list_of_dfs.append(df)
        time.sleep(.25)
    else:
        print(f"Error in response: {resp.status_code}")


In [105]:
# Check our work
list_of_dfs[21]

Unnamed: 0,Rank,Title,Distributor,Worldwide gross,Year
0,1,"Hi, Mom",Lian Ray Pictures,"$818,430,000 [1]",2021
1,2,Detective Chinatown 3,Wanda Pictures,"$687,480,000 [1]",2021
2,3,A Writer's Odyssey,CMC Productions,"$157,180,000 [1]",2021
3,4,Godzilla vs. Kong,"Warner Bros. , Toho","$123,100,000 [2]",2021
4,5,Endgame,Enlight Pictures,"$114,380,000 [1]",2021
5,6,Raya and the Last Dragon,Disney,"$90,753,487 [3] [4]",2021
6,7,Boonie Bears: The Wild Life,Huaqiang Fangte Pictures,"$90,240,000 [1]",2021
7,8,Tom & Jerry,Warner Bros.,"$85,600,000 [5]",2021
8,9,New Gods: Nezha Reborn,Bona Film,"$69,310,000 [1]",2021
9,10,Evangelion: 3.0+1.0 Thrice Upon a Time,Toho,"$45,325,099 [6]",2021


In [110]:
# Now to concat...
full_df = pd.concat(list_of_dfs, ignore_index=True)

In [111]:
full_df

Unnamed: 0,Rank,Title,Distributor,Worldwide gross,Year
0,1,Mission: Impossible 2,Paramount,"$546,388,105",2000
1,2,Gladiator,Universal,"$460,583,960",2000
2,3,Cast Away,Fox,"$429,632,142",2000
3,4,What Women Want,Paramount,"$374,111,707",2000
4,5,Dinosaur,Disney,"$349,822,765",2000
...,...,...,...,...,...
215,6,Raya and the Last Dragon,Disney,"$90,753,487 [3] [4]",2021
216,7,Boonie Bears: The Wild Life,Huaqiang Fangte Pictures,"$90,240,000 [1]",2021
217,8,Tom & Jerry,Warner Bros.,"$85,600,000 [5]",2021
218,9,New Gods: Nezha Reborn,Bona Film,"$69,310,000 [1]",2021


In [113]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Rank             220 non-null    int64 
 1   Title            220 non-null    object
 2   Distributor      220 non-null    object
 3   Worldwide gross  220 non-null    object
 4   Year             220 non-null    int64 
dtypes: int64(2), object(3)
memory usage: 8.7+ KB


Let's practice some data cleaning on the Worldwide Gross column:

## Discussion Time!

What else could we do with webscraping? Any project ideas pop into mind? Any useful things on that page we could also use to grab more data? Let's discuss!

- 
