# Web-Scraping

## Objectives

- Parse HTML elements in webpages
- Use requests and BeautifulSoup to get and process webpage contents
- Use ethics when scraping websites

Hopefully, you can get all the data you need easily and accessibly, and don't need to scour the web to find a source that will let you do your analysis. 

We'd all prefer one of these:

<img src="images/other_options.png" alt="image showcasing a downloadable csv, database connection, or API, but we're not always so lucky. not sure of image source, took from materials provided by another instructor" width=650>

But we're not always so lucky! Sometimes we need data that's less accessible.

Enter...

<img alt="beautiful soup logo" src="images/bs.png" width=500>

> "You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects."

- From the Beautiful Soup [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#)

## The components of a web page

When we visit a web page, our web browser makes a GET request to a web server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

- HTML — contain the main content of the page.
- CSS — add styling to make the page look nicer.
- JS — Javascript files add interactivity to web pages.
- Images — image formats, such as JPG and PNG allow web pages to show pictures.

After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping.

### HTML

HyperText Markup Language (HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python — instead, it’s a markup language that tells a browser how to layout content. Let’s take a quick tour through HTML so we know enough to scrape effectively.

HTML consists of elements called tags.

Tags have commonly used names that depend on their position in relation to other tags:

- **child** — a child is a tag inside another tag. 
- **parent** — a parent is the tag another tag is inside. 
- **sibling** — a sibiling is a tag that is nested inside the same parent as another tag. 

Here's som example HTML - which tags are parents? children? siblings?

~~~html
<html>
  <head></head>
  <body>
    <p>
      Here's a paragraph of text!
      <a href="https://www.dataquest.io">Learn Data Science Online</a>
    </p>
    <p>
      Here's a second paragraph of text!
      <a href="https://www.python.org">Python</a>        
    </p>
  </body>
</html>
~~~

## Grabbing Movie Data

We might think about grabbing more movie data, as we gear up for our Phase 1 project which uses movie data. 

If we go to [IMDB](https://www.imdb.com/), their only API content seems expensive, and their advanced search results in tabular data that seems _extremely_ scrapable.

**BUT** 

Enter - [conditions of use pages](https://www.imdb.com/conditions) ... and ethics!

> "**Robots and Screen Scraping:** You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below."


### Ethical Concerns

- Terms of Service (includes the conditions of use shown above)
- Denial of Service Attacks
- Confidentiality

[This article](https://oxylabs.io/blog/is-web-scraping-legal) discusses legal issues related to web scraping.

Key points: 

- Don't log into a site and then scrape what's only available after logging in - then, you're likely violating the Terms of Service (you can always check to see if that's actually covered in the ToS)
- Don't scrape copyrighted data 

**Let's Discuss**

- Do people scrape sites they shouldn't? Sure, all the time. But am I going to tell you to ignore conditions/terms of use? Absolutely not. Make good choices.

_We are not lawyers - this does not constitute legal advice._

Instead, let's scrape Wikipedia for movie data - Wikipedia has a very accessible Creative Commons license for use!

Let's explore a few [years in film](https://en.wikipedia.org/wiki/Table_of_years_in_film).

## Task: Grab the top 10 highest-grossing films for each year, 2000-2021

### Imports

Our goal is to collect data into a Pandas dataframe. Plus we're still working with websites, so we'll still need the requests library.

In [63]:
import pandas as pd
import requests
from bs4 import BeautifulSoup # note this odd import statement structure

In [64]:
# we may also need lxml - https://lxml.de/index.html
# helps process html or xml in python
# !pip install lxml

Test case - [the year 2000](https://en.wikipedia.org/wiki/2000_in_film).

In [65]:
# Get the response from the website, using requests
resp = requests.get("https://en.wikipedia.org/wiki/2000_in_film")

In [66]:
# Let's check out the text attribute of that response...
resp.text
# (ew)

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>2000 in film - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"6028534f-bb2e-49b5-83f8-a22ae4ad9abe","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"2000_in_film","wgTitle":"2000 in film","wgCurRevisionId":1091618353,"wgRevisionId":1091618353,"wgArticleId":167091,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Use mdy dates from May 2019","2000 in film","Film by year"],"wgPageCon

In [67]:
# And now... beautiful soup! Let's soup-ify that text attribute
soup = BeautifulSoup(resp.text)

In [68]:
# Can use a prettify function to pretty print
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   2000 in film - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"6028534f-bb2e-49b5-83f8-a22ae4ad9abe","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"2000_in_film","wgTitle":"2000 in film","wgCurRevisionId":1091618353,"wgRevisionId":1091618353,"wgArticleId":167091,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Use mdy dates from May 2019","2000 in film","Film by year"

In [69]:
# Now we need to find the table we want in the soup - use .find()
# Can pass a dictionary in the attributes argument
table = soup.find("table", attrs={"class":"wikitable"})

In [70]:
# Explore that result
table

<table class="wikitable sortable" style="margin:auto; margin:auto;">
<caption>Highest-grossing films of 2000
</caption>
<tbody><tr>
<th>Rank</th>
<th>Title</th>
<th>Distributor</th>
<th>Worldwide gross
</th></tr>
<tr>
<th style="text-align:center;">1
</th>
<td><i><a href="/wiki/Mission:_Impossible_2" title="Mission: Impossible 2">Mission: Impossible 2</a></i>
</td>
<td><a href="/wiki/Paramount_Pictures" title="Paramount Pictures">Paramount</a>
</td>
<td>$546,388,105
</td></tr>
<tr>
<th style="text-align:center;">2
</th>
<td><i><a href="/wiki/Gladiator_(2000_film)" title="Gladiator (2000 film)">Gladiator</a></i>
</td>
<td><a href="/wiki/DreamWorks_Pictures" title="DreamWorks Pictures">DreamWorks</a> / <a href="/wiki/Universal_Pictures" title="Universal Pictures">Universal</a>
</td>
<td>$460,583,960
</td></tr>
<tr>
<th style="text-align:center;">3
</th>
<td><i><a href="/wiki/Cast_Away" title="Cast Away">Cast Away</a></i>
</td>
<td><a class="mw-redirect" href="/wiki/20th_Century_Fox" titl

In [71]:
# Check out the first real row in the table
# Can use get_text as a method, then use string methods!
table.find_all('tr')[1].get_text().split("\n\n")

['\n1', 'Mission: Impossible 2', 'Paramount', '$546,388,105\n']

In [72]:
# Check out the last row
table.find_all('tr')[-1].get_text().split("\n\n")

['\n10', 'What Lies Beneath', 'DreamWorks / Fox', '$291,420,351\n']

In [73]:
# Can make into a dataframe like we did with an API result
# First need to define the columns
columns = table.find_all('tr')[0].get_text().split('\n')[1:5]
columns

['Rank', 'Title', 'Distributor', 'Worldwide gross']

In [74]:
# Now let's make a list of rows for the data
row_list = []
for row in table.find_all('tr')[1:]:
    row_list.append(row.get_text().split('\n\n'))
    
row_list

[['\n1', 'Mission: Impossible 2', 'Paramount', '$546,388,105\n'],
 ['\n2', 'Gladiator', 'DreamWorks / Universal', '$460,583,960\n'],
 ['\n3', 'Cast Away', 'Fox / DreamWorks', '$429,632,142\n'],
 ['\n4', 'What Women Want', 'Paramount', '$374,111,707\n'],
 ['\n5', 'Dinosaur', 'Disney', '$349,822,765\n'],
 ['\n6', 'How the Grinch Stole Christmas', 'Universal', '$345,141,403\n'],
 ['\n7', 'Meet the Parents', 'Universal / DreamWorks', '$330,444,045\n'],
 ['\n8', 'The Perfect Storm', 'Warner Bros.', '$328,718,434\n'],
 ['\n9', 'X-Men', 'Fox', '$296,339,527\n'],
 ['\n10', 'What Lies Beneath', 'DreamWorks / Fox', '$291,420,351\n']]

In [75]:
df = pd.concat([pd.DataFrame([i], columns = columns) for i in row_list], ignore_index=True)
df

Unnamed: 0,Rank,Title,Distributor,Worldwide gross
0,\n1,Mission: Impossible 2,Paramount,"$546,388,105\n"
1,\n2,Gladiator,DreamWorks / Universal,"$460,583,960\n"
2,\n3,Cast Away,Fox / DreamWorks,"$429,632,142\n"
3,\n4,What Women Want,Paramount,"$374,111,707\n"
4,\n5,Dinosaur,Disney,"$349,822,765\n"
5,\n6,How the Grinch Stole Christmas,Universal,"$345,141,403\n"
6,\n7,Meet the Parents,Universal / DreamWorks,"$330,444,045\n"
7,\n8,The Perfect Storm,Warner Bros.,"$328,718,434\n"
8,\n9,X-Men,Fox,"$296,339,527\n"
9,\n10,What Lies Beneath,DreamWorks / Fox,"$291,420,351\n"


In [76]:
# Clean that up... Using an applymap!
df.applymap(lambda x: x.replace("\n", ""))

Unnamed: 0,Rank,Title,Distributor,Worldwide gross
0,1,Mission: Impossible 2,Paramount,"$546,388,105"
1,2,Gladiator,DreamWorks / Universal,"$460,583,960"
2,3,Cast Away,Fox / DreamWorks,"$429,632,142"
3,4,What Women Want,Paramount,"$374,111,707"
4,5,Dinosaur,Disney,"$349,822,765"
5,6,How the Grinch Stole Christmas,Universal,"$345,141,403"
6,7,Meet the Parents,Universal / DreamWorks,"$330,444,045"
7,8,The Perfect Storm,Warner Bros.,"$328,718,434"
8,9,X-Men,Fox,"$296,339,527"
9,10,What Lies Beneath,DreamWorks / Fox,"$291,420,351"


**But wait...** there's a shortcut (thanks pandas)

In [77]:
table

<table class="wikitable sortable" style="margin:auto; margin:auto;">
<caption>Highest-grossing films of 2000
</caption>
<tbody><tr>
<th>Rank</th>
<th>Title</th>
<th>Distributor</th>
<th>Worldwide gross
</th></tr>
<tr>
<th style="text-align:center;">1
</th>
<td><i><a href="/wiki/Mission:_Impossible_2" title="Mission: Impossible 2">Mission: Impossible 2</a></i>
</td>
<td><a href="/wiki/Paramount_Pictures" title="Paramount Pictures">Paramount</a>
</td>
<td>$546,388,105
</td></tr>
<tr>
<th style="text-align:center;">2
</th>
<td><i><a href="/wiki/Gladiator_(2000_film)" title="Gladiator (2000 film)">Gladiator</a></i>
</td>
<td><a href="/wiki/DreamWorks_Pictures" title="DreamWorks Pictures">DreamWorks</a> / <a href="/wiki/Universal_Pictures" title="Universal Pictures">Universal</a>
</td>
<td>$460,583,960
</td></tr>
<tr>
<th style="text-align:center;">3
</th>
<td><i><a href="/wiki/Cast_Away" title="Cast Away">Cast Away</a></i>
</td>
<td><a class="mw-redirect" href="/wiki/20th_Century_Fox" titl

In [78]:
# Check out the read_html method
# Note - pandas likes the prettify objects better
pd.read_html(table.prettify())[0]

Unnamed: 0,Rank,Title,Distributor,Worldwide gross
0,1,Mission: Impossible 2,Paramount,"$546,388,105"
1,2,Gladiator,DreamWorks / Universal,"$460,583,960"
2,3,Cast Away,Fox / DreamWorks,"$429,632,142"
3,4,What Women Want,Paramount,"$374,111,707"
4,5,Dinosaur,Disney,"$349,822,765"
5,6,How the Grinch Stole Christmas,Universal,"$345,141,403"
6,7,Meet the Parents,Universal / DreamWorks,"$330,444,045"
7,8,The Perfect Storm,Warner Bros.,"$328,718,434"
8,9,X-Men,Fox,"$296,339,527"
9,10,What Lies Beneath,DreamWorks / Fox,"$291,420,351"


In [79]:
df = pd.read_html(table.prettify())[0]
df

Unnamed: 0,Rank,Title,Distributor,Worldwide gross
0,1,Mission: Impossible 2,Paramount,"$546,388,105"
1,2,Gladiator,DreamWorks / Universal,"$460,583,960"
2,3,Cast Away,Fox / DreamWorks,"$429,632,142"
3,4,What Women Want,Paramount,"$374,111,707"
4,5,Dinosaur,Disney,"$349,822,765"
5,6,How the Grinch Stole Christmas,Universal,"$345,141,403"
6,7,Meet the Parents,Universal / DreamWorks,"$330,444,045"
7,8,The Perfect Storm,Warner Bros.,"$328,718,434"
8,9,X-Men,Fox,"$296,339,527"
9,10,What Lies Beneath,DreamWorks / Fox,"$291,420,351"


In [80]:
# Can add a column saying which year this ranking is from
df['Year'] = 2000

In [81]:
df

Unnamed: 0,Rank,Title,Distributor,Worldwide gross,Year
0,1,Mission: Impossible 2,Paramount,"$546,388,105",2000
1,2,Gladiator,DreamWorks / Universal,"$460,583,960",2000
2,3,Cast Away,Fox / DreamWorks,"$429,632,142",2000
3,4,What Women Want,Paramount,"$374,111,707",2000
4,5,Dinosaur,Disney,"$349,822,765",2000
5,6,How the Grinch Stole Christmas,Universal,"$345,141,403",2000
6,7,Meet the Parents,Universal / DreamWorks,"$330,444,045",2000
7,8,The Perfect Storm,Warner Bros.,"$328,718,434",2000
8,9,X-Men,Fox,"$296,339,527",2000
9,10,What Lies Beneath,DreamWorks / Fox,"$291,420,351",2000


### Now Loop It!

In [82]:
# My preference - create a list of dataframes, then concat afterwards
# Are there other ways to create one big df from this? OF COURSE!

import time

list_of_dfs = []

for year in range(2000, 2021):
    url = f"https://en.wikipedia.org/wiki/{year}_in_film"
    resp = requests.get(url).text
    soup = BeautifulSoup(resp)
    table = soup.find('table', {'class':"wikitable sortable"})
    df = pd.read_html(table.prettify())[0]
    df['Year'] = year
    list_of_dfs.append(df)
    time.sleep(.1)
    # Only 20 things... not going to worry about using time to pause requests

In [83]:
# Check out the last one
list_of_dfs[6] = list_of_dfs[6].rename(columns={'Distributor(s)':'Distributor'})

In [84]:
# Now to concat...
full_df = pd.concat([df for df in list_of_dfs], ignore_index=True)
full_df

Unnamed: 0,Rank,Title,Distributor,Worldwide gross,Year
0,1,Mission: Impossible 2,Paramount,"$546,388,105",2000
1,2,Gladiator,DreamWorks / Universal,"$460,583,960",2000
2,3,Cast Away,Fox / DreamWorks,"$429,632,142",2000
3,4,What Women Want,Paramount,"$374,111,707",2000
4,5,Dinosaur,Disney,"$349,822,765",2000
...,...,...,...,...,...
205,6,Sonic the Hedgehog,Paramount,"$319,715,683",2020
206,7,Dolittle,Universal,"$251,409,960",2020
207,8,Jiang Ziya,Beijing enlight,"$243,883,429",2020
208,9,A Little Red Flower,HG Entertainment,"$238,600,000 [8] [9]",2020


Let's practice some data cleaning on the Worldwide Gross column:

In [85]:
# Can see a hint of some gross data in the Gross column
full_df['Worldwide gross']

0                $546,388,105
1                $460,583,960
2                $429,632,142
3                $374,111,707
4                $349,822,765
                ...          
205              $319,715,683
206              $251,409,960
207              $243,883,429
208    $238,600,000  [8]  [9]
209         $226,400,000  [a]
Name: Worldwide gross, Length: 210, dtype: object

In [86]:
# Let's check out that complicated example
full_df['Worldwide gross'][208]

'$238,600,000  [8]  [9]'

In [87]:
# Test out how to clean it...
full_df['Worldwide gross'][208].split(" ")[0]

'$238,600,000'

In [88]:
# Now let's do it on the whole column
full_df['Worldwide gross clean'] = full_df['Worldwide gross'].str.split(" ").str[0]

In [89]:
# Check our work
full_df.tail()

Unnamed: 0,Rank,Title,Distributor,Worldwide gross,Year,Worldwide gross clean
205,6,Sonic the Hedgehog,Paramount,"$319,715,683",2020,"$319,715,683"
206,7,Dolittle,Universal,"$251,409,960",2020,"$251,409,960"
207,8,Jiang Ziya,Beijing enlight,"$243,883,429",2020,"$243,883,429"
208,9,A Little Red Flower,HG Entertainment,"$238,600,000 [8] [9]",2020,"$238,600,000"
209,10,Shock Wave 2,Universe Films,"$226,400,000 [a]",2020,"$226,400,000"


In [90]:
# Can also remove commas and dollar signs, and make the column integers!
full_df['Worldwide gross clean'] = full_df['Worldwide gross clean'].str.replace(
    "$", "").str.replace(",", "").astype("int64")

In [92]:
# Sanity check
full_df

Unnamed: 0,Rank,Title,Distributor,Worldwide gross,Year,Worldwide gross clean
0,1,Mission: Impossible 2,Paramount,"$546,388,105",2000,546388105
1,2,Gladiator,DreamWorks / Universal,"$460,583,960",2000,460583960
2,3,Cast Away,Fox / DreamWorks,"$429,632,142",2000,429632142
3,4,What Women Want,Paramount,"$374,111,707",2000,374111707
4,5,Dinosaur,Disney,"$349,822,765",2000,349822765
...,...,...,...,...,...,...
205,6,Sonic the Hedgehog,Paramount,"$319,715,683",2020,319715683
206,7,Dolittle,Universal,"$251,409,960",2020,251409960
207,8,Jiang Ziya,Beijing enlight,"$243,883,429",2020,243883429
208,9,A Little Red Flower,HG Entertainment,"$238,600,000 [8] [9]",2020,238600000


## Discussion Time!

What else could we do with webscraping? Any project ideas pop into mind? Any useful things on that page we could also use to grab more data? Let's discuss!

- Had URLs in these results - could grab even more data on each movie using those
- Can loop through any kind of repeatable URL, provided you figure out the pattern!
- The possibilities are endless... (but don't forget to check the terms of use or copyright!)
