**SA433 &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2024**

# Lesson 22. Web Scraping with Pandas

## Overview

- **Web scraping** is the process of collecting structured data from web pages in an automated fashion


- In this lesson, we'll see how we can use some functionality built into Pandas to read tabular data from web pages

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Reading data from the clipboard 

- Let's start by importing Pandas:

In [None]:
import pandas as pd

- One easy way to read tabular data from a web page is to perform a slightly fancier version of copy-and-paste 

- `pd.read_clipboard()` reads the text in your clipboard and passes it to `pd.read_csv()` to create a DataFrame
    - [Documentation for `pd.read_clipboard()`](https://pandas.pydata.org/docs/reference/api/pandas.read_clipboard.html)

- In this way, we can "manually" scrape data from a web page
    - This method is good to use in pinch
    - Be careful, though, since this method isn't easily automated

- As an example, let's take a look at the [Wikipedia page for Super Bowl LIV](https://en.wikipedia.org/wiki/Super_Bowl_LIV)


- Highlight the "Team-to-team comparison" table in your browser, and copy it


- Now let's use `pd.read_clipboard()`:

- Looks good! 😎

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Reading data from a webpage 

- Instead of using the clipboard, we can ask Pandas to look for all the tables in a web page

- `pd.read_html()` reads any tables it finds in an HTML file into a *list of DataFrames*
    - [Documentation for `pd.read_html()`](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html)

- For example, we can grab all the tables from the Wikipedia page on Super Bowl LIV like this:

- We can see how many tables Pandas found and converted to DataFrames:

- We can inspect each DataFrame to figure out which one we want


- For example, it turns out the "Team-to-team comparison" table above is the 7th table in the list:

- Sometimes, Pandas doesn't convert the table to a DataFrame so cleanly

- For example, if we look at the "Scoring summary" table, which happens to be the 5th table in the list: 

- All the information is there, but it's kind of a mess


- Let's clean it up!

1. First, let's remove rows 0, 1, 2, and 12
    - In the past, we've used `.drop(columns=...)` to delete columns
    - We can delete rows using `.drop(index=...)`
    - [Documentation for `.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)

2. Let's rename the columns
    - In the past, we've used `.rename(columns=...)` to rename a few columns at a time
    - However, we want to rename all the columns, and there are many, so that's a bit cumbersome
    - We can use `.set_axis(..., axis='columns')` to rename all the columns at once
    - [Documentation for `.set_axis()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_axis.html)

3. Let's reset the index after all our work

- Much better! 👍


<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## More advanced web scraping in Python

- This web scraping functionality built into Pandas can be quite useful!

- However, if you have more demanding web scraping needs &ndash; especially for data that is not tabular &ndash; you may need to look elsewhere

- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) is a Python library for pulling data out of HTML (and XML) files
    - It is possibly the most popular Python library for these kinds of tasks, with many tutorials and guides available

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Notes and sources

- Lesson inspired by [this article by Lynn Leifker](https://github.com/LBBL96/Pandas-Web-Scraping-Tutorial)