# Scraping with Pandas

In [1]:
import pandas as pd

We can use the `read_html` function in Pandas to automatically scrape any tabular data from a page.

In [2]:
url = 'https://www.ttilgb.com/uiAgs03Action/openScreen.do'

In [3]:
tables = pd.read_html(url)
tables

HTTPError: HTTP Error 500: Internal Server Error

What we get in return is a list of dataframes for any tabular data that Pandas found.

In [None]:
type(tables)

We can slice off any of those dataframes that we want using normal indexing.

In [None]:
df = tables[0]
df.columns = ['State', 'Abr.', 'State-hood Rank', 'Capital', 
              'Capital Since', 'Area (sq-mi)', 'Municipal Population', 'Metropolitan', 
              'Metropolitan Population', 'Population Rank', 'Notes']
df.head()

Cleanup of extra rows

In [None]:
df = df.iloc[2:]
df.head()

Set the index to the `State` column

In [None]:
df.set_index('State', inplace=True)
df.head()

In [None]:
df.loc['Alabama']

## DataFrames as HTML

Pandas also had a `to_html` method that we can use to generate HTML tables from DataFrames.

In [None]:
html_table = df.to_html()
html_table

You may have to strip unwanted newlines to clean up the table.

In [None]:
html_table.replace('\n', '')

You can also save the table directly to a file.

In [None]:
df.to_html('table.html')

In [None]:
# OSX Users can run this to open the file in a browser, 
# or you can manually find the file and open it in the browser
!open table.html