# Scraping with Pandas

Pandas has native [html scraping capabilities](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html) that can be useful when the data is stored within an html table

### Import dependencies

In [None]:
import pandas as pd

### Define a variable to store the [URL for US capitals](https://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States)

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States'

### Read HTML page with Pandas
We can use the [`read_html`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html) function in Pandas to automatically scrape any tabular data from a page.

In [None]:
tables = pd.read_html(url)
len(tables)

### Observe resulting content types
What we get in return is a list of dataframes for any tabular data that Pandas found.

In [None]:
print(type(tables))
print(type(tables[0]))

### Grab one single DataFrame to examine
We can slice off any of those dataframes that we want using normal indexing. The one that we choose is arbitrary.

In [None]:
df = tables[0]
df.columns = ['State', 'Abr.', 'State-hood Rank', 'Capital', 
              'Capital Since', 'Area (sq-mi)', 'Municipal Population', 'Metropolitan', 
              'Metropolitan Population', 'Population Rank', 'Notes']
df.head()

### Remove excess header rows

In [None]:
df = df.iloc[2:]
df.head()

### Set the index to the `State` column

In [None]:
df.set_index('State', inplace=True)
df.head()

### View a single state's series

In [None]:
df.loc['Texas']

## DataFrames as HTML

Pandas also has a `to_html` method that we can use to generate HTML tables from DataFrames.

In [None]:
html_table = df.to_html()
html_table

### You may have to strip unwanted newlines to clean up the table.

[`.replace()` Documentation](https://www.tutorialspoint.com/python/string_replace.htm)

In [None]:
html_table.replace('\n', '')

### You can also save the table directly to a file.

In [None]:
df.to_html('table.html')

In [None]:
# OSX Users can run this to open the file in a browser, 
# or you can manually find the file and open it in the browser
!open table.html