In [1]:
%matplotlib inline

# Exercise 09: Web Scraping Wikipedia

I would like you to examine whether or not there is a linear correlation between the size of a US state and the year it was admitted to the union.

Objectives: 
+ Scraping a table from a webpage
+ Storing that data in a dataframe
+ Performing a linear regression on that data

## Part A
Using the URL I've provided below, I want you to scrape:
1. The name of each state
2. The year of admittance for each state
3. The land area for each state

Examine the URL to the webpage I've provided using your browser's element inspector to determine how to parse the relavent table.  

Store the data collected in a Pandas' DataFrame.

## Part B
Once you have scraped the necessary data, I would like you to perform a linear regression on the year of admittance for each state (x-axis) against the land area of each state (y-axis) using the Linear Regression model from scikit learn.

You may use the [API reference](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) and [this example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py) to assist you with your regression.

Plot the data points and regression line.  Print out the coefficients, mean squared error, and $r^2$ values of this model.


In [4]:
import urllib
import matplotlib.pylab as plt
from bs4 import BeautifulSoup
import bs4
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

In [5]:
wiki_url='https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
with urllib.request.urlopen(wiki_url) as response:
    territory = bs4.BeautifulSoup(response)
    

In [4]:
territory.title

<title>List of states and territories of the United States - Wikipedia</title>

In [6]:
all_tables = territory.find_all('table')

In [7]:
len(all_tables)

18

In [8]:
states_table = territory.findAll('table',
                           {'class':'wikitable sortable plainrowheaders'})
len(states_table)

4

In [9]:
states_table[0]


<table class="wikitable sortable plainrowheaders" style="text-align: center;">
<caption>States of the United States of America
</caption>
<tbody><tr>
<th colspan="2" rowspan="2" scope="col">Flag, name and<br/><a class="mw-redirect" href="/wiki/List_of_U.S._state_abbreviations" title="List of U.S. state abbreviations">postal abbreviation</a><sup class="reference" id="cite_ref-USPSabbreviations_14-0"><a href="#cite_note-USPSabbreviations-14">[12]</a></sup>
</th>
<th colspan="2" scope="col">Cities
</th>
<th rowspan="2" scope="col">Ratification or<br/>admission<sup class="reference" id="cite_ref-16"><a href="#cite_note-16">[C]</a></sup>
</th>
<th rowspan="2" scope="col">Population<br/><sup class="reference" id="cite_ref-17"><a href="#cite_note-17">[D]</a></sup><sup class="reference" id="cite_ref-AnnualEstUS_18-0"><a href="#cite_note-AnnualEstUS-18">[14]</a></sup>
</th>
<th colspan="2" scope="col">Total area<sup class="reference" id="cite_ref-areameasurements_19-0"><a href="#cite_note-aream

In [None]:
import requests

In [11]:
states = states_table[0].findAll('a')
states

[<a class="mw-redirect" href="/wiki/List_of_U.S._state_abbreviations" title="List of U.S. state abbreviations">postal abbreviation</a>,
 <a href="#cite_note-USPSabbreviations-14">[12]</a>,
 <a href="#cite_note-16">[C]</a>,
 <a href="#cite_note-17">[D]</a>,
 <a href="#cite_note-AnnualEstUS-18">[14]</a>,
 <a href="#cite_note-areameasurements-19">[15]</a>,
 <a href="#cite_note-areameasurements-19">[15]</a>,
 <a href="#cite_note-areameasurements-19">[15]</a>,
 <a href="/wiki/List_of_United_States_congressional_districts" title="List of United States congressional districts">Number<br/>of Reps.</a>,
 <a href="#cite_note-State_and_Local_Government_Finances_and_Employment-20">[16]</a>,
 <a href="/wiki/Alabama" title="Alabama">Alabama</a>,
 <a href="/wiki/Montgomery,_Alabama" title="Montgomery, Alabama">Montgomery</a>,
 <a href="/wiki/Birmingham,_Alabama" title="Birmingham, Alabama">Birmingham</a>,
 <a href="/wiki/Alaska" title="Alaska">Alaska</a>,
 <a href="/wiki/Juneau,_Alaska" title="Juneau

In [10]:
table_rows = states_table[0].find_all('tr')
table_rows

[<tr>
 <th colspan="2" rowspan="2" scope="col">Flag, name and<br/><a class="mw-redirect" href="/wiki/List_of_U.S._state_abbreviations" title="List of U.S. state abbreviations">postal abbreviation</a><sup class="reference" id="cite_ref-USPSabbreviations_14-0"><a href="#cite_note-USPSabbreviations-14">[12]</a></sup>
 </th>
 <th colspan="2" scope="col">Cities
 </th>
 <th rowspan="2" scope="col">Ratification or<br/>admission<sup class="reference" id="cite_ref-16"><a href="#cite_note-16">[C]</a></sup>
 </th>
 <th rowspan="2" scope="col">Population<br/><sup class="reference" id="cite_ref-17"><a href="#cite_note-17">[D]</a></sup><sup class="reference" id="cite_ref-AnnualEstUS_18-0"><a href="#cite_note-AnnualEstUS-18">[14]</a></sup>
 </th>
 <th colspan="2" scope="col">Total area<sup class="reference" id="cite_ref-areameasurements_19-0"><a href="#cite_note-areameasurements-19">[15]</a></sup>
 </th>
 <th colspan="2" scope="col">Land area<sup class="reference" id="cite_ref-areameasurements_19-1">

In [11]:
master_list = []
for n in range(len(table_rows)):
    col = table_rows[n].find_all('td')
    sublist = []
    for i in col:
        sublist.append(i.text.strip())
        master_list.append(sublist)
final = set([tuple(x) for x in master_list])
final = list(final)

In [32]:
import numpy as np
nosubrow = []
for m in final:
    m = np.array(m)
    if len(m) == 12:
        nosubrow.append(m[[0,3,7]])
    else:
        nosubrow.append(m[[0,2,6]])

In [17]:
df1 = pd.DataFrame(data = nosubrow, columns = ['state name', 
                                              'admittance','land area'])

In [18]:
df1.sort_values('state name')

Unnamed: 0,state name,admittance,land area
36,AK,"Jan 3, 1959",570641
12,AL,"Dec 14, 1819",50645
23,AR,"Jun 15, 1836",52035
16,AZ,"Feb 14, 1912",113594
43,CA,"Sep 9, 1850",155779
31,CO,"Aug 1, 1876",103642
18,CT,"Jan 9, 1788",4842
14,DE,"Dec 7, 1787",1949
34,FL,"Mar 3, 1845",53625
39,GA,"Jan 2, 1788",57513


In [40]:
# create linear regression model
# create x and y
land = pd.to_numeric(df1['land area'] ,downcast = 'float')


x = land
y= df1["admittance"]

lm = linear_model.LinearRegression()
model = lm.fit(x,y)

regr = linear_model.LinearRegression()
regr.fit(train, test)

ValueError: Unable to parse string "43,204" at position 0

In [None]:

plt.xticks(())
plt.yticks(())

plt.show()

In [None]:
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))