In [1]:
%matplotlib inline

# Exercise 09: Web Scraping Wikipedia

I would like you to examine whether or not there is a linear correlation between the size of a US state and the year it was admitted to the union.

Objectives: 
+ Scraping a table from a webpage
+ Storing that data in a dataframe
+ Performing a linear regression on that data

## Part A
Using the URL I've provided below, I want you to scrape:
1. The name of each state
2. The year of admittance for each state
3. The land area for each state

Examine the URL to the webpage I've provided using your browser's element inspector to determine how to parse the relavent table.  

Store the data collected in a Pandas' DataFrame.

## Part B
Once you have scraped the necessary data, I would like you to perform a linear regression on the year of admittance for each state (x-axis) against the land area of each state (y-axis) using the Linear Regression model from scikit learn.

You may use the [API reference](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) and [this example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py) to assist you with your regression.

Plot the data points and regression line.  Print out the coefficients, mean squared error, and $r^2$ values of this model.


In [2]:
import matplotlib.pylab as plt
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import statsmodels.api as sm
import scrapy
import numpy as np

Reading the html content using 'requests' package

In [3]:
url='https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
html = requests.get(url).content

NameError: name 'requests' is not defined

* Initialize three lists for states, year of admission and area
* Get the tr/td elements as necessary
* Clean the year and area lists
* Create a dataframe

In [None]:
states_list = []
admin_list = []
area_list = []
for tri in sel.css("table")[0].css('tbody tr')[2:]:
    states_list.append(tri.css("th a::attr(title)").extract()[0])
    if len(tri.css("td"))==12:
        admin_list.append(tri.css("td")[3].css("::text").extract()[0])
        area_list.append(tri.css("td")[5].css("::text").extract()[0])
    else:
        admin_list.append(tri.css("td")[2].css("::text").extract()[0])
        area_list.append(tri.css("td")[4].css("::text").extract()[0])

admin_list = [int(i.split(", ")[1]) for i in admin_list]
area_list = [int(i.split("\n")[0].replace(",","")) for i in area_list]
states = pd.DataFrame({"State":states_list,"Year_of_admission":admin_list,"Area":area_list})
states.head()

Easier way to pull the datable using pandas inbuilt function "read_html"

In [None]:
tbl = pd.read_html(url)
states1 = tbl[0].iloc[:,[0,4,6]].copy()
states1.columns = ["State","Year_of_admission","Area"]
states1.Year_of_admission = states1.Year_of_admission.apply(lambda x:x.split(", ")[1])
states1.head()

Building a linear regression model using sklearn and getting the coefficients, rsquare and mse

In [None]:
X = np.array(states["Year_of_admission"]).reshape(-1,1)
y = np.array(states["Area"])
model = linear_model.LinearRegression().fit(X, y)
print(f"The coefficient of the model is: {model.coef_}")
print(f"R-square is: {model.score(X,y)}")
print(f"Mean squared error is: {mean_squared_error(y,model.predict(X))}")

We can use statsmodels to get the p-values of the coefficient and hence get to know the statistical significance too

In [None]:
X2 = sm.add_constant(X)
est = sm.OLS(y,X2).fit()
est.summary()

The p-value is <0.05 and hence the coefficient is significant

Below is the plot that shows the linear trend

In [None]:
plt.scatter(X,y,color="black")
plt.plot(X,model.predict(X),color="red",linewidth = 3)
plt.show()

The only outliers are Alaska and Hawaii that got admitted pretty late but have very high and very less areas