## 1. Web Scraping

Modify the scripts we used in class to make a program to download both tables present in the  wikipedia page on the Anscombe's Quartet (https://en.wikipedia.org/wiki/Anscombe%27s_quartet). Each table should be saved in its own csv file. **Note:** Your file for the first table should contain the column names, the file for the second table does not need the column names.

In [3]:
# beautiful soup way
import pandas as pd
from bs4 import BeautifulSoup

import requests
page  = requests.get('https://en.wikipedia.org/wiki/Anscombe%27s_quartet')

data = page.text

soup = BeautifulSoup(data,'html5lib')


for table in soup.find_all("table"):
    if table.find('caption'):
        fullTable = []
        for tr in table.find_all('tr'):
            line = []
            if tr.find_all('th'):
                columnNames = []
                for th in tr.find_all('th'):
                    columnNames.append(th.get_text().strip())
            else:
                for td in tr.find_all('td'):
                    line.append(td.get_text().strip())
                fullTable.append(line)
        if len(columnNames) == len(fullTable[1]):
            newTable = pd.DataFrame(fullTable, columns = columnNames)
        else:
            newTable = pd.DataFrame(fullTable)
        newTable.to_csv('anscombe_table2.csv')
    else:
        fullTable = []
        for tr in table.find_all('tr'):
            line = []
            if tr.find_all('th'):
                columnNames = []
                for th in tr.find_all('th'):
                    columnNames.append(th.get_text().strip())
            else:
                for td in tr.find_all('td'):
                    line.append(td.get_text().strip())
                fullTable.append(line)
        if len(columnNames) == len(fullTable[1]):
            newTable = pd.DataFrame(fullTable, columns = columnNames)
        else:
            newTable = pd.DataFrame(fullTable)
        newTable.to_csv('anscombe_table1.csv')

In [4]:
# pandas way to check:
import pandas as pd

# read in the tables
wiki_tables = pd.read_html('https://en.wikipedia.org/wiki/Anscombe%27s_quartet')

# name each table
property_table = wiki_tables[0]
anscombe_table = wiki_tables[1]

property_table
anscombe_table

Unnamed: 0,I,I.1,II,II.1,III,III.1,IV,IV.1
0,x,y,x,y,x,y,x,y
1,10.0,8.04,10.0,9.14,10.0,7.46,8.0,6.58
2,8.0,6.95,8.0,8.14,8.0,6.77,8.0,5.76
3,13.0,7.58,13.0,8.74,13.0,12.74,8.0,7.71
4,9.0,8.81,9.0,8.77,9.0,7.11,8.0,8.84
5,11.0,8.33,11.0,9.26,11.0,7.81,8.0,8.47
6,14.0,9.96,14.0,8.10,14.0,8.84,8.0,7.04
7,6.0,7.24,6.0,6.13,6.0,6.08,8.0,5.25
8,4.0,4.26,4.0,3.10,4.0,5.39,19.0,12.50
9,12.0,10.84,12.0,9.13,12.0,8.15,8.0,5.56


In [17]:
pd.read_csv("anscombe_table1.csv", index_col=[0])

Unnamed: 0,Property,Value,Accuracy
0,Mean of x,9,exact
1,Sample variance of x: s2x,11,exact
2,Mean of y,7.50,to 2 decimal places
3,Sample variance of y: s2y,4.125,±0.003
4,Correlation between x and y,0.816,to 3 decimal places
5,Linear regression line,y = 3.00 + 0.500x,"to 2 and 3 decimal places, respectively"
6,Coefficient of determination of the linear reg...,0.67,to 2 decimal places


In [15]:
pd.read_csv("anscombe_table2.csv",index_col=[0])

Unnamed: 0,0,1,2,3,4,5,6,7
0,x,y,x,y,x,y,x,y
1,10.0,8.04,10.0,9.14,10.0,7.46,8.0,6.58
2,8.0,6.95,8.0,8.14,8.0,6.77,8.0,5.76
3,13.0,7.58,13.0,8.74,13.0,12.74,8.0,7.71
4,9.0,8.81,9.0,8.77,9.0,7.11,8.0,8.84
5,11.0,8.33,11.0,9.26,11.0,7.81,8.0,8.47
6,14.0,9.96,14.0,8.10,14.0,8.84,8.0,7.04
7,6.0,7.24,6.0,6.13,6.0,6.08,8.0,5.25
8,4.0,4.26,4.0,3.10,4.0,5.39,19.0,12.50
9,12.0,10.84,12.0,9.13,12.0,8.15,8.0,5.56


## 2. Pandas and Stats

The Iris dataset is one of the most famous datasets in statistics. Read about it in wikipedia: https://en.wikipedia.org/wiki/Iris_flower_data_set.

Download the dataset from the table in the wikipedia page using beatifulsoup or pandas, create a pandas dataframe containing the dataset (including column names). **Note:** The first column of the table contains only the order of the points in the dataset, it should become the index of your data frame.

In [5]:
import pandas as pd

iris = pd.read_html("https://en.wikipedia.org/wiki/Iris_flower_data_set", header=0, index_col=0)[0] #[0] is to just return dataframe

Your dataframe might have string values in the columns, if so, you need to convert each of the columns that should contain numbers to numeric values (Check the function `pd.to_numeric`).

After converting the columns to numeric use the `desribe()` method to  calculate the average and standard deviation for each variable.

In [6]:
#checking to see if any of the variables that should vbe numeric are coded as strings 
iris.info()

#since all the data types (Dtypes) appeared to be float64 except for species, this looks good and I will skip to pd.to_numeric step
#next, I will call describe() to calculate the average (mean) and standard deviation (std) for each numeric variable
iris.describe().loc[["mean","std"]]
#so it is easier to see I only return the values associated with the mean and standard deviation. To see all the outputs from describe() 
#just remove .loc[["mean","std"]].

<class 'pandas.core.frame.DataFrame'>
Index: 150 entries, 1 to 150
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Sepal length  150 non-null    float64
 1   Sepal width   150 non-null    float64
 2   Petal length  150 non-null    float64
 3   Petal width   150 non-null    float64
 4   Species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 7.0+ KB


Unnamed: 0,Sepal length,Sepal width,Petal length,Petal width
mean,5.843333,3.057333,3.758,1.2
std,0.828066,0.435866,1.765298,0.761401


Use the `.groupby()` method to group the data by species and calculate the average and standard deviation for each variable based on the iris species.

In [7]:
iris.groupby(["Species"]).describe().loc[:,(slice(None),["mean","std"])] 
#added last part just so its easier to read the output of mean and std for all the different numeric variables. 
#remove .loc[:,(slice(None),["mean","std"])]  to get full .describe() output

Unnamed: 0_level_0,Sepal length,Sepal length,Sepal width,Sepal width,Petal length,Petal length,Petal width,Petal width
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
Species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
I. setosa,5.006,0.35249,3.428,0.379064,1.462,0.173664,0.248,0.105444
I. versicolor,5.936,0.516171,2.77,0.313798,4.26,0.469911,1.326,0.197753
I. virginica,6.588,0.63588,2.974,0.322497,5.552,0.551895,2.026,0.27465


Make scatter plot showing the covariance of the variables. Check plotly's `create_scatterplotmatrix` function from the `figure_factory`. Your graph should look like this:

<img src="iris.png"></img>

In [1]:
from plotly.figure_factory import create_scatterplotmatrix
import plotly.graph_objects as go

fig = create_scatterplotmatrix(iris, index = "Species", diag = "histogram",
                        height=800, width=800)

for trace in fig['data']:
    if trace.type == 'histogram':
        trace.update(nbinsx=25)  

fig.update_layout(template='plotly_white',
                  title={'x':0.5})

fig.show()

NameError: name 'iris' is not defined