# Lab 5 Exercise


## Outline 

1. Get data into Python Notebook
   * Open files from local file system
   * Open files from Web
  
2. Preprocess data
    * Pandas

3. Visualize data
    * **matplotlib**
    * **seaborn**

4. References

## 1. Get data into Python


[Download Data](https://gist.githubusercontent.com/wjidea/9617d9f9d36ce6343124f538709332ab/raw/ec71e921ee43b02d8ec830d0f758482f459bef92/iris_data.csv)

This is OPTIONAL. Only if you are interested in seeing the data on your local computer programs, such as MS Excel.

### Iris dataset

Source: https://archive.ics.uci.edu/ml/datasets/iris  
About: https://en.wikipedia.org/wiki/Iris_flower_data_set

<img src="http://drive.google.com/uc?export=view&id=1-OZp7Bw4sNE2Qpk2o6StvVSHn4Vb0zly" width="700" />

<!-- ![alt text](http://drive.google.com/uc?export=view&id=1-OZp7Bw4sNE2Qpk2o6StvVSHn4Vb0zly)  -->

In [None]:
# Download data into local directory
!curl -s -H 'Accept: application/vnd.github.v3.raw+csv' -o 'iris_data.csv' \
https://gist.githubusercontent.com/wjidea/9617d9f9d36ce6343124f538709332ab/raw/ec71e921ee43b02d8ec830d0f758482f459bef92/iris_data.csv

## 2. Data Processing


### Import dependent packages and load data

In [None]:
import numpy as np
import pandas as pd
iris_data = pd.read_csv('iris_data.csv')

In [None]:
iris_data.info();


In [None]:
iris_data[50:].head()

In [None]:
iris_data[100:].head()

In [None]:
iris_data.groupby('species').mean()

In [None]:
iris_data.groupby('species').std()

In [None]:
iris_data.describe()

### Data reshape

In [None]:
df2 = pd.melt(iris_data, id_vars=['species'], value_vars=['petal_width','petal_length','sepal_width','sepal_length'])

In [None]:
df2.head()

## 3. Data Visualization
  *   matplotlib
  *   seaborn

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
sns.countplot(iris_data, x="species");

In [None]:
sns.pairplot(data=iris_data, kind='scatter', hue='species');

In [None]:
# get current figure Get the current figure, if no create Figure()
fig=plt.gcf()
fig.set_size_inches(10,7)
sns.scatterplot(x="petal_length", y="petal_width", data=iris_data, hue='species', s=50);

In [None]:
# get current figure Get the current figure, if no create Figure()
fig=plt.gcf()
fig.set_size_inches(10,7)
sns.boxplot(x="species", y="petal_length", data=iris_data, whis=np.inf);
sns.swarmplot(x="species", y="petal_length", data=iris_data, color="0.2", s=6);

In [None]:
fig=plt.gcf()
fig.set_size_inches(10,7)
sns.violinplot(x="species", y="petal_length", data=iris_data, inner=None)
sns.swarmplot(x="species", y="petal_length", data=iris_data, color="0.2", edgecolor="black");

In [None]:
# cluster map (dendogram and tree)
df = iris_data.iloc[:,:4]
df1 = iris_data.species
x = dict(zip(df1.unique(),"rgb"))
row_colors = df1.map(x)
cg = sns.clustermap(df,row_colors=row_colors,figsize=(12, 12),metric="correlation")
plt.setp(cg.ax_heatmap.yaxis.get_majorticklabels(),rotation = 0,size =12)
plt.setp(cg.ax_heatmap.xaxis.get_majorticklabels(),rotation = 0,size =12)
plt.show()
print(x)

In [None]:
# bar plot
g = sns.catplot(x="variable", y="value", hue="species", data=df2,
                height=6, kind="bar", palette="muted")
g.despine(left=True)
g.set_ylabels("Length/Width (cm)");

## **Summary**

1. load data into colab notebook
2. data manipulation with Pandas
3. Plotting data with Matplotlib and Seaborn

# Sandbox

In [None]:
iris_data.head()

In [None]:
iris_data.tail()

In [None]:
iris_data.describe()

In [None]:
iris_data.info()

In [None]:
iris_data['sepal_length']

In [None]:
iris_data[0:3]

In [None]:
iris_data.loc[0:4, 'sepal_length']

In [None]:
iris_data.loc[:, 'sepal_width']

In [None]:
iris_data.loc[0, 'sepal_width']

In [None]:
iris_data.iloc[0:5, :]

In [None]:
iris_data.iloc[0, 0]

In [None]:
iris_data[iris_data.sepal_length>7]

In [None]:
iris_data[iris_data["species"].isin(['setosa', 'virginica'])]

In [None]:
iris_data["species"].isin(['setosa', 'virginica'])

In [None]:
sns.regplot(data=iris_data, x="sepal_length", y="petal_length");


In [41]:
from collections import Counter

In [42]:
Counter(iris_data.sepal_length > 7)

Counter({False: 138, True: 12})

In [36]:
import pandas as pd

df = pd.DataFrame({
    'sepal_length': [5.1, 4.9, 4.7, 4.6],
    'sepal_width': [3.4, 3.5, 3.2, 3.1],
    'petal_length': [1.4, 1.4, 1.3, 1.5],
    'petal_width': [0.2, 0.2, 0.2, 0.2],
    'species': ['setosa','setosa','setosa','setosa']
})


In [37]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.4,1.4,0.2,setosa
1,4.9,3.5,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa


In [49]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# ANOVA 
model = ols('sepal_length ~ C(species)', data=iris_data).fit() 
anova_table = sm.stats.anova_lm(model, typ=2)

anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(species),63.212133,2.0,119.264502,1.6696690000000001e-31
Residual,38.9562,147.0,,
