## Goals in Exploratory Data Analysis
* ###  Understand the data types
* ###  Detect any problems in the data file. 
* ###  Detect any outliers or data that is questionable. 
* ###  Understand the range/distribution of the variables
* ###  Visualize the relationship between the variables

## Iris Flower

[UCI Data Repository](https://archive.ics.uci.edu/)

[Iris Flower Images](https://en.wikipedia.org/wiki/Iris_flower_data_set)

### Small classic data set used in all teaching of data science


In [None]:
import pandas as pd
import numpy as np 
from matplotlib import pyplot as plt 
import seaborn as sns 
df = pd.read_csv('data/iris.csv')

### Pandas 
### We make use of pandas to load the data.  
### I am not a big user of pandas, because pandas is designed for 2-dimensional data, and as a neuroscientist I typically deal in 3 or 4 dimensional data, so I end up having to use my own tools. 
### You might find this tutorial useful. 
### [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)

### Lets look at the dataframe we just read in

In [None]:
df.head()


In [None]:
df.keys()

In [None]:
# How many data-points do we have?
print ("Number of data-points:",df.shape[0])

# How many features do we have?
print ("Number of features:",df.shape[1]-1)


In [None]:
# How many samples do I have of each species
df["species"].value_counts()

## Making plots using Seaborn
### The strength of seaborn over matplotlib is that they have figured out how to combine many plots that we would normally do individually into a single program call.   

In [None]:

sns.set_style("whitegrid");
sns.pairplot(df, hue="species", size=3);
plt.show()


##$ A boxplot can be extremely useful in detecting data that may have outliers.  

In [None]:
sns.boxplot(data=df)

In [None]:
sns.violinplot( data=df)

### But my favorite is the swarm plot 

In [None]:

plt.figure(figsize=(10,8)) 
sns.swarmplot(y='petal_length', x='species', data=df)



In [None]:

plt.figure(figsize=(10,8)) 
sns.swarmplot(y='sepal_width', x='species', data=df)


## Its also useful to look at histograms with kernal density estimates 

In [None]:
plt.figure(figsize=(10,8))
sns.histplot(df, x='petal_length',
             kde=True)

### But what is really useful is to look at joint plots 

In [None]:
sns.jointplot(x='petal_length', y='petal_width', data=df)

### So, it might be very important for us to understand the correlation between variables. 

In [None]:
sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True)