# Table of Contents 

- The IRIS dataset:
    - Load the dataset
    - Explore the dataset: Descriptive statistics
    - Explore the dataset: Visualization
    


In [None]:
import os
import pandas as pd

# The IRIS dataset

This is perhaps the best known database to be found in the pattern recognition literature. 

- The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.
- There are four numeric attributes and the class attribute:
    1. sepal length in cm   
    2. sepal width in cm   
    3. petal length in cm   
    4. petal width in cm   
    5. class: {Iris Setosa, Iris Versicolour, Iris Virginica}
    

![irisdataset](https://setscholars.net/wp-content/uploads/2020/01/iris-768x576.png)

## Load the dataset

In [None]:
pd.read_csv?

Common pitfalls in `pd.read_csv`:
- what is the "sep" character
- is there any header?
- is there any index column?
- how are missing/unknown values denoted?

In [None]:
iris_df = pd.read_csv(os.path.join('dataset','iris.csv'))

## Explore the dataset: descriptive statistics


In [None]:
iris_df

In [None]:
iris_df.shape

In [None]:
pd.set_option('display.max_rows', 150)
iris_df

In [None]:
iris_df.head(10)

In [None]:
iris_df.head(10).T

Check if there is any missing value

In [None]:
iris_df.isna().sum()

In [None]:
iris_df.describe()

In [None]:
iris_df.drop('class',axis = 1).describe()

In [None]:
iris_df.info()

In [None]:
iris_df['class'].value_counts()

## Explore the dataset: Visualization

In [None]:
from matplotlib import pyplot as plt

### Histogram and boxplots with matplotlib


In [None]:
for att in iris_df.columns[:-1]:
    plt.figure()
    plt.hist(iris_df[att])
    plt.ylabel('occurrences')
    plt.xlabel(att)
    plt.title(f'histogram of {att} attribute')

In [None]:
for att in iris_df.columns[:-1]:
    plt.figure()
    plt.boxplot(iris_df[att])
    plt.ylabel('values')
    plt.xlabel(att)
    plt.title(f'boxplot of {att} attribute')

# Given IQR the interquartile range (Q3-Q1), 
# the upper whisker will extend to last datum less than Q3 + whis*IQR.
# the lower whisker will extend to the first datum greater than Q1 - whis*IQR.
# by default, whis = 1.5

### Histogram and boxplots with pandas
[Check the user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)


In [None]:
iris_df.drop('class',axis = 1).hist(bins = 7,layout = (1,4),figsize = (20,4))
plt.show()


In [None]:
iris_df['sepallength'].plot(kind='hist', bins=7, figsize=(10,6), title='iris hist plot: sepallength')
plt.show()

In [None]:
iris_df.hist(column='sepallength', by='class', bins = 7, figsize=(15,4),layout = (1,3))
plt.show()

In [None]:
iris_df.boxplot(column='sepallength', by='class', figsize=(10,6))
plt.show()

In [None]:
iris_df.drop('class',axis = 1).plot(kind ='box', 
                                    subplots = True, 
                                    figsize=(20,4),
                                    layout = (1,4),
                                    sharey=False)
plt.show()

In [None]:
iris_df.drop('class',axis = 1).plot(kind ='box', 
                                    subplots = True, 
                                    figsize=(16,4),
                                    layout = (1,4), 
                                    sharey=True)
plt.show()

### Scatter plot with matplotlib

In [None]:
dict_names = {1:'Iris-setosa', 2:'Iris-versicolor', 3:'Iris-virginica'}
dict_names.values()

In [None]:
x_index = 0
y_index = 2
for curr_class,color in zip(range(1,4),'rgb'):
    scatterplot = plt.scatter(iris_df[iris_df["class"] == curr_class].iloc[:,x_index],
                              iris_df[iris_df["class"] == curr_class].iloc[:,y_index], 
                              c=color,
                              # here you can customize the marker size or style, for instance 
                              label = dict_names[curr_class])
plt.xlabel(iris_df.columns[x_index])
plt.ylabel(iris_df.columns[y_index])
plt.legend()
plt.show()

In [None]:
# analogously: easier on the plotting stage, more complicated for handling legend elements.
x_index = 0
y_index = 2

scatterplot = plt.scatter(iris_df.iloc[:,x_index],iris_df.iloc[:,y_index], c=iris_df['class'])
plt.xlabel(iris_df.columns[x_index])
plt.ylabel(iris_df.columns[y_index])
plt.legend(handles=scatterplot.legend_elements()[0], labels=dict_names.values())
plt.show()

In [None]:
fig,axes = plt.subplots(4,4,figsize = (20,20))
for ix in range(4):
    for iy in range(4):
        scatterplot = axes[ix,iy].scatter(iris_df.iloc[:,ix],iris_df.iloc[:,iy], c=iris_df['class'])
        axes[ix,iy].set_xlabel(iris_df.columns[ix])
        axes[ix,iy].set_ylabel(iris_df.columns[iy])
        axes[ix,iy].legend(handles=scatterplot.legend_elements()[0], labels=dict_names.values())
        

### Scatter plot with pandas

In [None]:
iris_df.plot(x='sepallength', 
             y='petallength', 
             kind='scatter', 
             c='class',
             colormap = 'viridis',
             colorbar = False, 
             figsize=(6,6),
             title='iris scatter plot')

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(iris_df.drop('class',axis = 1),figsize = (16,16),alpha = 1,diagonal = 'hist',c = iris_df['class'])
plt.show()

### More plotting libraries: Seaborn

In [None]:
import seaborn as sns

[Overview of seaborn plotting functions](https://seaborn.pydata.org/tutorial/function_overview.html)

In [None]:
f, axes = plt.subplots(1, 2, figsize=(8, 4))

sns.scatterplot(data=iris_df, 
                x="sepallength", 
                y="petallength", 
                hue="class", 
                ax=axes[0]) # hue = Grouping variable that will produce points with different colors

sns.histplot(data=iris_df, 
             x="class", 
             hue="class", 
             legend=False, 
             ax=axes[1])
f.tight_layout()

In [None]:
sns.jointplot(data=iris_df, x="sepallength", y="petallength")
# Draw a plot of two variables with bivariate and univariate graphs.

Assigning a hue variable will add conditional colors to the scatterplot and draw separate density curves on the marginal axes:

internally, it uses `kdeplot()`: it plots univariate or bivariate distributions using kernel density estimation.
- A **kernel density estimate** (KDE) plot is a method for visualizing the distribution of observations in a dataset, analagous to a histogram. KDE represents the data using a continuous probability density curve in one or more dimensions.


In [None]:
sns.jointplot(data=iris_df, x="sepallength", y="petallength",hue = 'class')


In [None]:
sns.pairplot(data=iris_df, hue="class")


In [None]:
sns.set_theme(style="whitegrid")
ax = sns.boxplot(data=iris_df.iloc[:,:-1], orient="h")

See the [example gallery](https://seaborn.pydata.org/examples/index.html) for an overview on seaborn plotting options.

Seaborn is tightly integrated with matplotlib.

While you can be productive using only seaborn functions, full customization of your graphics will require some knowledge of matplotlib’s concepts and API. 

High quality data visualization products can be obtained by combining the two:
- **Seaborn** provides a powerful high-level interface for creating visually appealing plots quickly
- **Matplotlib** provides deep customizability 



If you specifically want interactive or animated web-based plots, go for [**plotly**](https://plotly.com/python/plotly-express/).
We will not cover this library in our lectures.

In [None]:
# see plotly_iris.py

### Correlation Analysis

`SciPy` is a collection of mathematical algorithms and convenience functions built on the NumPy extension of Python. It adds significant power to the interactive Python session by providing the user with high-level commands and classes for manipulating and visualizing data.

SciPy features includes, but are not limited to:
- statistics
- linear algebra
- fourier transform
- optimization algorithm
- ...


In [None]:
from scipy.stats import pearsonr
pearsonr(iris_df.sepallength,iris_df.sepalwidth)

The `pearsonr` function returns:
- Pearson product-moment correlation coefficent.
- The p-value associated with the chosen alternative: it roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets.

Pearson correlation coefficient can also be obtained with `pandas.DataFrame.corr()`

In [None]:
iris_df.corr()

In [None]:
f,ax = plt.subplots(figsize=(10, 8))
sns.heatmap(iris_df.corr(), annot=True, linewidths=.5, fmt= '.2f',ax=ax)
plt.show()
