# <center>Exploratory Data Analysis (EDA) </center>
References:
* https://www.datacamp.com/community/tutorials/exploratory-data-analysis-python
* Check good EDA notebooks published at Kaggle, e.g.
   * https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python
* Visualization:
   * https://towardsdatascience.com/pyviz-simplifying-the-data-visualisation-process-in-python-1b6d2cb728f1
   * https://medium.com/search?q=python%20visualization

Packages used in this notebook:
- Visualization: matplotlib, seaborn, plotly 
- Statistics Analysis: statsmodels

##  1. What is Exploratory Data Analysis (EDA)

- EDA is an approach to analyzing data sets to 
  * prepare data for modeling, e.g. 
    - dealing with missing values
    - feature engineering
    - correlation analysis etc
  * summarize their main characteristics, often with visual methods (i.e. **data profiling**)
  * generate hypotheses for subsequent modeling stage 

## 2. Example
- Data set: UCI Auto MPG datset (https://archive.ics.uci.edu/ml/datasets/Auto+MPG)
- Target:
  * Analyze variable correlation
  * Data profiling (visualization)

In [None]:
# Exercise 2.1. Load the data and library

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd

In [None]:
df = pd.read_csv('../../dataset/auto-mpg.csv', header=0)
df.head()
df.info()  # get detailed information of each column

## 3. Deal with Missing Values
- Find variables with missing values
- How to deal variables with missing values
  - drop samples (rows)
  - drop variables (columns)
  - interpolate

In [None]:
# Exercise 3.1. Create a simply If dataframe
# Missing values are shown as NaN (not a number)

import numpy as np

df1 = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, 6, 1],
                   [np.nan, np.nan, np.nan, 5], [5, 8, 2, 5]],
                   columns=list('ABCD'))
df1

In [None]:
# Exercise 3.2. Find missing values

# determine which value is null
df1.isnull()

# get number of null values in each row
df1.isnull().sum(axis=1)

In [None]:
#determine which row/columns have null values
df1.isnull().any(axis=0)
df1.isnull().any(axis=1)

# return any row/columns which has at least one null value
df1.loc[df1.isnull().any(axis=1)]

df1.loc[:, df1.isnull().any(axis=0)]

In [None]:
# Exercise 3.3. drop missing values
df1
# to drop row/column which contains any NaN, use how='any'
df1.dropna(axis=1, how='any')  

# for rows with all NaN, set how = 'all'
#df1.dropna(axis=1, how='all')  

# drop the columns which have 
# less than 3 good values (not missing)

df1.dropna(axis=0, thresh=3)

In [None]:
df1.fillna(0)

In [None]:
# Exercise 3.4. interpolate missing values
df1
df1.interpolate(method='linear')

In [None]:
#  now get back to our auto-mpg dataset

df.isnull().sum(axis=0)

In [None]:
# Exercise 3.5. # drop samples that have missing values

df=df.dropna(axis=0, how='any')  
df.info()

## 4. Visualization

- Typical graphical techniques used in EDA:
  - Bar chart
  - Histogram
  - Line chart
  - Scatter plot
  - Heatmap
  - ...
  
- Plot libraries
  - Matplotlib: a Python 2D plotting library which produces publication quality figures  
  - Seaborn:  a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics
  - Pandas plot: very convenient library based on matplotlib
  - Plotly: interactive plots https://plotly.com/chart-studio-help/tutorials/

In [None]:
# Exercise 4.1. Import plotting libraries

import seaborn as sns
import matplotlib.pyplot as plt

# plot charts inline
%matplotlib inline


### 4.1 Barchart
- Different forms: single bar, stacked, verticle, horizontal 
- Barcharts can be conveniently created using pandas
  * by default, **x axis is the index, and y can be columns**
  * value_counts, groupby, agg, pivot_table, or crosstab can be used to create values for plotting

In [None]:
df.cylinders.value_counts()

In [None]:
# Exercise 4.1.1 Plot number of cars by cylinders

ax=df.cylinders.value_counts().sort_index(axis=0).\
   plot.bar(figsize=(6,4), title="Model count by cylinders");

# set labels
ax.set(ylabel="cylinders", xlabel="model count");

# note: ";" to suppress unwanted output

In [None]:
# Exercise 4.1.2: create a bar chart to
# count models per origin


In [None]:
# Exercise 4.1.2. Use seaborn to generate attractive plot

# count cars by year
count_by_year=df.model_year.value_counts().reset_index()
count_by_year.columns=["year", "model_count"]
count_by_year

In [None]:
# set style
sns.set_style("whitegrid");

plt.figure(figsize=(8,5));

# Note x and y parameters should be set properly
# However, plot from pandas set x and y automatically

sns.barplot(x='year',y='model_count', data=count_by_year);
plt.show();   


In [None]:
# plotly
import plotly.express as px

#long_df = px.data.medals_long()

fig = px.bar(count_by_year, x='year',y='model_count',\
              title="model by year")
fig.show()


### 4.2. Line chart

In [None]:
df.groupby('model_year')[["mpg","acceleration"]].mean()

In [None]:
# Exercise 4.2.1. line chart

# How does mpg/acceleration change over time?

# show the relationship between 
# average mpg/cceleration and model year
# also note that two lines can be plotted with explicitly setting x and y

df.groupby('model_year')[["mpg","acceleration"]].mean()\
.plot(kind='line', figsize=(8,4))\
.legend(loc='center left', bbox_to_anchor=(1, 0.5));  # set legend

# what finding can be seen here?

### 4.3 Histogram : learn distribution of varilables

In [None]:
# Exercise 4.3.1. Histogram

# plot histgram using matlibplot
plt.figure(figsize=(5, 5));
plt.hist(df['weight'], color='g', bins=20);
plt.xlabel("weight");
plt.ylabel("Count");

In [None]:
# Exercise 4.3.2. plot histogram and PDF using seaborn

sns.set_style("whitegrid")
plt.figure(figsize=(5, 5))
sns.displot(df["horsepower"], color='g', bins=20);

In [None]:
# Exercise 4.3.3. plot multiple histogram plots using pandas plot

df[['horsepower', 'weight','acceleration','displacement']]\
.hist(figsize=(8, 4), bins=50);

### 4.4. Scatterplot: show interaction between variables
- Pairwise scatter plot: discover interaction between any pair of variables
- Check variable correlation using pd.corr()

In [None]:
# Exercise 4.4.1. Pairwise scatterplot

sns.pairplot(data=df);

# mpg, weight, displacement, ... 
# seem to be highly correlated

In [None]:
# Exercise 4.4.2. pairwise scatterplot with selected columns

# select variables for x and y axis
# color the points by origin (hue)

sns.pairplot(data=df, x_vars=['mpg', 'weight', 'displacement'],\
             y_vars=['mpg', 'weight', 'displacement'], \
             hue="origin");

# anything interesting can be found here?
# how about mpg/weight/displacement by origin?


In [None]:
# Exercise 4.4.3. Variable correlation

# Keep 3 decimals when printing float numberss
pd.options.display.float_format = '{:,.2f}'.format

df.corr()

### 4.5. FacetGrid: Show variable relationship by facet

In [None]:
# Exercise 4.6.1.: How is horsepower and mpg correlated
#                 for cars from different origin? 

# Generate grid by origin
g = sns.FacetGrid(df, col="origin") ;
g
# plot a scatterplot between hrsepower and mpg 
# for each facet in the grid
g.map(plt.scatter, "horsepower", "mpg") ;

# what insights can be found from this facet plot?


## 5. Regression
- For details, see http://www.statsmodels.org/dev/example_formulas.html
- For categorical variables, you can use R style formulas
- Model interpretation: http://connor-johnson.com/2014/02/18/linear-regression-with-python/

In [None]:
# Exercise 5.1.  OLS(ordinary least squares) regression 

# linear regression between mpg and other factors


import statsmodels.api as sm


X = df[['cylinders', 'displacement','horsepower',\
        'weight','acceleration']]

# add Intercept
X = sm.add_constant(X)
Y = df.mpg

model = sm.OLS(Y,X).fit()


# Print out the statistics
model.summary()


In [None]:
# Exercise 5.2.
# Use C to get dummy variables

import statsmodels.formula.api as smf


model = smf.ols(formula = 'mpg ~ cylinders + displacement+horsepower+ weight \
                          + acceleration+ C(origin)', \
                   data = df).fit();

# Print out the statistics
model.summary()