# Agenda
 

- PART-I:
    
    - What is data exploration?
    - Why data exploration is important?
    - Questions to answer with EDA 

- PART-II:

    - Using Pandas for exploratory data analysis
    - Using visualization libraries for exploratory data analysis

- PART-III:

    - Exploratory data analysis with housing data.

# PART-I

## What is data exploration?


> "Exploratory data analysis is a detective work: Numerical detective work or counting detective work or graphical detective work." John Tukey

__Exploratory Data Analysis__ (EDA) is an approach for data analysis to:

- Gain insight,

- Detect Anomalies,

- Understand variables and their relations,

- Inspect more informative features,

- Check/test assumptions.



## Why is EDA is important?

"To get a "feel" for the data, it is not enough for the analyst to know what is in the data; the analyst also must know what is not in the data, and the only way to do that is to draw on our own human pattern-recognition and comparative abilities in the context of a series of judicious graphical techniques applied to the data."" 

[Source](https://www.itl.nist.gov/div898/handbook/eda/section1/eda14.htm)


## Questions to Answer


1. What is a typical value?

2. What is a good distributional fit for a set of numbers?

3. What are the most important factors?

4. Is the measurements from different categories are equivalent?

5. Does the data have outliers?

[For a bigger list of questions and more details](https://www.itl.nist.gov/div898/handbook/eda/section3/eda32.htm)

## Steps of EDA





__Variable Identification__

- Identify input features (independent features, inputs) and target features (dependent features, outputs)

- Check whether any input feature is exact copy of the output feature or not.

- Check the data types of the input and output. 

- Check whether the variables are categorical, ordinal or continuous.


__Univariate Analysis__

Focus on individual variables. Possible techniques: 

- Mean, Median, Mode, variance, box-plots, histograms, range, counts, bar plots etc.

__Bivariate Analysis__

Focus on relations between two variables. Possible techniques:

- Scatter plots, heatmaps, [correlation statistics](https://en.wikipedia.org/wiki/Correlation_and_dependence#Rank_correlation_coefficients), Chi-square tests, stacked column plots etc.

__Missing Value Treatment__

Possible Techniques

- Deletion

- Mean-Median-Mode Imputation

- Prediction

- Similarity Based imputations


__Outlier Treatment__

An outlier is a data point that differs significantly from other observations. 

- Model based  methods to detect outliers

- Graph-based methods for detecting outliers.

- Hybrid methods

__Variable Transformation__

Possible situations where variable transformation might be needed:

- Change of scale
- Converting non-linear relationship to linear one. 
- Changing the distribution


__Possible methods for variable transformation__

- Applying a certain function (logarithm, square root, exponential etc.)

- Binning

- Hand made modifications

- Creating dummy-variables



## Resources

[NIST: Exploratory Data Analysis](https://www.itl.nist.gov/div898/handbook/eda/eda.htm)
[Tukey - Exploratory Data Analysis](http://www.ru.ac.bd/wp-content/uploads/sites/25/2019/03/102_05_01_Tukey-Exploratory-Data-Analysis-1977.pdf)

[IBM - Exploratory Data Analysis](https://www.ibm.com/cloud/learn/exploratory-data-analysis)

[Analytics Vidhya - Data Exploration](https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/)

[Omnisci -EDA](https://www.omnisci.com/learn/data-exploration)

# Part-II

## Loading Data

In [None]:
import requests # for loading data from an online resource
from io import StringIO # for reading inputs
import pandas as pd # for manipulating data


[Baseball  Hitters Dataset - Kaggle](https://www.kaggle.com/floser/hitters)

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS601_Fall21/main/Week05/data/Hitters.csv', index_col= 0).reset_index()

In [None]:
df.head()

## Variable Identification

In [None]:
## we can check object types
df.info()

Target Variable: Salary

Input variables: AtBats, Hits, Years, ...

In [None]:
df.Salary.isnull().sum()

## Univariate Analysis

__Descriptive Statistics__

In [None]:
## pandas has a vey handy method to return for descriptive stats

df.describe()

In [None]:
df[df.Hits == 1]

Let's choose a variable, say `Hits` and examine it.

__Histograms__

In [None]:
## Histograms are very useful tools to understand the distribution of a variable
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
## let's use seaborns histplot method for creating a histogram for Hits
sns.histplot(data= df.Hits)
plt.xticks(range(df.Hits.min(), df.Hits.max(), 10), rotation= 90)
plt.title('Hits Counts')
plt.show()

In [None]:
df[df.Hits < 10]

In [None]:
df[(df.Hits > 21) & (df.Hits < 31)]

In [None]:
## Let's investigate histogram more

In [None]:
## Also note that pandas series has an quantile method


df.Hits.quantile((0.01, 0.03))

In [None]:
9/322

In [None]:
sorted(df.Hits.values)

__Box Plots__

[Box Plots Explained](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51)

In [None]:
## another very useful visual method is boxplots. 
## By default they also try to get outliers.

sns.set_theme(style="whitegrid")
sns.boxplot(y = df.Hits)
plt.title('Boxplot for Hits')
plt.show()
## seaborn has a very easy to use method `boxplot` for creating boxplots

__Violin Plots__

[Violin Plots Explained](https://towardsdatascience.com/violin-plots-explained-fb1d115e023d)

In [None]:
sns.set_theme(style="whitegrid")
## Note that even if boxplots tell us median and min-max 
## they don't give whole distribution.

## seaborn has a handy method `violinplot` to create such plots


sns.violinplot(y = df.Salary)

plt.title('Salary Violin Plot')
plt.show()

__Working with categorical variables__

In [None]:
## note that if we are working with categorical variables 

## histplots are very straightforward.

df.League.value_counts()

In [None]:
sns.histplot(x = df.League)
plt.title('"Counts for Different Leagues')
plt.show()

## Bivariate Analysis

Continuous variable vs Continuous variable

In [None]:
## scatterplots are one of the most popular and useful methods 
## to see the relationship between two variables

## Let's check Walks vs RBI's

In [None]:
sns.scatterplot(x = df.AtBat, y = df.RBI)

Categorical variable vs Continuous variable

In [None]:
## sometimes we might want to compare a categorical variable wrt a continuous one.

## seaborn has a `catplot` method for this kind of plots.

In [None]:
sns.catplot(y = 'Hits', x = 'Division', data = df, kind = 'violin')

# PART-III

## Lab

This part of the notebook is inspired by [Hands on Machine Learning with Scikit-Learn & TensorFlow- ch2](https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb)


Use the EDA techniques we learned above to get better insight about the California Housing dataset.


# Readings before working on your midterm

From [R for data science book](https://r4ds.had.co.nz/exploratory-data-analysis.html) read chapter-7. (skip all the coding parts.)

Some of the questions you can ask to yourself as you are reading are:

From 7.1:

- What are the steps of EDA cycle?
- What is EDA?
- Tools of EDA?

From 7.2:

- What is the key to generating good questions?
- What are the two types of questions that might be helpful to make discoveries within your data?
- Definitions: Variable, variation, value, observation, etc.

From 7.3:

- When you can get variation?
- Why do you think even constant quantities might have variation when they are measured multiple times?
- What is a categorical variable?
- What is a continuous variable?
- What are the ways to visualize categorical and continuous variables?
- How do you identify `typical values` in a categorical and numerical values?
- What are the outliers?

From 7.4:

- What are the two ways of dealing with unusual values?

From 7.5:

- What is covariation?

- Explain density, box plot and IQR