In [None]:
! pip install seaborn --upgrade

# Exploratory Data Analysis

An Exploratory Data Analysis (EDA) is one of the first steps in the machine learning process. During an EDA, you examine data to determine its quality and applicability to the problem you are attempting to solve.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Plotting libraries
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Input data files are available in the read-only "../input/" directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Getting to Know Your Data

One of the first things you need to do when conducting an exploratory data analysis is to load the data

In [None]:
iris_data = pd.read_csv('/kaggle/input/iris-flower-dataset/IRIS.csv')

and verify that it has been correctly loaded.

In [None]:
iris_data

Once the data has been loaded, we can begin asking questions.

For example,
- What type of data is stored in each column?
- What is the distribution of values for each feature?
- Are there any missing or extreme values?

A **data quality report** can be used to answer these questions. This report includes,
- Summary statistics for quantitative: count, mean, median, mode, min, max, standard deviation, percentiles, number of missing values, number of unique values
- Summary statistics for categorical values: count, count and % missing, how many unique values (cardinality), number of values in each category (along with the %)
- Basic distribution plots: histograms or bar plots, box plots
- Basic relationship plots: scatter plot matrix

The `info` method can give us an indication of the datatype for each feature. We are primarily interested in knowing which variables are continuous and which are discrete.

In [None]:
iris_data.info()

The `describe` can be used to calculate the summary statistics of a data frame.

The specific output of the `describe` function varies depending on the nature of the data.

For continuous data, we are given: 
- count of non-empty values; 
- sample mean and standard deviation; 
- minimum and maximum values; 
- quantiles.

In [None]:
iris_data.describe()

For discrete data, we are given: 
- count of non-empty values; 
- number of unique "levels" (also called classes when we talk about the target vector); 
- most common level
- frequency of the most common level. 

In [None]:
iris_data.loc[:, 'species'].describe()

In addition to the summary statistics, it is also useful to dig a bit deeper into the data.

For categorical variables:
- Examine the 1st and 2nd mode and percentage of representation to identify the most common values

In [None]:
iris_data.loc[:, 'species'].value_counts().sort_values(ascending=False)

In [None]:
iris_data.loc[:, 'species'].value_counts().sort_values(ascending=False) / iris_data.loc[:, 'species'].count()

- Create bar plots of each variable to visualize the distribution of the data

In [None]:
sns.countplot(data=iris_data, x='species')

For continuous variables:
- Examine the mean and standard deviation are used to describe the central tendency and variation of the distribution. (We can either refer to the summary statistics, or recalculate them if necessary)

In [None]:
iris_data.loc[:, iris_data.columns[:-1]].mean()

In [None]:
iris_data.loc[:, iris_data.columns[:-1]].std()

- Because the mean and standard deviation are fairly useless without context, we want to create histograms and box plots to visualize the distribution of the data.

In [None]:
f, axs = plt.subplots(2, 2, figsize=(8, 6))
sns.histplot(data=iris_data, x='petal_length', ax=axs[0,0])
sns.histplot(data=iris_data, x='petal_width', ax=axs[0,1])
sns.histplot(data=iris_data, x='sepal_length', ax=axs[1,0])
sns.histplot(data=iris_data, x='sepal_width', ax=axs[1,1])
f.tight_layout()

The distribution of a variable is important because it helps us determine the best type of model to use for our solution. 

Three of the most common distributions are the **uniform distribution**, **normal distribution**, **exponential distribution**.
- Uniform distributions occur when each value of a random variable is equally likely
- Normal distributions occur most often for naturally occuring phenomenon
- Exponential distributions occur most often when dealing with how long it takes for an event to occur

In [None]:
xs0 = np.random.uniform(size=1000)
xs1 = np.random.normal(size=1000)
xs2 = np.random.exponential(size=1000)

f, axs = plt.subplots(1, 3, figsize=(10, 3))
sns.histplot(x=xs0, ax=axs[0])
axs[0].set_title("Uniform")
sns.histplot(x=xs1, ax=axs[1])
axs[1].set_title("Normal")
sns.histplot(x=xs2, ax=axs[2])
axs[2].set_title("Exponential")
f.tight_layout()

Distributions can be unimodal or multimodal, and can be skewed right (peak is on the left side) or skewed left (peak is on the right side).
- When the distribution is multimodal, the mean is a very misleading value; but the presence of a multimodal distribution may indicate multiple clearly distinct "groups" within the data

In [None]:
gen = np.random.default_rng()

xs0 = gen.exponential(scale=0.2, size=1000)
xs1 = -gen.exponential(scale=0.1, size=1000)
xs2 =  np.concatenate((gen.normal(loc=0, scale=1, size=500), np.random.normal(loc=5, scale=1.5, size=500)))

f, axs = plt.subplots(1, 3, figsize=(10, 3))
sns.histplot(x=xs0, ax=axs[0])
axs[0].set_title("Skew Right")
sns.histplot(x=xs1, ax=axs[1])
axs[1].set_title("Skew Left")
sns.histplot(x=xs2, ax=axs[2])
axs[2].set_title("Multimodal")
f.tight_layout()

## Identifying Data Quality Issues

Once the data has been summarized, it can be analyzed for quality issues.
- Missing values
- Irregular counts of unique values (cardinality)
- Outliers

If quality issues within the data are not resolved, then the data will not yield a good model.

Missing values
- Why are the values missing? Collection problems? Integration problems? Intentionally missing?
- If a large portion of the values for a feature are missing (~60% is a good rule of thumb), then it may be best to not use that feature

Irregular Cardinality
- Occurs when there are more or fewer unique values for a categorical variable than expected
- If all of the values in a feature are the same (Cardinality 1), then that feature should be removed if there are no errors - it will not be useful in the model
- If the cardinality is close to the number of instances in the dataset, then the feature may be continuous and not categorical (and vice versa)
- Cardinality values larger than expected may indicate invalid levels that need to be recoded

Outliers
- Outliers can be invalid (actual errors), or valid (correct values, but unusual for some reason)
- To detect outliers, you can
    - Examine the minimum and maximum values for sensibility
    - Look at a box plot to see where the whiskers are

## Handling Data Quality Issues

How to handle Missing Values
- Drop the entire row; this might be acceptable if there is a large amount of data, but can lead to bias
- Drop the feature if a large number of its values are missing (generally more than 60%)
- Create a new feature that indicates if the value is present or missing - this can be used to train the model when it should ignore the feature
- Impute the missing value by replacing it with a mean, median, mode, or other aggregate value
    - This is normally not the best idea as it can lead to bias in the data
    - Only consider it if a small number of features are missing (generally less than 30%)
- Build a predictive [regression] model based on the dataset to try and predict the missing features

Handling Outliers
- Clamp the values
    - $a_i = \begin{cases} 
            lower & \text{if $a_i$ < lower} \\
            upper & \text{if $a_i$ < lower} \\
            a_i & \text{otherwise}
        \end{cases}$
    - Method 1: Determine upper and lower values is to use the whiskers of a box plot (1.5 * 1st quartile and 3rd quartile)
    - Method 2: Set the upper and lower values to the mean plus/minus a multiple of the standard deviation
    - Always inspect the data to see how much of an impact clamping will have. If too many values will be changed, you may need to do something else. 
- Consider leaving them alone if the values are valid
- Consider removing the observations if the values are invalid and the feature is important

Regardless of how missing values are handled, you should try to do it in a way that does not change the underlying distribution of the data.

## Advanced Data Exploration

Scatter plots can be used to visualize the relationship between two variables
- Positive covariance (increase or decrease together)
- Negative covariance (as one goes up, the other goes down)
- No apparent relationship

In [None]:
sns.relplot(data=iris_data, x='sepal_width', y='petal_length', hue='species')

A scatter plot matrix displays scatter plots across all features in one visualization.

In [None]:
g = sns.PairGrid(iris_data, hue='species')
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
g.add_legend()

For categorical data, you can draw a bar plot, factored across the levels of a feature. 
- If there is a relationship present, then the distribution should differ across the levels of the second variable. 
    - Why? Because if there is no relationship, then the value of the first variable should have no impact on the second

In [None]:
bball_data = pd.read_csv('/kaggle/input/table37basketballteam/Table3-7BasketballTeam.csv')

If we consider the career stage of players split up by whether or not they have a shoe sponser, we see that there is no apparent relationship between the two variables.

In [None]:
sns.catplot(kind='count', data=bball_data, x='Career Stage', height=3)
sns.catplot(kind='count', data=bball_data, x='Career Stage', col='Shoe Sponsor', height=3)

However, if we look at position by shoe sponsership, there does appear to be a relationship: guards have more sponsership deals.

In [None]:
sns.catplot(kind='count', data=bball_data, x='Position', height=3)
sns.catplot(kind='count', data=bball_data, x='Position', col='Shoe Sponsor', height=3)

Stacked bar charts can also be used to compare categorical variables. If there is a relationship, then the proportions of each level should differ by a large margin.

To look for relationships between categorical and continuous variables, you can compare histograms when holding the level of the categorical variable steady.

In the following example, we look at position segmented by age and notice that while centers have some older players, it is not entirely clear if there is a strong relationship between the variables.

In [None]:
sns.displot(data=bball_data, x='Age', height=3)
sns.displot(data=bball_data, x='Age', hue='Position', col='Position', height=3)

Alternatively, you can also use box plots.

In [None]:
sns.catplot(kind='box', data=bball_data, x='Position', y='Age')

In [None]:
sns.catplot(kind='box', data=bball_data, x='Position', y='Height')

## Measuring Covariance and Correlation

Plots alone are not sufficient to understanding relationships between variables. Covariance and correlation provide numerical metrics of the strength of these relationships.

Covariance is defined as

$$\text{cov}(a,b) = \frac{1}{n - 1} \sum_i^n \left( (a_i - \bar{a})(b_i - \bar{b}) \right)$$

where $\bar{a}$ and $\bar{b}$ are the sample means of the feature vectors $a$ and $b$.

Covariance measures the *linear* relationship between the variables. Values near 0 indicate that there is little to no relationship between the variables. 

Covariance maintains the units of each variable, which may not make sense when compared to one another. 

Correlation is the normalized covariance, removing units and limiting the range to $[-1, 1]$

$$\text{corr}(a, b) = \frac{\text{cov}(a,b)}{\bar{\sigma}_a \bar{\sigma}_b}$$

where $\bar{\sigma}_a$ and  $\bar{\sigma}_b$ are the sample standard deviations of the feature vectors $a$ and $b$.

The covariance and correlation functions can be used to generate a covariance or correlation matrix, showing how every variable is linearly related to every other variable.

Correlation is not causation.
- Many times, the relationship between two variables exists because of confounded features that may or may not be directly observed
- Only careful experimentation can tease out causation
	
	
To measure the similarity between categorical variables, statistical techniques such as Chi-Squared tests and ANOVA can be used. 

## Data Preparation

Once the features have been identified, and the quantity assessed and corrected, the final step to conduct any transformations necessary to facilitate learning and model building.

### Normalization

There are two standard methods of normalization: range normalization and the z-transform.

To conduct range normalization, modify the range of a feature to be within $[low, high]$
	
$$a_i = \frac{a_i − \min⁡(a)}{\max⁡(a) − \min⁡(a)} (high - low ) + low$$

To apply a z-transform, modify the a feature to be normally distributed with a mean of 0 and standard deviation of 1 (this really only works as expected if the feature is normally distributed to begin with)

$$a_i = \frac{a_i − \bar{a}}{sd(a)}$$

If the feature is not normally distributed to begin with, the z-transform may distort the data bit; but it might still be useful, you'll need to check.

### Binning

- Converts a continuous feature into a categorical feature
    - Helps some algorithms handle continuous features "better"
    - Helps handle outliers
    - Discards information in the process
- If the number of bins is too low, then information is lost with respect to the distribution of the original values
- If the number of bins is too high, some of those bins may be empty
- Ideally, you want a number of bins that produces a representation close to the original distribution.

In [None]:
f, axs = plt.subplots(1, 3, figsize=(10, 3))
sns.histplot(data=iris_data, x='sepal_width', ax=axs[0], bins=3)
sns.histplot(data=iris_data, x='sepal_width', ax=axs[1], bins=10)
sns.histplot(data=iris_data, x='sepal_width', ax=axs[2], bins=30)
f.tight_layout()

In [None]:
f, axs = plt.subplots(1, 3, figsize=(10, 3))
sns.histplot(data=iris_data, x='sepal_width', ax=axs[0], binwidth=0.25)
sns.histplot(data=iris_data, x='sepal_width', ax=axs[1], binwidth=1)
sns.histplot(data=iris_data, x='sepal_width', ax=axs[2], binwidth=2)
f.tight_layout()

There are two standard techniques for determining the number of bins when manually creating a histogram:

Equal width binning
- Splits the values into b bins, each of size range / b
    E.g. [0, 10), [10, 20), …, [90, 100]
- Good for uniform distributions, but may produce many empty bins for non-uniform distributions

Equal frequency binning
- Sorts the values from smallest to largest and then puts an equal number of values in each bin
- The total number of instances in each bin is [count of instances] / [number of bins]

The Seaborn library for python does a pretty good job of calculating the optimal number of bins, but it does have exceptions. See the Seaborn documentation for details.

## Sampling

Once the data has been transformed, there is a question of how much of that data to use. When conducting a machine learning experiment, you generally only work with a subset of the data - called a training set. Sampling is the process used to select that subset of data.

If sampling is not done carefully, the sample will be biased and not accurately represent the population.

Top Sampling
- Selects a flat % from the top of the dataset
- Is almost always biased and impacted by the ordering of the data

Random Sampling
- Randomly selects a flat % from the dataset
- Does not preserve relationships in the data

Stratified Sampling
- The dataset is grouped by one or more particular variables, and then s% of each group (called a strata) is selected for the sample. 
- Maintains the relative frequency of each group within the dataset

Under-Sampling
- Creates a sample where all groups are equally represented
- Group the dataset by one or more variables; from each group, randomly sample (without replacement) N instances, where N is the number of instances in the smallest group

Over-Sampling
- Creates a sample where all groups are equally represented
- Group the dataset by one or more variables; from each group, randomly sample (with replacement) N instances, where N is the number of instances in the largest group

Under-sampling and over-sampling can be used to train predictive models that try to ignore sampling bias in the original dataset
- For example, when data is gathered unequally from different subpopulations

## Summary

The key outcomes of the data exploration process should include:
1. Have gotten to know the features, especially their central tendencies, variations, and distributions
2. Have identified any data quality issues, in particular missing values, irregular cardinality, and outliers
3. Have corrected any data quality issues related to invalid data
4. Have recorded any data quality issues due to valid data in a data quality plan, along with potential handling strategies
5. Be confident enough that good-quality data exists to continue with a project

Any steps taken to transform the data must be recorded so that they can also be applied as new data is made available.