In [None]:
import sys
sys.path.append('../src')

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

import seaborn as sns
sns.set()

## Exploratory Data Analysis

When you would like to use data, particularly data you haven't worked with before, there are a few things you should do first. 

+ what are the values of various summary statistics (mean, variance, etc…)?
+ how is the data distributed?
+ are there outliers?
+ do some data points have null/NaN values? how might we deal with those?

In this lesson we will talk about a few different approaches to exploratory data analysis (EDA). EDA is essential for you to discover the character of your data, the relationships between variables, and the potential problems inherent in the dataset.

The data we will be working with are either generated using functions or are contained within CSV files in this repo.

Here we import a function called `counts_data` from our `utils` library

In [None]:
from utils import counts_data

sample_mean = 100
samples = counts_data(mean=sample_mean)

### What is in `samples` now?

In [None]:
# how much data are in there?
display(samples.shape)

In [None]:
display(samples.mean())

In [None]:
display(samples.var())
display(samples.std())

[Moments of a Distribution](https://en.wikipedia.org/wiki/Moment_(mathematics))

In [None]:
display(samples.describe())

In [None]:
display(stats.skew(samples))
display(stats.kurtosis(samples))

In [None]:
bins = np.linspace(0.5*sample_mean, 1.5*sample_mean, 35)
samples = counts_data(mean=sample_mean, seed=None)
ax = sns.distplot(
    samples, 
    bins=bins, 
    rug=True,
    color='midnightblue',
)
ax.axis([
    0.5*sample_mean,
    1.5*sample_mean,
    0,
    0.05
])
_ = ax.set_xlabel('Value')
_ = ax.set_ylabel('Density/Counts')
_ = ax.set_title('Distribution of Counts Data')

In [None]:
counts_data??

In [None]:
from utils import counts_data_nan