### Exploratory Data Analysis (EDA) is an important part of the Analytics Workflow. 
- It is the stage where you get to know your data very well. 
- EDA makes use of both methods and visualisations.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

#### For this notebook, we are using a fabricated dataset called `people.csv`

In [None]:
people_df = pd.read_csv('../data/people.csv')
people_df.head()

### The pandas module has several dataframe methods that are useful for getting to know your data.
- `info()` tells you how many rows and columns you have, which columns (variables) have missing values, datatypes for each variable and how much memory the data requires.
- `describe()` provides statistical information about each variable in the dataset

1. Which columns have missing values?
2. What's the maximum sibling count for people in the dataset?
3. What is the average BMI for all observations (rows) in the dataset? What's the median?

In [None]:
people_df.info()

In [None]:
people_df.describe()

### Another handy way to get an idea of missingness in the data is to call `isnull( ). sum( )` on the dataframe
- how does this work?

In [None]:
people_df.isnull().sum()

### The `value_counts( )` method gives you a quick idea of the distribution of  _unique values_ in a column

In [None]:
people_df.sex.value_counts()

### Uniqueness can be examined with `unique( )` which returns all unique values or `nunique( )` which counts them

In [None]:
people_df.name.nunique()

### You can call `.plot( )` on a dataframe to see all numerical variables plotted at the same time
- not so useful if variables are on different scales!

In [None]:
people_df.plot();

### A more useful way to plot all variables at the same time is a pairplot or correlation plot
- distribution of the variable is shown in histograms along the diagonal
- scatterplots plot two variables at a time and give an indication of which variables may be correlated

In [None]:
sns.set(style="ticks", color_codes=True)
sns.pairplot(people_df);

### We'll use matplotlib and seaborn to examine additional plots and what they help communicate

- to look at the distribution of continuous data, a histogram is most frequently used

In [None]:
plt.hist('bmi', data = people_df);

- to look at counts of discrete data, a barplot is often used

In [None]:
plt.bar('name', 'sibling_count', data = people_df)
plt.xticks(rotation = 70)
plt.title('Number of Siblings');

- sorting the data first on the variable you are counting helps the readibility of a barplot

In [None]:
data = people_df.sort_values('sibling_count')
plt.bar('name', 'sibling_count', data = data)
plt.xticks(rotation = 70)
plt.title('Number of Siblings');

### The basic `plot( )` method in the `matpotlib pyplot` module is versatile. 
- It is generally used to create line plots, but can be used for scatterplots with the `linestyle =` argument set to `'none'` as below

In [None]:
plt.plot('height', 'weight', marker = 'o', linestyle = 'none', data = people_df)
plt.xlabel('height in inches')
plt.ylabel('weight in pounds');

### The `scatter( )` method is a more direct way to create a scatterplot

In [None]:
plt.scatter('height', 'weight', data = people_df)
plt.xlabel('height in inches')
plt.ylabel('weight in pounds');

In [None]:
plt.scatter('weight', 'bmi', data = people_df)
plt.xlabel('weight in pounds')
plt.ylabel('BMI');

### Matplotlib can get pretty fancy as in the plot below
- horizontal lines to show boundaries for `overweight` and `obese`
- annotations to label those boundaries

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter('weight', 'bmi', data = people_df, color = 'darkblue')
plt.hlines(y=25, xmin = 0, xmax = 250, color = 'orange')
plt.hlines(y=30, xmin = 0, xmax = 250, color = 'red')
plt.xlim(0, 250)
plt.ylim(0, 34)
plt.annotate(s = ' overweight', xy = (0, 26))
plt.annotate(s = ' obese', xy = (0, 31))
plt.xlabel('weight in pounds')
plt.ylabel('BMI')
plt.title('Observed BMI for People');

### The `seaborn` package makes prettier plots
- boxplots are another way to look at the distribution of a variable. The top and bottom borders of the blue rectangle define the 3rd and 1st quartiles. The middle line is the 2nd quartile (median). The lines parallell to the rectangle (called whiskers) indicate the range of data in the distribution. Outliers will appear as dots beyond these lines.

In [None]:
sns.boxplot(y=people_df.weight);

### adding a value for `x` tells seaborn to create multiple boxplots, one for each unique value in the specified `x` 

In [None]:
sns.boxplot(y=people_df.weight, x = people_df.sex, )
plt.title('distribution of weights by sex');
#plt.xlabel('')

### Two more plots that can be helpful in getting to know the data (particularly with large datasets) are strip plots and swarm plots.
- adding jitter to a strip plot randomly moves the dots a bit off their actual location (which is similar to what a swarm plot does)

In [None]:
sns.stripplot(x=people_df.sex, y=people_df.height, jitter = False);

In [None]:
sns.swarmplot(x=people_df.sex, y=people_df.height);