![](https://www.cranfield.ac.uk/~/media/Images/mastheads/masthead-bix-mh344x810.ashx?h=344&la=en&mw=810&w=810&hash=4AD63583A5EA5BF7E5E561A1C62A8F7879F36571)

<h1 style="text-align:center; text-transform:uppercase">Exploratory Data Analysis</h1>
<h2 style="text-align:center">Getting to know your Dataset with Plot.ly</h2>

<div style="width:400px; text-align:left; margin: 0 auto"><blockquote>Doing statistics is like doing crosswords except that one cannot know for sure whether one has found the solution.</blockquote>
<footer style="text-align:right">~  John W. Tukey</footer></div>

<h3 style="text-align:center">Sanity Check</h3>

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd

In [None]:
import plotly
from plotly.offline import init_notebook_mode, iplot

In [None]:
# !pip install plotly

In [None]:
import cufflinks as cf

In [None]:
# !pip install cufflinks

In [None]:
import seaborn as sns

In [None]:
# !pip install seaborn

In [None]:
# Go Offline
init_notebook_mode()
cf.go_offline()

In [None]:
# Demo
cf.datagen.histogram().iplot(kind='histogram')

#### Version Check

In [None]:
# Check https://pyformat.info/ for string formatting
for module in [np, pd, plotly, cf, sns]:
    print "{:16}{}".format(module.__name__[:], module.__version__)

<h2 style="text-align:center">Motivating Example</h2>

* Exam was administered in two seperate medium-sized rooms.
* A report was received from the invigilators in Room 1 about suspicious behaviour. 
* The exam is entirely multiple choice.
* The graph shows the number of identical right answers and the number of identical wrong answers for each pair of students (~18K pairs).
* The line corresponds to forty total shared answers (two students having identical test papers).

![](http://lalashan.mcmaster.ca/theobio/math/index.php/Special:GetProjectFile?project=Answer_matching&make=false&display=raw&random-number=367&filename=cplot.Rout-0.png)

<a href="http://jd-mathbio.blogspot.hk/2015/02/finding-cheaters-using-multiple-choice.html?utm_source=marketo&utm_medium=email&utm_campaign=DA-NL-202&mkt_tok=3RkMMJWWfF9wsRokvq3MZKXonjHpfsX%2B7%2BooW6Gg38431UFwdcjKPmjr1YEETcB0aPyQAgobGp5I5FEOS7PYS6V6t6EOUg%3D%3D" style="text-align:center; margin:0 auto; display:block">SOURCE</a>

<h4 style="text-align:center">Can you catch the cheaters?</h4>

<h2 style="text-align:center">The Goals of Exploratory Data Analysis</h2>

1. Summarize the main characteristics of datasets with Summary Statistics (Initial Data Analysis)
1. Visually inspect the structure and nature of the data
1. Find what the data can tell us beyond the formal modeling or hypothesis testing task.

Always explore your data visually. Whatever specific hypothesis you have when you go out to collect data is likely to be worse than any of the hypotheses you’ll form after looking at just a few simple visualizations of that data.

<h2 style="text-align:center">The EDA Checklist</h2>


1. Initial Data Analysis
    1. Quality of data
    1. Quality of measurements
    1. Initial transformations
1. Univariate statistics (single variable)
1. Bivariate associations (correlations)
1. Multivarite patterns (analysis)

### Initial Data Analysis

The initial steps of all data analyses consist of checking consistency and accuracy of the data, describing and exploring the study sample and preparing the data for further analyses. It is crucial this is done before embarking on complex analyses.

### Quality of Data

The quality of the data should be checked as early as possible. Data quality can be assessed in several ways, using different types of analysis: frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency histograms, n: variables are compared with coding schemes of variables external to the data set, and possibly corrected if coding schemes are not comparable.

<h3 style="text-align:center">Example : Salary Data</h3>

In [None]:
df = pd.read_csv('salaries.csv')

In [None]:
df.head(10)

In [None]:
df.sample(10)

In [None]:
df.info()

In [None]:
df.describe()

<h4 style="text-align:center">Are there any duplicates in the data?</h4>

In [None]:
df.duplicated('email').value_counts()

In [None]:
# Only show the non-duplicated ones.
df[df.duplicated('email') == False]

In [None]:
df_unique = df.drop_duplicates(['email','ip_address'])

<h4 style="text-align:center">Is the data complete?</h4>

Where all attributes of an entity are not available. An example would be missing zip codes in address data

In [None]:
df_unique.info()

In [None]:
df_unique.gender.value_counts(dropna=False)

In [None]:
# Inspecting missing values
df_unique[df_unique.gender.isnull()].head()

In [None]:
# Dropping all missing values
df_unique_complete = df_unique.dropna(subset=['gender'])
df_unique_complete.info()

In [None]:
# Insert the majority class
df_unique_majority_fill = df_unique.copy()
df_unique_majority_fill.loc[df_unique_majority_fill.gender.isnull(), 'gender'] = df_unique_majority_fill.gender.max()
df_unique_majority_fill.info()

In [None]:
df_unique_majority_fill.gender.value_counts().iplot(kind='bar')

In [None]:
import random
# Fill according to the existing distribution
df_unique_distribution_fill = df_unique.copy()

dfx = df_unique_distribution_fill

def random_sample_with_distribution(series):
    return series[series.isnull()].apply(lambda x: random.choice(series.dropna().tolist()))

df_unique_distribution_fill.gender = dfx.gender.fillna(random_sample_with_distribution(dfx.gender))

df_unique_distribution_fill.info()

In [None]:
df_unique_distribution_fill.gender.value_counts().iplot(kind='bar')

In [None]:
# Could these saved?
first_name = df_unique[df_unique.gender.isnull()].first_name
first_name.unique()

In [None]:
# https://pypi.python.org/pypi/gender-guesser/
!pip install gender-guesser

In [None]:
import gender_guesser.detector as gender
d = gender.Detector(case_sensitive=False)

first_name.apply(d.get_gender)[:10]

In [None]:
first_name.apply(d.get_gender).replace({'mostly_male':'male','mostly_female':'female'}).value_counts()

In [None]:
gender_fill = df_unique[df_unique.gender.isnull()].first_name.apply(d.get_gender).replace({'mostly_male':'male','mostly_female':'female'})
gender_fill = gender_fill.str.capitalize()

# This is wrong because it writes it into a *copy* of the data, instead of the original dataframe 
df_unique[df_unique.gender.isnull()].gender = gender_fill

In [None]:
# Look at the gender NA count - nothing changes!
df_unique.info()

In [None]:
# instead you have to use *.loc* indexers for both columns and rows to _set_ the values in the original dataframe
df_unique.loc[df_unique.gender.isnull(),'gender'] = gender_fill

In [None]:
# That's better!
df_unique.info()

<h4 style="text-align:center">Is the data consistent?</h4>

Example would be a system or database where phone numbers are stored in different formats like 9999999999, +1 999-999-9999, 999-999-9999, 99999 99999. Similar issues can exist in address data wherein the addresses are not standardized.

In [None]:
df_unique.salary.iplot(kind='hist')

In [None]:
df_unique.query('salary < 0').salary.iplot(kind='hist')
# This doesn't provide the desired results - why?

In [None]:
df_unique.info()
# Salary is an 'object' type, but it's actually a float value!

In [None]:
try:
    df_unique.salary.astype(int)
except ValueError, e:
    print "Whoops :", e

In [None]:
df_unique.salary.sample(10)
# Ahhh! It's the $ symbol.

In [None]:
df_unique.salary = df_unique.salary.str.replace('$','',).astype(float)
# Much better!

In [None]:
df_unique.head(10)

In [None]:
df.salary = df.salary.str.replace('$','').astype(float)
# Now let's save it back into the column, and check the result, we now have a float datatype.
df.info()

In [None]:
df_unique.query('salary < 0').iplot(kind='scatter', y='salary',mode='markers',text='first_name')
# Let's try again - oh, no difference! All -9999.

In [None]:
df_unique.head()

In [None]:
# Apparently, missing values was encoded as -9999, so lets will that with the sample mean.
mean_salary = df_unique[df_unique.salary > 0].salary.mean()
df_unique.loc[df_unique.salary < 0,'salary'] = mean_salary

df_unique.salary.iplot(kind='hist')

In [None]:
# More granularity
df_unique.salary.iplot(kind='hist',bins=100)

### Quality of Measurements

The quality of the measurement instruments should only be checked during the initial data analysis phase when this is not the focus or research question of the study. One should check whether structure of measurement instruments corresponds to structure you know to expect from experience, or is reported in other studies.

<h3 style="text-align:center">Example : Measurement Precision</h3>

![](assets/rounding1.png)

![](assets/rounding2.png)

![](assets/rounding3.png)