<a href="https://colab.research.google.com/github/young-hwanlee/my-practical-statistics-for-data-scientists/blob/main/Chapter_1_Exploratory_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Practical Statistics for Data Scientists (Python)**
# **Chapter 1. Exploratory Data Analysis**
> (c) 2019 Peter C. Bruce, Andrew Bruce, and Peter Gedeck

Import required Python packeages.

In [1]:
%matplotlib inline

from pathlib import Path

import pandas as pd
import numpy as np
from scipy.stats import trim_mean
from statsmodels import robust

!pip install wquantiles
import wquantiles

import seaborn as sns
import matplotlib.pylab as plt

  import pandas.util.testing as tm


Collecting wquantiles
  Downloading wquantiles-0.6-py3-none-any.whl (3.3 kB)
Installing collected packages: wquantiles
Successfully installed wquantiles-0.6


In [2]:
# try:
#     import common
#     DATA = common.dataDirectory()
# except ImportError:
#     DATA = Path().resolve() / 'data'

Define paths to data sets. If you don't keep your data in the same directory as the code, adapt the path names.

In [3]:
# AIRLINE_STATS_CSV = DATA / 'airline_stats.csv'
# KC_TAX_CSV = DATA / 'kc_tax.csv.gz'
# LC_LOANS_CSV = DATA / 'lc_loans.csv'
# AIRPORT_DELAYS_CSV = DATA / 'dfw_airline.csv'
# SP500_DATA_CSV = DATA / 'sp500_data.csv.gz'
# SP500_SECTORS_CSV = DATA / 'sp500_sectors.csv'
# STATE_CSV = DATA / 'state.csv'

DATA = 'https://raw.githubusercontent.com/young-hwanlee/practical-statistics-for-data-scientists/master/data/'

AIRLINE_STATS_CSV = DATA + 'airline_stats.csv'
KC_TAX_CSV = DATA + 'kc_tax.csv.gz'
LC_LOANS_CSV = DATA + 'lc_loans.csv'
AIRPORT_DELAYS_CSV = DATA + 'dfw_airline.csv'
SP500_DATA_CSV = DATA + 'sp500_data.csv.gz'
SP500_SECTORS_CSV = DATA + 'sp500_sectors.csv'
STATE_CSV = DATA + 'state.csv'

## **Estimates of Location**
### **Example: Location Estimates of Population and Murder Rates**

In [4]:
# Table 1-2
state = pd.read_csv(STATE_CSV)
print(state.head(8))

         State  Population  Murder.Rate Abbreviation
0      Alabama     4779736          5.7           AL
1       Alaska      710231          5.6           AK
2      Arizona     6392017          4.7           AZ
3     Arkansas     2915918          5.6           AR
4   California    37253956          4.4           CA
5     Colorado     5029196          2.8           CO
6  Connecticut     3574097          2.4           CT
7     Delaware      897934          5.8           DE


Compute the mean, trimmed mean, and median for Population. For mean and median, we can use the pandas methods of the data frame. The trimmed mean requires the trim_mean function in scipy.stats.

In [5]:
state = pd.read_csv(STATE_CSV)
print(state['Population'].mean())

6162876.3


In [6]:
print(trim_mean(state['Population'],0.1))

4783697.125


In [7]:
print(state['Population'].median())

4436369.5


Weighted mean is available with numpy. For weighted median, we can use the specialized package wquantiles (https://pypi.org/project/wquantiles/).

In [8]:
print(state['Murder.Rate'].mean())

4.066


In [9]:
print(np.average(state['Murder.Rate'],weights=state['Population']))

4.445833981123393


In [10]:
print(wquantiles.median(state['Murder.Rate'],weights=state['Population']))

4.4


## **Estimates of Variability**

In [11]:
# Table 1-2
print(state.head(8))

         State  Population  Murder.Rate Abbreviation
0      Alabama     4779736          5.7           AL
1       Alaska      710231          5.6           AK
2      Arizona     6392017          4.7           AZ
3     Arkansas     2915918          5.6           AR
4   California    37253956          4.4           CA
5     Colorado     5029196          2.8           CO
6  Connecticut     3574097          2.4           CT
7     Delaware      897934          5.8           DE


Standard deviation

In [12]:
print(state['Population'].std())

6848235.347401142


Interquartile range is calculated as the difference of the 75% and 25% quantile.

In [13]:
print(state['Population'].quantile(0.75) - state['Population'].quantile(0.25))

4847308.0


Median absolute deviation from the median can be calculated with a method in statsmodels.

In [14]:
print(robust.scale.mad(state['Population']))
print(abs(state['Population'] - state['Population'].median()).median() / 0.6744897501960817)

3849876.1459979336
3849876.1459979336


## **Percentiles and Boxplots**

## **Frequency Table and Histograms**

## **Density Estimates**

## **Exploring Binary and Categorical Data**

## **Correlation**

## **Scatterplots**

## **Exploring Two or More Variables**

### **Hexagonal binning and Contours**
#### **Plotting numeric versus numeric data**

## **Two Categorical Variables**

## **Categorical and Numeric Data**

## **Visualizing Multiple Variables**