<a href="https://colab.research.google.com/github/rahiakela/data-science-research-and-practice/blob/main/practical-statistics-for-data-scientists/01-exploratory-data-analysis/exploratory_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Exploratory Data Analysis

##Setup

In [2]:
!pip -q install wquantiles

In [3]:
import pandas as pd
import numpy as np
from scipy.stats import trim_mean
from statsmodels import robust
import wquantiles

import seaborn as sns
import matplotlib.pylab as plt
%matplotlib inline

In [None]:
!wget https://github.com/gedeck/practical-statistics-for-data-scientists/raw/master/data/state.csv

##Estimates of Location

A basic step in exploring your data is getting a “typical value” for each feature (variable): an estimate of where most of the data is located (i.e., its central tendency).

**Mean**

The most basic estimate of location is the mean, or average value. The mean is the sum of all values divided by the number of values.

$
mean = \bar x = \frac {\sum_{i=1}^n x_i}{n}
$

**Trimmed Mean**

A variation of the mean is a trimmed mean, which you calculate by dropping a fixed
number of sorted values at each end and then taking an average of the remaining values.

Representing the sorted values by $x_1 , x_2 , ..., x_n$ where $x_1$ is the smallest value and $x_n$ the largest, the formula to compute the trimmed mean with $p$ smallest and largest values omitted is:

$
trimmed \space mean = \bar x = \frac {\sum_{i=p+1}^{n-p} x_i}{n-2p}
$

A trimmed mean eliminates the influence of extreme values.

**Weighted Mean**

Another type of mean is a weighted mean, which you calculate by multiplying each data value $x_i$ by a user-specified weight $w_i$ and dividing their sum by the sum of the weights.

$
weighted \space mean = \bar x_w = \frac {\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i}
$

There are two main motivations for using a weighted mean:

- Some values are intrinsically more variable than others, and highly variable
observations are given a lower weight.
- The data collected does not equally represent the different groups that we are interested in measuring.

**Median**

The median is the middle number on a sorted list of the data.since the mean is much more sensitive to the data, there are many instances in
which the median is a better metric for location.

**Outliers**

The median is referred to as a robust estimate of location since it is not influenced by
outliers (extreme cases) that could skew the results. An outlier is any value that is very
distant from the other values in a data set.

When outliers are the
result of bad data, the mean will result in a poor estimate of location, while the
median will still be valid.

The median is not the only robust estimate of location. In fact, a trimmed mean is
widely used to avoid the influence of outliers.

The trimmed mean can be thought of as a compromise
between the median and the mean: it is robust to extreme values in the data, but uses
more data to calculate the estimate for location.


###Example: Location Estimates

Let's see the first few rows in the data set containing population and murder
rates (in units of murders per 100,000 people per year) for each US state (2010
Census).

In [6]:
state = pd.read_csv("state.csv")
state.head(8)

Unnamed: 0,State,Population,Murder.Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA
5,Colorado,5029196,2.8,CO
6,Connecticut,3574097,2.4,CT
7,Delaware,897934,5.8,DE


Let's compute mean and median.The trimmed mean requires the trim_mean function in scipy.stats:

In [7]:
state["Population"].mean()

6162876.3

In [8]:
trim_mean(state["Population"], 0.1) # 10% trimming

4783697.125

In [9]:
state["Population"].median()

4436369.5

The mean is bigger than the trimmed mean, which is bigger than the median.

This is because the trimmed mean excludes the largest and smallest five states
(trim=0.1 drops 10% from each end). 

If we want to compute the average murder rate
for the country, we need to use a weighted mean or median to account for different
populations in the states.

For weighted median, we can use the specialized
package `wquantiles`:

In [10]:
np.average(state["Murder.Rate"], weights=state["Population"])

4.445833981123393

In [11]:
wquantiles.median(state["Murder.Rate"], weights=state["Population"])

4.4

In this case, the weighted mean and the weighted median are about the same.

###Key Ideas

* The basic metric for location is the mean, but it can be sensitive to extreme values (outlier).
* Other metrics (median, trimmed mean) are less sensitive to outliers and unusual distributions and hence are more robust.

##Estimates of Variability