# Practical Statistics

In [54]:
%matplotlib inline

from pathlib import Path

import pandas as pd
import numpy as np
from scipy.stats import trim_mean
from statsmodels import robust
import wquantiles
import statistics 
import seaborn as sns
import matplotlib.pylab as plt

In [13]:
AIRLINE_STATS_CSV =pd.read_csv("data/airline_stats.csv")
KC_TAX_CSV = pd.read_csv('data/kc_tax.csv.gz')
LC_LOANS_CSV = pd.read_csv('data/lc_loans.csv')
AIRPORT_DELAYS_CSV = pd.read_csv('data/dfw_airline.csv')
SP500_DATA_CSV = pd.read_csv('data/sp500_data.csv.gz')
SP500_SECTORS_CSV = pd.read_csv('data/sp500_sectors.csv')
STATE_CSV = pd.read_csv('data/state.csv')

### Mean

The most basic estimate of location is the mean, or average value. The mean is the
sum of all the values divided by the number of values. Consider the following set
of numbers: {3 5 1 2}. The mean is (3 + 5 + 1 + 2) / 4 = 11 / 4 = 2.75.


### Trimmed Mean 
A variation of the mean is a trimmed mean, which you calculate by dropping a
fixed number of sorted values at each end and then taking an average of the
remaining values.

<li>A trimmed mean eliminates the influence of extreme values.
<li>Trimmed means are widely used, and in many cases, are
preferable to use instead of the ordinary mean.

### Weighted Mean

Another type of mean is a weighted mean, which you calculate by multiplying
each data Xi value by a weight Wi and dividing their sum by the sum of the
weights.
#### There are two main motivations for using a weighted mean:

<li>Some values are intrinsically more variable than others, and highly variable
observations are given a lower weight. For example, if we are taking the
average from multiple sensors and one of the sensors is less accurate, then
we might downweight the data from that sensor.</li>
    
    
<li>The data collected does not equally represent the different groups that we are
interested in measuring. For example, because of the way an online
experiment was conducted, we may not have a set of data that accurately
reflects all groups in the user base. To correct that, we can give a higher
weight to the values from the groups that were underrepresented.</li>

### Median and Robust Estimates

The median is the middle number on a sorted list of the data. If there is an even
number of data values, the middle value is one that is not actually in the data set,
but rather the average of the two values that divide the sorted data into upper and
lower halves. Compared to the mean, which uses all observations, the median
depends only on the values in the center of the sorted data.

### Outliers
The median is referred to as a robust estimate of location since it is not influenced
by outliers (extreme cases) that could skew the results. An outlier is any value
that is very distant from the other values in a data set.

The median is not the only robust estimate of location. In fact, a trimmed mean is
widely used to avoid the influence of outliers. For example, trimming the bottom
and top 10% (a common choice) of the data will provide protection against
outliers in all but the smallest data sets. The trimmed mean can be thought of as a
compromise between the median and the mean: it is robust to extreme values in
the data, but uses more data to calculate the estimate for location.

## Location Estimates of Population and Murder Rates

In [33]:
state_csv=STATE_CSV.rename(columns={"Murder.Rate":"Murder_Rate"})

In [34]:
print(STATE_CSV.head(8))

         State  Population  Murder.Rate Abbreviation
0      Alabama     4779736          5.7           AL
1       Alaska      710231          5.6           AK
2      Arizona     6392017          4.7           AZ
3     Arkansas     2915918          5.6           AR
4   California    37253956          4.4           CA
5     Colorado     5029196          2.8           CO
6  Connecticut     3574097          2.4           CT
7     Delaware      897934          5.8           DE


In [36]:
# Mean 
state_csv["Population"].mean()

6162876.3

In [37]:
trim_mean(state_csv['Population'], 0.1)

4783697.125

The mean is bigger than the trimmed mean, which is bigger than the median.
This is because the trimmed mean excludes the largest and smallest five states
(trim=0.1 drops 10% from each end)

In [38]:
state_csv['Population'].median()

4436369.5

In [39]:
### Weighted Mean 
def weighted_average_m1(Murder_Rate,Population):
    return round(sum([Murder_Rate[i]*Population[i] for i in range(len(Murder_Rate))])/sum(Population),2)

weighted_average_m1(state_csv['Murder_Rate'], state_csv['Population'])

4.45

In [42]:
# Murder_Rate Mean
state_csv['Murder_Rate'].mean()

4.066

In [41]:
#Murder_Rate Median
state_csv['Murder_Rate'].median()

4.0

In [48]:
# Weighted Mean
print(np.average(state_csv['Murder_Rate'], weights=state_csv['Population']))

4.445833981123393


In [49]:
# Weighted Median
print(wquantiles.median(state_csv['Murder_Rate'], weights=state_csv['Population']))

4.4


In [50]:
# Murder_Rate Trimmed Mean 
trim_mean(state_csv['Murder_Rate'], 0.1)

3.9450000000000003

## Estimates of Variability

At the heart of statistics lies variability: measuring
it, reducing it, distinguishing random from real variability, identifying the various
sources of real variability, and making decisions in the presence of it

The best-known estimates for variability are the variance and the standard
deviation, which are based on squared deviations. The variance is an average of
the squared deviations, and the standard deviation is the square root of the
variance.

Neither the variance, the standard deviation, nor the mean absolute deviation is
robust to outliers and extreme values

### Variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean. In other words, it measures how far a set of numbers is spread out from their average value. 

### Standard Deviation

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values.A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.

In [51]:
state_csv.head(8)

Unnamed: 0,State,Population,Murder_Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA
5,Colorado,5029196,2.8,CO
6,Connecticut,3574097,2.4,CT
7,Delaware,897934,5.8,DE


In [53]:
state_csv['Population'].std()

6848235.347401142

In [55]:
statistics.variance(state_csv['Population'])

46898327373394.45

In [59]:
# Quantiles
np.percentile(state_csv['Population'],np.arange(0,100,25))

array([ 563626.  , 1833004.25, 4436369.5 , 6680312.25])

In [60]:
# Percentile
np.percentile(state_csv['Population'],np.arange(0,100,10))

array([  563626. ,   889558.6,  1353913. ,  2508139.4,  3014731.8,
        4436369.5,  5457149.4,  6419552.5,  8940611.8, 12715204.3])

In [64]:
#IQR (InterQuartrile Range)
  
from scipy import stats 
  
IQR = stats.iqr(state_csv['Population'], interpolation = 'midpoint') 
  
print(IQR) 

4796417.0


In [66]:
# Interquartile range is calculated as the difference of the 75% and 25% quantile.
print(state_csv['Population'].quantile(0.75) - state_csv['Population'].quantile(0.25))

4847308.0


In [68]:
# Median absolute deviation from the median can be calculated with a method in statsmodels
print(robust.scale.mad(state_csv['Population']))
print(abs(state_csv['Population'] - state_csv['Population'].median()).median() / 0.6744897501960817)

3849876.1459979336
3849876.1459979336
