# Lesson 3: Statistical Fundamentals (Part 1)
## Starter code for guided practice & demos
Topics covered:

#### 1. Codealong: Summary statistics in pandas
- Basic stats (min, max, mean, median, mode, count)
- Box plots (interquartile range, quantiles, outliers)
- Standard deviation, variance, pandas.describe()
- Correlation
- Anscombe's Quartet

#### 2. Demo: Median & mean
- Generating random data using statistical distributions
- Density plots using matplotlib

#### 3. Demo: Skewness & kurtosis
- Normality
- Random seeds

#### 4. Demo: Types of distribution

#### 5. Demo: Dummy variables
- Using masks to randomly divide a dataset into two categories (useful later when we talk about cross-validation)
- Using maps to code categorical variables as numeric
- Using dummy variables to code a single categorical variable of k categories as k-1 dummy variables using `pd.get_dummies()`

<div style='background-color: #fcf2f2; border-color: #dFb5b4; border-left: 5px solid #dfb5b4; padding: 0.5em;'>
**Warning:** You will need to install ggplot before you run the next cell (run `conda install ggplot` in your shell/Terminal).
<div/>

In [1]:
# Import the modules we'll be using today
from ggplot import mtcars
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import metrics

%matplotlib inline

## Codealong: Summary statistics in pandas
	
Methods available include:

    .min() - Compute minimum value
    .max() - Compute maximum value
    .mean() - Compute mean value
    .median() - Compute median value
    .mode() - Compute mode value(s)
    .count() - Count the number of observations

### Part 1. Basic stats
#### Read in the examples

In [2]:
# This is one way of creating a dataframe, by specifying a dictionary of lists
df = pd.DataFrame({
    'example1': [18, 24, 17, 21, 24, 16, 29, 18],
    'example2': [75, 87, 49, 68, 75, 84, 98, 92],
    'example3': [55, 47, 38, 66, 56, 64, 44, 39]})
df

Unnamed: 0,example1,example2,example3
0,18,75,55
1,24,87,47
2,17,49,38
3,21,68,66
4,24,75,56
5,16,84,64
6,29,98,44
7,18,92,39


#### Instructor example: Calculate the mean for each coloumn

In [3]:
df.mean()

example1    20.875
example2    78.500
example3    51.125
dtype: float64

In [4]:
df.median()

example1    19.5
example2    79.5
example3    51.0
dtype: float64

In [5]:
df.mode()

Unnamed: 0,example1,example2,example3
0,18,75.0,
1,24,,


#### Students: Calculate median, mode, max, min for example

Note: All answers should match your hand calculations

In [None]:
#maximum

In [None]:
#minimum

In [None]:
#median

In [None]:
#mode

### Part 2. Quartiles, interquartile range and box plots

#### Instructor: Interquartile range 

In [None]:
print "50% Quartile:"
print df.quantile(.50)
print

print "Median (red line of a box plot)"
print df.median()

In [None]:
print"25% (bottom of the box)"
print df.quantile(0.25)
print

print"75% (top of the box)"
print df.quantile(0.75)

In [None]:
df['example1'].plot(kind='box')

#### Student: Create plots for examples 2 and 3 and check the quartiles

#### What does the cross in example 2 represent?

Answer: 

### Part 3. Standard deviation and variance

#### In Pandas
	Methods include: 
		.std() - Compute Standard Deviation
		.var() - Compute variance

#### Let's calculate variance by hand first.

<img(src='https://dl.dropboxusercontent.com/u/3404204/samplevarstd.png', style="width: 50%; height: 50%")>

In [None]:
# example1
mean = df["example1"].mean()
n= df["example1"].count()

print "example1:"
print df["example1"], "\n"

print "mean:", mean, "\n"
print "n:", n

In [None]:
# Written out by hand for instructional purposes 
# If you have time, try refactoring this to create a function to calculate variance for any dataset

# Find the squared distance from the mean
obs0 = (18 - mean) ** 2
obs1 = (24 - mean) ** 2
obs2 = (17 - mean) ** 2
obs3 = (21 - mean) ** 2
obs4 = (24 - mean) ** 2
obs5 = (16 - mean) ** 2
obs6 = (29 - mean) ** 2
obs7 = (18 - mean) ** 2

print obs0, obs1, obs2, obs3, obs4, obs5, obs6, obs7
print

# Sum each observation's squared distance from the mean 
numerator = obs0 + obs1 + obs2 + obs3 + obs4 + obs5 + obs6 +obs7
denominator = n - 1
variance = numerator/denominator
print "numerator:", numerator, "\n"
print "denominator:", denominator, "\n"
print "variance:", variance

In [None]:
# Using pandas
print "Variance"
print df["example1"].var()

#### Students: Calculate the standard deviation for each sample

Recall that the standard deviation is the square root of the variance. 

In [None]:
# Find the variance for each dataset

In [None]:
# Calculate standard deviation by hand from the variance of each dataset


In [None]:
# Now do it with pandas!


#### Short Cut!

In [None]:
# We can use describe() method to do lots of things at once:
# gives us count of non-missing values, mean, std dev, min/max + quartiles
df.describe()

#### Student: Check understanding 
Which value in the above table is the median? 

Answer: 

### Part 4. Correlation

In [None]:
# Correlations between example1, example2, and example3 as a correlation matrix
df.corr()

In [None]:
# Let's explore this dataset
anscombe = pd.read_csv('anscombe.csv')
anscombe = anscombe.drop('Unnamed: 0', axis=1)  # bit of data munging, we don't need this column so drop it
anscombe

In [None]:
# Huh, this looks like a weird dataset. Two x columns, four y columns...let's get some more intuition about it
# by looking at the aggregate statistics. Before you read on, what do you notice?
anscombe.describe()

When looking at the dataframe, the data looks quite different, but when looking at the output from the `.describe()` call, we notice the columns share some similar features:
- mean(x) = 9, i.e. both x cols have mean of 9
- var(x) = 11, i.e. both x cols have variance of 11 (std dev = sqrt(11) = 3.316625)
- mean(y) = 7.50, i.e. all y columns have mean of 7.50 (or close to it)
- var(y) = 4.12, i.e. all y columns have variance of 4.12 (std dev = sqrt(4.12) = ~2.03)


In [None]:
# Let's check out the correlation matrix to try understand this dataset further
anscombe.corr()

If we were to plot x with y1, y2 and y3; and plot x4 with y4...what can we expect? Well maybe we could look at the correlations of these plots, as well as a fitted line of best fit (a.k.a "linear regression") for each of those:
- corr(x, y) = 0.816 for all plots mentioned above
- linear regression for all plots is `y = 3 + 0.5*x`

From inspecting summary statistics, these look pretty identical. Let's now visualise and inspect further.

In [None]:
# Visualise
for y in ['y1', 'y2', 'y3', 'y4']:
    if y != 'y4':
        print anscombe.plot(kind='scatter', x='x', y=y)
    else:
        print anscombe.plot(kind='scatter', x='x4', y=y)

Visualising is critical sometimes! Even though ALL four of above plots have equal mean, equal variance, equal correlation, equal regression coefficients...
- `x ~ y1` has a reasonable if slightly noisy relationship between x & y1
- `x ~ y2` has a perfect non-linear (i.e. not a line, i.e. parabolic) relationship between x & y2
- `x ~ y3` has a perfect linear relationship between x & y3, except for one outlier
- `x4 ~ y4` has no relationship between x4 & y4, other than all the x's are 8 except for one rogue point!

These graphs were created in 1973 by statistician Francis Anscombe to demonstrate the importance of graphing data before analyzing it. Read more here: https://en.wikipedia.org/wiki/Anscombe%27s_quartet

---

## Demo: Mean & median

Although the mean and median both give us some sense of the centre of a distribution, they aren't always the same. The *median* gives us a value that **splits the data into two halves** while the *mean* is a **numeric average,** so extreme values can have a significant impact on the mean. 

In a symmetric distribution, the mean and median will be the same. Let's investigate with a density plot:

In [None]:
# First, we'll use a "random seed" to ensure your randomly generated numbers are the same as mine
np.random.seed(12345)

In [None]:
bunch_of_random_but_normally_distributed_numbers = np.random.normal(size=100000)
norm_data = pd.DataFrame(bunch_of_random_but_normally_distributed_numbers)
norm_data.head()

In [None]:
# Visualise
norm_data.plot(kind="density", figsize=(10,5))

plt.vlines(norm_data.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=0.4,
           linewidth=5.0)

plt.vlines(norm_data.median(),   # Plot red line at median
           ymin=0, 
           ymax=0.4, 
           linewidth=2.0,
           color="red")

In the plot above, the mean and median are both so close to zero that the red median line lies on top of the thicker black line drawn at the mean. 

In skewed distributions, the mean tends to get pulled in the direction of the skew, while the median tends to resist the effects of skew:
 

In [None]:
# Generate skewed data from an exponential distribution
skewed_data = pd.DataFrame(np.random.exponential(size=100000))
skewed_data.head()

In [None]:
skewed_data.plot(kind="density", figsize=(10,5), xlim=(-1,5))

plt.vlines(skewed_data.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=0.8,
           linewidth=5.0)

plt.vlines(skewed_data.median(),   # Plot red line at median
           ymin=0, 
           ymax=0.8, 
           linewidth=2.0,
           color="red")

Notice that the mean is also influenced heavily by outliers, while the median resists the influence of outliers:

In [None]:
norm_data = np.random.normal(size=50)
some_outliers = np.random.normal(15, size=3)
combined_data = pd.DataFrame(np.concatenate((norm_data, some_outliers), axis=0))

combined_data.plot(kind="density", figsize=(10,5), xlim=(-5,20))

plt.vlines(combined_data.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=0.2,
           linewidth=5.0)

plt.vlines(combined_data.median(),   # Plot red line at median
           ymin=0, 
           ymax=0.2, 
           linewidth=2.0,

           color="red")

Since the median tends to resist the effects of skewness and outliers, it is known a "robust" statistic. 

The median generally gives a better sense of the typical value in a distribution with significant skew or outliers.

---

## Demo: Skewness and Kurtosis
*Skewness* measures the **skew or asymmetry of a distribution** while *Kurtosis* measures the **"peakedness" of a distribution**. 

We won't go into the exact calculations behind these, but they are essentially just statistics that take the idea of variance a step further: while variance involves squaring deviations from the mean, skewness involves cubing deviations from the mean, and kurtosis involves raising deviations from the mean to the 4th power.

Pandas has built in functions for checking skewness and kurtosis, df.skew() and df.kurt() respectively:

In [None]:
comp1 = np.random.normal(loc=0, scale=1, size=200) # N(0, 1)
comp2 = np.random.normal(loc=10, scale=2, size=200) # N(10, 4)

df1 = pd.Series(comp1)
df2 = pd.Series(comp2)

In [None]:
print "N(0,1) with 200 variates:\t\t", df1.skew()  # Normal distribution therefore skew is close to zero
print "N(10,2) with 200 variates:\t\t", df2.skew()  # Normal distribution therefore skew is close to zero
print "exponential(1.0) with 100k variates:\t", skewed_data.skew()[0]  # Positively skewed data

In [None]:
print "N(0,1) with 200 variates:\t\t", df1.kurt()  # N(0,1) distribution, kurtosis is close to zero
print "N(10,2) with 200 variates:\t\t", df2.kurt()  # N(10,4) distribution is flatter than N(0,1), therefore lower kurtosis
print "exponential(1.0) with 100k variates:\t", skewed_data.kurt()[0]  # Skewed & peaked, should have higher kurtosis

#### Example: mtcars

In [None]:
# Let's now use an example dataset (mtcars) from the ggplot library
mtcars.head()

In [None]:
mtcars['mpg'].plot(kind="density")

In [None]:
mtcars["mpg"].describe()  # Check out basic stats

In [None]:
mtcars["mpg"].skew()  # Check skewness

In [None]:
mtcars["mpg"].kurt()  # Check kurtosis 

---

## Demo: Types of distribution
To explore these measures further, let's create some dummy data and inspect it:

In [None]:
norm_data = np.random.normal(size=100000)
skewed_data = np.concatenate((np.random.normal(size=35000) + 2, 
                              np.random.exponential(size=65000)), 
                              axis=0)
uniform_data = np.random.uniform(0, 2, size=100000)
peaked_data = np.concatenate((np.random.exponential(size=50000),
                              np.random.exponential(size=50000) * (-1)),
                              axis=0)

data_df = pd.DataFrame({"norm": norm_data,
                        "skewed": skewed_data,
                        "uniform": uniform_data,
                        "peaked": peaked_data})

data_df.head()

In [None]:
# Visualise the normal distribution
data_df["norm"].plot(kind="density", xlim=(-5,5))

In [None]:
# Visualise the peaked distribution (two exponentials back to back)
data_df["peaked"].plot(kind="density", xlim=(-5,5))

In [None]:
# Visualise the skewed distribution (normal with a bit of exponential)
data_df["skewed"].plot(kind="density", xlim=(-5,5))

In [None]:
# Visualise the uniform distribution
data_df["uniform"].plot(kind="density", xlim=(-5,5))

In [None]:
# We can visualise all the columns of the dataframe (i.e. all the distributions) in one go
data_df.plot(kind="density", xlim=(-5,5))

### Skewness
Now let's check the skewness of each of these distributions. 

Since skewness measures asymmetry, we'd expect to see low skewness for all of the distributions except the skewed one, because all the others are roughly symmetric:

In [None]:
data_df.skew()

### Kurtosis
Now let's check kurtosis. Since kurtosis measures peakedness, we'd expect the flat (uniform) distribution to have low kurtosis while the distributions with sharper peaks should have higher kurtosis.

In [None]:
data_df.kurt()

As we can see from the output, the normally distributed data has a kurtosis near zero, the flat distribution has negative kurtosis, and the two pointier distributions have positive kurtosis.

---

## Demo: Dummy variables
We want to represent categorical variables numerically, but we can't simply code them as 0=rural, 1=suburban, 2=urban because that would imply an **ordered relationship** between suburban and urban (suggesting that urban is somehow "twice" the suburban category, which doesn't make sense).

Why do we only need **two dummy variables, not three?** Because two dummies capture all of the information about the Area feature, and implicitly defines rural as the reference level.

In general, if you have a categorical feature with k levels, you create k-1 dummy variables.

In [None]:
# read data into a DataFrame
data = pd.read_csv('advertising.csv', index_col=0)
data.head()

### Dummying categorical variables with two categories
Let's create a new feature called "Size," and randomly assign observations to be small or large:

In [None]:
# Reset random seed for reproducibility
np.random.seed(12345)

# Create a Series of booleans in which roughly half are True
nums = np.random.rand(100)
mask_large = nums > 0.5

print "nums:", nums[0:6]
print "mask_large:", mask_large[0:6]

In [None]:
# Initially set Size to small, then change roughly half to be large
data['Size'] = 'small'
data.loc[mask_large, 'Size'] = 'large'
data.head()

We will soon encounter scikit-learn. Remember now that scikit-learn requires ALL data to be represented numerically.

If a feature only has two categories, we can simply create a dummy variable that represents the categories as a binary value.

In [None]:
# create a new Series called IsLarge
data['IsLarge'] = data.Size.map({'small': 0, 'large': 1})  # this is new, can you figure out what .map() is doing?
data.head()

### Dummying categorical variables with more than two categories
Let's create a new feature called Area, and randomly assign observations to be rural, suburban, or urban:

In [None]:
# assign roughly one third of observations to each group
nums = np.random.rand(len(data))
mask_suburban = (nums > 0.33) & (nums < 0.66)
mask_urban = nums > 0.66
data['Area'] = 'rural'
data.loc[mask_suburban, 'Area'] = 'suburban'
data.loc[mask_urban, 'Area'] = 'urban'
data.head()

We have to represent Area numerically, but we can't simply code it as 0=rural, 1=suburban, 2=urban because that would imply an ordered relationship between suburban and urban (and thus urban is somehow "twice" the suburban category).

Instead, we create another dummy variable:

#### Common pattern: create multiple dummy variables using get_dummies(), then exclude the first dummy column
    my_categorical_var_dummies = pd.get_dummies(my_categorical_var, prefix='Area').iloc[:, 1:]

In [None]:
# create three dummy variables using get_dummies, then exclude the first dummy column
area_dummies = pd.get_dummies(data.Area, prefix='Area').iloc[:, 1:]

# now concatenate the dummy variable columns onto the original DataFrame (axis=0 means rows, axis=1 means columns)
data = pd.concat([data, area_dummies], axis=1)
data.head()