<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Experiments and Hypothesis Testing

_Authors: Alexander Egorenkov (DC)_

---

<a id="data-source"></a>
## Experiments and Hypothesis Testing

---

Today, we’ll use advertising data from an example in the book [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/).
- This is a well-known, standard introduction to machine learning.
- The book has a more advanced version — [Elements of Statistical Learning](http://web.stanford.edu/~hastie/ElemStatLearn/) — if you are comfortable with linear algebra and statistics at the graduate level.

#### Code-Along: Bring in Today's Data

In [1]:
# Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# This allows plots to appear directly in the notebook.
%matplotlib inline
plt.style.use('fivethirtyeight') 

In [2]:
# Read data into a DataFrame.

# We use index_col to tell Pandas that the first column in the data has row labels.
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
data.head() 

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


#### Questions About the Advertising Data

Let's pretend you work for the company that manufactures and markets this new device. The company might ask you the following: "On the basis of this data, how should we spend our advertising money in the future?"

<a id="what-are-the-featurescovariatespredictors"></a>
### What are the Features/Covariates/Predictors?

In [4]:
# Answer:

<a id="what-is-the-outcomeresponse"></a>
### What Is the Outcome/Response?

In [5]:
# Answer:

<a id="math-review"></a>
## Math Review
---

$$ \frac{1}{n - 1} \sum\limits_{i=1}^n (x_{i} - \bar{x})^2 $$

### Your Turn:  

Manually calculate the variance of the radio column.  Confirm that you did the calculation correctly by using the var() aggregator.

$$ \sqrt{\frac{1}{n - 1} \sum\limits_{i=1}^n (x_{i} - \bar{x})^2} $$

### Your Turn: 

Calculate the standard deviation of the radio column.  Confirm that you did it correctly by using the std() aggregator.

### Question Prompt:

 - What are the units used to report the variance and standard deviation?

$$ \frac{1}{n - 1} \sum\limits_{i=1}^n (x_{i} - \bar{x}) (y_{i} - \bar{y}) $$

### Covariance: 

 - Measures the linear dependence of one variable upon another.
 
**Questions:**

 - What will the covariance of two variables be when they are both larger than average? 
  - Both below average?
  - If one is above, and one is below?
 - What happens if you take the covariance of a variable with respect to itself?

### Your Turn:  

Calculate the covariance between the radio and sales column.  Use the np.cov() function to double check your results.

$$ \frac{\frac{1}{n - 1} \sum\limits_{i=1}^n (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sigma_{X}\sigma_{Y}} $$

### Correlation:

 - What relation, if any, does it have with the covariance?
 - Will it move in the same, opposite, or undetermined direction with the covariance of any two variables?

### Your Turn:

Calculate the correlation between the radio and sales column.  Use the df.corr() method to confirm that your answer is correct.

### Your Turn:

Take 10-15 minutes, and do various types of eda to figure out the relationship between the different types of advertising and overall sales.

Try the following:
 - the seaborn heatmap
 - the seaborn pairplot
 - chart the results of various aggregations -- mean, median, etc, w/ plots
 - try looking at histograms and boxplots of the different variables to get a feel for their internal distribution