# Lesson 3: Data Analysis

## Statistics


## Terminology
- `Significant level`
    In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis.[3] More precisely, the significance level defined for a study, α, is the probability of the study rejecting the null hypothesis, given that it were true; and the p-value of a result, p, is the probability of obtaining a result at least as extreme, given that the null hypothesis were true. The result is statistically significant, by the standards of the study, when p < α.
    [Link to wikipedia article](https://en.wikipedia.org/wiki/Statistical_significance)
- `Normal Distribution`

- ``

In [7]:
# Kurt's Introduction
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/umJQ6gVT8kY" frameborder="0" allowfullscreen></iframe>')

In [8]:
# Why is Statistics Useful?
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/DyeRm96wH5M" frameborder="0" allowfullscreen></iframe>')

In [10]:
# Introduction to Normal (Gauss Distribution)
from IPython.display import HTML
HTML ('<iframe width="560" height="315" src="https://www.youtube.com/embed/ZfOTcwXAdEw" frameborder="0" allowfullscreen></iframe>')

The equation for the normal distribution is:

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}.e^{\frac{-(x - \mu)^2}{2\sigma^2}}$$

### T-Test
To be more explicit:
- It is important to note that you cannot "accept" a null.
- You can just "retain" or "fail to reject".

If you would like to learn more about the t-test, check out [this lesson](https://classroom.udacity.com/courses/ud201/lessons/1333678604/concepts/1470193200923) in Intro to Inferential Statistics.

Welch's T-Test In Python
You can check out additional information about the scipy implementation of the t-test below:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html


In [12]:
# t-Test video
from IPython.display import HTML
HTML('<iframe width="369" height="208" src="https://www.youtube.com/embed/tjSj2OkV51A" frameborder="0" allowfullscreen></iframe>')

In [13]:
# Welch's Two-Sample t-Test
from IPython.display import HTML
HTML('<iframe width="369" height="208" src="https://www.youtube.com/embed/B_1cnwYn7so" frameborder="0" allowfullscreen></iframe>')

### 14. Quiz - Welch's t-Test Exercise

Performs a **t-test** on two sets of baseball data (left-handed and right-handed hitters).

You will be given a csv file that has *three* columns.
A player's `name`, `handedness` (L for lefthanded or R for righthanded) and their
`career batting average` (called 'avg').
You can look at the csv file by downloading the baseball_stats file from Downloadables below. 

Write a function that will:
- read that the csv file into a pandas data frame,and
- Run Welch's t-test on the two cohorts defined by handedness.
  - One cohort should be a data frame of right-handed batters. And 
  - the other cohort should be a data frame of left-handed batters.

We have included the `scipy.stats` library to help you write
or implement **Welch's t-test**:
http://docs.scipy.org/doc/scipy/reference/stats.html

With a significance level of 95%, if there is no difference
between the two cohorts, return a tuple consisting of
True, and then the tuple returned by scipy.stats.ttest.  

If there is a difference, return a tuple consisting of
False, and then the tuple returned by scipy.stats.ttest.

For example, the tuple that you return may look like:
(True, (9.93570222, 0.000023))

Supporting materials
[baseball_stats.csv](https://www.udacity.com/api/nodes/702578673/supplemental_media/baseball-statscsv/download)

In [None]:
import numpy
import scipy.stats
import pandas

def compare_averages(filename):
    """
    The description for this quiz is above text.
    """
    baseball_data = pandas.read_csv(filename)
    lh_player = baseball_data.loc[baseball_data['handedness'] == 'L', 'avg']
    rh_player = baseball_data.loc[baseball_data['handedness'] == 'R', 'avg']
    
    # Welch's t-test
    (t, p) = scipy.stats.ttest_ind(lh_player, rh_player, equal_var=False)
    
    # Welch's t-test results.
    result = (p > 0.05, (t, p))
    
    return result
    

Your calculated t-statistic is 9.93570222624
The correct t-statistic is +/-9.93570222624

In [14]:
# Exaplaination for Welch's t-Test exercise
from IPython.display import HTML
HTML('<iframe width="550" height="309" src="https://www.youtube.com/embed/TrSU-GH7TDY" frameborder="0" allowfullscreen></iframe>')

### Non-normal Data

When performing the t-Test, we assume that our data is normal.
In the wild, you'll often encounter probability distributions.
They're distinctly not normal. They might look like two diagrams below or even completely different.

![non-normal data example](image/l3-1-1.png)

As you imagine, there are still statistical tests that we can utilize when our data is not normal.

First of, we should have some machinery in place for determining whether or not our data is **Gaussian** in the first place. A crude, inaccurate way of determining whether or not our data is normal is simply to plot a histogram of our data ans ask, does this look like a bell curve? In both of the cases above, the answer would definitely be no. But, we can do  little bit better than that. There are some statistical tests that we can use to measure the likelihood that a sample is drawn from a normally distributed population. One such test is the **Shapiro-Wilk** test. The theory of this test is out of this course's scope. But you can implement this test easyly like this:

```Python
(W, p) = scipy.stats.shapiro(data)
```
- with `W` is the Shapiro-Wilk test statistic, 
- `p` value, which should be interpreted the same way as we would interpret the p-value for our t-test.

That is, given null hypothesis that this data is drawn from a normal distribution, what is the likelihood that we would observe a value of W that was at least as extreme as the one that we see?


### Non-Parametric Test

A statistical test that does not assume our data is drawn from any particular underlying probability distribution.

Mann-Whitney U test is a test that tests null hypothesis that two populations are the same:

```Python
(U, P) = scipy.stats.mannwhitneyu(x, y)
```
- x and y are two samples.

### Note
These have just been some of the methods that we can use when performing statistical tests on data. As you can imagine, there are a number of additional ways to handle data from different probability distributions or data that looks like it came from no probability distribution whatsoever.

Data scientist can perform many statistical procedures. But it's vital to understand the underlying structure of the data set and consequently, which statistical tests are appropriate given the data that we have.   

There are many different types of statistical tests and even many different schools of thought within statistics regarding the correct way to analyze data. This has really just been an opportunity to get your feet wet with statistical analysis. *It's just the tip of the iceberg*.


## 2. What is Machine Learning?
In addition to statistics, many data scientists are well versed in machine learning.
Machine Learning is a branch of artificial intelligence that's focused on constructing systems that learn from large amounts of data to make predictions.

Machine Learning can also be useful to predict which movies you might like on Netflix or how many home runs a batter may hit over the course of his career.

These are all the potential applications of machine learning.  

In [16]:
# Why is Machine Learning Useful?

from IPython.display import HTML
HTML('<iframe width="846" height="476" src="https://www.youtube.com/embed/uKEm9_HvkKQ" frameborder="0" allowfullscreen></iframe>')

### Statistics vs. Machine Learning
What is the difference between statistics and machine learning

## Gradient Descent in Python


In [None]:
import numpy
import pandas

def compute_cost(features, values, theta):
    """
    Compute the cost of a list of parameters, theta, given a list of features 
    (input data points) and values (output data points).
    """
    m = len(values)
    sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
    cost = sum_of_square_errors / (2*m)

    return cost

def gradient_descent(features, values, theta, alpha, num_iterations):
    """
    Perform gradient descent given a data set with an arbitrary number of features.
    """

    # Write code here that performs num_iterations updates to the elements of theta.
    # times. Every time you compute the cost for a given list of thetas, append it 
    # to cost_history.
    # See the Instructor notes for hints. 
    
    cost_history = []

    ###########################
    ### YOUR CODE GOES HERE ###
    ###########################

    return theta, pandas.Series(cost_history) # leave this line for the grader
