# Statistical Rigor

### Significance Tests
Using our data, can we disprove an assumption with a pre-defined level of confidence?

* Null Hypothesis : It is a statement that we're trying to disprove by running our test.

* 귀무가설 : 강력한 증거가 없이는 가급적 지키려고 하는 가설

* p-value : Probability of obtaining a best statistic at least as extreme as ours if null hypothesis was true.

* p-value : 주어진 관찰통계량 값으로서 귀무가설을 기각할 수 있는 최소한의 유의수

### Why is statistics useful in data science?
-> They provide a formalized framework for comparing and evaluating data

-> They enable us to evaluate whether perceived effects in our dataset reflect differences across the whole population

### Example : Is there any difference btw the batting average of lefties and righties?

1. Many tests make assumptions about data's distribution.

2. Very common distribution - Normal Distribution (Gaussian Distribution)

H_null : Two samples come from same population.

## 1. T-test

When doing the t-test, we assume the data is normally distributed. 

If variance differs, we do the Welch's t-test, otherwise we just do normal t-test.

In [None]:
# T-test between two groups

import pandas
import numpy
import scipy.stats

def compare_averages() :
    df = pandas.read_csv(filename)
    ttest = scipy.stats.ttest_ind(df['avg'][df['handedness']=='L'], 
    df['avg'][df['handedness']=='R'], equal_var=False)

    if ttest[1] <= .05 :
        return (False, result)
    else :
        return (True, result)

if __name__ == '__main__' :
    result = compare_averages()
    print (result)

## 2. Shapiro-Wilk Test

A test to check the normality of a data. It's in scipy lib.

In [None]:
w, p = scipy.stats.shapiro(data)

# w is shapiro statistics, p is p-value
# H_null : This data is from normal distribution

## 3. Non-parametric Test

A statistical test that does not assume our data is drawn from any particular underlying probability distribution.

-> Mann-Whitney-WilCoxon u test : Tests null hypothesis that two populations are the same.

u, p = scipy.stats.mannwhitneyu(x,y)

## Statistics vs. Machine Learning?

-> Statistics is focused on analyzing existing data, and drawing valid conclusions.

-> Machine learning is focused on making predictions.

## Prediction with Regression with gradient descent

Can we write an equation that takes a bunch of info and predicts Home Runs?

First, we define Cost Function. In this case, Cost function is the Least Square Error term.

Second, how to minimize Cost Function? -> Gradient Descent


** Gradient Descent - Cost Function **

Cost function = 0.5 * ( sum ( Y_predicted - Y_observed )^2)

In [1]:
import numpy
import pandas

def compute_cost(features, values, theta):
    """
    Compute the cost of a list of parameters, theta, given a list of features 
    (input data points) and values (output data points).
    """
    m = len(values)
    sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
    cost = sum_of_square_errors / (2*m)

    return cost

def gradient_descent(features, values, theta, alpha, num_iterations):
    """
    Perform gradient descent given a data set with an arbitrary number of features.
    """

    # Write code here that performs num_iterations updates to the elements of theta.
    # times. Every time you compute the cost for a given list of thetas, append it 
    # to cost_history.
    # See the Instructor notes for hints. 
    
    m = len(values)
    cost_history = []
    
    for i in range(num_iterations) :
        predicted_values = numpy.dot(features,theta)
        theta = theta - alpha/m * numpy.dot((predicted_values-values), features)
        cost_history.append(compute_cost(features,values,theta))

    return theta, pandas.Series(cost_history) # leave this line for the grader


### Calculating R^2

R^2 = 1 - SSR / SSTO

SSR = sum (data-prediction)^2

SSTO = sum ( data - avg ) ^2

In [2]:
def compute_r_squared(data, predictions) :
    SST = ((data-np.mean(data))**2).sum()
    SSReg = ((predictions-data)**2).sum()
    r_squared = 1 - SSReg / SST
    
    return r_squared

### Additional Considerations

- Other types of linear regression -> ordinaly least squares regression

- Parameter Estimation

- Over / Underfitting

- Multiple local minima