# GA Data Science 31
Instructor: Amy Roberts, PhD

###Course Part 1 review:   
**Algorithms:** 
1. KNN 
2. Regression
3. Logistic Regression


**Key Concepts:**  

1. bias and variance
2. standard deviation
3. standard error
4. MSE/RMSE
5. Confidence intervals
6. R-square
7. Under/overfitting
8. Cross validation
9. dummy coding 

**Classification model considerations **
1. Imbalanced Classes
2. ROC curves
3. Percision vs Recall  
--plus a note on Probablity, Odds, and Odds Ratios--  




In [None]:
#General imports
from sklearn import datasets
from sklearn import metrics
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt


## Algorithms

### 1. KNN 
Classifing unknown observations based on their neighbors
 
 |Continuous | Categorical 
--- | --- | --- 
supervised | regression | **classification**
upsupervised | dimension reduction | clustering





Algorithm | Type | Outcome |  Model |Key Steps | Model fit | Interpretation
--- | --- | --- | --- | --- | --- | --- | --- | ---
kNN| Supervised |Categorical (classifier) | Non-parmetric, lazy| Train/Test, selecting K | Accruacy cross-validation | This obeservation belongs to group X


####Pseudocode:

In [None]:
# Scikit 
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=1)
model.fit(train[features], train[outcome])

#determine how many ks
results = []
for n in range(1, 51, 2):
    clf = KNeighborsClassifier(n_neighbors=n)
    clf.fit(train[features], train[outcome])
    preds = clf.predict(test[features])
    score = clf.score(test[features], test[outcome])
    
    print "Neighbors: %d, Score: %3f" % (n, score)
    
    results.append([n, score])


### 2. Linear Regression
Fit a line to our data where coefficients are estimated using the least squares criterion.  This means we find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors") 

Algorithm | Type | Outcome |  Model |Key Steps | Model fit | Interpretation
--- | --- | --- | --- | --- | --- | --- | --- | ---
Linear Regression| Supervised | Continuous | Parmetric| Determining covariates  |R square and Adj R square | **categorical covar:** Compared to [reference category], we would expect [category of interest] to be [b-coeffienct] more/less of outcome (e.g., compared to shrubs that were in partial sun, we would expect shrubs in full sun to be 11 cm taller, on average, at the same level of soil bacteria. Where 11 is the beta coeeffient for shurb height) **continuous covar:** For every [beta-coeffient] of covariate there will be a 1 unit change in the outcome


####Pseudocode:

In [None]:
# this is the standard import if you're using "formula notation" (similar to R)
import statsmodels.formula.api as smf

# create a fitted model in one line
#formula notiation is the equivalent to writting out our models such that 'outcome = predictor'
#with the follwing syntax formula = 'outcome ~ predictor1 + predictor2 ... predictorN'
lm = smf.ols(formula='outcome ~ covariate', data=data).fit()

#print the full summary
lm.summary()


### 3. Logistic Regression
Fit a line to our data where coefficients are estimated using the least squares criterion.  This means we find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors") 

Algorithm | Type | Outcome |  Model |Key Steps | Model fit | Interpretation
--- | --- | --- | --- | --- | --- | --- | --- | ---
Logistic Regression| Supervised | Categorical | Parmetric| Determining covariates, convert coeffients to odds ratios or predicted probablities  |R square and Adj R square or ML | **categorical covar:** exposed "times the odds" of getting outcome compared to unexposed. **continuous covar:** every unit increase in covar multiplies the odds of survival by OR (e.g., every year increase in age multiplies the odds of survival by 0.94)

####Pseudocode:

In [None]:
import statsmodels.api as sm
logit = sm.Logit(data[outcome], data[train_cols])
result = logit.fit()

### Algorithms covered so far

Algorithm | Type | Outcome |  Model |Key Steps | Model fit | Interpretation
--- | --- | --- | --- | --- | --- | --- | --- | ---
kNN| Supervised |Categorical (classifier) | Non-parmetric, lazy| Train/Test, selecting K | Accruacy cross-validation | This obeservation belongs to group X
Linear Regression| Supervised | Continuous | Parmetric| Determining covariates  |R square and Adj R square | **categorical covar:** Compared to [reference category], we would expect [category of interest] to be [b-coeffienct] more/less of outcome (e.g., compared to shrubs that were in partial sun, we would expect shrubs in full sun to be 11 cm taller, on average, at the same level of soil bacteria. Where 11 is the beta coeeffient for shurb height) **continuous covar:** For every [beta-coeffient] of covariate there will be a 1 unit change in the outcome
Logistic Regression| Supervised | Categorical | Parmetric| Determining covariates, convert coeffients to odds ratios or predicted probablities  |R square and Adj R square or ML | **categorical covar:** exposed "times the odds" of getting outcome compared to unexposed. **continuous covar:** every unit increase in covar multiplies the odds of survival by OR (e.g., every year increase in age multiplies the odds of survival by 0.94)


## Key Concepts

###1. Bias and Variance

**Error due to Bias:** The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. Imagine you could repeat the whole model building process more than once: each time you gather new data and run a new analysis creating a new model. Due to randomness in the underlying data sets, the resulting models will have a range of predictions. Bias measures how far off in general these models' predictions are from the correct value.  
**Error due to Variance:** The error due to variance is taken as the variability of a model prediction for a given data point. Again, imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model.

<img(src='images/biasVsVarianceImage.png', style="width: 30%; height: 30%")>

###2. Standard Deviation 
In statistics, the standard deviation (SD, also represented by the Greek letter sigma, σ for the population standard deviation or s for the sample standard deviation) is a measure that is used to quantify the amount of variation or  dispersion of a set of data values. **It is the square root of the variance.**

###3. Standard Error
The standard error of the mean (SEM) quantifies the precision of the mean. It is a measure of how far your sample mean is likely to be from the true population mean. It is expressed in the same units as the data.

As the standard error of an estimated value generally increases with the size of the estimate, a large standard error may not necessarily result in an unreliable estimate. Therefore it is often better to compare the error in relation to the size of the estimate.

Recall that the regression line is the line that minimizes the sum of squared deviations of prediction (also called the sum of squares error). The standard error of the estimate is closely related to this quantity and is defined below:

         σest = sqrt((sum(Y-Y')^2)/N )

where σest is the standard error of the estimate, Y is an actual score, Y' is a predicted score, and N is the number of pairs of scores. The numerator is the sum of squared differences between the actual scores and the predicted scores.


###4. MSE/RMSE
MSE- mean/average of the square of all of the error. It is a measure of both variance and bias. 
RMSE- The square root of the MSE

The use of RMSE is very common and it makes an excellent general purpose error metric for numerical predictions.
Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors.

For an unbiased estimator, the RMSE is the square root of the variance, known as the standard deviation.

###5. Confidence Intervals/P-values
**Confidence Intervals** The 95% confidence interval is measured by two standard errors either side of the estimate. If the population from which this sample was drawn was sampled 100 times, approximately 95 of those confidence intervals would contain the "true" coefficient.

**Hypothesis testing** Generally speaking, you start with a null hypothesis and an alternative hypothesis (that is opposite the null). Then, you check whether the data supports rejecting the null hypothesis or failing to reject the null hypothesis.
(Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. The alternative hypothesis may indeed be true, except that you just don't have enough data to show that.)

As it relates to model coefficients, here is the conventional hypothesis test:  
null hypothesis: There is no relationship between TV ads and Sales (and thus β1 equals zero)  
alternative hypothesis: There is a relationship between TV ads and Sales (and thus β1 is not equal to zero)  

**How do we test this hypothesis?** Intuitively, we reject the null (and thus believe the alternative) if the 95% confidence interval does not include zero (or 1 for ratio measures). Conversely, the p-value represents the probability that the coefficient is actually zero. Generally we look for a p-value less 0.05


###6. R squared
The most common way to evaluate the overall fit of a linear model is by the R-squared value. **R-squared is the proportion of variance explained**, meaning the proportion of variance in the observed data that is explained **by the model**, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)

R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the model. 

**limitations of R-squared **

1. Linear/logistic models rely upon a lot of assumptions (such as the features being independent), and if those assumptions are violated (which they usually are), R-squared and p-values are less reliable.

2. R-squared is susceptible to overfitting, and thus there is no guarantee that a model with a high R-squared value will generalize.

3. R-squared will always increase as you add more features to the model, even if they are unrelated to the response.

Thus, selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model.
There is alternative to R-squared called **adjusted R-squared that penalizes model complexity (to control for overfitting), but it generally under-penalizes complexity.**

**For feature selection: consider using:**
So is there a better approach to feature selection? Cross-validation. It provides a more reliable estimate of out-of-sample error, and thus is a better way to choose which of your models will best generalize to out-of-sample data. There is extensive functionality for cross-validation in scikit-learn, including automated methods for searching different sets of parameters and different models. Importantly, cross-validation can be applied to any model, whereas the methods described above only apply to linear models.

###7. Under and Overfitting
Typically our goal is to use our model to predict some outside --new-- data using a model that we create with data we currently have for our analysis. Under and over fitting refer to creating a model that 
1. Under fitting: Does not sufficiently capture the variance of our predictors
2. Over Fitting: Captures the variance of our current data well, but does not capture the outside --new-- data. aka it's too specific to our current data.

**We use Train and Test techniques to protect against overfitting**

<img(src='images/trainTest.png', style="width: 50%; height: 50%">

###8. Cross Validation
Often times only doing a training and test set 1 time is not sufficient to insure that your model is fitting well. Cross validation is a standardized why to repeat the train/testing sets to increase our confidence in our prediction model.
How many folds are needed? [source](http://research.cs.tamu.edu/prism/lectures/iss/iss_l13.pdf)

**With a large number of folds:**  
1. The **bias of the true error rate estimator** will be **small** (the estimator will be very accurate)
2. The **variance of the true error rate estimator** will be **large**
3. The **computational time will be very large** as well (many experiments) If you reduce number of folds (aka: the number of experiments) the computation time is reduced
4. The **variance of the estimator** will be **small**
5. The **bias of the estimator** will be **large** (conservative or higher than the true error rate) 

In practice, the choice of the number of folds depends on the size of the dataset n.   
**For large datasets, even 3-Fold Cross Validation will be quite accurate**  
**For very sparse datasets,** we may have to use leave-one-out in order to train on as many examples as possible g  **A common choice for K-Fold Cross Validation is K=10**

<img(src='images/crossValidationImage.png', style="width: 50%; height: 50%")>

####Pseudocode:

In [None]:
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(classifer_or_model, data_to_fit, optional_variable_to_predicted, cv=10)
#note cv is the number of folds (K) and is optional. If not specified 3 will be used

###9. Class/Dummy Variables
We have to represent categorical variables numerically, but we can't simply code it as 0=rural, 1=suburban, 2=urban because that would imply an **ordered relationship** between suburban and urban (and thus urban is somehow "twice" the suburban category).

Why do we only need **two dummy variables, not three?** Because two dummies capture all of the information about the Area feature, and implicitly defines rural as the reference level. (In general, if you have a categorical feature with k levels, you create k-1 dummy variables.)


In [None]:
# create three dummy variables using get_dummies, then exclude the first dummy column
my_categorical_var_dummies = pd.get_dummies(my_categorical_var, prefix='Area').iloc[:, 1:]

### Special considerations and tools for Classifers

###1. Imbalanced Classes
If there are a lot more a certain category than the comparison category, the imbalance will confuse many classifiers as they will only perform well on the dominant class and poorly on the minority class. This can be a major problem if you are manianly intersted in the uncommon class (e.g., cancer, fraud).

Solutions: 
1. Undersampling the dominant class - remove some the majority class so it has less weight  
    *Drawback: Removing datapoints could lose important information*
2. Oversampling the minority class - add more of the minority class so it has more weight.  
    *Drawback: Just replicating randomly minority classes could cause overfit*
3. Hybrid - doing both is better--> [SMOTE](https://www.jair.org/media/953/live-953-2037-jair.pdf)



###2. ROC- a more sophisticated way to capture misclassification

There are 4 key consdierations when it comes to measuring the goodness of a classification model: 

<img(src='images/typesOfMisclassificationImage.png', style="width: 50%; height: 50%")>

Sensitivity: True Postitive Rate
Specificity: False Positive Rate

<img(src='images/ROCurveImage.png', style="width: 25%; height: 25%")>
We evaluate a classifier by measuring the area under the curve for its ROC curve. The Greater area under the curve, the more effective the classifier.

Then for our chosen classifer, we pick an appropriate decision threshold. In general, we pick the decision threshold that gets us closest to the upper left corner


####Pseudocode:

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

def plot_roc_curve(target_test, target_predicted_proba, this_label):
    fpr, tpr, thresholds = roc_curve(target_test, target_predicted_proba[:, 1])
    
    roc_auc = auc(fpr, tpr)
    # Plot ROC curve
    plt.plot(fpr, tpr, label= this_label + ', ROC Area = %0.3f' % roc_auc)
    plt.plot([0, 1], [0, 1], 'k--')  # random predictions curve
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.xlabel('False Positive Rate or (1 - Specifity)')
    plt.ylabel('True Positive Rate or (Sensitivity)')
    plt.title('ROC')
    plt.legend(loc="lower right")

###3. Precision vs Recall

**Precision** is the percentage of True Positives in your set of results  
**Recall** is True Postives / Total Positives. (aka: Same as True Positive Rate/ sensitivity) 

####Pseudocode:

In [None]:
from sklearn import metrics
print(metrics.classification_report(expected, predicted))

#related 2X2 
print(metrics.confusion_matrix(expected, predicted))

### A Note Probability, Odds and  Odds ratios
In logistic regression, β1 represents the change in the log-odds for a
unit change in x.

This means that e^(β1) gives us the change in the odds for a unit change in x.

Odds Ratios can be calcuated as the (odds in the exposed group)/(odds in the unexposed)

Probability: the number of ways that an event can occur divided by the total number of possible outcomes. 

The probability of drawing a red card from a standard deck of cards is 26/52 (50 percent). 
The probability of drawing a club from that deck is 13/52 (25 percent)

#### What's the probability of getting heads in a fair coin flip? 

In [2]:
1/2.0

0.5

The odds for an event is the ratio of the number of ways the event can occur to the number of ways it does not occur.

For example, using the same events as above, the odds for:
drawing a red card from a standard deck of cards is 1:1; and
drawing a club from that deck is 1:3.

####What's the odds of a fair coin flip?

1:1

#### Suppose that 18 out of 20 patients in an experiment lost weight while using diet A, while 16 out of 20  lost weight using diet B.  

#####What's the probability of weight loss with diet A? What's the odds?

In [11]:
#90 percent probability,
probabilityA = 18/20.0
print probabilityA

oddsA = 18/2.0
print oddsA
#often stated "the odds of weight loss are 9:1"

0.9
9.0


#####What's the probablity of weight loss with diet B? What's the odds?

In [10]:
#(80 percent, odds of 4:1)
probabilityB = 16/20.0
print probabilityB
oddsB = 16/4.0
print oddsB


0.8
4.0


#####What's the odds ratio?

In [13]:
OR_A = oddsA/oddsB
print OR_A

2.25


Interpretation: Participants study diet A had 2.25 times the odds of weightloss compared to those in study group B. 

In [14]:
OR_B = oddsB/oddsA
print OR_B

0.444444444444


Interpretation: Participants on study diet B had 0.44 times the odds of weightloss compared to those in study group A. 

#####What if this was a continous variable?

Suppose we do a logistic regression on maternal age and to predict probability of giving birth to a child with birth defects. 

######Write out the model

In [None]:
P(no defect) = alpha + Beta1(maternal age)

#####We run the regression and find an OR of 0.94 with a 95% confidence interval of (0.90, 0.98)

Interpretation: For every 1 year increase in age, mothers have 0.94 times the odds of having a healthy baby. 

**-----------------------------------------------------

-----------------------------------------------------**


#So what's next? 
Week | Tuesday | Thursday
--- | --- | ---
6 | 2/23: Flex |  **UNIT 3** 2/25:  Decision trees and random forest
 7 | 3/1: Natural Language Processing | 3/3: Dimensionality reduction
 8 | 3/8: Time series data | 3/10:  Create models with time series data
 9 | 3/15: Database technologies | 3/17: Final project work session
10 | 3/22: What’s next? | 3/24: Final project presentations
11 | 3/29: Final project presentations


##Reading for Thursday

Resources
scikit-learn documentation: [Decision Trees](http://scikit-learn.org/stable/modules/tree.html)  
Wikipedia: http://en.wikipedia.org/wiki/Decision_tree