# Data Science Project Template

Adapted from 36-462: Data Mining S17

## Getting started

- Importing data and libraries
- Figuring out what the data is: variable types, distribution, relationships
- Looking for problems: outliers, missing data, inconsistent coding

In [2]:
# basic libraries, more imports below as needed
import pandas as pd
import numpy as np

# import/load data 
from sklearn import datasets
iris = datasets.load_iris()

## Explore the data

- Dimension of the data, variable, observations
- Variable types: factors, dates, numeric
- Variable distributions: Number of factor levels, imbalance of factor levels, skewness of variables. Base rate and balance of outcome
- Simple relationships between variables

## Plot the data

Can help to inform modeling and feature extraction

- Plot univariate and bivariate distributions of variables- are they continuous? Bumpy? Mixed?
- Look at conditional plots and investigate differences
- Is there temporal or spatial structure?

In [None]:
import matplotlib.pyplot as plt

## What is the outcome of interest?

Is is a continuous regression problem?
- Do we care what the value is? Or if it just large?
- How does the variable behave? Heavily skewed? Are there many big values we care about?

Is is a classification problem?
- How many classes are there?
- What are the default class proportions? What is the base rate? Is there severe imbalance?
    - Do we want better misclassification, or just enriched subsets of the data?
    - Are your costs asymmetric enough that you should be thinking about different cutoffs? (Sensitivity, specificity, etc)

## Constraints on the classifier

- Is training time severely limited (pretty rare)
- Is evaluation/prediction time severely limited? (Very common)
- Limits to types of variables to use
- Does the model need to be interpretable?

## Outliers

Worry about points abnormal enough that they unduly influence the model. These may be:

- Mistakes
- Observations that are not coming from the process we want to model

Outliers can be identified in plots, or by a variety of rules (large Mahalanobis distance, high leverage, etc). Sometimes they are fixable, other times just missing

## Missing data

- Remove whole observations if the missingness is not informative, this only costs data. If it is informative, losing information biases the population.
- Try to fill in the missing values: imputation

    1). Use a strongly correlated variable to predict the missing values
    
    2). Nearest neighbors: find the k closest other points and average their value for this entry
    
- Can code missing value as another value (sometimes)
- Use a method that is robust to missing values

## Correcting for badly imbalanced data

Suppose the data is 90% class 1 and 10% class 2

- Downsample: Sample one (or a few) elements of class 1 for each element of class 2 to make more balanced
- Upsampling: Duplicate elements of class 2 to make more balanced
- Artificially change prior weights, class weights, or case weights

In [None]:
from sklearn.utils import resample

# downsample


## Start learning

### Validation: Plan workflow from the start

Validation plan:

1). Split the data into a training set and test set [(basic train test split)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) [(stratified splitting)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html)

2). Do all model selection and experimentation on the training set. Use [cross validation](https://scikit-learn.org/0.16/modules/generated/sklearn.cross_validation.train_test_split.html) or out-of-bag errors only on this set to pick a model

3). Fit chose model on the whole training set

4). At the very end, use your testing set to get an accurate picture of how well the model actually performs

5). For production, fit model on the whole dataset

### Warnings about cross-validation

1). Can have a high computational cost, especially as the number of parameters/models grows

2). When the number of models is verty large, it can give misleading results. It is not immune from the usual problems with multiple testing

3). The folds need to include the entire estimation pipeline. a frequent mistake is to carry out part of the model selection before breaking the data into folds, which invalidates the result

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=514)

# or, if there are imbalanced classes, might want to use a stratified split
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.33, random_state=514)
sss.get_n_splits(X, y)

from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
kf.get_n_splits(X_train)

### Features: Extract reasonable first set of features to use for prediction

- With modern model selection algorithms (Lasso, Random Forest) having extra features does not hurt much
- Make sure to only use information tbat will be in new samples

### What to use: Start with something simple: interpretable, flexible


| Model                           | Robust to outliers | Robust to useless variables | Easy to use |
|---------------------------------|--------------------|-----------------------------|-------------|
| Linear/logistic regression      | &nbsp;             | &nbsp;                      | &nbsp;      |
| Ridge regression/logistic ride  | &nbsp;             | &nbsp;                      | &nbsp;      |
| Lasso regression/logistic lasso | &nbsp;             | X                           | &nbsp;      |
| Splines/additive models         | &nbsp;             | &nbsp;                      | &nbsp;      |
| K-nearest neighbors             | X                  | &nbsp;                      | &nbsp;      |
| Trees                           | X                  | X                           | &nbsp;      |
| Random Forests                  | X                  | X                           | X           |
| Boosted trees                   | X                  | X                           | &nbsp;      |
| SVM                             | X                  | &nbsp;                      | &nbsp;      |
| LDA                             | &nbsp;             | &nbsp;                      | &nbsp;      |

## Bias-Variance Tradeoff

- If the training error is much smaller than the test (CV) error, the model is overfit --> lower bias, increase variance
- If the training error is about the same as the test (CV) error, the model is not yet overfit --> increase bias, lower variance

## Ways to change the model

- Get more data: Same bias, decreases variance
- Make more/better features: Increase flexibility, increase variance (maybe)
- Use a more flexible model: decrease bias
- Use a more regularized model: decreases variance

# Ensemble learning 

Build _M_ different classifiers, combine guesses by:

- Voting, weighted voting
- Averaging
- Stacking: using guesses as input to another classifier