
## Model Training, Tuning, and Debugging

### Supervised Learning: Neural Networks

* Simplest NN is a perceptron - single layer
* Bias is like an intercept, have linear combination of features, pass through (e.g., sigmoid) activation function
* Layers of nodes
* each node is one multivariate linear function with aa univariate nonlinear transformation
* Trained via (stochastic) gradient descent
* Can represent any non-linear function (very expressive)
* Generally hard to interpret
* Expensive to train, fast to predict
* scikit-learn: `sklearn.neural_network.MLPClassifier`
* Deep learning frameworks:
    * MXNet
    * TensorFlow
    * Caffe
    * PyTorch
* Convolutional NNs - input is image or sequence image
* Use filters to create the next layer
* Pooling layer - reduce the size w/ max or average pooling
* Reduce the size of the data to aid convergence
* Turn into fully connected layer(s) at end

#### Recurrent Neural Network (RNN)

* Works well with sequence or time-shared features


### Supervised Learning: K-Nearest Neighbors 

* Define a distance metric
    * Euclidian
    * Manhattan
    * Any vector norm can be used as a measure of distance
* Choose the number of K neighbors
* Find the K nearest neighbors of the new observation that we want to classify
* Assign class label by majority vote
* Important to find the right K
* Commonly use $ K = \frac{\sqrt{n}}{2} $ where n = number of samples
    * K depends on your data
* Smaller k = more local behavior, larger k = more global behavior
* Non-parametric, instance-based, lazy
    * Non parametric - model is not defined by fixed set of parameters
    * Instance-based or lazy learning - Model is the result of effectively memorizing training data
* Requires keeping the original data set - can be very expensive
* Space complexity and prediction-time complexity grow with size of training data
* Suffers from curse of dimensionality - points become increasingly isolated with more dimensions, for a fixed-size training dataset
* scikit-learn: `sklearn.neighbors.KNeighborsClassifier`

### Supervised Learning: Linear and Non-Linear Support Vector Machines

![SVM](images/svm.png)

#### Linear SVM

* What really matters is the points that lie on the margins, or the support vectors
* Optimal hyperplane separates two classes
* Very popular in research
* Simplest case: maximixe the margin - the distance b/t the decision boundary (hyperplane) and the support vectors (training examples closest to boundary)
* Max margin picture not applicable in non-separable case
* scikit-learn: `sklearn.svm.SVC`

#### Non-linear SVM

* Also popular approach in research
* "Kernelize" for nonlinear problems:
    * Choose a distance function called a "kernel"
    * Map the learning task to a higher-dimension space
    * Apply a linear SVM classifier in the new space
* Not memory-efficient, because it stores the support vectors, which grow with the size of the training data
* Computation is expensive
* scikit-learn: `sklearn.svm.SVC`

### Supervised Learning: Decision Trees and Random Forests

#### Decision Trees

* Algorithm decides what to use as splits at what layer
* Entropy - relative measure of disorder in the data source

$$ H(X) = - \sum_{i=1}^N P(x_i)log(P(x_i)) $$

* Try to get data "pure" - each leaf has only 0's or 1's
* Entropy is low when all classes in a node are the same
* Nodes are split based on the feature taht has the largest information gain (IG) between parent node and its split nodes
* One metric to quantify IG is to compare entropy before and after splitting
* In a binary case, entropy is 0 if all samples belong to the same class for a node (i.e., pure)
* Entropy is 1 if samples contain both classes with equal proportion (i.e., 50% for each class, chaos)
* The splitting procedure can go iteratively at each child node until the end-nodes (or leaves) are pure (i.e., there is only one class in each node)
    * But the splitting procedure usually stops at certain criteria to prevent overfitting

* In summary
    * Train / build the tree by maximizing IG to choose splits (i.e. the impurity of split sets are lower)
    * Easy to interpret (superficially)
    * Expressive = flexible
    * Less need for feature transformations
    * Susceptible to overfitting
    * Must "prune" the tree to reduce potential overfitting
    * scikit-learn: `sklearn.tree.DecisionTreeClassifier`

#### Random Forest

* Ensemble methods - learn multiple models an dcombine results, usually via majority vote or averaging
* Set of decision trees, each learned from a different randomly sampled subset with replacement
* Features to split on for each tree, randomly selected subset from original features
* Prediction: average output probabilities
* Increases diversity through random selection of training dataset and subset of features for each tree
* Reduces variance through averaging
* Each tree typically does not need to be pruned
* More expensive to train and run
* scikit-learn: `sklearn.ensemble.RandomForestClassifier`

### Unsupervised Learning

#### K-means clustering

* Iteratively separates data into K clusters, minimizing sum of distances to center of closest cluster
    * Step 1 - assign each instance to closest cluster
    * Step 2 - recompute each center from assigned instances
* Guaranteed to converge to local optimum
* Suffers from curse of dimensionality
* scikit-learn: `sklearn.cluster.kmeans`
* User must determine or provide number of clusters (K)
* Error is determined using sum of squared errors (SSE)

$$ SSE = \sum_{j=1}^n\sum_{i=1}^m\left\lVert x_i - c_j \right\rVert^2_2 $$

* Where $ c_j $ is the $ j^{th} $ cluster centroid and $ x $ is the number of samples belonging to the $ j^{th} $ cluster
* Calculate this once cluster structure has been stabilized

* Elbow method - use the elbow point as a starting point to determine how many clusters you should use
    * More clusters implies smaller within-cluster SSE
    * The decline of SSE (y-axis) slows down after the optimum number of clusters (x-axis) (i.e. the elbow point)

![elbow](images/elbow.png)

* Remember that if each cluster reaches size = 1, SSE will be zero but it's pretty useless

#### Hierarchical clustering

##### Agglomorative or "Bottom-up" 
* Bottom-up approach
* Each data point begins as its own cluster

##### Divisive
* Top-down approach
* Start with all points in a single cluster

* Nested clusters with hierarchy
* User doesn't need to provide number of clusters but needs to find a place to cut the dendrogram

![elbow](images/dendrogram.png)

### Model Training: Validation Set

* Model training - improve model by optimizing parameters or data
* Model tuning - tweak hyperparameters, looking for overfitting or underfitting
* Motivation - model training and tuning involve comparing performance for different model or data settings
* Problem - when you use the test set for these comparisons, that effectively makes it part of a training set (model may learn the patterns from the test set during training)
* Solution - split training data into two parts - training and validation set
    * Use training set to train candidate models, etc.
    * The validation set plays the role of the test set during debugging and tuning
    * Save the test set for measuring generalization of your final model
* Validation set
    * Issue - Splitting the training data into training and validation sets may make it too small or unrepresentative
    * Solution - Use the holdout method to get the test set, then use k-fold cross validation on the training set for debugging and tuning
    

### Model Training: Bias Variance Tradeoff

![bias-variance tradeoff](images/biasvar1.png)

#### Using learning curves to evaluate the model

* Motivation - detect if model is under- or overfitting, and impact of training data size the error
* learning curves - plot training dataset and validation dataset error or accuracy against training set size
* scikit-learn: `sklearn.learning_curve.learning_curve`
    * Uses stratified k-fold cross-validation by default if output is binary or multiclass (preserves percentage of samples in each class)
    * Note: `sklearn.model_selection.learning_curve` in v 0.18

![bias-variance tradeoff](images/biasvar2.png)

#### Learning Curves

![learning curves](images/learningcurves.png)

### Model Debugging: Error Analysis

### Model Tuning: Regularization

### Model Tuning: Hyperparameter Tuning

### Model Tuning

### Model Tuning: Feature Extraction

### Model Tuning: Feature Selection

### Model Tuning: Bagging/Boosting

