In [2]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning) 

In [3]:
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import pandas as pd
import time
from scipy import stats
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.metrics import confusion_matrix
from sklearn import metrics
%matplotlib inline

### Recap: basic operations in Jupyter notebook, types of cells, executing cells.

### Data import

Read the data in a data frame using pandas, take a look at them, check the size of the data set, rename columns to something easier to type.

In [20]:
dataset = pd.read_table('LAE_OII_Strata.txt')
dataset.shape

(5436, 5)

In [21]:
dataset.columns

Index(['type', 'wavelength of EL (angstroms)', 'EL flux (erg/cm^2/s)',
       'continuum flux density', 'EW observed'],
      dtype='object')

In [22]:
dataset.rename(columns={'wavelength of EL (angstroms)': 'wl', 'EW observed': 'ew',
                        'EL flux (erg/cm^2/s)': 'el', 'continuum flux density': 'cont_flux'},
                inplace=True)
dataset.columns

Index(['type', 'wl', 'el', 'cont_flux', 'ew'], dtype='object')

### Data exploration

Look at data properties divided by type to figure out some differences between LAEs and OIIs. Change settings to visualize all the columns in a data frame. Eliminate outliers.

### Transform pandas data frame into a numpy array that can be fed to sklearn methods.

### A quick way to see what variables are important

Create a simple linear model and use the cofficients of different variables to track variables' importance. Useful when there are many and we want to get rid of some of them, or to just build understanding of the model. Note that this doesn't properly inform one of which variables are redundant.

### Summary

### First steps with models: Decision Tree Classifier.

Decision trees are nice because they can be interpreted easily. For example, this is a decision tree showing how scientists might decide whether a newly found planet has a good chance to harbor life:

Figure from [here](http://www.machinelearningtutorial.net/2017/01/17/decisiontree/).

<br><div style="text-align: center ">  <b> IS ANYBODY OUT THERE?</b></div>

<img src="Strata_images/exoplanets.svg" width="500"/>

Decision trees work by deciding where to split the data set using values of different features, and where to stop.

Mathematically, a good decision tree is one that maximizes the information gain (e.g. the increase in accuracy) at every "split".

<b> Pros </b> Easy to interpret, fast.

<b> Cons </b> Prone to overfitting.

#### Let's get coding!

-  Import model, fit using k-fold (k = 5) cross validation, establish benchmark performance.

-  Consider the metric and its potential fallbacks by comparing to "dummy" estimator;

-  Calculate and plot the confusion matrix.

### Another easy to interpret algorithm is KNN (K nearest neighbors). 

<div style="text-align: center ">  <b> LET'S FIND SOME NEIGHBORS!</b></div>
<table><tr>
<td> <img src="Strata_images/KNN_1.png" width="350"/> </td>
<td>  <img src="Strata_images/KNN_2.png" width="350"/> </td>
</tr></table>

#### Let's get coding!

Import model, fit using k-fold (k = 5) cross validation, establish benchmark performance, play with basic parameters.

### Summary and 10-minute break

## Part 2: Advanced Algorithms

### Support Vector Machines (classifier)

Support Vector Machines are a long-term staple of machine learning. Parameter tuning is very important in SVMs, and it's the curse and blessing of this algorithm.

<b>Pros: </b> Accurate, Powerful

<b>Cons: </b>  SLOW, need standardization

It's usually a good idea to do parameter optimization on a (representative) selection of your data.

### SVMs in a nutshell

In a classification problem such as this one, SVMs attempts to find the ideal boundary to separate the two classes.

<img src="Strata_images/SVM_1.png" width="300"/>

This looks easy, but even in this simple cases of completely separable variables, there are many possible choices, with different resulting boundaries.

<img src="Strata_images/SVM_2.png" width="300"/> 

SVM's strategy is to 1. Maximize the separation between classes, called the <b> margin </b> and 2. Use slack variables to attribute a "penalty" to misclassifications (soft margin).

<img src="Strata_images/SVM_3.png" width="300"/>

In the more general case of non-linearly-separable variables, SVMs attempt to map the original feature space (for us a 4D space) to a higher dimensionality space, where instances are more separable. The set of functions used for the mapping is called <b>kernel</b>. 
<br>
<br>
<img src="Strata_images/SVM_4.png" width="500"/>


The most important parameters of an SVM are:

- The type of kernel (linear, polynomial, or Gaussian, "rbf" in sklearn);

<img src="Strata_images/SVM_5.png" width="400"/>

- Gamma, the "wiggliness" of the boundary (small gammas = more linear);

<img src="Strata_images/SVM_6.png" width="300"/>

- C, the soft margin parameter (smaller C values assign a smaller penalty to misclassifications near the boundary, and generates a wider margin).

<img src="Strata_images/SVM_7.png" width="300"/>

All the figures in this section are from [here](https://www.ncbi.nlm.nih.gov/pubmed/20221922) and [here](https://www.cs.utexas.edu/~mooney/cs391L/slides/svm.ppt).

### Let's get coding! 

-  Import model;

-  Establish benchmark performance for 5 fold cross validation;

-  Visualize and briefly describe the parameters.

<b> TASKS (10-15 mins) </b>

-  Play with different parameters, such as type of kernel (for time scaling reasons, use only poly and rbf), soft margin C, and gamma, to see if you can beat the benchmark performance above. Tip 1: Trying 2-3 values per parameter will be sufficient for now, especially if your machine is taking long. Tip 2: Use low values of gamma (< 1.0) to reduce fitting time.

-  Now do the same thing, but using precision as your scoring method.

#### Coding Solution.

Introduce Grid Search CV (params, cv, scoring, verbose, n_jobs) as a method to optimize various parameters simultaneously; use timings to get an idea of the speed of various methods.

### SVM Summary

-  After parameter optimization, SVM's performance improves a bit over the baseline.

-  We learned how to optimize parameters with Grid Search.

-  The best model depends A LOT on the metric.

### Ensemble methods: 1. Random Forest Classifiers

Random Forest Classifiers are combinations of decision trees. The "random" part refers to the fact that different trees in the forest are created using random splits of the data, and random subsets of the features. This randomization process makes the algorithm more robust against overfitting, compared to single trees.

<b> Pros: </b> Fast (parallel), robust, insensitive to data range.

<b> Cons: </b> Fast but not fastest when compared to other ensemble methods.

Random Forests have many adjustable parameters. One can tune the parameters of each tree, and the way they are combined.

<img src="Strata_images/DT1.png" width="700"/>

#### Tree Parameters

The figure above shows an example of possible split. The parameters associated to that are:

-  The minimum number of instances in a leaf node;

-  The minimum number of instances required in a split node;

- The maximum depth of tree.

They all deal with reducing overfitting by avoiding to go "too deep" in each tree; it makes sense to change two out of three.

Additional parameters are:

-  The criterion chosen to decide whether a split is "worth it", expressed in terms of information gain;

-  The number of features that are used in building trees.

#### Forest Parameters

In Random Forests, the predictions generated by all the trees are simply averaged to produce the final results. The number of trees in the forest can be adjusted, with the general understanding that more trees are better, but at some point performance will plateau, so one can find the trade-off between having more trees and lower runtime.

### Let's get coding! 

-  Import model;

-  Establish benchmark performance for 5 fold cross validation.

<b> TASKS (10-15 minutes) </b> 

-  Use the get_params() method to find out the names and signatures of different parameters, and their default values.

-  Play with different values of the number of trees (estimators, using values between 5 and 50), maximum depth of tree (usually around 3-8), the minimum amount of instances in a split (2-10), and the maximum number of features (you can decide this one!) allowed in builiding individual trees to see if you can beat the benchmark performance above.

-  Now do the same thing, but using recall as your scoring method.

### RF Summary

### Ensemble methods 2: Gradient Boosting Models

Gradient Boosting models are another ensemble method where different decision trees are combined together.

Unlike Random Forests, the model is built by <b> adding individual trees in a sequential fashion, </b>
but choosing which trees we add to the model in a way that minimizes the current loss function. The "Gradient" part refers to the fact that we try to move along the gradient of the objective function (by calculating its numerical derivative) as we add more trees.

The parameters depend on the particular implementation.

In the sklearn formulation, the parameters of each tree are essentially the same we saw above; additionally we have the "learning_rate" parameter, which dictates how much each tree contribute to the final estimator, and the "subsample" parameters, which allows one to use a < 1.0 fraction of samples.

I liked this blog post about parameter tuning for GBMs:

https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/

#### We'll do the usual import and benchmarking:

<b> TASKS (10 minutes) </b>

-  Use the get_params() method to find out the names and signatures of different parameters, and their default values.

-  Play with different values of the number of trees (estimators: 5, 10, 20), max depth of tree (2-8), learning rate (0.1-0.5), and the maximum number of features allowed to see how much you can improve the benchmark performance above.

-  Compare the timings to Random Forests.

### A note about xgboost (vs sklearn's GBM)

Sometimes knowns as "regularized" GBM, more robust to overfitting.

Has more flexibility in defining weak learners, and objective function.

Reputation of being very fast.

From the same author as the one above:

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

### Summary so far

#### Subtleties in parameter optimization: See additional notebook

-  Use cv_results to look at gradients along algorithms and build understanding;

-  Push the edges of your parameter grid search; 

-  Do nested cross validation to optimize parameters in order to avoid leakage between the parameter optimization and the cross validation procedure. 

### Tips for advanced optimization (know your data).

Flip data so less common class becomes the positive one and check performance (in particular, recall). Introduce the "class weight" parameter for unbalanced data sets where we are interested in the "uncommon" class; define and use ad-hoc metrics.

#### The class weight parameter

In SVMs, C, the soft margin parameter, can take different values according to class. <b> This is helpful for imbalanced data sets, where we are interested in the less common objects. </b> This parameter is available for other estimators too!

<img src="Strata_images/SVM_8.png" width="500"/>

### My advice: Define your own evaluation metric 

This is an example of what we did for this paper (Leung, VA et al 2016), where x0 = 1 - precision and x1 = 1 - recall.

<img src="Strata_images/Formula_Leung.jpg" width="300"/>


#### How to do that in code?

### Summary

### Additional Content

-  Notebook 1: Nested Cross Validation (the proper way to optimize parameters), Grid Search best practices, Randomized Grid Search.

-  Notebook 2: Diagnostic Tools (Learning Curves, Bias/Variance tradeoff, Feature Importance)

In [83]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')


    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')