# CSE 255 Programming Assignment 7 - Data Analysis using XGBoost

------needs to be put in theory------ 

We are going to use boosted trees. The question is: why don't we use the gradient boosted in Spark?

THe reason is that, in our experience, XGBoost running on a single machine is much faster than Spark running on 10 machines. So we use XGBoost and am showing you how to use it here.

## Problem Statment

The code above computes the average ranges for each example.

Your Task is to generate a range for **individual examples**.

More specifically, you are to write a function that takes in as input:

1. **Training set**
1. **Test set**
1. **n_bootstrap** Number of bootstrap samples on which you will run XGBoost.
1. **minR, maxR** two numbers such that $0 < minR < maxR < 1$ that define the fractions of the `n_bootstrap` scores that define the range.

The output should be a confidence interval for each example in the test set. Together with a prediction that is `Gervais / Cuviers / Unsure`. THe prediction `unsure` is to be output if the confidence interval contains the point 0.

**Note:** The output from your program depends on the random bootstrap samples, and is therefor not deterministic. Don't try to get the test cases exactly correct. Instead, you should aim for the endpoints that you calculate to be close to the endpoints given in the test's assert command.

## Theory

## Notebook Setup

### Importing Required Libraries

In [1]:
%matplotlib inline
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from lib import XGBHelper as xgbh

### Loading Data

The data files were preprocessed on PySpark (10 nodes) cluster. The code for the same can be found [here](Data_Processing_Whales.ipynb). The preprocessed is a numpy array with `4175` rows (for the 10mb file) with following columns (zero-indexed):
* Col 0-9: projections on first 10 eigen vectors
* Col 10: rmse
* Col 11: peak2peak
* Col 12: label (`0 if row.species==u'Gervais' else 1`)

In [2]:
#Use Data/processed_data_150mb.np for a slightly bigger file
data  = np.load("Data/processed_data_15mb.np")
X = data[:, :-1]
y = np.array(data[:, -1], dtype=int)

## Setting Parameters for XG Boost
* Maximum Depth of the Tree = 3 _(maximum depth of each decision trees)_
* Step size shrinkage used in update to prevents overfitting = 0.3 _(how to weigh trees in subsequent iterations)_
* Evaluation Criterion= Maximize Loglikelihood according to the logistic regression _(logitboost)_
* Maximum Number of Iterations = 1000 _(total number trees for boosting)_
* Early Stop if score on Validation does not improve for 5 iterations

[Full description of options](https://xgboost.readthedocs.io/en/latest//parameter.html)

In [6]:
def xgboost_setup():
    param = {}
    param['max_depth']= 3   # depth of tree
    param['eta'] = 0.3      # shrinkage parameter
    param['silent'] = 0     # not silent
    param['objective'] = 'binary:logistic'
    param['nthread'] = 7 # Number of threads used
    param['eval_metric'] = 'logloss'

    plst = param.items()
    return plst

In [27]:
def calc_stats(margin_scores):
    score_mean = np.mean(margin_scores, axis=1)
    score_std = np.std(margin_scores, axis=1)
    min_score = score_mean + 2*score_std
    max_score = score_mean - 2*score_std
    return min_score, max_score

In [32]:
def bootstrap_pred(X_train, X_test, y_train, y_test, n_bootstrap, minR, maxR):
    score_list = []
    bootstrap_size = 500
    plst = xgboost_setup()
    for i in range(n_bootstrap):
        samp_indices = np.random.randint(len(X_train), size=bootstrap_size)
        
        X_samp = X_train[samp_indices]
        y_samp = np.array(y_train[samp_indices], dtype=int)
        
        num_round = 100
        
        dtrain = xgb.DMatrix(X_samp, label=y_samp)
        dtest = xgb.DMatrix(X_test, label=y_test)
        
        evallist = [(dtrain, 'train'), (dtest, 'eval')]
        
        bst = xgb.train(plst, dtrain, num_round, evallist, verbose_eval=False)
        
        y_pred = bst.predict(dtest, ntree_limit=bst.best_ntree_limit, output_margin=True)
        
        scores = sorted(np.round(y_pred, 2))
        
        normalizing_factor = (max(scores) - min(scores))
        scores = scores/normalizing_factor
        score_list.append(scores)
        
    margin_scores = np.sort(np.array(score_list), axis=0).T
    min_ind = int(minR * n_bootstrap)
    max_ind = int(maxR * n_bootstrap)
    margin_scores = margin_scores[:, min_ind:max_ind]
    
    min_scr, max_scr = calc_stats(margin_scores)    
    print(min_scr)
    print(max_scr)
    pred = prediction(min_scr, max_scr)
    print(pred)
#     pred = np.zeros(margin_scores.shape[0], dtype=int)
    
#     pred[np.intersect1d(np.where(min_scr < 0), np.where(max_scr > 0))] = 0
#     pred[np.intersect1d(np.where(min_scr < 0), np.where(max_scr < 0))] = -1
#     pred[np.intersect1d(np.where(min_scr > 0), np.where(max_scr > 0))] = 1
    

In [33]:
def prediction(min_scr, max_scr):
    pred = np.zeros(min_scr.shape, dtype=int)
    pred[np.intersect1d(np.where(min_scr < 0), np.where(max_scr > 0))] = 0
    pred[np.intersect1d(np.where(min_scr < 0), np.where(max_scr < 0))] = -1
    pred[np.intersect1d(np.where(min_scr > 0), np.where(max_scr > 0))] = 1
    return pred

In [36]:
#Sampling Random indices for selection

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

#samp_indices = np.random.randint(len(X_test), size=10)
print(samp_indices)
#Test data and labels
X_test = X_test[samp_indices]
y_test = np.array(y_test[samp_indices], dtype=int)


bootstrap_pred(X_train, X_test, y_train, y_test, n_bootstrap=100, minR=0.1, maxR=0.9)

[1076   14 1206  101 1144  862  315  858  756  912]
[-0.49037403 -0.39128804 -0.33350122 -0.25335822 -0.20072483 -0.10861206
 -0.04009901  0.10376799  0.29661256  0.50962585]
[-0.7791341  -0.68435764 -0.6014159  -0.53394866 -0.46818763 -0.4171182
 -0.31703883 -0.20936123 -0.06620912  0.22086589]
[-1 -1 -1 -1 -1 -1 -1  0  0  1]
