# Ensemble Methods and Pipelines

Ensemble methods could make up the content of an entire course. **Ensembles combine more than one model systematically.** In practice, data scientists do this because it helps them predict outcomes with more accuracy and less bias when compared to any single method. Since there are thousands of possible algorithms that can be combined in what are basically limitless ways, there is no way to cover all possible ensemble methods. Instead, we will discuss one particular ensemble method called Random Forest, which is quite widely used. Note that the idea of combining multiple relatively "weak" machine learning methods into a single "strong" machine learning method is quite common, and that most sophisticated machine learning models do this. 

Also in this lesson, we will introduce the idea of a **"pipeline"**, which is a collection of steps that you can use to automate a lot of your data science process so that testing modifications and variation implementation are relatively straightforward/simple. Pipelines require some getting used to, but are worth the effort in the end. **similar to loops - looking at data specific hyper-parameters**
* how do we pick number of folds for cross validation - w/ pipelines, can iterate through easier
* can use grid search

Rather than go through a lots of explanations, we're going to build an ensemble model and talk about what it does. Then we're going to do the same for a pipeline. 

In [1]:
# load in packages/libraries
import pandas as pd
import warnings
warnings.filterwarnings('ignore')   # want to ignore the warnings for this assignment
import numpy as np
import math
import sklearn
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Random Forest Classifiers

```clf = LogisticRegression() ``` 

versus

```clf = RandomForestClassifier()```

In [2]:
# import data
# source: https://data.gov.in/catalog/rainfall-india?filters%5Bfield_catalog_reference%5D=1090541&format=json&offset=0&limit=6&sort%5Bcreated%5D=desc
import os
os.chdir(r'C:\\Users\\livsh\\Downloads') 
transposed = pd.read_csv('binary_and_precip_transposed.csv')
transposed = transposed.drop(transposed.columns[0], axis = 1)
transposed.head()

Unnamed: 0,precip,binary,month,year
0,6.7,1,1,1901
1,0.0,0,1,1901
2,1.7,0,1,1901
3,3.8,1,1,1901
4,6.3,1,1,1901


The `binary_and_precip_transposed.csv` is from Lesson_8_Cantrell_Project_Application

I added this cell of code to lesson 8: 
 transposed.to_csv('binary_and_precip_transposed.csv')

The `df_binary.csv` is from Lesson_8_Cantrell_Project Application
I added this cell of code to lesson 8: 
df_binary.to_csv('df_binary.csv')

In [3]:
df_binary = pd.read_csv('df_binary.csv')
df_binary = df_binary.drop(df_binary.columns[0], axis = 1)
df_binary.head()

Unnamed: 0,JAN_bi,FEB_bi,MAR_bi,APR_bi,MAY_bi,JUN_bi,JUL_bi,AUG_bi,SEP_bi,OCT_bi,NOV_bi,DEC_bi
0,1,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,1,0,0,0
2,0,0,1,0,0,0,1,0,1,0,0,0
3,1,0,1,0,1,0,0,0,0,0,1,1
4,1,0,0,0,0,0,0,0,1,0,0,0


In [4]:
# splitting up data
# splits into roughly equally sized 
# shuffle - shuffles before splitting the data

# random state - the random way we are selecting is going to be the same everytime so it is repeatable, a seed and how we do the shuffle
cv = KFold(n_splits=10, shuffle=True, random_state=0)  # making 10 fold - each are going to have 10% of the data
for train_index, test_index in cv.split(transposed):
    print("TRAIN:", train_index, "TEST:", test_index)
    
# the indicies in train should not be in test

TRAIN: [   0    1    2 ... 1399 1400 1402] TEST: [   4    5   19   27   31   34   45   52   54   55   85  108  141  142
  148  152  159  161  184  192  202  211  224  227  231  233  241  247
  254  268  278  298  303  308  312  326  362  363  376  412  418  420
  426  438  445  458  461  467  471  472  477  487  500  526  528  529
  533  536  542  554  557  563  565  568  569  572  587  608  609  610
  634  638  642  649  654  656  678  704  708  711  717  740  757  758
  759  761  762  768  788  792  795  798  826  846  877  887  901  911
  918  920  922  935  946  963  980  983  986 1000 1002 1010 1024 1032
 1034 1038 1041 1063 1070 1127 1150 1154 1168 1174 1179 1183 1222 1235
 1252 1257 1259 1261 1270 1298 1299 1332 1347 1349 1372 1375 1396 1401
 1403]
TRAIN: [   0    3    4 ... 1401 1402 1403] TEST: [   1    2    8    9   14   18   29   39   40   47   53   56   58   61
   75   80   92  124  140  156  182  186  198  204  215  253  260  270
  279  295  299  302  310  315  317  319  3

In [5]:
# looking at the size of the test and train
cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(transposed):
    print("TRAIN:", len(train_index), "TEST:", len(test_index))

TRAIN: 1263 TEST: 141
TRAIN: 1263 TEST: 141
TRAIN: 1263 TEST: 141
TRAIN: 1263 TEST: 141
TRAIN: 1264 TEST: 140
TRAIN: 1264 TEST: 140
TRAIN: 1264 TEST: 140
TRAIN: 1264 TEST: 140
TRAIN: 1264 TEST: 140
TRAIN: 1264 TEST: 140


In [6]:
from sklearn.metrics import precision_score 
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

#### Logistic Regression

In [7]:
transposed.head()

Unnamed: 0,precip,binary,month,year
0,6.7,1,1,1901
1,0.0,0,1,1901
2,1.7,0,1,1901
3,3.8,1,1,1901
4,6.3,1,1,1901


In [8]:
# Define function
cv = KFold(n_splits=10, shuffle=True, random_state=None)

# Create for-loop
for train_index, test_index in cv.split(transposed):

    # Define training and test sets
    X_train = transposed.loc[train_index].drop(['year', 'month', 'binary'], axis=1)
    y_train = transposed.loc[train_index]['binary']    # what we want to predict
    X_test = transposed.loc[test_index].drop(['year', 'month', 'binary'], axis=1)
    y_test = transposed.loc[test_index]['binary']
    
        
    # Fit model
    clf = LogisticRegression(max_iter = 10000)  # model operator based on logistic regression, set max iterations
    clf.fit(X_train, y_train)

    # Generate predictions
    predicted = clf.predict(X_test)  # predicted model and output using this subset
    
    # Compare to actual outcomes and return precision (dont want a really long number so round)
    print('Precision: ', (round(precision_score(y_test, predicted)*100,1))) 

Precision:  52.4
Precision:  66.7
Precision:  71.4
Precision:  70.8
Precision:  80.8
Precision:  52.2
Precision:  55.6
Precision:  54.5
Precision:  76.0
Precision:  75.0


#### Random Forest Classifier

In [9]:
# Define function
cv = KFold(n_splits=10, shuffle=True, random_state=0)

# Create for-loop
for train_index, test_index in cv.split(transposed):

    # Define training and test sets
    X_train = transposed.loc[train_index].drop(['year', 'month', 'binary'], axis=1)
    y_train = transposed.loc[train_index]['binary']    # what we want to predict
    X_test = transposed.loc[test_index].drop(['year', 'month', 'binary'], axis=1)
    y_test = transposed.loc[test_index]['binary']
    
        
    # Fit model
    clf = RandomForestClassifier(random_state=1)
    clf.fit(X_train, y_train)

    # Generate predictions
    predicted = clf.predict(X_test)  # predicted model and output using this subset
    
    # Compare to actual outcomes and return precision (dont want a really long number so round)
    print('Precision: ', (round(precision_score(y_test, predicted)*100,1))) 

Precision:  59.5
Precision:  55.3
Precision:  68.8
Precision:  62.2
Precision:  57.5
Precision:  67.4
Precision:  68.0
Precision:  71.1
Precision:  56.8
Precision:  74.4


##### Adjust Parameters for Random Forest Classifer

In [10]:
# Define function
cv = KFold(n_splits=10, shuffle=True, random_state=0)

# Create for-loop
for train_index, test_index in cv.split(transposed):

    # Define training and test sets
    X_train = transposed.loc[train_index].drop(['year', 'month', 'binary'], axis=1)
    y_train = transposed.loc[train_index]['binary']    # what we want to predict
    X_test = transposed.loc[test_index].drop(['year', 'month', 'binary'], axis=1)
    y_test = transposed.loc[test_index]['binary']
    
        
    # Fit model
    clf = RandomForestClassifier(random_state=1, oob_score = True)
    clf.fit(X_train, y_train)

    # Generate predictions
    predicted = clf.predict(X_test)  # predicted model and output using this subset
    
    # Compare to actual outcomes and return precision (dont want a really long number so round)
    print('Precision: ', (round(precision_score(y_test, predicted)*100,1))) 
    
    
# oob_score = True did not change the precision values compared to the random forest classifer

Precision:  59.5
Precision:  55.3
Precision:  68.8
Precision:  62.2
Precision:  57.5
Precision:  67.4
Precision:  68.0
Precision:  71.1
Precision:  56.8
Precision:  74.4


In [11]:
# Define function
cv = KFold(n_splits=10, shuffle=True, random_state=0)

# Create for-loop
for train_index, test_index in cv.split(transposed):

    # Define training and test sets
    X_train = transposed.loc[train_index].drop(['year', 'month', 'binary'], axis=1)
    y_train = transposed.loc[train_index]['binary']    # what we want to predict
    X_test = transposed.loc[test_index].drop(['year', 'month', 'binary'], axis=1)
    y_test = transposed.loc[test_index]['binary']
    
        
    # Fit model
    clf = RandomForestClassifier(random_state=1, n_estimators=100)
    clf.fit(X_train, y_train)

    # Generate predictions
    predicted = clf.predict(X_test)  # predicted model and output using this subset
    
    # Compare to actual outcomes and return precision (dont want a really long number so round)
    print('Precision: ', (round(precision_score(y_test, predicted)*100,1))) 
    
    
# N_estimators = 100 --> did not change the precison values compared to the random forest classifer

Precision:  55.6
Precision:  59.5
Precision:  71.7
Precision:  57.5
Precision:  59.5
Precision:  66.0
Precision:  66.7
Precision:  68.1
Precision:  60.0
Precision:  73.3


# Pipelines

Pipelines help you automate some of your work and make the process a bit more systematic. As you can likely tell, even though you aren't a software engineer, machine learning is all about letting machines do the work of "learning".  

That's what pipelines allow us to do. You build a pipeline, tell the computer to use that pipeline to run many combinations of different features/parameters, and the machine tells you what works best.   

Pipelines can do two things: 
1.) Transform data (tyically for feature engineering)
2.) Estimate with the data (predicting outcomes given some data) 

## Grid Search & Parameter Tuning

Note that where pipelines really shine is in tuning hyperparameters. We can do this with for loops, but pipelines make it much easier.  


**What is a hyperparameter?** https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/

For example, with Random Forest Classification, we might want to adjust parameters like the number of estimators, or the minimum number of samples. Each of these are configurable by using sklearn's pipeline and grid search tools. The only trick is that we set these by using the model name, then two underscores, then the parameter name. So instead of using:

```
random_forest.n_estimators  
```

we should use:   *2 underscores and no .*

```
random_forest__n_estimators
```

We follow that with a list of the values we want to try. So, if we wanted to try all the values between 5 and 10 we could use either: 

``` 
random_forest__n_estimators=[5,6,7,8,9, 10]
```

or

``` 
random_forest__n_estimators=list(range(5,11))
```

which produces the same thing. If we think that trying every value will take too long (note, every additional variation is multiplied by all the other variations!), then maybe just try a few.  

Grid search over a few parameters using your precipitation data and binary values

In [12]:
import sklearn.pipeline
import sklearn.feature_selection

X = transposed.drop(['binary', 'year', 'month'], axis=1).values  # independent variable 
y = transposed['binary'].values                 # dependent variable                          

select = sklearn.feature_selection.SelectKBest(k='all')
clf = sklearn.ensemble.RandomForestClassifier(random_state=1)

steps = [('feature_selection', select),
         ('random_forest', clf)]

pipeline = sklearn.pipeline.Pipeline(steps) 
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)  # splitting the data

pipeline.fit( X_train, y_train ) # fit your pipeline on X_train and y_train
y_prediction = pipeline.predict( X_test ) # call pipeline.predict() on your X_test data to make a set of test predictions
report = sklearn.metrics.classification_report( y_test, y_prediction ) # test your predictions using sklearn.classification_report()
print(report) # and print the report

              precision    recall  f1-score   support

           0       0.85      0.89      0.87       247
           1       0.70      0.63      0.67       104

    accuracy                           0.81       351
   macro avg       0.78      0.76      0.77       351
weighted avg       0.81      0.81      0.81       351



# Broadening Your Horizons

We don't have time to cover all the possible classifiers you can possibly use, but in the following code, we'll run through a few that you might find useful: 

In [13]:
from sklearn.neural_network import MLPClassifier  # there are classifiers and regressors
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

names = ["Nearest Neighbors", "Linear SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

classifiers = [
    KNeighborsClassifier(),
    SVC(),
    GaussianProcessClassifier(),  # looking at normality
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(),
    MLPClassifier(),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]


for name, clf in zip(names, classifiers):
    cv = KFold(n_splits=5, shuffle=True, random_state=1) # not interating through the number of splits
    for train_index, test_index in cv.split(transposed):
        X_train = transposed.loc[train_index].drop(['year', 'month', 'binary'], axis=1)
        y_train = transposed.loc[train_index]['binary']
        X_test = transposed.loc[test_index].drop(['year', 'month', 'binary'], axis=1)
        y_test = transposed.loc[test_index]['binary']

        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        print(name, score)

Nearest Neighbors 0.8042704626334519
Nearest Neighbors 0.7437722419928826
Nearest Neighbors 0.800711743772242
Nearest Neighbors 0.7580071174377224
Nearest Neighbors 0.7714285714285715
Linear SVM 0.8113879003558719
Linear SVM 0.7793594306049823
Linear SVM 0.7935943060498221
Linear SVM 0.8256227758007118
Linear SVM 0.7535714285714286
Gaussian Process 0.8185053380782918
Gaussian Process 0.7793594306049823
Gaussian Process 0.797153024911032
Gaussian Process 0.8256227758007118
Gaussian Process 0.7571428571428571
Decision Tree 0.8042704626334519
Decision Tree 0.7686832740213523
Decision Tree 0.7829181494661922
Decision Tree 0.8291814946619217
Decision Tree 0.8071428571428572
Random Forest 0.7935943060498221
Random Forest 0.7864768683274022
Random Forest 0.7900355871886121
Random Forest 0.7330960854092526
Random Forest 0.7392857142857143
Neural Net 0.7295373665480427
Neural Net 0.7580071174377224
Neural Net 0.7829181494661922
Neural Net 0.701067615658363
Neural Net 0.7642857142857142
AdaBoost

#### Observations
* Did not change the regressors because we the binary values are not continuous data
* the various regressors are are similar in range
* the "worst" regressors with the lowest values are QDA (similar to the inclass example) and Naive Byes
* using literature review to help develop models for our projcect (looking at time series for next week)