# Introduction

During this workshop we'll cover:

* Some definitions
* The prerequisites of predictive analytics and machine learning
* Some exploration
* An overview of some popular techniques, and some under-the-hood
* A hands-on exmaple

##    Definitions
Many of the terms associated with **analytics** are, sadly, overused or blatantly misused.  **Descriptive analytics** is the summarization of data in dimensions sufficient for visualization and inspection.  Any inferences drawn from descriptive analysis are left to the individual, and are not computationally informed.  

For the purpose of this discussion we'll adopt the [INFORMS definition](https://www.informs.org/Community/Analytics) of **predictive analytics**.  It is quantitative analysis that:

* Predicts future probabilities and trends, and
* Finds relationships in data that may not be readily apparent with descriptive analysis.

To define **Big Data Analytics** we'll borrow from IBM's three Vs (Volume, Velocity, and Variety) and add that these large, disparate datasets are brought together for the purpose of analysis under a common objective.  When that objective involves predicting future probabilities and trends, and finding relationships beyond descriptive analysis we can make our Big Data Analytics predictive.  

**Machine Learning (ML)** is the method by which we train a model to produce predictions or draw inferences from data.  This is similar to the concept of **Artificial Intelligence (AI)**, but distinct.  The distinction is not well-defined, and some well-meaning people use the terms interchangeably.  In my view, the difference is that while ML can be used with a human in the loop, and to inform decisions; AI seeks to allow the machine to learn on its own.  There is a dynamic and automatic aspect to AI.  AI is the result of a certain type of machine learning.  It's worth noting that algorithms are used to train our models; if you walk away from this class knowing the difference between those two terms, we'll declare it a success!  

An **algorithm** is a set of mathematical and/or computational procedures that, in this case, produce a model.  Generally, an ML algorithm will utilize some math to find the model that optimizes predictions on the training set of data.  That optimization minimizes the error on the training set (with a nod to the danger of overfitting).  

A **model** is the result of the ML algorithm.  It is a function (in the formal sense of the term) whose domain is the independent variables in the data, and whose range is the output that will inform the desired inference.  It could take the form of a linear mathematical formula, or a black box, or something in-between.  

Above I've presumed that we're trying to predict or forecast something that is observable.  That is, I've assumed that we have data that includes a dependent variable.  Actually, ML comes in two flavors: Supervised and Unsupervised.  **Supervised Machine Learning (SML)** is a set of techniques used when there is a dependent variable available within the data (either directly or by proxy).  It's called "supervised" because the algorithm gets to "know" when it's right or wrong.  **Unsupervised Machine Learning (UML)** is a set of techniques that do not require a dependent variable.  While UML can be very useful, we'll focus on SML today.

##  The prerequisites
Machine Learning is part statistics and part computer science.  Below are some topic areas to start with if you want to become proficient in ML.  **Note: these are NOT prerequisites for this workshop, but rather some areas of study for the interested and uninitiated participant.**

### Coding
This is a given for this audience, but worth mentioning nonetheless.

### Exploring
Predictive analytics goes beyond descriptive statistics.  That said, describing the data is an important step; often called Exploratory Data Analysis (EDA).  This means drawing pictures of your data that are interpretable, and forming and testing hypotheses pertinent to the data and your problem.  Learning how to explore and ask questions of the data is a valuable skill.
Hypothesis testing is a basic component of inferential statistics.  While most ML projects don't involve explicit hypothesis testing, the results of many algorithms contain computations that are the result of hypothesis tests; whether the modeler knows it or not.  Further, any SML model is, itself, the result of a generalized (if not formal) hypothesis test.  For that reason we'll take a minute to understand (or refresh) what formal hypothesis testing is.  
A hypothesis test follows these steps:

* Determine the appropriate question.  We could ask "are these two variables related?" or "does this particular modification change an expected outcome?" or something like that.  
* Formulate the hypothesis statement.  In its simplest form, a hypothesis statement contains a Null Hypothesis $(H_{o})$ and an Alternative Hypothesis $(H_{a})$.  $H_{o}$ assumes that there is nothing different, or that no relationship exists.  $H_{a}$ is written to reflect the relationship or change that you're trying to detect.
* Collect the data.  Data should be independent from one observation to the next and controlled.
* Find the appropriate technique.  There are many techniques available and selecting the appropriate one depends on the type of data and the question being asked.
* Compute the probability.  The math here assumes that the null hypothesis is true, and then computes the probability that what you observed (the data) could actually happen if the null is true.  
* Make an inference.  If the probability is too small, then you can reject the null hypothesis.  In other words, what you observed is very unlikely given what you assumed and therefore the assumption is very likely to be faulty.  If the probability value is not small enough then you fail to reject the null.  The threshold is generally set at 5%.   

## Exploring and Modeling with real data

The World Development Indicators data has over 1400 metrics that capture many features of a country's performance and growth year over year.  I've formated the data for our purposes today, but it can be used in many other ways.
We'll spend a little time exploring, and then get into the modeling.

First, we'll read in the curated data and remove some of the features that are relatively incomplete.
Note that pandas is not a terribly efficient dataframe, but it's a pretty simple reader.  If size and speed are issues there are lots of better/more efficient ways to read in data.

We'll use [plotly](https://plot.ly/python/) to draw some pictures of the data.  

I'll cover the basics of ensemble methods, and then I'll also take some time to go under the hood of a couple algorithms.  In particular, investigate trees as they're a basis for two very popular algorithms.  That's where it'll get a little math-y.  

Then we'll build a model using an out of the box algorithm, and see how it does.  

We'll then talk about training and testing including parameter search. 

Finally, you'll train and test a model using GBM (or something different if you like).

Let's import some stuff, and get started.

In [None]:
import pandas as pd
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import Scatter, Layout, Bar
import numpy as np

init_notebook_mode(connected = True) # Allows us to use plotly in offline mode

bDF = pd.read_csv(r'builtDF.csv', encoding = "ISO-8859-1") # A dataset that I've curated
descriptions = pd.read_csv(r'wdiSeries.csv', encoding = "ISO-8859-1") # Descriptions of each field name
desc_dict = descriptions.set_index('Code')['Name'].to_dict()

def check_complete(ser):
    l = len(ser)
    isNaN = np.sum(pd.isnull(ser))
    if isNaN/float(l) > 0.1:
        return False
    else:
        return True
    
keep = bDF.apply(check_complete, axis = 0)
print(bDF.shape)
bDF_keep = bDF.iloc[:,keep.tolist()]
print(bDF_keep.shape)

for code in bDF_keep.columns:
    if code in desc_dict.keys():
        print("%s : %s" %(code, desc_dict[code]))

print(bDF_keep.columns)


# Drawing some pictures
Here we'll use plotly to take a look at some of the data features against one another, and maybe get some ideas about ways we could augment the data.

In [None]:
trace_ind = Scatter(
    x = bDF_keep['NV.IND.TOTL.ZS'], 
    y = bDF_keep['NY.GDP.MKTP.KD'],
    mode = 'markers',
    name = 'Industry',
    marker = dict(
        color = '#cc7c2c'
        )
    )

trace_agr = Scatter(
    x = bDF_keep['NV.AGR.TOTL.ZS'], 
    y = bDF_keep['NY.GDP.MKTP.KD'],
    mode = 'markers',
    name = 'Agriculture',
    marker = dict(
        color = '#636b77'
        )
    )

trace_srv = Scatter(
    x = bDF_keep['NV.SRV.TETC.ZS'], 
    y = bDF_keep['NY.GDP.MKTP.KD'],
    mode = 'markers',
    name = 'Services',
    marker = dict(
        color = '#3270d3'
        )
    )

data = [trace_ind, trace_agr, trace_srv]

layout = dict(
    title = 'Relationship between the percentage of type of value add and overall GDP',
    hovermode = 'closest',
    yaxis = dict(zeroline = False,
        title = 'GDP in constant 2010 US$',
        type = 'log'),
    xaxis = dict(zeroline = False,
        title = 'Percent value add',
        )
    )

fig = dict(data = data, layout = layout)
iplot(fig)


In [None]:
def create_trace(resp, x_name, y_name, sz_name, tx_name):
    x = bDF_keep[x_name][bDF_keep.region == resp]
    y = bDF_keep[y_name][bDF_keep.region == resp]
    sz = bDF_keep[sz_name][bDF_keep.region == resp]
    tx = bDF_keep[tx_name][bDF_keep.region == resp]
    return (x, y, sz, tx)

regions = bDF_keep.region.unique()
data = []
colors = ['#839cc4', '#ad83c4', '#c48383', '#93c483', '#c4a283', '#83c4b8', '#7f8187']
for i, r in enumerate(regions):
    x_val, y_val, sz_val, tx_val = create_trace(r, 'NE.TRD.GNFS.ZS', 'NY.GDP.PCAP.CD', 
                                        'IT.NET.USER.P2', 'country_name')
    trace_val = Scatter(
        x = x_val,
        y = y_val,
        text = tx_val,
        name = r,
        mode = 'markers',
        marker = dict(
            color = colors[i],
            size = sz_val,
            sizemode = 'area',
            sizemin = 1
            ),
        line = dict(
            width = 2,
            color = 'rgb(0,0,0)')
        )
    data.append(trace_val)

layout = dict(
    title = 'GDP, internet users per 100 ppl (size), and trade for various regions (color)',
    hovermode = 'closest',
    yaxis = dict(zeroline = False,
        title = 'GDP per capita in constant 2010 US$',

        ),
    xaxis = dict(zeroline = False,
        title = 'Trade as a percentage of GDP',

        )
    )

fig = dict(data = data, layout = layout)
iplot(fig)

In [None]:
# Build your own visual using either Scatter, Bar, or something of your choosing!.


## Advanced Modeling Techniques

So we've spent a bunch of time munging, reshaping, and exploring our data.  This is where the fun happens.  Building models is usually the pay-off of all the hard work of the previous steps.  Before we do so, however, let's take a little bit of time to introduce the algorithms.  We're only going to implement a couple that come with the scipy stack, but I'll point you toward a couple others that are really interesting.

### Ensemble Methods

Ensembles are built on the premise that many weak learners can be combined to produce a single strong learner. Consider this example.  Let's say that I have a population of 1,000 people who are all pretty dumb.  Since they're dumb we only give them one job, but (also since they're dumb) they're not very good at it.  That job is to pick football games.  A good game picker can be right 75% of the time.  Our guys aren't very good, but I can train them to a level where they're right about the outcome of the game 51% of the time pretty regularly.  So I let them vote on games, and I pick the winner based on their votes.  With those parameters, my 1,000 voting morons will outperform my single expert.

In [None]:
# This calculates the probability that, out of my 1,000 voters, 499 or fewer will
# pick the winner, and thus the vote will be wrong.
binom_test(499, n = 1000, p = 0.51) / 2.0

In principle, that's what ensembles are doing.  A weak learner only has to be right slightly better than random.  The act of averaging not only makes it more likely that they'll be right, but (because of a version of the Central Limit Theorem) also reduces the variance of the prediction.  Low variance and low bias are two very good properties for an algorithm to have.  

#### Trees

Many useful ensemble methods are based on tree algorithms.  A decision tree algorithm starts with all the data available to it, and finds the variable and split that minimizes some measure in the resulting nodes.  We'll restrict our conversation to the classification setting for simplicity.  The measure for classification is generally nodal purity.  That is, we split our dataset based on the variable and value that will result in the best subsets of data.  We then repeat the process at each resulting node until a stopping criteria is reached.  For example:

![SimpleTree](tree_img.jpg)

Here we're trying to find a model that will eventually take new observations of `X1`, `X2`, and `X3` and predict yes or no.  What the algorithm found is that splitting the dataset based on `X1` at the value `5` results in the best child nodes, and subsequently splitting `X2` on the left and `X3` on the right results in a good set of predictions.  Obviously this is a simple example with very limited data.  We could run an algorithm like this against millions of observations of hundreds of variables to tune a model.

That description is pretty much true, but also not terribly rigorous.  I'll describe two tree algorithms that take slightly different approaches to finding optimal splits, and along the way we'll establish some of the rigor. 

Please note that I'm purposely not establishing notation because in my view doing so in relatively short form only adds more confusion than it alleviates.  Also note that in order to really understand what's going on you'll want to get comfortable with the notation... learn the Greek!

In the previous paragraphs I've been a bit cavalier with the word 'optimal'.  In fact, binary decision trees have been proven to be NP-complete.  That's a fancy way of saying that it's relatively easy to verify a solution but not easy to solve it, and by 'solve it' I mean find an optimal solution.  Let's think about why that is.  

The reason it's difficult is because of the potential for complex relationships in the data.  In fact, with enough data I'd argue that these complex relationships become inevitable.  That is, even when you find a good split in the data, it's possible that the overall tree will suffer because a locally sub-optimal split can result in an optimal tree with the addition of other child nodes further down the tree.

Thus decision tree algorithms do NOT guarantee an optimal solution.  They can, however, do some good by greedily using local optimum at each split, and then pruning the results.  

It's useful to recognize that there are two things that the algorithm has to find in order to build the tree.  First, it has to find the best variable to split, and then it has to find the best way to split the data based on that variable.  That's relatively easy given the nature of the metrics.

##### Measures 

Gini impurity is a measure of a distribution's uncertainty.  That is, if I choose an observation at random from a population, and  assign its predicted class also randomly according to the underlying distribution what is the probability that I misclassify?  Samples with more of one class than another will have lower impurity.

Entropy is also a measure of a distribution's uncertainty.  In our context we use Shannon's entropy to measure the uncertainty in the source node, and compare that to the weighted (by the resulting node sizes) sum of the each child node's entropy.  Bigger drop in entropy means more information gain, which is the quantity that we want to maximize at each split.

The two pictures below show both how the tree is built and how it can miss signal.

In [None]:
fd = pd.read_csv(r"fake_data.csv")

def cum_prob(ser):
    ret = []
    c_pos = 0
    df = fd.sort_values(ser)
    for i, s in df[ser].iteritems():
        l_pos = len(df[(df[ser] <= s) & (df['Y'] == 1)])
        ret.append(float(l_pos) / len(df[df.Y == 1]))
    return ret

x_x1 = cum_prob('X1')
x_x2 = cum_prob('X2')
x_x3 = cum_prob('X3')

traceX1 = Scatter(
    x = fd.sort_values('X1').X1 / fd.X1.max(), 
    y = x_x1,
    mode = 'lines',
    name = 'X1',
    line = dict(
        color = '#cc7c2c',
        shape = 'spline'
        )
    )

traceX2 = Scatter(
    x = fd.sort_values('X2').X2 / fd.X2.max(), 
    y = x_x2,
    mode = 'lines',
    name = 'X2',
    line = dict(
        color = '#636b77',
        shape = 'spline'
        )
    )

traceX3 = Scatter(
    x = fd.sort_values('X3').X3 / fd.X3.max(), 
    y = x_x3,
    mode = 'lines',
    name = 'X3',
    line = dict(
        color = '#3270d3',
        shape = 'spline'
        )
    )

data = [traceX1, traceX2, traceX3]

layout = dict(
    title = 'Cummulative probability distributions of the degenerative exmaple',
    hovermode = 'closest',
    yaxis = dict(zeroline = False,
        title = 'Empirical CDF of predictors for response equal 1',
        ),
    xaxis = dict(zeroline = False,
        title = 'Normalized predictors X1, X2, and X3 respectively',
        )
    )

fig = dict(data = data, layout = layout)
iplot(fig)


def create_trace(resp, x_name, y_name, sz_name, tx_name):
    x = fd[x_name][fd.Y == resp]
    y = fd[y_name][fd.Y == resp]
    sz = fd[sz_name][fd.Y == resp]
    tx = fd[tx_name][fd.Y == resp]
    return (x, y, sz, tx)

responses = [0, 1]
data = []
colors = ['#a02022', '#2055a0']
for i, r in enumerate(responses):
    x_val, y_val, sz_val, tx_val = create_trace(r, 'X1', 'X2', 
                                        'X3', 'name')
    trace_val = Scatter(
        x = x_val,
        y = y_val,
        text = tx_val,
        name = r,
        mode = 'markers',
        marker = dict(
            color = colors[i],
            size = sz_val,
            sizemode = 'diameter',
            sizemin = 1
            ),
        line = dict(
            width = 2,
            color = 'rgb(0,0,0)')
        )
    data.append(trace_val)

layout = dict(
    title = 'A degenerate example of why it is hard to find a guaranteed optimal decision tree',
    hovermode = 'closest',
    yaxis = dict(zeroline = False,
        title = 'X2',

        ),
    xaxis = dict(zeroline = False,
        title = 'X1',

        )
    )

fig = dict(data = data, layout = layout)
iplot(fig)





#### Random Forest

Random Forests (RF) are a tree based ensemble method that, as the name suggests, train a bunch of small trees on random subsets  of the data.  Each of those trees gets to vote, and the results are combined into the forest for predictions based on new data.

#### Gradient Boosting Methods

A relative to RF, Gradient Boosting Methods (GBM) start with a weak learner and iterate on each step's residuals to eventually find a model that well approximates the training data.  The models at each step are saved, and combined according to a weight variable that is found at each stage.

##### XGBoost

To quote the [GitHub Site](https://github.com/dmlc/xgboost/blob/master/README.md):

> XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment(Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

For the eager learner, I suggest reading the [XGBoost paper](https://arxiv.org/pdf/1603.02754v3.pdf).  

### Other methods

Not all state-of-the-art methods are ensembles; though many solutions still aggregate models at the point of implementation so the distinction is often slim.  

#### Support Vector Machines

Support Vector Machines (SVM) are a very interesting, non-probabilistic approach to classification.  Generally used in SML contexts, SVMs basically construct hyperplanes in the data-cloud that separate classes data.  If the data aren't separable the algorithm projects to higher dimensions until is can separate.  The hyperplanes are chosen to maximize the distance to the nearest observation on either side (or, in the case of non-separable data, minimize the miss distance to the farthest mis-classified observation).  Using the [kernel trick](http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html) the distances are computationally feasible, and the optimization is convex.  That's just a fancy way of saying that it can train on many observations quickly.  

![SVM](svm_example.jpg)

#### Neural Networks

Alternatively, Artificial Neural Networks (ANN) are a class of methods that seek to replicate the human brain's method of decision making.  The algorithm fell out of vogue in the early 2000s, but has seen a resurgence in recent years, with advancements including Google's [TensorFlow project](https://www.tensorflow.org/), they are again popular and successful.  

![NN](neural_net.jpeg)

#### Even More

* Logistic Regression... tried and true, and still advancing.
* Cubist... a combination of a tree and logistic regression.

### A Word of Caution

I haven't really gone into much mathematical or theoretical detail here.  One would spend multiple semesters in graduate school to do it justice.  That doesn't mean that someone without a degree cannot apply these algorithms.  It does, however, mean that anyone who uses them must take the time to understand what's going on under the hood.  A real project requires months of effort.  Some considerations:

* Data acquisition and munging can take weeks to months
    * You all are well aware of this
* Know your data and your algorithm
    * Some algorithms don't like meaningless predictors.
    * Some can't handle too much dependence within the predictors (see the classic Least Squares regression problem).
    * Some don't like sparse data, and others require variable scaling.
* Tuning a model requires some math
    * You should know how the algorithm finds its answer (e.g. how is it optimizing?).  
    * Know how to efficiently search the hyperparameters associated with the algorithm.  
    * Pick a solid, robust cross-validation routine and be VERY careful to avoid a data leak between the training and test sets.  
    
If you find yourself confronted with a problem that is well-suited to treatment by SML, I recommend getting deep on one algorithm, and get a working model before trying others.  Start with RandomForest.  It's relatively easy to understand the theoretical underpinnings, parallelizes well, and performs generally against many data scenarios.  

## Modeling

Okay, with that out of the way we can train a couple models using three of the algorithms we've listed here.  The first thing is to build training and test sets from our input data.  Normally, we'd want to hold out a validation set as well, but we're skipping that step here for simplicity.  I will, however, explain it.

### Training and Test Sets

In the simplest case (what we'll do below) we need a training set of data with which to train the model, and then a test set that we can check the results against.  The test set gives us an idea of the ability of the model to characterize signal in the data based on observations that were not used to train it.  

If you're training a model for production use then cross-validation is a best practice.  In fact, k-fold cross-validation is now an industry standard.  We start by holding out some of the data as a test set, just like the simple case.  Next, however, we'll further divide the remaining data into training and validation sets.  In k-fold cross-validation we do this multiple times; they can overlap with each other so long as the paired training and validation sets don't overlap each other.  Each split is used to train a model and validate its parameters.  In this way we hone in on the most robust model.  At the very end, when we have a model that doesn't over-fit, and performs acceptably, we use the test set (never before seen by the model) to determine overall performance.  

Below we construct a simple test - train split, and then build an out of the box RandomForest model.

In [None]:
# create a training and test set
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV # from sklearn.grid_search import GridSearchCV
# for older version of sklearn... same for from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, roc_auc_score

bDF_filled = bDF_keep.fillna(bDF_keep.mean())

y = bDF_filled.growthInd
dropCols = ['growthInd','country_name', 'GDPdelta',
            'GDPpercentgrowth', 'income_group', 'region', 'country code']
X = bDF_filled.drop(dropCols, axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 272)

print("The shape of the training set")
print(X_train.shape, y_train.shape)
print("The shape of the test set")
print(X_test.shape, y_test.shape)
print(X_train.dtypes)

In [None]:
# build a model
RFClassifier = RandomForestClassifier(n_estimators = 250, n_jobs = 1)
RFfit = RFClassifier.fit(X_train, y_train)
yRF_pred = RFfit.predict(X_test)

print("###### Some performance metrics for the RandomForest model. #######")
print("The confusion matrix: \n")
print(confusion_matrix(y_test, yRF_pred))
print("\n Precision and Recall \n")
print(classification_report(y_test, yRF_pred, labels = ['-1', '0', '1']))


### Model performance metrics

It's also worth noting that you have to decide which model performance metric you care about.  That's as much a qualitative judgement as anything, and depends on the nature of your problem.  We'll see results based on three difference metrics below.  For instance, precision is what you'd measure if you really cared about minimizing false positives, and recall if you need to minimize false negatives.  

Credit to Wikipedia for the following image.  
![Precision and Recall](Precisionrecall.png)

### Hyperparameter tuning

The [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) documentation describes the parameters that can be passed to the algorithm.  These are referred to as hyper-parameters.  That is, values that the algorithm uses to decide how to proceed.  Just like we need to find optimal splits at each node in the tree, we also need to find the optimal values of these hyperparameters.  Unfortunately this is even less well defined than our decision tree problem.  That is, there isn't any closed form solution for how to optimize hyperparameters (if you think about it, it'll be obvious why).  That means we're left with more primitive search methods.  The most brute force of these is grid-search; which we'll employ below.  In essence what we're doing is breaking the training set into train/test sets, and building and testing a bunch of models by iterating through each of the combinations of hyperparameters.  We can then find those that perform best, and apply them to testing on the hold-out set.  This process is called cross-validation.  


In [None]:
tuning = {'n_estimators': [10, 30, 50], 'max_depth': [2, 5, 10], 
          'criterion': ['gini', 'entropy'], 'max_features': ['auto', None]}
tuned_parameters = [tuning]

scores = ['roc_auc', 'precision', 'recall']

bdf_dum = pd.get_dummies(bDF_filled, columns = ['region', 'income_group'])
bdf_dum['Y'] = 0
bdf_dum.Y = bdf_dum.growthInd.apply(lambda x: 0 if x <= 0 else 1)
features_to_use = [col for col in bdf_dum.columns if col not in ['region', 'country code', 'country_name',
                                                                'income_group', 'growthInd', 'GDPdelta',
                                                                'GDPpercentgrowth', 'Y']]

X_train, X_test, y_train, y_test = train_test_split(bdf_dum[features_to_use], bdf_dum.Y,
                                                    test_size = 0.15)

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()
    
    clf = GridSearchCV(RandomForestClassifier(), tuned_parameters, cv=5,\
                       scoring='%s' % score)

    clf.fit(X_train, y_train)
    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()
    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()

    #y_est = pd.Series(y_pred, index = X_test.index.values, name = 'estimated')


winning_model = clf.fit(X_train, y_train)