# Classification and k-fold Validation
We are going to introduce the idea of k-fold validation using our binary classifier.   For this study, we will use the one-versus all analysis: our classifier is attempting to identify a single predefined digit (single) versus all of the other digits (our background). 

Up to this point, we have been developing our models by splitting our data into test and train subsets, generally using a 20%/80% test/train split.   The idea being, that we train the model with 80% of the data, and reserve 20% to test the model, with the assumption being that the results we obtain on the test sample are representative of what we should expect when we apply our trained model to new (truly) unseen data.

But there are two major problems with this strategy:
1.  What if we got "lucky" (or "unlucky" as the case may be) with the 20% we had reserved for testing?  That is, the particular samples that we have reserved for testing may have features that happen to be distributed in a way that make them much easier (or harder) to classifiy.  Then our model may perform significantly differently from our expectations when we apply it to new data.   
2.  We don't use all of our available data in training our model.   It does seem unfortunate that we don't take advantage of the 20% of the data that we reserved for testing.  

k-fold validation attempts to deal with each of these issues.


#Get the data
First let's get the data.  As usual we will have a hook in our code to allow us to switch between the smaller dataset (_short) and the larger dataset.

In [0]:
import pandas as pd
#
#short = ""
short = "short_"

#
# Read in all of the other digits
dfAll = pd.DataFrame()
for digit in range(10):
    print("Processing digit ",digit)
    fname = 'https://raw.githubusercontent.com/big-data-analytics-physics/data/master/ch3/digit_' + short + str(digit) + '.csv'
    df = pd.read_csv(fname,header=None)
    df['digit'] = digit
    dfAll = pd.concat([dfAll, df])
#
# Define our "signal" digit
digitSignal = 5
dfA = dfAll[dfAll['digit']==digitSignal]
dfB = dfAll[dfAll['digit']!=digitSignal]
#
# Define the signal column
dfA['signal'] = 1
dfB['signal'] = 0
#
# Shuffle our background
from sklearn.utils import shuffle
dfB = shuffle(dfB)
#
# Uncomment the next line to limit dfB to be the same length as dfA
dfB_use = dfB.head(len(dfA))
dfCombined = dfB_use
dfCombined = pd.concat([dfCombined, dfA])
print("Total data size",len(dfCombined))


# The "Old" Approach: A single test/train split
Before we proceed with k-fold validation, lets revisit our standard, single-split test/train approach.   We will split the data, fit it with a model, predict results, and get performance metrics and our ROC curves.   Also thinking ahead, we will put **ALL** of the code specific to fitting **AND** performance measures for the binary classifier into their own modules.

In [0]:
from collections import defaultdict
def autovivify(levels=1, final=dict):
    return (defaultdict(final) if levels < 2 else
            defaultdict(lambda: autovivify(levels-1, final)))


In [0]:
#
# Determine the performance
def binaryPerformance(y,y_pred,y_score,debug=False):
#
# Assuming a binary classifier with 1=signal, 0=background
  confusionMatrix = autovivify(2,int)
  for i in range(len(y_pred)):
    trueClass = y[i]
    predClass = y_pred[i]
    confusionMatrix[trueClass][predClass] += 1

  if debug:
    for trueClass in range(2):
      print("True: ",trueClass,end="")
      for predClass in range(2):
        print("\t",confusionMatrix[trueClass][predClass],end="")
      print()
    print()
  TP = confusionMatrix[1][1]
  FP = confusionMatrix[0][1]
  FN = confusionMatrix[1][0]
  TN = confusionMatrix[0][0]
  
  if debug:
    print("TP predicted true, actually true   ",TP)
    print("FP predicted true, acutally false  ",FP)
    print("TN predicted false, actually false ",TN)
    print("FN predicted false, actually true  ",FN)


  precision = TP / (TP + FP)
  recall = TP / (TP + FN)
  f1_score = 2.0 / ( (1.0/precision) + (1.0/recall) )
  
  if debug:
    print("Precision = TP/(TP+FP) = fraction of predicted true actually true ",precision)
    print("Recall = TP/(TP+FN) = fraction of true class predicted to be true ",recall)
    print("F1 score = ",f1_score)

  #
  # Get the ROC curve.  We will use the sklearn function to do this
  from sklearn import metrics
  #fpr_train, tpr_train, thresholds_train = metrics.roc_curve(y_train, y_train_score, pos_label=1)
  fpr, tpr, thresholds = metrics.roc_curve(y, y_score, pos_label=1)
  #
  # Get the auc
  auc = metrics.roc_auc_score(y, y_score)
  if debug:
    print("AUC this sample: ",auc_train)
  
  return precision,recall,auc,fpr, tpr, thresholds

# The runFitter method
This looks similar to what we have used before.  **NOTE** however that the results are stored in a dictionary, and it is this single dictionary that is returned by calling this method.

In [0]:

def runFitter(estimator,X_train,y_train,X_test,y_test):
#
# Now fit to our training set
  estimator.fit(X_train,y_train)
#
# Now predict the classes and get the score for our traing set
  y_train_pred = estimator.predict(X_train)
  y_train_score = estimator.decision_function(X_train)   # NOTE: some estimators have a predict_prob method instead od descision_function
#
# Now predict the classes and get the score for our test set
  y_test_pred = estimator.predict(X_test)
  y_test_score = estimator.decision_function(X_test)

#
# Now get the performaance
  precision_test,recall_test,auc_test,fpr_test, tpr_test, thresholds_test = binaryPerformance(y_test,y_test_pred,y_test_score,debug=False)
  precision_train,recall_train,auc_train,fpr_train, tpr_train, thresholds_train = binaryPerformance(y_train,y_train_pred,y_train_score,debug=False)
#
# Decide what you want to return: for now, just precision, recall, and auc for both test and train
  results = {
      'precision_train':precision_train,
      'recall_train':recall_train,
      'auc_train':auc_train,
      'fpr_train':fpr_train, 
      'tpr_train':tpr_train, 
      'thresholds_train':thresholds_train,
      'precision_test':precision_test,
      'recall_test':recall_test,
      'auc_test':auc_test,
      'fpr_test':fpr_test, 
      'tpr_test':tpr_test, 
      'thresholds_test':thresholds_test}

  return results
  

# Do the Fit!
The following code performs a single test/train split fit.   Note how the results returned from **runFitter** are accessed.


In [0]:

#
# Form our test and train data
from sklearn.model_selection import train_test_split
train_digits,test_digits = train_test_split(dfCombined, test_size=0.3, random_state=42)

X_train = train_digits.as_matrix(columns=train_digits.columns[:784])
y_train = train_digits['signal'].values

X_test = test_digits.as_matrix(columns=test_digits.columns[:784])
y_test = test_digits['signal'].values


#
# Get our estimator and predict
from sklearn.svm import LinearSVC

estimator = LinearSVC(random_state=42,dual=False,max_iter=100,tol=0.001)    # use dual=False when  n_samples > n_features which is what we have


results = runFitter(estimator,X_train,y_train,X_test,y_test)

print("AUC training data: ",results['auc_train'])
print("AUC testing data:  ",results['auc_test'])



# Plot results

In [0]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
  init_notebook_mode(connected=False)

In [0]:
import plotly.plotly as py
import numpy as np
from plotly.offline import iplot
import plotly.graph_objs as go

enable_plotly_in_cell()

trace0 = go.Scatter(
    x=results['fpr_train'],
    y=results['tpr_train'],
    text=results['thresholds_train'],
    mode='line',
    name='Trainig set'
)

trace1 = go.Scatter(
    x=results['fpr_test'],
    y=results['tpr_test'],
    text=results['thresholds_test'],
    mode='line',
    name='Testing set'
)

layout = dict(
    title='ROC Curve',
    xaxis=dict(title='FPR'),
    yaxis=dict(title='TPR')
)

data = [trace0,trace1]      #   this is a list because you might want to plot many data sets
iplot(dict(data=data,layout=layout),validate=False)

# k-fold Validation
Now let's introduce a different approach to the simple single test/train split: **k-fold** validation.

The basic idea is this: 
1.   We first randomly shuffle our dataset (very important as it may be ordered).
2.   We divide our dataset up into "k" approximately even subsamples, or folds. 
3.   We iterate over each fold:
      *   We use that fold as our **testing** sample, and the remaining (k-1) subsamples are combined and used as the training sample
      *   We fit our model on the training sample and evaluate it on the testing sample as usual
      *   We store the results for each fold.
      
Note that each sample in our dataset is used once in the testing, and k-1 times in the training.

## Shuffle our dataset

In [0]:
from sklearn.utils import shuffle
dfCombinedShuffle = shuffle(dfCombined,random_state=42)    # by setting the random state we will get reproducible results

X = dfCombinedShuffle.as_matrix(columns=dfCombinedShuffle.columns[:784])
y = dfCombinedShuffle['signal'].values

## Create the folds
You can do the training and folds simultaneously (this is what the book does), but it is more general to make the folds then do the training.  I prefer this approach, although the downside is that you need to store intermediate results, and then average afterwards.

We will use an sklearn function called **StratifiedKFold**.   This will stratify the test and train samples based on the class labels that you give the **split** method of this function.

Note that what is returned by **split** are two arrays (test and train) of **indices**.  You use these indicies to pull the actual samples and lables out of your dataset.

In [0]:
from sklearn.model_selection import StratifiedKFold
kfolds = 5
skf = StratifiedKFold(n_splits=kfolds)

debug = False
if debug:
  for train_index, test_index in skf.split(X, y):
    print()
    print(X[test_index])
    print(test_index)

## Loop over the folds and fit
Remember that the sklearn function **split** returns **indices** into the test and train datasets.   We use thes to form temporary inputs for our fitting code.   We store the intermediate results from each fold (for example to form averages).

In [0]:
#
# Get our estimator and predict
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier

estimator = LinearSVC(random_state=42,dual=False,max_iter=100,tol=0.001)    # use dual=False when  n_samples > n_features which is what we have
#
# Cresate some vars to keep track of everything
fpr_test_list = []
tpr_test_list = []
thresh_test_list = []
avg_auc_test = 0.0
avg_auc_train = 0.0
numSplits = 0.0
#
# Also keep track of the 
#
# Now loop
for train_index, test_index in skf.split(X, y):
  print("Training")
  X_train = X[train_index]
  y_train = y[train_index]
  X_test = X[test_index]
  y_test = y[test_index]
  
#
# Now fit to our training set
  results = runFitter(estimator,X_train,y_train,X_test,y_test)
#
# 
  fpr_test_list.append(results['fpr_test'])
  tpr_test_list.append(results['tpr_test'])
  thresh_test_list.append(results['thresholds_test'])
  avg_auc_test += results['auc_test']
  avg_auc_train += results['auc_train']
  numSplits += 1.0
  print("   Split ",numSplits,"; test AUC ",results['auc_test'],"; train AUC ",results['auc_train'])
#
avg_auc_test /= numSplits
avg_auc_train /= numSplits
print("average AUC test:  ",avg_auc_test)
print("average AUC train: ",avg_auc_train)

  
  

In [0]:
import plotly.plotly as py
import numpy as np
from plotly.offline import iplot
import plotly.graph_objs as go

enable_plotly_in_cell()

data = []
for kfold in range(kfolds):
  data.append(go.Scatter(
                x=fpr_test_list[kfold],
                y=tpr_test_list[kfold],
                text=thresh_test_list[kfold],
                mode='line',
                name='K-fold '+str(kfold)
                ))
  
  

layout = dict(
    title='ROC Curve for Folds of Test Data',
    xaxis=dict(title='FPR'),
    yaxis=dict(title='TPR')
)
iplot(dict(data=data,layout=layout),validate=False)

# What is the primary purpose and use of k-fold validation?
This artcle explains it best:  
https://machinelearningmastery.com/train-final-machine-learning-model/

I will summarize the main points of why cross-fold validation is used:

* It is a method used to estimate the performance of a proposed model on unseen data.
* By creating multiple models and testing them on multuple subsets of the data, we can obtain the mean and standard deviation of a performance metric (such  as AUC, or recall), and these can be used to infer the confidence interval on the expected performance of your final model on unseen data.
* You can use this same procedure to compare **different** parameter choices for the same underlying model (such as how many layers in an Artificial Neural Network).   The basic idea is this:
    * You loop over a set of choices of parameter settings
    * For each choice, you run k-fold validation.   Store the mean and standard deviation of your performance metric.
    * Choose that parameter setting (or combination of settings) which yield(s) the best performance metric.
* You can use this same procedure to compare **different proposed models** (such as a SVM vs a Decision Tree).

# What is the "Final" model?
Generally it is **not** one of the models from the k-fold procedure.   Once you are are satisfied you have chosen the best model, and best settings for that model, you discard all of the intermediate models.   You run your chosen model over **ALL** of the data.   It is this **final** model which you then run on new unseen data.


## Example of using cross validation to guide model-making decisions

Usually, when given the opportunity, you use **ALL** of the data to fit your model.   But what if it was very costly (in time or money) to run all of your data - it might be helpful to know that only a subset of data is needed for the accuracy you require.   Or maybe you have a small amount of real data, and you are using simulated data to determine how much real data you will need to collect.  In these circumstances you would like to be able to compare your model's performce under these different circumstatnces.

We start with 2000 total samples (approximately evenly distributed between signal and background), and increase the data we use in increments, until we have all of the data available (again such that we have equal amounts of signal and background).   How do the performance measures of AUC, recall, and precision, vary as we do this?

**NOTE**: This section must be done with the full data - not the "short" data sample!

In [0]:
print(len(dfCombined))

In [0]:
from sklearn.utils import shuffle
from sklearn.model_selection import StratifiedKFold
#
# Choose data set sizes we will loop over
dataSizes = [2000,4000,6000,8000,10000,12626]
#
# Define our estimator
estimator = LinearSVC(random_state=42,dual=False,max_iter=200,tol=0.001)    # use dual=False when  n_samples > n_features which is what we have

#
# Define the folds
kfolds = 5
skf = StratifiedKFold(n_splits=kfolds)

#
# Lists for storage of intermediate results
auc_test_list = []
auc_train_list = []
rec_test_list = []
rec_train_list = []
pre_test_list = []
pre_train_list = []

#
# Loop over datasets
for dataSize in dataSizes:
  print("Running data size ",dataSize)
#
# Grab the first "dataSise" rows
  dfUse = dfCombinedShuffle.head(dataSize)

  X = dfUse.as_matrix(columns=dfUse.columns[:784])
  y = dfUse['signal'].values
#
# Cresate some vars to keep track of everything
  avg_auc_test = 0.0
  avg_auc_train = 0.0
  avg_precision_test = 0.0
  avg_precision_train = 0.0  
  avg_recall_test = 0.0
  avg_recall_train = 0.0
  numSplits = 0.0
#
# Now loop over the folds
  for train_index, test_index in skf.split(X, y):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]
    print("size ",len(X_train))
  
#
# Now fit to our training set
    results = runFitter(estimator,X_train,y_train,X_test,y_test)
#
# 
    avg_auc_test += results['auc_test']
    avg_auc_train += results['auc_train']
    
    avg_precision_test += results['precision_test']
    avg_precision_train += results['precision_train']
    
    avg_recall_test += results['recall_test']
    avg_recall_train += results['recall_train']
    
    numSplits += 1.0
    print("   Split ",numSplits,"; this AUC ",results['auc_test'],results['auc_train'])
#
  avg_auc_test /= numSplits
  avg_auc_train /= numSplits
  avg_precision_test /= numSplits
  avg_precision_train /= numSplits
  avg_recall_test /= numSplits
  avg_recall_train /= numSplits

  auc_test_list.append(avg_auc_test)
  auc_train_list.append(avg_auc_train)
  rec_test_list.append(avg_recall_test)
  rec_train_list.append(avg_recall_train)
  pre_test_list.append(avg_precision_test)
  pre_train_list.append(avg_precision_train)

  
  print("   test/train auc",avg_auc_test,avg_auc_train)


In [0]:
for dataSize,aucTrain,aucTest,recTrain,recTest,preTrain,preTest in zip(dataSizes,auc_train_list,auc_test_list,rec_train_list,rec_test_list,pre_train_list,pre_test_list ):
  print(round(dataSize,3),round(aucTrain,3),round(aucTest,3),round(recTrain,3),round(recTest,3),round(preTrain,3),round(preTest,3))

## Performance plot using AUC

In [0]:
import plotly.plotly as py
import numpy as np
from plotly.offline import iplot
import plotly.graph_objs as go

enable_plotly_in_cell()

# Create a trace
trace1 = go.Scatter(
    x = dataSizes,
    y = auc_train_list,
    mode = 'line',
    name = "AUC Train"
)
# Create a trace
trace2 = go.Scatter(
    x = dataSizes,
    y = auc_test_list,
    mode = 'line',
    name = "AUC Test"
)

data = [trace1,trace2]
layout = dict(
    title='AUC vs Data Set Size',
    xaxis=dict(title='Data Set Size'),
    yaxis=dict(title='AUC')
)
iplot(dict(data=data,layout=layout),validate=False)

# Plot and embed in ipython notebook!
iplot(data, filename='basic-scatter')

## Performance plot using Recall

In [0]:
import plotly.plotly as py
import numpy as np
from plotly.offline import iplot
import plotly.graph_objs as go

enable_plotly_in_cell()
# Create a trace
trace1 = go.Scatter(
    x = dataSizes,
    y = rec_train_list,
    mode = 'line',
    name = "Recall Train"
)
# Create a trace
trace2 = go.Scatter(
    x = dataSizes,
    y = rec_test_list,
    mode = 'line',
    name = "Recall Test"
)

layout = dict(
    title='Recall vs Data Set Size',
    xaxis=dict(title='Data Set Size'),
    yaxis=dict(title='Recall')
)
iplot(dict(data=data,layout=layout),validate=False)


# Another test: using ALL of the data for the background
Throughout this workbook, we restricted the background so that it was the same size as the signal data sample.   Was this a good idea?  Lets use k-fold validation to check this!   We already have the result for restricting the data:

In [0]:
print("Data size:      ",dataSizes[-1])
print("AUC test:       ",auc_test_list[-1])
print("Recall test:    ",rec_test_list[-1])
print("Precision test: ",pre_test_list[-1])


How much data do we have?  The dataframes **dfA** and **dfB** contain all of our signal and background:

In [0]:
print("Length dfA",len(dfA))
print("Length dfB",len(dfB))

## Using ALL of the data: an imbalanced Dataset
Using all of the data necessarily means that we will have an imbalanced dataset: whichever digit we pick as our signal, we have 9 times as much background when using all of the other digits as our background.   But - we have alot more background to train our classifier.   Will this help?

In [0]:

dfB = shuffle(dfB)
dfCombined = dfB   # this is ALL of the background
dfCombined = pd.concat([dfCombined, dfA])
dfUse = shuffle(dfCombined)

print("Length ",len(dfUse))

estimator = LinearSVC(random_state=42,dual=False,max_iter=200,tol=0.001)    # use dual=False when  n_samples > n_features which is what we have

from sklearn.model_selection import StratifiedKFold
kfolds = 5
skf = StratifiedKFold(n_splits=kfolds)

X = dfUse.as_matrix(columns=dfUse.columns[:784])
y = dfUse['signal'].values
#
# Cresate some vars to keep track of everything
avg_auc_test = 0.0
avg_auc_train = 0.0
avg_precision_test = 0.0
avg_precision_train = 0.0  
avg_recall_test = 0.0
avg_recall_train = 0.0
numSplits = 0.0


#
# Now loop
for train_index, test_index in skf.split(X, y):
  X_train = X[train_index]
  y_train = y[train_index]
  X_test = X[test_index]
  y_test = y[test_index]
  print("size ",len(X_train))
  
#
# Now fit to our training set
  results = runFitter(estimator,X_train,y_train,X_test,y_test)
#
# 
  avg_auc_test += results['auc_test']
  avg_auc_train += results['auc_train']
    
  avg_precision_test += results['precision_test']
  avg_precision_train += results['precision_train']
    
  avg_recall_test += results['recall_test']
  avg_recall_train += results['recall_train']
    
  numSplits += 1.0
  print("   Split ",numSplits,"; this AUC ",results['auc_test'],results['auc_train'])
  print("   Split ",numSplits,"; this AUC ",results['auc_test'],results['auc_train'])
  print("                      ; this Rec ",results['recall_test'],results['recall_train'])
  print("                      ; this Pre ",results['precision_test'],results['precision_train'])
#
avg_auc_test /= numSplits
avg_auc_train /= numSplits
avg_precision_test /= numSplits
avg_precision_train /= numSplits
avg_recall_test /= numSplits
avg_recall_train /= numSplits

print("Data size:      ",len(dfUse))
print("AUC test:       ",avg_auc_test)
print("Recall test:    ",avg_recall_test)
print("Precision test: ",avg_precision_test)


## Rebalance the data?
Let's try something slightly different.   We will use ALL of the background (except for a small part) and ALL of one digit (again except for a small part).   We will then directly resample the single digit sample, reusing the data till it is the same size as the background.

We will train and test on this new sample.   The two small portions we removed will be combined into a **validation** sample (called dfCombinedTest below) which we will use to test if this procedure is reasonable.

In [0]:
print("Length dfA",len(dfA))
print("Length dfB",len(dfB))
dfB = shuffle(dfB)
dfB1 = dfB.head(62687)
dfB2 = dfB.tail(1000)
dfA1 = dfA.head(5313)
dfA2 = dfA.tail(1000)
dfCombinedNew = dfB1   # this is ALL of the background
dfCombinedNew = pd.concat([dfCombinedNew, dfA1])
dfCombinedNew = pd.concat([dfCombinedNew, dfA1])
dfCombinedNew = pd.concat([dfCombinedNew, dfA1])
dfCombinedNew = pd.concat([dfCombinedNew, dfA1])
dfCombinedNew = pd.concat([dfCombinedNew, dfA1])
dfCombinedNew = pd.concat([dfCombinedNew, dfA1])
dfCombinedNew = pd.concat([dfCombinedNew, dfA1])
dfCombinedNew = pd.concat([dfCombinedNew, dfA1])
dfCombinedNew = pd.concat([dfCombinedNew, dfA1])
dfUseNew = shuffle(dfCombinedNew)

dfCombinedTest = dfB2
dfCombinedTest = pd.concat([dfCombinedTest, dfA2])
print("Length ",len(dfCombinedTest))

## Now use cross validation to estimate performance of the rebalanced sample

In [0]:
estimator = LinearSVC(random_state=42,dual=False,max_iter=200,tol=0.001)    # use dual=False when  n_samples > n_features which is what we have



from sklearn.model_selection import StratifiedKFold
kfolds = 5
skf = StratifiedKFold(n_splits=kfolds)

X = dfUseNew.as_matrix(columns=dfUseNew.columns[:784])
y = dfUseNew['signal'].values
#
# Cresate some vars to keep track of everything
avg_auc_test = 0.0
avg_auc_train = 0.0
avg_precision_test = 0.0
avg_precision_train = 0.0  
avg_recall_test = 0.0
avg_recall_train = 0.0
numSplits = 0.0


#
# Now loop
for train_index, test_index in skf.split(X, y):
  X_train = X[train_index]
  y_train = y[train_index]
  X_test = X[test_index]
  y_test = y[test_index]
  print("size ",len(X_train))
  
#
# Now fit to our training set
  results = runFitter(estimator,X_train,y_train,X_test,y_test)
#
# 
  avg_auc_test += results['auc_test']
  avg_auc_train += results['auc_train']
    
  avg_precision_test += results['precision_test']
  avg_precision_train += results['precision_train']
    
  avg_recall_test += results['recall_test']
  avg_recall_train += results['recall_train']
    
  numSplits += 1.0
  print("   Split ",numSplits,"; this AUC ",results['auc_test'],results['auc_train'])
  print("   Split ",numSplits,"; this AUC ",results['auc_test'],results['auc_train'])
  print("                      ; this Rec ",results['recall_test'],results['recall_train'])
  print("                      ; this Pre ",results['precision_test'],results['precision_train'])
#
avg_auc_test /= numSplits
avg_auc_train /= numSplits
avg_precision_test /= numSplits
avg_precision_train /= numSplits
avg_recall_test /= numSplits
avg_recall_train /= numSplits

print("Data size:      ",len(dfUseNew))
print("AUC test:       ",avg_auc_test)
print("Recall test:    ",avg_recall_test)
print("Precision test: ",avg_precision_test)


## Train using the full rebalanced sample, and test using the validation sample

In [0]:
#
# Get our estimator and predict
from sklearn.svm import LinearSVC
estimator = LinearSVC(random_state=42,dual=False,max_iter=200,tol=0.001)    # use dual=False when  n_samples > n_features which is what we have



X_train = dfCombinedNew.as_matrix(columns=dfCombinedNew.columns[:784])
y_train = dfCombinedNew['signal'].values
X_test = dfCombinedTest.as_matrix(columns=dfCombinedTest.columns[:784])
y_test = dfCombinedTest['signal'].values
#
  
#
# Now fit to our training set
results = runFitter(estimator,X_train,y_train,X_test,y_test)

print("Results using full sample and held-out sample: ")
print("AUC test:       ",results['auc_test'])
print("Recall test:    ",results['recall_test'])
print("Precision test: ",results['precision_test'])
print("AUC train:       ",results['auc_train'])
print("Recall train:    ",results['recall_train'])
print("Precision train: ",results['precision_train'])

