# Tiny Training Trial
This notebook is to support the [Tiny Training Trial](https://w.amazon.com/bin/view/MLSciences/Community/ML_Challenge/TinyTraining/). The associated Leaderboard is available [here](https://leaderboard.corp.amazon.com/tasks/292).

## Loading the Data into Eider:
Since we got some people asking about best ways to import into Eider, I thought we'd go one step further and make it trivial to import!  Below is a snippet for loading and and taking a look at the dataset via S3 below. It's highly recommended to use the below method to avoid a needless local import. 

First, let's make sure we have our credentials set to ```ml-eider-shared-1```, and then load them in.

In [0]:
### Download Training Data and Test Features ###

import pandas as pd
eider.s3.download('s3://eider-datasets/mlu/projects/DontOverFitChallenge/TTT_train.csv', '/tmp/TTT_train.csv')
eider.s3.download('s3://eider-datasets/mlu/projects/DontOverFitChallenge/TTT_test_features.csv', '/tmp/TTT_test_features.csv')

train = pd.read_csv('/tmp/TTT_train.csv')
test = pd.read_csv('/tmp/TTT_test_features.csv',index_col = 'ID')

In [0]:
#what is the data like?
#train.describe()
'''
is_class0 = train['label'] == 0
is_class8 = train['label'] == 8
is_class9 = train['label'] == 9

is_class1 = train['label'] == 1
is_class2 = train['label'] == 2
is_class3 = train['label'] == 3
is_class4 = train['label'] == 4
is_class5 = train['label'] == 5
is_class6 = train['label'] == 6
is_class7 = train['label'] == 7

train_class0 = train[is_class0]
train_class8 = train[is_class8]
train_class9 = train[is_class9]
dataframes089 = [train_class0, train_class8, train_class9]

train_class1 = train[is_class1]
train_class2 = train[is_class2]
train_class3 = train[is_class3]
train_class4 = train[is_class4]
train_class5 = train[is_class5]
train_class6 = train[is_class6]
train_class7 = train[is_class7]
dataframes_remaining = [train_class1, train_class2, train_class3, train_class4, train_class5, train_class6, train_class7]

#selective_train = pd.concat(dataframes089)
selective_train = pd.concat(dataframes_remaining)

print selective_train.shape
print selective_train.head(4)
selective_train['label'].hist()
'''
train['label'].hist()
print train.head(n=10)

In [0]:
#import modules
import numpy as np
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import uniform, randint

from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss

#classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.svm import SVC
from xgboost.sklearn import XGBClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV

from sklearn.pipeline import Pipeline

#gridsearch
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

In [0]:
print train.label.value_counts()
print selective_train.label.value_counts()


In [0]:

#split data and labels
Features, labels = train.drop(["label"], axis = 1), train["label"]
print("Features-labels shape", Features.shape, labels.shape, test.shape)
Features_train, Features_test, labels_train, labels_test = train_test_split(Features, labels, test_size = 0.2, random_state = 54321)
print("train-test shape", Features_train.shape, Features_test.shape, labels_train.shape, labels_test.shape)

'''
#split selective train that is focussed on class 0, 8 and 9
selective_Features, selective_labels = selective_train.drop(["label"], axis = 1), selective_train["label"]
selective_Features_train, selective_Features_test, selective_labels_train, selective_labels_test = train_test_split(selective_Features, selective_labels, test_size = 0.2, random_state = 42)
print("selective-Features-labels shape", selective_Features.shape, selective_labels.shape, test.shape)
print("selective-train-test shape", selective_Features_train.shape, selective_Features_test.shape, selective_labels_train.shape, selective_labels_test.shape)

Features = selective_Features
labels = selective_labels

Features_train = selective_Features_train
Features_test = selective_Features_test
labels_train = selective_labels_train
labels_test = selective_labels_test
print("Features-labels shape", Features.shape, labels.shape, test.shape)
print("train-test shape", Features_train.shape, Features_test.shape, labels_train.shape, labels_test.shape)
'''

In [0]:
#pretty print best scores for hypertuning
def report_best_scores(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

def combinations_on_off(num_classifiers):
    return [[int(x) for x in list("{0:0b}".format(i).zfill(num_classifiers))]
           for i in np.arange(1, 2 ** num_classifiers)]
    

def Stacking(model,train,y,test,n_fold):
  folds=StratifiedKFold(n_splits=n_fold,random_state=1)
  #test_pred=np.empty((test.shape[0],1),float)
  test_pred=np.empty((0,1),float)
  train_pred=np.empty((0,1),float)
  for train_indices,val_indices in folds.split(train,y.values):
    x_train,x_val=train.iloc[train_indices],train.iloc[val_indices]
    y_train,y_val=y.iloc[train_indices],y.iloc[val_indices]
    model.fit(X=x_train,y=y_train)
    train_pred=np.append(train_pred,model.predict(x_val))
    
  test_pred=np.append(test_pred,model.predict(test))
  return test_pred.reshape(-1,1),train_pred

In [0]:

#stacking ensemble
rf = RandomForestClassifier(n_estimators=500, max_features=26, min_samples_split=4, bootstrap=False, criterion="gini", max_depth=None)
test_pred_rf ,train_pred_rf=Stacking(model=rf,n_fold=10, train=Features_train,test=Features_test,y=labels_train)
train_pred_rf = pd.DataFrame(train_pred_rf)
test_pred_rf = pd.DataFrame(test_pred_rf)

svc = SVC(kernel='linear', C=2.2, gamma=1, probability=True)
test_pred_svc ,train_pred_svc=Stacking(model=svc,n_fold=10, train=Features_train,test=Features_test,y=labels_train)
train_pred_svc = pd.DataFrame(train_pred_svc)
test_pred_svc = pd.DataFrame(test_pred_svc)

gb = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=None, random_state=42)
test_pred_gb ,train_pred_gb=Stacking(model=gb,n_fold=10, train=Features_train,test=Features_test,y=labels_train)
train_pred_gb = pd.DataFrame(train_pred_gb)
test_pred_gb = pd.DataFrame(test_pred_gb)

xgb = XGBClassifier(objective="binary:logistic", colsample_bytree=0.9161548184174394, learning_rate=0.074321727571868, n_estimators=102, subsample=0.9377762690534515, max_depth=4, gamma=0.031170681255620725, scale_pos_weight=3)
test_pred_xgb ,train_pred_xgb=Stacking(model=xgb,n_fold=10, train=Features_train,test=Features_test,y=labels_train)
train_pred_xgb = pd.DataFrame(train_pred_xgb)
test_pred_xgb = pd.DataFrame(test_pred_xgb)

df = pd.concat([train_pred_rf, train_pred_svc, train_pred_gb, train_pred_gb], axis=1)
df_test = pd.concat([test_pred_rf, test_pred_svc, test_pred_gb, test_pred_gb], axis=1)

print df.head()
print Features_train.head()

df.columns = ["rf_pred", "svc_pred", "gb_pred", "xgb_pred"]
df_test.columns = ["rf_pred", "svc_pred", "gb_pred", "xgb_pred"]
data = pd.concat([df, Features_train.reset_index().drop(["index"], axis = 1)], axis = 1)
data_test = pd.concat([df_test, Features_test.reset_index().drop(["index"], axis = 1)], axis = 1)
print data.head(n=50)

#model = LogisticRegression(random_state=42)
#model = RandomForestClassifier(n_estimators=10, max_features=26, min_samples_split=4, bootstrap=False, criterion="gini", max_depth=None)
#model = SVC(kernel='linear', C=2.2, gamma=1, probability=True)
#model = XGBClassifier(objective="binary:logistic", colsample_bytree=0.9161548184174394, learning_rate=0.074321727571868, n_estimators=102, subsample=0.9377762690534515, max_depth=4, gamma=0.031170681255620725, scale_pos_weight=3)
model = GradientBoostingClassifier(n_estimators=102, learning_rate=0.074321727571868, max_depth=4, random_state=42)
model.fit(data,labels_train)
labels_pred = model.predict(data_test)

print(accuracy_score(labels_test, labels_pred))
print confusion_matrix(labels_test, labels_pred)

print stats.describe(cross_val_score(model, Features, labels, cv=5, scoring='accuracy'))

print model.score(data_test, labels_test)

In [0]:
#extreme boost classifier 
xgb = XGBClassifier(objective="binary:logistic", colsample_bytree=0.9161548184174394, learning_rate=0.074321727571868, n_estimators=102, subsample=0.9377762690534515, max_depth=4, gamma=0.031170681255620725, scale_pos_weight=3)
#xgb = XGBClassifier(objective="binary:logistic")
'''
params = {
    "colsample_bytree": uniform(0.7, 0.3),
    "gamma": uniform(0, 0.5),
    "learning_rate": uniform(0.03, 0.3), # default 0.1 
    "max_depth": randint(2, 6), # default 3
    "n_estimators": randint(100, 150), # default 100
    "subsample": uniform(0.6, 0.4),
    "min_child_weight": [5,6],
    "scale_pos_weight": [1,2,3,4]
}
print "searching...."
search = RandomizedSearchCV(xgb, param_distributions=params, random_state=42, n_iter=200, cv=5, verbose=1, n_jobs=-1, return_train_score=True)
print "fitting...."
search.fit(Features, labels)
print "reporting...."
report_best_scores(search.cv_results_, 1)
'''
xgb.fit(Features_train, labels_train)
labels_pred = xgb.predict(Features_test)
print(accuracy_score(labels_test, labels_pred))
plt.matshow(confusion_matrix(labels_test, labels_pred))
plt.title('Confusion matrix - XGBClassifier')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

print confusion_matrix(labels_test, labels_pred)

stats.describe(cross_val_score(xgb, Features, labels, cv=5, scoring='accuracy'))


In [0]:
#Naive Bayes
mNB = MultinomialNB(alpha=0.22, fit_prior=True, class_prior=None)
mNB.fit(Features_train, labels_train)
labels_pred = mNB.predict(Features_test)

print(accuracy_score(labels_test, labels_pred))
print confusion_matrix(labels_test, labels_pred)
stats.describe(cross_val_score(mNB, Features, labels, cv=5, scoring='accuracy'))
log_loss(labels_test, labels_pred)

In [0]:
#SVM Classifier

#svm = SVC(kernel='linear', C=2.21, gamma=1)
svm = SVC(kernel='linear', C=2.2, gamma=0.001)
'''
params = {
        'C': randint(1,10),
        'gamma': [0.0001, 0.001, 0.01, 1, 10],
        'kernel': ['linear','rbf']
    }
search = RandomizedSearchCV(svm, param_distributions=params, random_state=42, n_iter=200, cv=5, verbose=1, n_jobs=-1, return_train_score=True)
print "fitting...."
search.fit(Features, labels)
print "reporting...."
report_best_scores(search.cv_results_, 1)
'''
svm.fit(Features_train, labels_train)
labels_pred = svm.predict(Features_test)
print(accuracy_score(labels_test, labels_pred))
print confusion_matrix(labels_test, labels_pred)
stats.describe(cross_val_score(svm, Features, labels, cv=5, scoring='accuracy'))

In [0]:
rf = RandomForestClassifier(n_estimators=500, max_features=25, min_samples_split=9, bootstrap=False, criterion="gini", max_depth=None)
'''rf = RandomForestClassifier(n_estimators=500)
params = {"max_depth": [3, None],
              "max_features": randint(1, 30),
              "min_samples_split": randint(2, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

print "searching...."
search = RandomizedSearchCV(rf, param_distributions=params, random_state=42, n_iter=200, cv=5, verbose=1, n_jobs=-1, return_train_score=True)
print "fitting...."
search.fit(Features, labels)
print "reporting...."
report_best_scores(search.cv_results_, 1)

'''
rf.fit(Features_train, labels_train)
labels_pred = rf.predict(Features_test)
print(accuracy_score(labels_test, labels_pred))
plt.matshow(confusion_matrix(labels_test, labels_pred))
plt.title('Confusion matrix - Voting Classifier')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
print confusion_matrix(labels_test, labels_pred)
	
stats.describe(cross_val_score(rf, Features, labels, cv=5, scoring='accuracy'))


In [0]:
#Random Forest + SVM ensemble
rf = RandomForestClassifier(n_estimators=500, max_features=26, min_samples_split=4, bootstrap=False, criterion="gini", max_depth=None)
svc = SVC(kernel='linear', C=2.2, gamma=1, probability=True)
#xgb = XGBClassifier(objective="binary:logistic", colsample_bytree=0.9514986114133412, learning_rate=0.15444585070129954, n_estimators=147, subsample=0.9458889505020213, max_depth=2, gamma=0.23434657989748514, scale_pos_weight=3)
#rf = RandomForestClassifier(n_estimators=500, max_features=25, min_samples_split=9, bootstrap=False, criterion="gini", max_depth=None)
#svc = SVC(kernel='linear', C=2.2, gamma=0.001, probability=True)
xgb = XGBClassifier(objective="binary:logistic", colsample_bytree=0.9161548184174394, learning_rate=0.074321727571868, n_estimators=102, subsample=0.9377762690534515, max_depth=4, gamma=0.031170681255620725, scale_pos_weight=3)

gb = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=None, random_state=42)

'''
classifiers = [
    ("rf", rf),
    ("svc", svc)
    #("xgb", xgb)
]

mixclf = Pipeline([
    ("voting", VotingClassifier(classifiers, voting="soft"))
])

param_grid = dict(
    #voting__weights=combinations_on_off(len(classifiers))
    voting__weights=[[5,3],[4.1,3],[4.2,3],[4.3,3],[4.4,3],[4.5,3],[4.6,3],[4.7,3],[4.8,3],[4.9,3],[4.1,2.8],[4.2,2.8],[4.3,2.8],[4.4,2.8],[4.5,2.8],[4.6,2.8],[4.7,2.8],[4.8,2.8],[4.9,2.8]]
)

grid_search = GridSearchCV(mixclf, param_grid=param_grid, n_jobs=-1, verbose=10, scoring="neg_log_loss")

grid_search.fit(Features_train, labels_train)

cv_results = grid_search.cv_results_

for mean_score, params in zip(cv_results["mean_test_score"], cv_results["params"]):
    print(params, mean_score)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

'''    
#mixclf = VotingClassifier(estimators=[('rf', rf), ('svc', svc)], voting='soft', weights=[4.95, 2.91])
mixclf = VotingClassifier(estimators=[('rf', rf), ('svc', svc),('gb',gb),('xgb',xgb)], voting='soft', weights=[4.94,2.9,1,2])

mixclf.fit(Features_train, labels_train)
labels_pred = mixclf.predict(Features_test)
print(accuracy_score(labels_test, labels_pred))

plt.matshow(confusion_matrix(labels_test, labels_pred))
plt.title('Confusion matrix - Voting Classifier')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
print confusion_matrix(labels_test, labels_pred)

#stats.describe(cross_val_score(mixclf, Features, labels, cv=5, scoring='accuracy'))


In [0]:
#what is the test data like?
test.head()

## Outputting data
Here is a quick snippet of code to write the data to disk, and then we'll talk how to upload to leaderboard.  I'll make the all zero prediction.

In [0]:
test = test.drop(["label"], axis =1)
#test =test.reset_index()
print(test.shape)
test.head()

In [0]:
mixclf.fit(Features, labels)
test['label'] = mixclf.predict(test)
test[['label']].to_csv('/tmp/TTT_fake_sub_trial28.csv', index=True, index_label = 'ID')

## Getting our model output out of Eider and into Leaderboard
Great. Now we have a dummie sample submission in Eider that we now need to export locally so that we may then upload to Leaderboard in the following steps:
1. Within the Eider console top bar, select [Files](https://eider.corp.amazon.com/file)
2. You should now see 'Files', 'TMP' and 'Exported notebooks' tabs. 
3. Select 'TMP' then select 'Connect to workspace'. You should now see any files from your last run of your workspace. If there was no 'Connect to workspace' option, your files from the last run should already be present. *Files in the 'TMP' should be considered temporary as they will expire after an hour's worth of idle time.*
4. Go to the ```TTT_fake_sub.csv``` file and select Save
5. This file will now be permanently saved to your Eider account and available for local download.
6. Go to the 'Files' tab, and click 'download' to save it to your local machine.

We now have our model's output .csv and are ready to upload to Leaderboard
1. Search for your [Leaderboard instance](https://leaderboard.corp.amazon.com/tasks/292) and go to the 'Make a Submission' section
2. Upload your local file and include your notebook version URL for tracking.
3. Your score on the public leaderboard should now appear. 

The private leaderboard contains the vast majority of the data, and so your final rankings in this competition will be a bit of a surprise! Take care and avoid overfitting!