# Final Project - Zarina

## Choosing the models
I wanted to focus on kmer data. Both of the models I chose are trained on kmers instead of presence/absence. I was originally interested in the gradient boosting model, and due to the processing power availability to me, chose to use basic decision trees as my base estimator. I did not want to use the linear models as I felt like I could not guarantee that the kmers would fit the linear regression assumptions. For curiosities sake, I also chose to do a random forest model to see how it compares to the gradient boosting model with the same base estimator.

I am using KNN (K Nearest Neighbour) as a baseline model. In week 4, we found that it has a peak accuracy of 84%, therefore, I am hoping to generate a model with an accuracy greater than 84%.

### Imports

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn import tree
from sklearn import ensemble
import bayes_opt

### Load data

In [2]:
seed = 130

def load_data():
    
    # Load Kmer data
    train_kmers = np.load('../data/train_test_data/train_kmers.npy', allow_pickle=True)
    test_kmers = np.load('../data/train_test_data/test_kmers.npy', allow_pickle=True)

    # Load target data & IDs
    y_train = np.load('../data/train_test_data/y_train.npy', allow_pickle=True)
    y_train_ids = np.load('../data/train_test_data/train_ids.npy', allow_pickle=True).astype(str)
    y_test_ids = np.load('../data/train_test_data/test_ids.npy', allow_pickle=True).astype(str)
    
    return train_kmers, test_kmers, y_train, y_train_ids, y_test_ids

X_train_kmers, X_test_kmers, y_train, y_train_ids, y_test_ids = load_data()
y_train = y_train.reshape(-1)

### K-Fold CV
Below is a stratified K-Fold Cross-Validation. I chose to stratify the distribution to get the best mix possible for my training data.

In [3]:
K = 3

kfold = sklearn.model_selection.StratifiedKFold(
    n_splits = K,
    shuffle = True, # Want to shuffle as seen in slides
    random_state = seed, # To ensure reproducible results
)

kfold_dfs = {}
val_idx = {}
for i, (train_index, val_index) in enumerate(kfold.split(X_train_kmers, y_train)):
    
    val_idx[i] = val_index

    kfold_dfs[i] = (X_train_kmers[train_index], X_train_kmers[val_index], y_train[train_index], y_train[val_index])

# Specify fold 0
X_train_fold_0 = kfold_dfs[0][0]
X_val_fold_0 = kfold_dfs[0][1]
y_train_fold_0 = kfold_dfs[0][2]
y_val_fold_0 = kfold_dfs[0][3]

# Specify fold 1
X_train_fold_1 = kfold_dfs[1][0]
X_val_fold_1 = kfold_dfs[1][1]
y_train_fold_1 = kfold_dfs[1][2]
y_val_fold_1 = kfold_dfs[1][3]

# Specify fold 2
X_train_fold_2 = kfold_dfs[2][0]
X_val_fold_2 = kfold_dfs[2][1]
y_train_fold_2 = kfold_dfs[2][2]
y_val_fold_2 = kfold_dfs[2][3]

## Building a Random Forest Model
A random forest is more robust than a simple decision tree. The hyperparameters of interest are the number of trees in the forest (n_estimators) and the tree depth (max_depth) .

In [4]:
def random_forest_model_fit_0(numberTrees, sizeTrees):
     # Create model
    randomForestModel = ensemble.RandomForestClassifier(n_estimators=int(numberTrees), max_depth=int(sizeTrees), random_state = seed)
    
    # Fit model
    randomForestModel.fit(X_train_fold_0, y_train_fold_0)

    # Evaluate the model and return the evaluation score
    score = sklearn.metrics.balanced_accuracy_score(y_val_fold_0, randomForestModel.predict(X_val_fold_0))
    
    return score

### Model Optimization
I went with a Bayesian optimizer to find the hyperparameters for my random forest model.

In [5]:
# Bounded region of parameter space
parameter_limits = {'numberTrees': (1, 50), 'sizeTrees': (1, 25)}

optimizer0 = bayes_opt.BayesianOptimization(
    f = random_forest_model_fit_0,
    pbounds = parameter_limits,
    random_state=seed,
)

# Fit the model using our custom optimizer
optimizer0.maximize(
    init_points=10, # Arbitrary larger number to increase spread
    n_iter=20, # Arbitrary large-ish number to optimize search
)

|   iter    |  target   | number... | sizeTrees |
-------------------------------------------------
| [0m1        [0m | [0m0.8125   [0m | [0m7.94     [0m | [0m8.477    [0m |
| [95m2        [0m | [95m0.8325   [0m | [95m28.18    [0m | [95m4.019    [0m |
| [0m3        [0m | [0m0.8325   [0m | [0m25.2     [0m | [0m23.11    [0m |
| [95m4        [0m | [95m0.8425   [0m | [95m19.61    [0m | [95m4.035    [0m |
| [0m5        [0m | [0m0.835    [0m | [0m49.94    [0m | [0m19.5     [0m |
| [0m6        [0m | [0m0.82     [0m | [0m36.78    [0m | [0m1.759    [0m |
| [0m7        [0m | [0m0.82     [0m | [0m40.71    [0m | [0m2.236    [0m |
| [95m8        [0m | [95m0.855    [0m | [95m28.99    [0m | [95m9.147    [0m |
| [0m9        [0m | [0m0.815    [0m | [0m11.67    [0m | [0m19.34    [0m |
| [0m10       [0m | [0m0.8375   [0m | [0m47.44    [0m | [0m17.95    [0m |
| [0m11       [0m | [0m0.815    [0m | [0m25.69    [0m | [0m12

### Best Random Forest Model for Fold 0
Based on the above, the best model on fold 0 has a forest with a size of 28, and a tree depth of 9, resulting in an accuracy of 0.855. 

I ran the model on the full set to submit to Kaggle, with a private/public score of 0.95625/0.91875.

In [6]:
randomForestFold0 = ensemble.RandomForestClassifier(n_estimators=28, max_depth=9, random_state = seed)
randomForestFold0.fit(X_train_kmers, y_train)

# Make test predictions and save out as a dataframe
test_preds = randomForestFold0.predict(X_test_kmers)

# Save
test_preds_df = pd.DataFrame(data={"genome_id":y_test_ids, "y_pred":test_preds})
test_preds_df.to_csv("randomForestFold0.csv", index=False) # IMPORTANT: Do not save the index
test_preds_df.head()

Unnamed: 0,genome_id,y_pred
0,562.42833,R
1,562.42739,R
2,562.22823,S
3,562.45646,S
4,562.22547,S


### Modeling the Random Forest Model on Fold 1

In [7]:
def random_forest_model_fit_1(numberTrees, sizeTrees):
    
    randomForestModel = ensemble.RandomForestClassifier(n_estimators=int(numberTrees), max_depth=int(sizeTrees), random_state = seed)
    randomForestModel.fit(X_train_fold_1, y_train_fold_1)
    score = sklearn.metrics.balanced_accuracy_score(y_val_fold_1, randomForestModel.predict(X_val_fold_1))

    return score

# We do not need to restate the parameter limits as they are the same

optimizer1 = bayes_opt.BayesianOptimization(
    f = random_forest_model_fit_1,
    pbounds = parameter_limits,
    random_state=seed,
)

optimizer1.maximize(
    init_points=10, # Arbitrary larger number to increase spread
    n_iter=20, # Arbitrary large-ish number to optimize search
)

|   iter    |  target   | number... | sizeTrees |
-------------------------------------------------
| [0m1        [0m | [0m0.8473   [0m | [0m7.94     [0m | [0m8.477    [0m |
| [95m2        [0m | [95m0.8797   [0m | [95m28.18    [0m | [95m4.019    [0m |
| [0m3        [0m | [0m0.8498   [0m | [0m25.2     [0m | [0m23.11    [0m |
| [0m4        [0m | [0m0.8497   [0m | [0m19.61    [0m | [0m4.035    [0m |
| [0m5        [0m | [0m0.8723   [0m | [0m49.94    [0m | [0m19.5     [0m |
| [0m6        [0m | [0m0.8473   [0m | [0m36.78    [0m | [0m1.759    [0m |
| [0m7        [0m | [0m0.8273   [0m | [0m40.71    [0m | [0m2.236    [0m |
| [0m8        [0m | [0m0.8648   [0m | [0m28.99    [0m | [0m9.147    [0m |
| [95m9        [0m | [95m0.8948   [0m | [95m11.67    [0m | [95m19.34    [0m |
| [0m10       [0m | [0m0.8623   [0m | [0m47.44    [0m | [0m17.95    [0m |
| [0m11       [0m | [0m0.8897   [0m | [0m10.22    [0m | [0m22.55 

### Best Random Forest Model for Fold 1
This suggests that the best model for fold 1 would be a forest with a size of 11, and a tree depth of 19, resulting in an accuracy of 0.8948

Again I saved the results here to submit to Kaggle, which predicted a score of 0.95000/0.90000.

In [8]:
randomForestFold1 = ensemble.RandomForestClassifier(n_estimators=11, max_depth=19, random_state = seed)
randomForestFold1.fit(X_train_kmers, y_train)

test_preds = randomForestFold1.predict(X_test_kmers)

test_preds_df = pd.DataFrame(data={"genome_id":y_test_ids, "y_pred":test_preds})
test_preds_df.to_csv("randomForestFold1.csv", index=False) # IMPORTANT: Do not save the index
test_preds_df.head()

Unnamed: 0,genome_id,y_pred
0,562.42833,R
1,562.42739,R
2,562.22823,S
3,562.45646,S
4,562.22547,S


### Modeling the Random Forest Model on Fold 2

In [9]:
def random_forest_model_fit_2(numberTrees, sizeTrees):
    randomForestModel = ensemble.RandomForestClassifier(n_estimators=int(numberTrees), max_depth=int(sizeTrees), random_state = seed)
    randomForestModel.fit(X_train_fold_2, y_train_fold_2)
    score = sklearn.metrics.balanced_accuracy_score(y_val_fold_2, randomForestModel.predict(X_val_fold_2))

    return score

optimizer2 = bayes_opt.BayesianOptimization(
    f = random_forest_model_fit_2,
    pbounds = parameter_limits,
    random_state=seed,
)

optimizer2.maximize(
    init_points=10, # Arbitrary larger number to increase spread
    n_iter=20, # Arbitrary large-ish number to optimize search
)

|   iter    |  target   | number... | sizeTrees |
-------------------------------------------------
| [0m1        [0m | [0m0.7699   [0m | [0m7.94     [0m | [0m8.477    [0m |
| [95m2        [0m | [95m0.7899   [0m | [95m28.18    [0m | [95m4.019    [0m |
| [0m3        [0m | [0m0.7524   [0m | [0m25.2     [0m | [0m23.11    [0m |
| [0m4        [0m | [0m0.7799   [0m | [0m19.61    [0m | [0m4.035    [0m |
| [0m5        [0m | [0m0.7649   [0m | [0m49.94    [0m | [0m19.5     [0m |
| [0m6        [0m | [0m0.7773   [0m | [0m36.78    [0m | [0m1.759    [0m |
| [0m7        [0m | [0m0.7749   [0m | [0m40.71    [0m | [0m2.236    [0m |
| [0m8        [0m | [0m0.7849   [0m | [0m28.99    [0m | [0m9.147    [0m |
| [0m9        [0m | [0m0.7524   [0m | [0m11.67    [0m | [0m19.34    [0m |
| [0m10       [0m | [0m0.7649   [0m | [0m47.44    [0m | [0m17.95    [0m |
| [0m11       [0m | [0m0.6698   [0m | [0m1.0      [0m | [0m1.0      

### Best Random Forest Model for Fold 2
This suggests that the best model for fold 2 would be a forest with a size of 36, and a tree depth of 6, resulting in an accuracy of 0.8149

Kaggle had a score of 0.91875/0.87500

In [10]:
randomForestFold2 = ensemble.RandomForestClassifier(n_estimators=36, max_depth=6, random_state = seed)
randomForestFold2.fit(X_train_kmers, y_train)

test_preds = randomForestFold2.predict(X_test_kmers)

test_preds_df = pd.DataFrame(data={"genome_id":y_test_ids, "y_pred":test_preds})
test_preds_df.to_csv("randomForestFold2.csv", index=False) # IMPORTANT: Do not save the index
test_preds_df.head()

Unnamed: 0,genome_id,y_pred
0,562.42833,R
1,562.42739,R
2,562.22823,S
3,562.45646,S
4,562.22547,S


### Final Random Forest Model
Based on the results of the three optimized models above, the best model would be the one trained on fold 1. (Forest size of 11, and a tree depth of 19) as it had the highest accuracy of 0.8948

## Building a Gradient Boost Model

In [11]:
def gradient_boost_model_fit_0(numberTrees, sizeTrees, learnRate):
    # Create model
    gradientBoostModel = sklearn.ensemble.AdaBoostClassifier(
        estimator = tree.DecisionTreeClassifier(max_depth=int(sizeTrees)), # Can choose any simple base estimator
        n_estimators = int(numberTrees),
        learning_rate = float(learnRate), # Another parameter to tune
        algorithm="SAMME",
    )
    
    # Fit model
    gradientBoostModel.fit(X_train_fold_0, y_train_fold_0)

    # Evaluate the model and return the evaluation score
    score = sklearn.metrics.balanced_accuracy_score(y_val_fold_0, gradientBoostModel.predict(X_val_fold_0))
    
    return score

### Gradient Boost Model Optimization
Through trial and error, I found a rough upper bound of the learning rate of approximately 25

In [12]:
# Bounded region of parameter space
parameter_limits = {'numberTrees': (1, 50), 'sizeTrees': (1, 25), 'learnRate': (1.0, 25.0)}

optimizer0 = bayes_opt.BayesianOptimization(
    f = gradient_boost_model_fit_0,
    pbounds = parameter_limits,
    random_state=seed,
)

# Fit the model using our custom optimizer
optimizer0.maximize(
    init_points=10, # Arbitrary larger number to increase spread
    n_iter=20, # Arbitrary large-ish number to optimize search
)

|   iter    |  target   | learnRate | number... | sizeTrees |
-------------------------------------------------------------
| [0m1        [0m | [0m0.7875   [0m | [0m4.399    [0m | [0m16.27    [0m | [0m14.32    [0m |
| [95m2        [0m | [95m0.7925   [0m | [95m4.019    [0m | [95m25.2     [0m | [95m23.11    [0m |
| [95m3        [0m | [95m0.795    [0m | [95m10.12    [0m | [95m7.196    [0m | [95m24.97    [0m |
| [95m4        [0m | [95m0.82     [0m | [95m19.5     [0m | [95m36.78    [0m | [95m1.759    [0m |
| [0m5        [0m | [0m0.795    [0m | [0m20.45    [0m | [0m3.523    [0m | [0m14.71    [0m |
| [0m6        [0m | [0m0.785    [0m | [0m9.147    [0m | [0m11.67    [0m | [0m19.34    [0m |
| [0m7        [0m | [0m0.8175   [0m | [0m23.75    [0m | [0m35.6     [0m | [0m23.36    [0m |
| [0m8        [0m | [0m0.78     [0m | [0m11.53    [0m | [0m31.72    [0m | [0m17.02    [0m |
| [0m9        [0m | [0m0.7925   [0m | [0

  sample_weight = np.exp(
  return fit_method(estimator, *args, **kwargs)


| [0m12       [0m | [0m0.255    [0m | [0m25.0     [0m | [0m32.69    [0m | [0m1.0      [0m |
| [0m13       [0m | [0m0.6875   [0m | [0m19.77    [0m | [0m40.5     [0m | [0m3.821    [0m |
| [0m14       [0m | [0m0.8075   [0m | [0m22.83    [0m | [0m35.79    [0m | [0m20.38    [0m |
| [0m15       [0m | [0m0.795    [0m | [0m20.24    [0m | [0m38.59    [0m | [0m23.43    [0m |
| [0m16       [0m | [0m0.77     [0m | [0m19.05    [0m | [0m33.18    [0m | [0m22.96    [0m |
| [0m17       [0m | [0m0.785    [0m | [0m25.0     [0m | [0m40.54    [0m | [0m23.52    [0m |
| [0m18       [0m | [0m0.82     [0m | [0m14.17    [0m | [0m36.1     [0m | [0m1.703    [0m |
| [0m19       [0m | [0m0.7025   [0m | [0m12.86    [0m | [0m32.6     [0m | [0m5.656    [0m |
| [0m20       [0m | [0m0.82     [0m | [0m11.64    [0m | [0m40.29    [0m | [0m1.045    [0m |
| [0m21       [0m | [0m0.82     [0m | [0m6.95     [0m | [0m37.9     [0m | 

### Best Gradient Boost Model for Fold 0
Based on the above, the best model on fold 0 has a forest with a size of 36, a tree depth of 1, and a learning rate of 19.5 resulting in an accuracy of 0.82

Kaggle score: 0.85625/0.83125

In [13]:
gradientBoostFold0 = sklearn.ensemble.AdaBoostClassifier(estimator = tree.DecisionTreeClassifier(max_depth=1), n_estimators = 36, learning_rate = 19.5, algorithm="SAMME")

gradientBoostFold0.fit(X_train_kmers, y_train)

# Make test predictions and save out as a dataframe
test_preds = gradientBoostFold0.predict(X_test_kmers)

# Save
test_preds_df = pd.DataFrame(data={"genome_id":y_test_ids, "y_pred":test_preds})
test_preds_df.to_csv("gradientBoostFold0.csv", index=False) # IMPORTANT: Do not save the index
test_preds_df.head()

Unnamed: 0,genome_id,y_pred
0,562.42833,S
1,562.42739,R
2,562.22823,S
3,562.45646,S
4,562.22547,S


Let's repeat the same steps for the other two folds.

In [14]:
def gradient_boost_model_fit_1(numberTrees, sizeTrees, learnRate):

    gradientBoostModel = sklearn.ensemble.AdaBoostClassifier(
        estimator = tree.DecisionTreeClassifier(max_depth=int(sizeTrees)), # Can choose any simple base estimator
        n_estimators = int(numberTrees),
        learning_rate = float(learnRate), # Another parameter to tune
        algorithm="SAMME",
    )
    
    gradientBoostModel.fit(X_train_fold_1, y_train_fold_1)
    score = sklearn.metrics.balanced_accuracy_score(y_val_fold_1, gradientBoostModel.predict(X_val_fold_1))
    
    return score

optimizer1 = bayes_opt.BayesianOptimization(
    f = gradient_boost_model_fit_1,
    pbounds = parameter_limits,
    random_state=seed,
)

optimizer1.maximize(
    init_points=10, # Arbitrary larger number to increase spread
    n_iter=20, # Arbitrary large-ish number to optimize search
)

|   iter    |  target   | learnRate | number... | sizeTrees |
-------------------------------------------------------------
| [0m1        [0m | [0m0.8022   [0m | [0m4.399    [0m | [0m16.27    [0m | [0m14.32    [0m |
| [95m2        [0m | [95m0.8247   [0m | [95m4.019    [0m | [95m25.2     [0m | [95m23.11    [0m |
| [0m3        [0m | [0m0.8047   [0m | [0m10.12    [0m | [0m7.196    [0m | [0m24.97    [0m |
| [95m4        [0m | [95m0.8423   [0m | [95m19.5     [0m | [95m36.78    [0m | [95m1.759    [0m |
| [0m5        [0m | [0m0.7997   [0m | [0m20.45    [0m | [0m3.523    [0m | [0m14.71    [0m |
| [0m6        [0m | [0m0.8372   [0m | [0m9.147    [0m | [0m11.67    [0m | [0m19.34    [0m |
| [0m7        [0m | [0m0.8272   [0m | [0m23.75    [0m | [0m35.6     [0m | [0m23.36    [0m |
| [0m8        [0m | [0m0.8147   [0m | [0m11.53    [0m | [0m31.72    [0m | [0m17.02    [0m |
| [0m9        [0m | [0m0.7847   [0m | [0m2.88

### Best Gradient Boost Model for Fold 1
Based on the above, the best model on fold 1 has a forest with a size of 25, a tree depth of 23, and a learning rate of 4.019 resulting in an accuracy of 0.8523

Kaggle score: 0.86250/0.86250

In [15]:
gradientBoostFold1 = sklearn.ensemble.AdaBoostClassifier(estimator = tree.DecisionTreeClassifier(max_depth=23), n_estimators = 25, learning_rate = 4.019, algorithm="SAMME")

gradientBoostFold1.fit(X_train_kmers, y_train)

test_preds = gradientBoostFold1.predict(X_test_kmers)

test_preds_df = pd.DataFrame(data={"genome_id":y_test_ids, "y_pred":test_preds})
test_preds_df.to_csv("gradientBoostFold1.csv", index=False) # IMPORTANT: Do not save the index
test_preds_df.head()

Unnamed: 0,genome_id,y_pred
0,562.42833,S
1,562.42739,R
2,562.22823,S
3,562.45646,S
4,562.22547,S


In [16]:
def gradient_boost_model_fit_2(numberTrees, sizeTrees, learnRate):

    gradientBoostModel = sklearn.ensemble.AdaBoostClassifier(
        estimator = tree.DecisionTreeClassifier(max_depth=int(sizeTrees)), # Can choose any simple base estimator
        n_estimators = int(numberTrees),
        learning_rate = float(learnRate), # Another parameter to tune
        algorithm="SAMME",
    )
    
    gradientBoostModel.fit(X_train_fold_2, y_train_fold_2)
    score = sklearn.metrics.balanced_accuracy_score(y_val_fold_2, gradientBoostModel.predict(X_val_fold_2))
    
    return score

optimizer2 = bayes_opt.BayesianOptimization(
    f = gradient_boost_model_fit_2,
    pbounds = parameter_limits,
    random_state=seed,
)

optimizer2.maximize(
    init_points=10, # Arbitrary larger number to increase spread
    n_iter=20, # Arbitrary large-ish number to optimize search
)

|   iter    |  target   | learnRate | number... | sizeTrees |
-------------------------------------------------------------
| [0m1        [0m | [0m0.7274   [0m | [0m4.399    [0m | [0m16.27    [0m | [0m14.32    [0m |
| [0m2        [0m | [0m0.7124   [0m | [0m4.019    [0m | [0m25.2     [0m | [0m23.11    [0m |
| [0m3        [0m | [0m0.7249   [0m | [0m10.12    [0m | [0m7.196    [0m | [0m24.97    [0m |
| [95m4        [0m | [95m0.7873   [0m | [95m19.5     [0m | [95m36.78    [0m | [95m1.759    [0m |
| [0m5        [0m | [0m0.7149   [0m | [0m20.45    [0m | [0m3.523    [0m | [0m14.71    [0m |
| [0m6        [0m | [0m0.7624   [0m | [0m9.147    [0m | [0m11.67    [0m | [0m19.34    [0m |
| [0m7        [0m | [0m0.7224   [0m | [0m23.75    [0m | [0m35.6     [0m | [0m23.36    [0m |
| [0m8        [0m | [0m0.7174   [0m | [0m11.53    [0m | [0m31.72    [0m | [0m17.02    [0m |
| [0m9        [0m | [0m0.7249   [0m | [0m2.883    

  sample_weight = np.exp(
  return fit_method(estimator, *args, **kwargs)


| [0m13       [0m | [0m0.7873   [0m | [0m24.32    [0m | [0m33.35    [0m | [0m1.0      [0m |
| [0m14       [0m | [0m0.7873   [0m | [0m20.18    [0m | [0m25.23    [0m | [0m1.0      [0m |
| [0m15       [0m | [0m0.7873   [0m | [0m15.56    [0m | [0m49.8     [0m | [0m1.0      [0m |


  sample_weight = np.exp(
  return fit_method(estimator, *args, **kwargs)


| [0m16       [0m | [0m0.2203   [0m | [0m25.0     [0m | [0m50.0     [0m | [0m1.0      [0m |
| [0m17       [0m | [0m0.7873   [0m | [0m8.898    [0m | [0m47.68    [0m | [0m1.139    [0m |
| [0m18       [0m | [0m0.7149   [0m | [0m11.26    [0m | [0m50.0     [0m | [0m8.698    [0m |
| [0m19       [0m | [0m0.7298   [0m | [0m22.45    [0m | [0m29.88    [0m | [0m9.005    [0m |
| [0m20       [0m | [0m0.7873   [0m | [0m12.78    [0m | [0m41.55    [0m | [0m1.0      [0m |
| [0m21       [0m | [0m0.7873   [0m | [0m10.61    [0m | [0m28.2     [0m | [0m1.0      [0m |
| [0m22       [0m | [0m0.2203   [0m | [0m12.74    [0m | [0m18.54    [0m | [0m1.0      [0m |
| [0m23       [0m | [0m0.7873   [0m | [0m6.695    [0m | [0m34.76    [0m | [0m1.629    [0m |
| [0m24       [0m | [0m0.7674   [0m | [0m12.46    [0m | [0m33.18    [0m | [0m6.739    [0m |
| [0m25       [0m | [0m0.7349   [0m | [0m1.0      [0m | [0m30.76    [0m | 

### Best Gradient Boost Model for Fold 2
Based on the above, the best model on fold 2 has a forest with a size of 36, a tree depth of 1, and a learning rate of 19.5 resulting in an accuracy of 0.7873
These are the same parameters as fold 0!

Kaggle score: 0.85625/0.83125

In [17]:
gradientBoostFold2 = sklearn.ensemble.AdaBoostClassifier(estimator = tree.DecisionTreeClassifier(max_depth=1), n_estimators = 36, learning_rate = 19.5, algorithm="SAMME")

gradientBoostFold2.fit(X_train_kmers, y_train)

test_preds = gradientBoostFold2.predict(X_test_kmers)

test_preds_df = pd.DataFrame(data={"genome_id":y_test_ids, "y_pred":test_preds})
test_preds_df.to_csv("gradientBoostFold2.csv", index=False) # IMPORTANT: Do not save the index
test_preds_df.head()



Unnamed: 0,genome_id,y_pred
0,562.42833,S
1,562.42739,R
2,562.22823,S
3,562.45646,S
4,562.22547,S


### Final Gradient Boost Model
Based on the results of the three optimized models, the best model would be the one trained on fold 1 (Forest with a size of 25, a tree depth of 23, and a learning rate of 4.019), with a resulting accuracy of 0.8523

## Conclusion
Perhaps simple is better in this case, as the random forest yielded better accuracies compared to the gradient boost across the board. The best model that I came up with was the random forest model trained on the middle fold of data.

This is, of course, only the best model in the narrow scope of the data used and the default sklearn packages.

I also did not have the chance to compare the models against each other, or to find the balanced accuracy.