# Forest Cover Type Prediction - Final Report
The goal of this project is to build a model that uses cartographic features about a cell of forest land to accurately predict the predominant kind of tree cover for the cell.

## The Inference Problem
**X:** Cartographic fatures of forest land, such as elevation, slope, distance to water, shade, and soil type.

**y:** Predominant class of tree cover for the given cell.

**Model:** We will train a variety of models, including KNN, Naive Bayes, Decision Trees, and Others.

**Parameters:** Each of the different models will encode the training data via the parameters. KNN is the exception, and no parameters will be stored.

**Cost Functions:** Each model will employ a different cost function. For example, decision trees will use entropy.

**Objective:** Each of the models will have their own objective, like maximizing likelihood, in the case of Naivve bayes.

## [Data Source](https://www.kaggle.com/c/forest-cover-type-prediction)
The outcome variable (forest cover type) comes from the US Forest Service, while the feature variables come from a combination of the US Geological survey as well as the USFS. This data encapsulates four wilderness areas in the Roosevelt National Forest; because these areas are preserved from most human disturbance, we assume forest cover types are a result of natural processes represented by the independent variables (although this assumption is not necessary to generate an effective model).

## Feature Definitions
The raw data contains a mixture of continuous and binary variables, as defined below: 

- `Elevation` - Elevation in meters
- `Aspect` - Aspect in degrees azimuth
- `Slope` - Slope in degrees
- `Horizontal_Distance_To_Hydrology` - Horz Dist to nearest surface water features
- `Vertical_Distance_To_Hydrology` - Vert Dist to nearest surface water features
- `Horizontal_Distance_To_Roadways` - Horz Dist to nearest roadway
- `Hillshade_9am` (0 to 255 index) - Hillshade index at 9am, summer solstice
- `Hillshade_Noon` (0 to 255 index) - Hillshade index at noon, summer solstice
- `Hillshade_3pm` (0 to 255 index) - Hillshade index at 3pm, summer solstice
- `Horizontal_Distance_To_Fire_Points` - Horz Dist to nearest wildfire ignition points
- `Wilderness_Area` (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation
- `Soil_Type` (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation
- `Cover_Type` (7 types, integers 1 to 7) - Forest Cover Type designation

Additional transformed variables to potentially train our models:
- `Wilderness_Area` - combining the 4 binary columns into one categorical variable (1-4), making an assumption about exclusivity of areas that needs to be checked
- `Soil_Type` - combining the 40 binary columns into one categorical variable (1-40), making an assumption about exclusivity of soil types that needs to be checked
- `Total_Distance_To_Hydrology` - Euclidean distance using "Horizontal" and "Vertical" distances
- Binned versions of continuous variables

## Testing Plan
We plan to tune and compare a variety of models, optimizing toward the highest possible $F_1$ score (a metric which balances precision and recall). All models will be trained on labeled data, and tested against "development" data using a 50-50 split. 

Potential models to test include:
- k Nearest Neighbors
- Naive Bayes
- Logistic Regression
- Decision Trees
- Support Vector Machines

## <a id = 0> </a>Navigation
- [Data Load](#1)
- [Data Split](#2)
- [Exploratory Data Analyses](#3)
    - [Histogram](#4)
    - [Scatter Plots](#5)
    - [Correlation Matrix](#6)
    - [Box Plots](#7)
    - [Violin Plots](#8)
    - [Wilderness Area and Soil Types](#9)
- [Confusion Matrix](#9.5)
- [Model Building](#10)
- [Result Analyses](#11)

## <a id = 1> </a> Data Load
[Back to Navigation](#0)

In [1]:
# General libraries
import numpy as np
import pandas as pd
import re
import time

# SK-learn - learning libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.svm import SVC
import xgboost as xgb
from sklearn.neural_network import MLPClassifier

# SK-learn - feature processing libraries
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# SK-learn - evaluation libraries
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Producing Decision Tree diagrams
from IPython.display import Image, display
import pydotplus
from subprocess import call

# Other
import copy
from textwrap import wrap

# Expand rows/columns in df outputs
pd.set_option(
#     'max_rows', None, 
    'max_columns', None,
    'max_colwidth', None
)

import warnings
warnings.filterwarnings(action='ignore')

## <a id = 2> </a> Data Split
[Back to Navigation](#0)

In [2]:
train_data = pd.read_csv('../data/processed/train_data.csv').set_index('Id')
train_labels = pd.read_csv('../data/processed/train_labels.csv').set_index('Id')
dev_data = pd.read_csv('../data/processed/dev_data.csv').set_index('Id')
dev_labels = pd.read_csv('../data/processed/dev_labels.csv').set_index('Id')

In [3]:
print(
    f'Train Data Shape: {train_data.shape}'
    f'\nTrain Labels Shape: {train_labels.shape}'
    f'\nDev Data Shape: {dev_data.shape}'
    f'\nDev Labels Shape: {dev_labels.shape}'
     )

Train Data Shape: (12096, 54)
Train Labels Shape: (12096, 1)
Dev Data Shape: (3024, 54)
Dev Labels Shape: (3024, 1)


## <a id = 3> </a>Exploratory Data Analyses
[Back to Navigation](#0)

In [None]:
# Column List
train_data.columns

In [None]:
# Statistics Summary
train_data.describe()

In [None]:
# Check data types for each field
train_data.dtypes

In [None]:
# Check null values
train_data.isna().sum()

**Observations**
- All data fields are int64 objects
- `Wilderness_Area` and `Soil_Type` are binary features
- `Cover_Type` is categorized from 1-7
- The rest of the fields are continuous
- No null values

### <a id = 4> </a>Histograms of each non-binary feature
[Back to Navigation](#0)

Given Random split between train and dev, we would expect the training distributions to compare similarly to our dev data. This will be key in in generalization both across Dev data, as well as final test data. 

In [None]:
# Note: When you export this notebook and run it in Jupyter Lab, you need to
# reset the first column of test_kaggle as you did with train and test:
# test_data = test_kaggle.set_index('Id')
# Otherwise, the last row of graphs are one off

# Strip underscores from feature names for nice printing
formatted_cols = copy.deepcopy(X_train_df.columns).str.replace('_', ' ')

# Plot Formatting
plt.rcParams.update({'text.color' : "dimgrey",
                     'axes.labelcolor' : "grey"})

# include dev_data in plots for comparison
# datasets = [train_data, dev_data]    
# data_names = ['train', 'dev']

datasets = [X_train_df]
data_names = ['train']

# For Train, Dev, and Test, plot each non-binary feature
fig, axes = plt.subplots(1, 10)




# Loop through to show hist of non-binary for each 
for d, data in enumerate(datasets):    # For each dataset (only needed when comparing dev and train)
    for i in np.arange(0, 10):          # For each non-binary figure in dataset

        
        
        data.iloc[:, i].plot.hist(ax = axes[i], 
                                    figsize = (20,5), 
                                    sharex = True, color = '#1c4966')

        
        # Column and Row names for each plot
        if (i == 0) and (d == 0):    # Top Left Corner
            axes[i].set_ylabel(data_names[d])
            axes[i].set_title("\n".join(wrap(formatted_cols[i], 12)))
        
        elif i == 0:    # First Column
            axes[i].set_ylabel(data_names[d])
    
        elif d == 0:    # First Row
            axes[i].set_ylabel('')
            axes[i].set_title("\n".join(wrap(formatted_cols[i], 12)))
        else:
            axes[i].set_ylabel('')
            
        # For All Plots
        axes[i].set_yticks([])
        axes[i].spines['top'].set_visible(False)
        axes[i].spines['right'].set_visible(False)
        axes[i].spines['left'].set_color('grey')
        axes[i].spines['bottom'].set_color('grey')
        axes[i].tick_params(colors = 'grey')

plt.show()

### <a id = 5> </a> Scatterplots comparing each feature
[Back to Navigation](#0)

Scatterplots may reveal correlational relationships between features. Additionally, the color of each datapoint represents a forest cover type. This will also help reveal if the relationship between two features varies by forest cover type.

In [None]:
# Scatterplot Matrix
# ------------------------------------------------------------------------------
# This isn't meant to be a final output (obviously it's too much in its current
# state); just wanted to see all of the distributions at once so we could pick
# out meaningful ones 
# Currently, this takes a long time to run.

train_data_copy = X_train_df.copy().iloc[:, :10]
train_data_copy["Cover_Type"] = X_train_df.Cover_Type

# The different colors indicate Cover_Type
sns.pairplot(train_data_copy, kind="scatter", hue="Cover_Type", palette="Set1")
plt.show()

In [None]:
# Cutting down the number of columns
columns = ["Elevation", "Aspect", "Slope", "Hillshade_9am",
           "Hillshade_Noon", "Hillshade_3pm", "Cover_Type"]

train_data_copy2 = X_train_df.copy().loc[:, columns]
train_data_copy2

# The different colors indicate Cover_Type
sns.pairplot(train_data_copy2, kind="scatter", hue="Cover_Type", palette="Set1")
plt.show()

In [None]:
# Rest of the columns
columns = ["Horizontal_Distance_To_Hydrology",
           "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways",
           "Horizontal_Distance_To_Fire_Points", "Hillshade_9am",
           "Hillshade_Noon", "Hillshade_3pm", "Cover_Type"]

train_data_copy3 = X_train_df.copy().loc[:, columns]
train_data_copy3

# The different colors indicate Cover_Type
sns.pairplot(train_data_copy3, kind="scatter", hue="Cover_Type", palette="Set1")
plt.show()

### <a id = 6> </a>Correlation Matrix - Relationships between each non-binary feature
[Back to Navigation](#0)

Comparing the train_data heatmap to dev_data, it is evident that they have largely the same correlation structure. This is expected given the random 80/20 split, but it is important to note any deviations in structure will lead to poor generalization.

In [None]:
fig, axes = plt.subplots(1, 2, sharey = True, figsize = (20,10))

datasets = [X_train_df, X_dev_df]    
data_names = ['train', 'dev']

# Correlation plot for each dataset - numeric values
for i, data in enumerate(datasets):    # For each dataset

    corr = data.iloc[:, :10].corr()    # Set the correlation matrix
    
    # Mask to upper triangular
    mask = np.zeros_like(corr)
    mask[np.triu_indices_from(mask)] = True
    
    # Plot correlation heatmap
    sns.heatmap(corr, 
                xticklabels = corr.columns.values,
                yticklabels = corr.columns.values, 
                cmap = 'bwr', 
                annot = True, 
                mask = mask, 
                fmt = '.2f',
                ax = axes[i],
                cbar = False).set(title = data_names[i])

plt.show()

### <a id = 7> </a>Boxplots for Numeric Features
[Back to Navigation](#0)

Cartographic features like Elevation, Aspect, and Slope 

In [None]:
fig, ax = plt.subplots(10, 1, figsize = (20, 35))

feature_cols = ["Elevation", "Aspect", "Slope", 
                "Horizontal_Distance_To_Roadways", 
                "Horizontal_Distance_To_Fire_Points", "Hillshade_9am",
                "Hillshade_Noon", "Hillshade_3pm", 
                "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Hydrology"] # Added these to feature_cols

for i, var in enumerate(feature_cols):
    sns.boxplot(x = var, data = X_train_df, ax = ax[i])
    
plt.show()

### <a id = 8> </a>Violin Plot on Continuous Features
[Back to Navigation](#0)

In [None]:
# Violin plot
fig, axes = plt.subplots(5, 2, figsize = (20, 20))
col_list = ['Elevation', 'Aspect', 'Slope',
            'Horizontal_Distance_To_Hydrology','Vertical_Distance_To_Hydrology', 
            'Horizontal_Distance_To_Roadways','Hillshade_9am',
            'Hillshade_Noon', 'Hillshade_3pm','Horizontal_Distance_To_Fire_Points']
i = 0
for col_name in col_list:
    row = i // 2
    col = i % 2
    sns.violinplot(x='Cover_Type', y=col_name, data=X_train_df , ax=axes[row][col])
    i = i + 1

**Observations**
- `Elevation` varies according to `Cover_Type` indicating that this will be an important variable for prediction
- `Horizontal_Distance_To_Hydrology` and `Horizontal_Distance_To_Roadways` have similar distributions

### <a id = 9> </a>`Wilderness_Area` and `Soil_Type` Binary Features Exploration
[Back to Navigation](#0)

Unpivot `Wilderness_Area` and `Soil_Type` Variables

In [None]:
soil_list = []
for i in range(40):
    soil_list.append(f'Soil_Type{i+1}')

wild_area_list = ['Wilderness_Area1','Wilderness_Area2','Wilderness_Area3','Wilderness_Area4']

In [None]:
# Unpivot df from wide to long format by combining `Soil_Type#` and `Wilderness_Area#` to one column each
X_train_df_comb = pd.melt(X_train_df, 
                          id_vars=col_list+soil_list+['Cover_Type'], 
                          value_vars=wild_area_list, 
                          var_name='Wilderness_Area')
X_train_df_comb2 = X_train_df_comb[X_train_df_comb.value != 0].drop(columns=['value'])

X_train_df_comb3 = pd.melt(X_train_df_comb2, 
                          id_vars=col_list+['Wilderness_Area', 'Cover_Type'], 
                          value_vars=soil_list, 
                          var_name='Soil_Type')
X_train_df_comb4 = X_train_df_comb3[X_train_df_comb3.value != 0].drop(columns=['value'])

In [None]:
# Count plot - Combined Wilderness Area
plt.figure(figsize=(15, 8))
ax = sns.countplot(x='Wilderness_Area', hue='Cover_Type', data=X_train_df_comb4)
for p in ax.patches:
    ax.annotate(f'{p.get_height():.0f}', (p.get_x()-0.001, p.get_height()+10))
plt.legend(loc='upper right', title='Cover Type')
plt.show()

**Observations**
- Cover Type 4 only exists in Wilderness Area 4
- Fairly equal representation of wilderness areas, except for Wilderness Area 2

In [None]:
# Count plot - Soil Type
plt.figure(figsize=(50, 10))
ax = sns.countplot(x='Soil_Type', hue='Cover_Type', data=X_train_df_comb4)
# for p in ax.patches:
#     ax.annotate(f'{p.get_height():.0f}', (p.get_x(), p.get_height()+10))
plt.legend(loc='upper right', title='Cover Type')
plt.show()

**Observations**
- There are no cover types for Soil Type 7 and 15 

### Explore Wilderness Area Binary Counts
Fairly equal representation of wilderness areas, except for Wilderness Area 2

In [None]:
X_train_df.groupby(['Wilderness_Area1','Wilderness_Area2','Wilderness_Area3','Wilderness_Area4'])['Cover_Type'].count()

### Determining if one Soil type exists for each data point

In [None]:
# Determining if one soil type exists for each data point
soil_type_cols = [col_name for col_name in X_train_df.columns if 'Soil_Type' in col_name]

X_train_df_soil = X_train_df.copy()
X_train_df_soil['Soil_Type_Count'] = X_train_df_soil[soil_type_cols].sum(axis = 1)
X_train_df_soil['Soil_Type_Count'].value_counts()

# Only 1 soil type exists for each row - no mix of different soil types

In [None]:
# try:
#     dev_data.drop(columns = 'Soil_Type_Count', axis = 1)
# except:
#     print('Soil_Type_Count not yet created')

X_dev_df_soil = X_dev_df.copy()
X_dev_df_soil['Soil_Type_Count'] = X_dev_df_soil[soil_type_cols].sum(axis = 1)
X_dev_df_soil['Soil_Type_Count'].value_counts()


Looks like both Soil_Types and Wilderness_Areas are mutually exclusive within the columns (only 1 area/type per row)


In [None]:
# Wilderness Areas and Soil Types
# ------------------------------------------------------------------------------
# Combining wilderness areas into one column, soil types into one column

def get_feature_number(r, col_prefix):
  
  # gets the column name suffix of the true variable (wilderness area/soil type)
    cols = [col_name for col_name in r.index if (col_prefix in col_name) and re.search(r'\d', col_name) is not None]


    feature_subix = r[cols].argmax()

  #
    feature_name = cols[feature_subix]
    feature_num = ''.join([i for i in feature_name if i.isdigit()])
    return int(feature_num)

X_train_df_DR = X_train_df_soil.copy()
X_dev_df_DR = X_dev_df_soil.copy()

X_train_df_DR['Wilderness_Area'] = X_train_df_DR.apply(lambda x:get_feature_number(x, 'Wilderness_Area'), axis = 1)
X_train_df_DR['Soil_Type'] = X_train_df_DR.apply(lambda x:get_feature_number(x, 'Soil_Type'), axis = 1)
X_dev_df_DR['Wilderness_Area'] = X_dev_df_DR.apply(lambda x:get_feature_number(x, 'Wilderness_Area'), axis = 1)
X_dev_df_DR['Soil_Type'] = X_dev_df_DR.apply(lambda x:get_feature_number(x, 'Soil_Type'), axis = 1)

In [None]:
# Total_Distance_to_Hydrology
# ------------------------------------------------------------------------------
# Create Total_Distance_to_Hydrology based on Euclidean distance
X_train_df_DR['Total_Distance_To_Hydrology'] = np.sqrt(X_train_df_DR["Horizontal_Distance_To_Hydrology"]**2 + X_train_df_DR['Vertical_Distance_To_Hydrology']**2)
X_dev_df_DR['Total_Distance_To_Hydrology'] = np.sqrt(X_dev_df_DR["Horizontal_Distance_To_Hydrology"]**2 + X_dev_df_DR['Vertical_Distance_To_Hydrology']**2)
X_train_df_DR[["Total_Distance_To_Hydrology", "Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology"]].head(10)

In [None]:
X_train_df_DR['Wilderness_Area'].value_counts()

In [None]:
X_train_df_DR['Soil_Type'].value_counts().sort_index().head()

## <a id = 9.5> </a>Confusion Matrix 
[Back to Navigation](#0)

Gaussian NB

In [None]:
def gauss_NB_confusion_matrix():
    
    scaler = StandardScaler()
    X_train_std = scaler.fit_transform(train_data.iloc[:, :10])
    X_dev_std = scaler.transform(dev_data.iloc[:, :10])
    
    model = GaussianNB()
    model.fit(X_train_std, train_labels.values.ravel())
    dev_pred = model.predict(X_dev_std)
    
    nb_f1_score = metrics.f1_score(dev_pred, dev_labels, average = 'weighted')
    
    print(f'Gaussian NB f1_score: {nb_f1_score:.4f}\n')
    
    # Print confusion matrix in ASCII form
    conf_matrix = confusion_matrix(dev_labels, dev_pred)
    print('Confusion Matrix:')
    print(conf_matrix)
    
    # Produce confusion matrix in the form of heatmap
    fig = plt.figure(figsize=(10, 10))
    
    ax = fig.add_subplot(111)
    cmx = ax.matshow(conf_matrix, cmap=plt.cm.Accent)
    plt.colorbar(cmx)
    
    plt.title('Confusion Matrix Heat Map')
    plt.xlabel('Predicted', fontsize=14)
    plt.ylabel('Actual', fontsize=14)
    plt.imshow(conf_matrix, interpolation='nearest', cmap=plt.cm.Accent)
    classNames = [str(i+1) for i in range(conf_matrix.shape[0])]    
    tick_marks = np.arange(len(classNames))
    plt.xticks(tick_marks, classNames, rotation=0, fontsize=14)
    plt.yticks(tick_marks, classNames, fontsize=14)
   
    for i in range(len(classNames)):
        for j in range(len(classNames)):
            plt.text(j,i, str(conf_matrix[i][j]), size='large', horizontalalignment='center')    

gauss_NB_confusion_matrix()

## <a id = 10> </a>Model Building
[Back to Navigation](#0)

For the purposes of encapsulation and avoiding conflicts, each model cleaning/building is wrapped in a function. Eventually we will resolve conflicts and combine the tests.

In [4]:
class Model():
    def __init__(self, model_type):
        self.model_type = model_type
        self.scaler_type = None
        self.X_train = None
        self.X_dev = None
        
        if model_type == 'kNN':
            self.model = KNeighborsClassifier()
        elif model_type == 'Gaussian_NB':
            self.model = GaussianNB()
        elif model_type == 'Logistic_Regression':
            self.model = LogisticRegression(random_state=0, max_iter=10000)
        elif model_type == 'Decision_Tree':
            self.model = DecisionTreeClassifier(random_state=0, criterion='entropy')
        elif model_type == 'SVC':
            self.model = SVC(random_state=0, kernel='rbf')
        elif model_type == 'XGBoost':
            self.model = xgb.XGBClassifier(eval_metric='mlogloss', random_state=0)
        elif model_type == 'Neural_Net':
            self.model = MLPClassifier(random_state=0, max_iter=500)
                  
    def featurePreprocessingScale(self, scaler_type, X_train, X_dev):
        self.scaler_type = scaler_type

        if scaler_type == 'MinMax':
            scaler = MinMaxScaler()
        elif scaler_type == 'Standard':
            scaler = StandardScaler()
        elif scaler_type == 'Robust':
            scaler = RobustScaler()

        X_train_scaled = scaler.fit_transform(X_train)
        X_dev_scaled = scaler.transform(X_dev)
        
        self.X_train_scaled = X_train_scaled
        self.X_dev_scaled = X_dev_scaled

        return([X_train_scaled, X_dev_scaled])
    
    def gridSearchCv(self, train_data, dev_data, train_labels, dev_labels,
                     params=None, scaler_type=None):
        
        self.X_train = train_data
        self.X_dev = dev_data        
        
        gscv = GridSearchCV(self.model, param_grid=params, cv=3, n_jobs=-1)
        
        if scaler_type != None:
            [self.X_train, self.X_dev] = self.featurePreprocessingScale(scaler_type, train_data, dev_data)

        gscv.fit(self.X_train, train_labels.values.ravel())
        dev_predict = gscv.predict(self.X_dev)
        
        self.best_model = gscv
        self.best_f1score = metrics.f1_score(dev_labels, dev_predict, average='weighted')
        self.best_accuracy = metrics.accuracy_score(dev_labels, dev_predict)
        self.dev_predict = dev_predict
        self.classification_report = classification_report(dev_predict, dev_labels)

In [5]:
scaler_options = ['MinMax', 'Standard', 'Robust', None]
model_options = [
    {'model_type':'kNN','params':{'n_neighbors':list(range(1, 3))}}
    ,{'model_type':'Gaussian_NB','params':{'var_smoothing':[0.001]}}
    ,{'model_type':'Logistic_Regression','params':{'C':[500, 1000]}}
    ,{'model_type':'Decision_Tree','params':{'max_leaf_nodes':[50]}}
    ,{'model_type':'SVC','params':{'C':[10],'gamma':[0.5]}}   
    ,{'model_type':'XGBoost','params':{'max_depth':[7],'subsample':[0.8],'n_estimators':[200]}}
    ,{'model_type':'Neural_Net','params':{'hidden_layer_sizes':[(100,),(100,20)]}}    
]

In [6]:
i = 1
finalResult_df = pd.DataFrame()
for scalerType in scaler_options:
    for modelType in model_options:
        start_time = time.time()
        model = Model(model_type=modelType['model_type'])
        model.gridSearchCv(train_data, dev_data, train_labels, dev_labels,
                     params=modelType['params'], scaler_type=scalerType)
        end_time = time.time()
        print(
        f'''
        Model Number {i}: {modelType['model_type']}
        Scaler Type: {scalerType}
        Parameters: {modelType['params']}
        Highest F1-Score: {model.best_f1score:.4f}
        Highest Accuracy: {model.best_accuracy:.4f}
        Optimal Parameters: {model.best_model.best_params_}
        Run Time: {end_time-start_time:.2f}s
        '''
        )
        
        finalResult_df = finalResult_df.append(pd.DataFrame(
        {
            'Model Number':[i]
            ,'Model Type':[model.model_type]
            ,'Scaler Type':[scalerType]
            ,'F1-Score':[round(model.best_f1score, 4)]
            ,'Optimal Parameters':[model.best_model.best_params_]
            ,'Run Time':[round(end_time-start_time, 2)]
        }
        )
                                              )
        i+=1


        Model Number 1: kNN
        Scaler Type: MinMax
        Parameters: {'n_neighbors': [1, 2]}
        Highest F1-Score: 0.8171
        Optimal Parameters: {'n_neighbors': 1}
        Run Time: 9.72s
        

        Model Number 2: Gaussian_NB
        Scaler Type: MinMax
        Parameters: {'var_smoothing': [0.001]}
        Highest F1-Score: 0.5032
        Optimal Parameters: {'var_smoothing': 0.001}
        Run Time: 0.29s
        

        Model Number 3: Logistic_Regression
        Scaler Type: MinMax
        Parameters: {'C': [500, 1000]}
        Highest F1-Score: 0.6985
        Optimal Parameters: {'C': 1000}
        Run Time: 63.25s
        

        Model Number 4: Decision_Tree
        Scaler Type: MinMax
        Parameters: {'max_leaf_nodes': [50]}
        Highest F1-Score: 0.6908
        Optimal Parameters: {'max_leaf_nodes': 50}
        Run Time: 0.48s
        

        Model Number 5: SVC
        Scaler Type: MinMax
        Parameters: {'C': [10], 'gamma': [0.5]}
  

## <a id = 11> </a>Result Analyses
[Back to Navigation](#0)

In [14]:
finalResult_df.sort_values(by='F1-Score', ascending=False).reset_index(drop=True)

Unnamed: 0,Model Number,Model Type,Scaler Type,F1-Score,Optimal Parameters
0,6,XGBoost,MinMax,0.8701,"{'max_depth': 7, 'n_estimators': 200, 'subsample': 0.8}"
1,20,XGBoost,Robust,0.8689,"{'max_depth': 7, 'n_estimators': 200, 'subsample': 0.8}"
2,13,XGBoost,Standard,0.8667,"{'max_depth': 7, 'n_estimators': 200, 'subsample': 0.8}"
3,27,XGBoost,,0.8654,"{'max_depth': 7, 'n_estimators': 200, 'subsample': 0.8}"
4,22,kNN,,0.8474,{'n_neighbors': 1}
5,19,SVC,Robust,0.8461,"{'C': 10, 'gamma': 0.5}"
6,21,Neural_Net,Robust,0.8436,"{'hidden_layer_sizes': (100, 20)}"
7,14,Neural_Net,Standard,0.8351,"{'hidden_layer_sizes': (100, 20)}"
8,12,SVC,Standard,0.831,"{'C': 10, 'gamma': 0.5}"
9,7,Neural_Net,MinMax,0.8205,"{'hidden_layer_sizes': (100, 20)}"
