# Benchmark Ensemble

## Wilson Peguero Rosario

Ensemble learning is a general approach where the combination of related methods provides better predictions or improves overall performance. Some real-world examples of its use include the Netflix Challenge, gene classification, image segmentation, and video retrieval.

In this assignment, you will implement ensemble learning, combining a variety of learning methods such as max voting, averaging, weighted averaging, bagging, boosting (gradient boosting, random forest, XGBoost, etc.), stacking, blending, and other variations.

You will have the freedom to choose between implementing classification or regression machine learning, or a combination of the two, so choose your ensemble techniques accordingly.

1. Access the "UCI Machine Learning Repository," located in the topic Resources. Note: There are about 120 data sets that are suitable for use in a clustering task. For this part of the exercise, you must choose one of these datasets, provided it includes at least 10 attributes and 10,000 instances
2. Ensure that the datasets are suitable for clustering using this method.
3. You may search for data in other repositories, such as Data.gov or Kaggle.

For your selected dataset, build an ensemble model as follows:

1. Explain the dataset and the type of information you wish to gain by applying an ensemble method.
1. Explain the ensemble components and how you will be using it in your analysis (list the steps, intuition behind the mathematical representation, and address its assumptions). Specifically, which of max voting, averaging, weighted averaging, bagging, boosting (gradient boosting, random forest, XGBoost, etc.), stacking, blending, and/or other variations have you chosen, and why.
1. Import necessary libraries, then read the dataset into a data frame and perform initial statistical exploration.
1. Clean the data and address unusual phenomena (e.g., normalization, feature scaling, outliers); use illustrative diagrams and plots and explain them.
1. Formulate two questions that can be answered by employing the ensemble learning
1. If appropriate and relevant to your model, split the data into training and testing sets.
1. Provide a diagram that illustrates how the ensemble components are combined into one learning model.
1. Implement and execute the ensemble learning model. Explain the intuition behind each mathematical step.
1. Answer the questions you formulated using the results obtained from executing the ensemble model.
1. Interpret the predictions made by the model in the context of the questions you asked.
1. Validate your model using relevant validation metrics such as a confusion matrix, accuracy score, ROC-AUC curves, and k-fold cross validation. Then, explain the results.
1. Explain how ensemble system reduced the variance.
1. Include all mathematical formulas used and graphs representing the final outcomes.

Prepare a comprehensive technical report as a markdown document or Jupyter notebook, including all code, code comments, all outputs, plots, and analysis. Make sure the project documentation contains a) problem statement, b) algorithm of the solution, c) analysis of the findings, and d) references.

### Problem Statement

Currently, 99% (3.96 Billion) of four billion species that have ever lived on this earth have gone extinct. even now, there are currently multiple species that are at risk of becoming extinct. This provides one with an environment where there may be multiple variants of a single species that has gone extinct while unidentified. Through the use of genetic material and data on current living species, one is able to determine what the closest rewlative the extinct species may have currently alive. Knowing the relationship between extinct and current species allows one to understand how a species has come to adapt to it's environment (whether it be through regression or developing a new mutation that allowed the current species to thrive in it's environment).

From the codon usage dataset, one hopes to extract enough information to determine whether a certain species belongs to a certain type of species, family, kingom, etc. At the moment, the data set can provide one with the DNA Type of the species as well as the kingdom that the species belongs in.

In [66]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [46]:
df__codons = pd.read_csv(r'codon_usage_dataset/codon_usage.csv')
df__codons.describe()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,DNAtype,SpeciesID,Ncodons,UUA,UUG,CUU,CUC,CUA,CUG,AUU,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
count,13027.0,13027.0,13027.0,13027.0,13027.0,13027.0,13027.0,13027.0,13027.0,13027.0,...,13027.0,13027.0,13027.0,13027.0,13027.0,13027.0,13027.0,13027.0,13027.0,13027.0
mean,0.367237,130460.164888,79611.78,0.020637,0.014104,0.017821,0.018286,0.019044,0.018451,0.028353,...,0.005454,0.009929,0.006423,0.024179,0.021162,0.028291,0.021683,0.001643,0.000592,0.006178
std,0.688745,124787.591843,719728.3,0.02071,0.00928,0.010587,0.014572,0.024251,0.016578,0.017507,...,0.006605,0.008574,0.006387,0.013827,0.01304,0.014343,0.015018,0.001824,0.000907,0.010345
min,0.0,7.0,1000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,28851.5,1602.0,0.00561,0.007105,0.01089,0.00783,0.005305,0.00718,0.016365,...,0.00122,0.00169,0.00117,0.012385,0.01186,0.01736,0.00971,0.00056,0.0,0.00041
50%,0.0,81972.0,2929.0,0.01526,0.01336,0.01613,0.01456,0.00968,0.0128,0.02548,...,0.00353,0.00927,0.00455,0.02542,0.01907,0.02609,0.02054,0.00138,0.00042,0.00113
75%,1.0,222891.5,9120.0,0.02949,0.01981,0.02273,0.02511,0.01725,0.02432,0.038115,...,0.00715,0.015925,0.01025,0.03419,0.02769,0.0368,0.031125,0.00237,0.00083,0.00289
max,12.0,465364.0,40662580.0,0.15133,0.10119,0.08978,0.10035,0.16392,0.10737,0.15406,...,0.05554,0.09883,0.05843,0.18566,0.11384,0.14489,0.15855,0.0452,0.02561,0.1067


In [47]:
df__codons.columns

Index(['Kingdom', 'DNAtype', 'SpeciesID', 'Ncodons', 'SpeciesName', 'UUU',
       'UUC', 'UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG', 'AUU', 'AUC', 'AUA',
       'AUG', 'GUU', 'GUC', 'GUA', 'GUG', 'GCU', 'GCC', 'GCA', 'GCG', 'CCU',
       'CCC', 'CCA', 'CCG', 'UGG', 'GGU', 'GGC', 'GGA', 'GGG', 'UCU', 'UCC',
       'UCA', 'UCG', 'AGU', 'AGC', 'ACU', 'ACC', 'ACA', 'ACG', 'UAU', 'UAC',
       'CAA', 'CAG', 'AAU', 'AAC', 'UGU', 'UGC', 'CAU', 'CAC', 'AAA', 'AAG',
       'CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG', 'GAU', 'GAC', 'GAA', 'GAG',
       'UAA', 'UAG', 'UGA'],
      dtype='object')

### Data Processing

Now that the data has been uploaded, one should be able to process the data into a more functional format. The first change to make is convert the *Kingdom* feature into it's categorical version.

In [48]:
df__codons['Kingdom'] = pd.Categorical(df__codons['Kingdom'])
dict__cat = dict(enumerate(df__codons['Kingdom'].cat.categories))
df__codons['Kingdom'] = df__codons['Kingdom'].cat.codes

Now that the data is fully categorized, a correlation matrix can be created.

In [49]:
corr_matrix = df__codons.corr()
corr_matrix\
    .style\
        .background_gradient(cmap='coolwarm', axis=None, vmin=-1, vmax=1)\
            .format(precision=2)

Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,UUA,UUG,CUU,CUC,CUA,CUG,AUU,AUC,AUA,AUG,GUU,GUC,GUA,GUG,GCU,GCC,GCA,GCG,CCU,CCC,CCA,CCG,UGG,GGU,GGC,GGA,GGG,UCU,UCC,UCA,UCG,AGU,AGC,ACU,ACC,ACA,ACG,UAU,UAC,CAA,CAG,AAU,AAC,UGU,UGC,CAU,CAC,AAA,AAG,CGU,CGC,CGA,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
Kingdom,1.0,0.17,0.06,-0.08,-0.03,-0.07,0.3,0.23,0.35,-0.23,0.0,0.06,0.18,-0.11,-0.07,-0.18,0.05,-0.18,0.0,-0.07,0.14,-0.42,0.25,0.28,0.41,-0.38,-0.11,-0.26,-0.31,0.12,0.03,0.13,0.31,0.31,-0.28,-0.04,-0.04,0.26,0.17,0.38,-0.29,-0.07,0.09,0.15,-0.16,-0.06,0.16,0.21,0.14,0.0,0.18,-0.1,-0.13,-0.28,-0.37,0.18,-0.27,0.06,0.19,-0.23,-0.24,-0.26,-0.19,0.05,0.02,0.26
DNAtype,0.17,1.0,-0.02,-0.05,0.46,-0.04,0.38,0.07,0.46,-0.31,0.46,0.14,0.45,-0.37,-0.08,-0.33,0.39,-0.44,-0.05,-0.12,0.04,-0.32,0.04,0.03,0.16,-0.33,-0.25,-0.15,-0.29,0.16,-0.12,0.32,0.12,0.37,-0.27,-0.05,-0.38,0.13,-0.01,0.21,-0.39,0.31,-0.19,0.14,-0.5,0.1,-0.18,-0.09,-0.31,0.23,-0.02,0.1,-0.49,-0.08,-0.29,0.49,-0.22,-0.09,-0.32,-0.27,-0.46,-0.1,-0.5,0.12,-0.03,0.46
SpeciesID,0.06,-0.02,1.0,0.05,0.03,-0.05,0.06,0.02,0.05,-0.04,0.04,0.01,0.03,-0.05,0.01,-0.01,0.04,-0.06,-0.05,0.0,0.03,0.03,-0.01,0.01,0.01,0.01,-0.02,-0.05,-0.01,-0.09,-0.02,-0.0,-0.03,0.04,-0.0,-0.03,-0.06,0.02,0.02,0.06,0.03,0.04,-0.06,0.05,-0.04,0.03,-0.02,-0.03,-0.07,0.04,0.0,0.02,-0.08,0.04,0.02,0.02,0.01,-0.03,-0.0,-0.01,-0.03,-0.01,-0.08,0.05,-0.01,0.05
Ncodons,-0.08,-0.05,0.05,1.0,-0.03,0.01,-0.04,-0.0,-0.05,0.08,-0.04,-0.0,-0.05,0.03,-0.02,0.03,-0.04,0.07,-0.03,0.04,-0.02,0.1,-0.04,-0.01,-0.05,0.08,0.02,-0.01,0.06,-0.04,0.02,-0.04,-0.02,-0.06,0.06,0.0,0.05,-0.06,-0.0,-0.05,0.06,-0.03,-0.03,-0.03,0.06,-0.02,-0.03,-0.02,-0.0,-0.01,-0.01,-0.01,0.02,0.02,0.08,-0.04,0.07,-0.01,0.0,0.02,0.03,0.03,0.05,-0.03,0.0,-0.05
UUA,-0.03,0.46,0.03,-0.03,1.0,0.02,0.26,-0.36,0.19,-0.55,0.82,-0.29,0.67,-0.26,0.3,-0.61,0.58,-0.54,0.13,-0.48,0.12,-0.43,0.21,-0.36,0.14,-0.47,-0.33,0.11,-0.54,0.21,-0.2,0.46,-0.36,0.47,-0.42,0.31,-0.5,0.38,-0.4,0.3,-0.45,0.62,-0.48,0.3,-0.59,0.56,-0.43,0.1,-0.43,0.29,-0.41,0.44,-0.5,-0.06,-0.49,0.18,-0.4,0.23,-0.27,0.08,-0.63,0.19,-0.58,0.26,0.01,0.33
UUG,-0.07,-0.04,-0.05,0.01,0.02,1.0,-0.01,-0.48,-0.42,-0.15,0.03,-0.49,-0.17,0.35,0.47,-0.2,-0.04,0.16,0.27,-0.41,-0.16,-0.06,0.14,-0.41,-0.21,-0.08,0.33,0.31,-0.3,-0.09,0.13,0.4,-0.31,-0.12,0.11,0.49,-0.13,0.16,-0.46,-0.31,0.02,0.38,-0.26,0.11,0.05,0.31,-0.33,0.41,-0.03,0.42,-0.42,0.23,0.24,0.3,-0.15,-0.01,-0.04,0.4,0.29,0.48,-0.16,0.32,0.12,-0.13,0.05,-0.47
CUU,0.3,0.38,0.06,-0.04,0.26,-0.01,1.0,0.19,0.36,-0.42,0.43,-0.03,0.29,-0.26,0.19,-0.35,0.24,-0.49,0.23,-0.13,0.29,-0.44,0.3,0.08,0.2,-0.45,-0.27,-0.07,-0.32,0.17,-0.14,0.31,0.11,0.33,-0.38,-0.05,-0.26,0.32,0.01,0.33,-0.41,0.11,-0.21,0.23,-0.44,0.06,-0.15,-0.07,-0.26,0.1,-0.09,0.07,-0.33,-0.13,-0.42,0.25,-0.37,-0.08,-0.18,-0.14,-0.51,-0.03,-0.43,0.16,0.01,0.42
CUC,0.23,0.07,0.02,-0.0,-0.36,-0.48,0.19,1.0,0.46,0.23,-0.3,0.66,-0.08,-0.34,-0.52,0.44,-0.29,0.01,-0.31,0.64,-0.0,0.14,-0.21,0.72,0.1,0.14,-0.24,-0.46,0.42,-0.12,0.02,-0.43,0.64,-0.08,0.05,-0.62,0.21,-0.3,0.71,0.13,0.05,-0.62,0.36,-0.2,-0.0,-0.64,0.4,-0.44,0.09,-0.49,0.59,-0.51,-0.16,-0.35,0.25,0.08,0.17,-0.56,-0.19,-0.67,0.22,-0.53,0.01,0.01,-0.01,0.47
CUA,0.35,0.46,0.05,-0.05,0.19,-0.42,0.36,0.46,1.0,-0.31,0.28,0.6,0.55,-0.57,-0.31,-0.24,0.39,-0.53,-0.23,0.12,0.34,-0.4,-0.06,0.33,0.59,-0.41,-0.56,-0.41,-0.18,0.27,-0.21,-0.11,0.36,0.54,-0.45,-0.39,-0.2,0.05,0.39,0.66,-0.44,-0.17,0.08,0.17,-0.56,-0.23,0.24,-0.31,-0.22,-0.22,0.32,-0.07,-0.6,-0.42,-0.34,0.4,-0.35,-0.37,-0.39,-0.55,-0.39,-0.29,-0.59,0.23,-0.03,0.84
CUG,-0.23,-0.31,-0.04,0.08,-0.55,-0.15,-0.42,0.23,-0.31,1.0,-0.59,0.22,-0.52,0.21,-0.45,0.59,-0.56,0.74,-0.34,0.65,-0.34,0.68,-0.37,0.32,-0.43,0.7,0.28,-0.22,0.7,-0.43,0.31,-0.5,0.2,-0.58,0.47,-0.34,0.55,-0.57,0.4,-0.46,0.49,-0.5,0.25,-0.5,0.66,-0.52,0.14,-0.21,0.31,-0.32,0.33,-0.5,0.26,0.03,0.74,-0.28,0.67,-0.33,0.06,-0.26,0.63,-0.29,0.56,-0.22,-0.03,-0.24


Taking into account the above correlation matrix, one can observe that there is some clear autocorrelation between the explanatory variables as well as some very low correlation between the explanatory variables and the response variables (*Kingdom* and *DNAtype*). The next step is to divide the explanatory variables (codon concentrations) by the number of codons features (*Ncodons*).

In [50]:
df__codons = df__codons[pd.to_numeric(df__codons['UUU'], errors='coerce').notnull()]


Noticed that there 

In [51]:
df__codons['Ncodons'] = df__codons['Ncodons'].astype('float64')
for (col, data) in df__codons.iteritems():
    if col not in ('Kingdom', 'DNAtype', 'Ncodons', 'SpeciesID', 'SpeciesName'):
        df__codons[col] = df__codons[col].astype('float64')
        df__codons[col] = df__codons[col].div(df__codons['Ncodons'], axis=0)
    else:
        pass

corr_matrix = df__codons.corr()
corr_matrix\
    .style\
        .background_gradient(cmap='coolwarm', axis=None, vmin=-1, vmax=1)\
            .format(precision=2)

Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,UUU,UUC,UUA,UUG,CUU,CUC,CUA,CUG,AUU,AUC,AUA,AUG,GUU,GUC,GUA,GUG,GCU,GCC,GCA,GCG,CCU,CCC,CCA,CCG,UGG,GGU,GGC,GGA,GGG,UCU,UCC,UCA,UCG,AGU,AGC,ACU,ACC,ACA,ACG,UAU,UAC,CAA,CAG,AAU,AAC,UGU,UGC,CAU,CAC,AAA,AAG,CGU,CGC,CGA,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
Kingdom,1.0,0.17,0.06,-0.08,0.1,0.22,0.08,0.05,0.28,0.28,0.32,-0.02,0.13,0.18,0.22,0.1,0.08,0.04,0.15,0.0,0.13,0.14,0.23,-0.19,0.24,0.31,0.33,-0.14,0.05,-0.03,-0.03,0.18,0.16,0.15,0.29,0.28,-0.05,0.08,0.13,0.25,0.25,0.34,-0.05,0.07,0.18,0.22,0.03,0.09,0.22,0.18,0.16,0.1,0.24,0.09,0.02,-0.05,-0.15,0.19,-0.08,0.11,0.19,-0.0,-0.01,0.01,-0.01,0.16,0.1,0.26
DNAtype,0.17,1.0,-0.02,-0.05,0.5,0.27,0.52,0.22,0.41,0.16,0.43,-0.11,0.48,0.23,0.48,0.01,0.13,-0.08,0.42,-0.17,0.16,0.06,0.18,-0.17,0.18,0.15,0.23,-0.14,0.05,0.05,-0.08,0.26,0.08,0.4,0.24,0.4,-0.03,0.17,-0.11,0.26,0.13,0.24,-0.15,0.41,0.07,0.26,-0.21,0.27,0.06,0.08,-0.12,0.33,0.14,0.3,-0.18,0.12,-0.12,0.47,-0.04,0.11,-0.14,0.04,-0.15,0.18,-0.2,0.16,0.02,0.39
SpeciesID,0.06,-0.02,1.0,0.05,0.08,0.08,0.07,0.04,0.1,0.07,0.09,-0.0,0.09,0.07,0.09,0.06,0.09,0.06,0.09,-0.0,0.06,0.05,0.09,0.02,0.09,0.09,0.11,0.03,0.06,0.03,0.02,0.04,0.05,0.09,0.08,0.12,0.05,0.06,0.04,0.1,0.08,0.12,0.08,0.09,0.05,0.12,0.04,0.1,0.09,0.06,0.03,0.1,0.1,0.08,0.03,0.08,0.02,0.09,0.04,0.07,0.1,0.07,0.04,0.05,0.01,0.1,0.04,0.08
Ncodons,-0.08,-0.05,0.05,1.0,-0.09,-0.1,-0.08,-0.09,-0.09,-0.08,-0.06,-0.08,-0.1,-0.1,-0.08,-0.11,-0.1,-0.09,-0.1,-0.09,-0.11,-0.09,-0.11,-0.07,-0.1,-0.09,-0.1,-0.07,-0.09,-0.09,-0.09,-0.11,-0.09,-0.09,-0.1,-0.1,-0.08,-0.09,-0.1,-0.11,-0.09,-0.09,-0.08,-0.08,-0.11,-0.1,-0.08,-0.09,-0.11,-0.08,-0.08,-0.08,-0.1,-0.1,-0.09,-0.08,-0.07,-0.08,-0.06,-0.08,-0.07,-0.1,-0.1,-0.1,-0.09,-0.07,-0.05,-0.05
UUU,0.1,0.5,0.08,-0.09,1.0,0.37,0.83,0.68,0.64,0.14,0.27,0.08,0.83,0.25,0.65,0.45,0.63,0.14,0.66,0.18,0.51,0.11,0.41,0.02,0.57,0.23,0.42,0.07,0.48,0.42,0.06,0.54,0.38,0.81,0.35,0.64,0.26,0.67,0.15,0.62,0.15,0.35,0.14,0.86,0.3,0.59,0.15,0.77,0.27,0.54,0.16,0.73,0.26,0.71,0.23,0.43,0.03,0.65,0.18,0.57,0.26,0.57,0.13,0.6,0.16,0.34,0.17,0.26
UUC,0.22,0.27,0.08,-0.1,0.37,1.0,0.32,0.33,0.56,0.72,0.64,0.46,0.49,0.87,0.48,0.52,0.38,0.67,0.56,0.39,0.49,0.68,0.59,0.3,0.48,0.7,0.64,0.35,0.45,0.37,0.58,0.64,0.43,0.43,0.78,0.61,0.41,0.26,0.5,0.47,0.69,0.53,0.38,0.37,0.8,0.47,0.37,0.34,0.81,0.26,0.46,0.43,0.81,0.42,0.4,0.33,0.4,0.62,0.32,0.22,0.27,0.32,0.6,0.42,0.39,0.39,0.25,0.55
UUA,0.08,0.52,0.07,-0.08,0.83,0.32,1.0,0.44,0.58,0.13,0.4,-0.07,0.87,0.26,0.77,0.29,0.52,0.01,0.71,-0.02,0.45,0.07,0.45,-0.08,0.5,0.18,0.46,-0.05,0.26,0.36,-0.03,0.53,0.24,0.67,0.28,0.68,0.1,0.52,0.05,0.58,0.15,0.48,0.02,0.73,0.2,0.56,-0.03,0.7,0.2,0.37,0.02,0.59,0.22,0.64,0.03,0.33,-0.08,0.57,0.02,0.46,0.09,0.42,-0.01,0.52,-0.03,0.38,0.16,0.43
UUG,0.05,0.22,0.04,-0.09,0.68,0.33,0.44,1.0,0.48,0.06,0.01,0.27,0.52,0.18,0.3,0.64,0.67,0.3,0.47,0.47,0.58,0.13,0.33,0.26,0.53,0.16,0.3,0.28,0.66,0.55,0.15,0.43,0.51,0.74,0.34,0.44,0.49,0.74,0.3,0.55,0.12,0.15,0.37,0.74,0.34,0.54,0.41,0.7,0.3,0.6,0.28,0.73,0.24,0.68,0.54,0.61,0.21,0.54,0.37,0.67,0.48,0.73,0.32,0.69,0.47,0.21,0.21,-0.13
CUU,0.28,0.41,0.1,-0.09,0.64,0.56,0.58,0.48,1.0,0.57,0.54,0.17,0.74,0.5,0.61,0.43,0.55,0.29,0.59,0.15,0.61,0.45,0.65,0.05,0.64,0.56,0.55,0.09,0.38,0.38,0.29,0.58,0.38,0.63,0.61,0.67,0.23,0.43,0.32,0.67,0.52,0.61,0.18,0.55,0.46,0.64,0.18,0.53,0.48,0.34,0.2,0.54,0.49,0.58,0.24,0.35,0.11,0.62,0.15,0.35,0.21,0.41,0.21,0.53,0.19,0.42,0.24,0.54
CUC,0.28,0.16,0.07,-0.08,0.14,0.72,0.13,0.06,0.57,1.0,0.65,0.42,0.31,0.78,0.36,0.3,0.15,0.59,0.27,0.27,0.31,0.79,0.53,0.26,0.35,0.82,0.51,0.28,0.21,0.13,0.62,0.4,0.32,0.15,0.77,0.43,0.27,0.01,0.49,0.34,0.83,0.55,0.29,0.05,0.64,0.36,0.28,0.05,0.68,0.03,0.33,0.11,0.72,0.18,0.23,0.08,0.38,0.38,0.25,-0.05,0.11,0.03,0.46,0.17,0.27,0.35,0.23,0.65


Now that the data has been normalized better, one should be able to begin the data modeling process.

### Data Modeling

Now that the fields have been normalized, The data will be split into the train and test versions of the data set.

In [60]:
y = df__codons['Kingdom']
X = df__codons.drop(['Kingdom','DNAtype','SpeciesName'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3
)

Now that the data is properly split, the model can be creating using the Gradient Boost algorithm as it maximizes the predictive accuracy of the model and very little processing time.

In [61]:
gbc = GradientBoostingClassifier(n_estimators=100).fit(X_train, y_train)

Now that the model has been created, one can extract predictions from the said model. these predictions can then be utilized to see how efficient the model is at determining the *Kingdom* that each species belongs in.

In [64]:
y_pred = gbc.predict(X_test)
print(f'Accuracy                  :{accuracy_score(y_test, y_pred)}')
print(f'error rate                :{1-accuracy_score(y_test,y_pred)}')
print(f'precision                 :{precision_score(y_test,y_pred, average=None).mean()}')
print(f'recall                    :{recall_score(y_test, y_pred, average=None).mean()}')

Accuracy                  :0.7443705220061413
error rate                :0.2556294779938587
precision                 :0.6799054747775475
recall                    :0.5863583750827073


Using a simple decision tree, we obtain the following:

In [68]:
dtmodel = DecisionTreeClassifier(criterion='entropy').fit(X_train, y_train)
y_simple_pred = dtmodel.predict(X_test)

print(f'Accuracy                  :{accuracy_score(y_test, y_simple_pred)}')
print(f'error rate                :{1-accuracy_score(y_test,y_simple_pred)}')
print(f'precision                 :{precision_score(y_test,y_simple_pred, average=None).mean()}')
print(f'recall                    :{recall_score(y_test, y_simple_pred, average=None).mean()}')

Accuracy                  :0.7525588536335721
error rate                :0.2474411463664279
precision                 :0.5841209213047078
recall                    :0.5799186201320178


From this case, the ensemble method only improved the precision of the model significantly, albeit at the cost of having somehwat lower accuracy and recall capability. This may be deem a success in terms of appropriately labeling true positives, while on the other the fact that all other parameters did not significantly improve may warrant one to explore different models.

Source(s):

- https://ourworldindata.org/extinctions
- https://www.sciencedirect.com/science/article/abs/pii/S1673852721001855