Ensemble learning is a general approach where the combination of related methods provides better predictions or improves overall performance. Some real-world examples of its use include the Netflix Challenge, gene classification, image segmentation, and video retrieval.

In this assignment, you will implement ensemble learning, combining a variety of learning methods such as max voting, averaging, weighted averaging, bagging, boosting (gradient boosting, random forest, XGBoost, etc.), stacking, blending, and other variations.

You will have the freedom to choose between implementing classification or regression machine learning, or a combination of the two, so choose your ensemble techniques accordingly.

1. Access the "UCI Machine Learning Repository," located in the topic Resources. Note: There are about 120 data sets that are suitable for use in a clustering task. For this part of the exercise, you must choose one of these datasets, provided it includes at least 10 attributes and 10,000 instances
2. Ensure that the datasets are suitable for clustering using this method.
3. You may search for data in other repositories, such as Data.gov or Kaggle.

For your selected dataset, build an ensemble model as follows:

1. Explain the dataset and the type of information you wish to gain by applying an ensemble method.
1. Explain the ensemble components and how you will be using it in your analysis (list the steps, intuition behind the mathematical representation, and address its assumptions). Specifically, which of max voting, averaging, weighted averaging, bagging, boosting (gradient boosting, random forest, XGBoost, etc.), stacking, blending, and/or other variations have you chosen, and why.
1. Import necessary libraries, then read the dataset into a data frame and perform initial statistical exploration.
1. Clean the data and address unusual phenomena (e.g., normalization, feature scaling, outliers); use illustrative diagrams and plots and explain them.
1. Formulate two questions that can be answered by employing the ensemble learning
1. If appropriate and relevant to your model, split the data into training and testing sets.
1. Provide a diagram that illustrates how the ensemble components are combined into one learning model.
1. Implement and execute the ensemble learning model. Explain the intuition behind each mathematical step.
1. Answer the questions you formulated using the results obtained from executing the ensemble model.
1. Interpret the predictions made by the model in the context of the questions you asked.
1. Validate your model using relevant validation metrics such as a confusion matrix, accuracy score, ROC-AUC curves, and k-fold cross validation. Then, explain the results.
1. Explain how ensemble system reduced the variance.
1. Include all mathematical formulas used and graphs representing the final outcomes.

Prepare a comprehensive technical report as a markdown document or Jupyter notebook, including all code, code comments, all outputs, plots, and analysis. Make sure the project documentation contains a) problem statement, b) algorithm of the solution, c) analysis of the findings, and d) references.

### Problem Statement

Currently, 99% (3.96 Billion) of four billion species that have ever lived on this earth have gone extinct. even now, there are currently multiple species that are at risk of becoming extinct. This provides one with an environment where there may be multiple variants of a single species that has gone extinct while unidentified. Through the use of genetic material and data on current living species, one is able to determine what the closest rewlative the extinct species may have currently alive. Knowing the relationship between extinct and current species allows one to understand how a species has come to adapt to it's environment (whether it be through regression or developing a new mutation that allowed the current species to thrive in it's environment).

From the codon usage dataset, one hopes to extract enough information to determine whether a certain species belongs to a certain type of species, family, kingom, etc. At the moment, the data set can provide one with the DNA Type of the species as well as the kingdom that the species belongs in.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

df__codons = pd.read_csv(r'codon_usage_dataset/codon_usage.csv')
df__codons.describe()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,DNAtype,SpeciesID,Ncodons,UUA,UUG,CUU,CUC,CUA,CUG,AUU,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
count,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,...,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0
mean,0.367209,130451.105926,79605.76,0.020637,0.014104,0.01782,0.018288,0.019044,0.01845,0.028352,...,0.005454,0.009929,0.006422,0.024178,0.021164,0.02829,0.021683,0.001645,0.000592,0.006178
std,0.688726,124787.086107,719701.0,0.020709,0.00928,0.010586,0.014572,0.02425,0.016578,0.017507,...,0.006605,0.008574,0.006387,0.013828,0.013041,0.014342,0.015018,0.001834,0.000907,0.010344
min,0.0,7.0,1000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,28850.75,1602.0,0.00561,0.007108,0.01089,0.00783,0.005307,0.00718,0.01636,...,0.00122,0.00169,0.00117,0.01238,0.01186,0.01736,0.00971,0.00056,0.0,0.00041
50%,0.0,81971.5,2927.5,0.01526,0.01336,0.01613,0.01456,0.009685,0.0128,0.025475,...,0.00353,0.00927,0.004545,0.02542,0.01907,0.026085,0.02054,0.00138,0.00042,0.00113
75%,1.0,222891.25,9120.0,0.029485,0.01981,0.02273,0.025112,0.017245,0.024315,0.038113,...,0.00715,0.015922,0.01025,0.03419,0.02769,0.0368,0.031122,0.00237,0.00083,0.00289
max,12.0,465364.0,40662580.0,0.15133,0.10119,0.08978,0.10035,0.16392,0.10737,0.15406,...,0.05554,0.09883,0.05843,0.18566,0.11384,0.14489,0.15855,0.0452,0.02561,0.1067


### Data Processing

Now that the data has been uploaded, one should be able to process the data into a more functional format. The first change to make is convert the *Kingdom* feature into it's categorical version.

Source(s):

- https://ourworldindata.org/extinctions
- https://www.sciencedirect.com/science/article/abs/pii/S1673852721001855