# Binary Prediction of Poisonous Mushrooms

### Artificial Intelligence 2nd Project

The aim of this project is to implement and evaluate machine learning models for predicting whether a mushroom is **poisonous** or **edible** based on its physical characteristics. As such, this is a binary classification problem, where the target variable is the venomosity of the mushroom.

To achieve our goal, we will follow the standard machine learning pipeline, which consists of analyzing the data, preprocessing it to ensure higher accuracy, and, finally, training and comparing the models.

## Table of Contents

1. [Coding environment](#coding-environment)
    1. [Importing the Libraries](#importing-the-libraries)
    2. [Loading the Dataset](#loading-the-dataset)
2. [Data Analysis and Preprocessing](#data-analysis-and-preprocessing)
    1. [Exploring the Dataset](#exploring-the-dataset)
    2. [Removing Duplicates](#removing-duplicates)
    3. [Filling in Missing Values](#filling-in-missing-values)
    4. [Removing Outliers](#remove-outliers)
    5. [Encoding Qualitative Data](#encoding-qualitative-data)
3. [Training the Models](#training-the-models)

## Coding environment

### Importing the libraries

Due to its extensive machine learning ecosystem, we have opted to use [Python](https://www.python.org/) for this project. As such, before proceeding, it is imperative to prepare our coding environment by importing the libraries we will be working with, namely:

* **[Pandas](https://pandas.pydata.org/)** - For data manipulation and preprocessing.
* **[Scikit-learn](https://scikit-learn.org/stable/)** - For implementing machine learning models and evaluation metrics.
* **[Matplotlib](https://matplotlib.org/)** - For creating graphs, tables, and numerous other data visualization methods.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

### Loading the Dataset

Next, we must load the data itself, which is stored in a compressed CSV file. However, there is no need to manually uncompress it, as Pandas handles that automatically.

In [None]:
df = pd.read_csv('data/train.zip')

## Data Analysis and Preprocessing

Having finished the setup, the following steps are to **analyze** and **preprocess** the dataset. While it is common to separate the two, we opted to apply the preprocessing as soon as we deem it necessary during our analysis. We believe this decision will enable extra preprocessing opportunities based on the insights gained during the initial exploration.

### Exploring the dataset

It goes without saying that a solid understanding of the dataset is paramount to training accurate models. Below is a small excerpt from our dataset:

In [None]:
print("First rows from our dataset:")
df.head()

Before proceeding, it is clear that the `id` column offers no significant information, as it simply indicates the index of a row. As such, we can safely drop it from the dataset.

In [None]:
df.drop('id', axis=1, inplace=True)

With that out of the way, we should analyze the data types of the remaining columns to get a better picture of the complete dataset.

In [None]:
df.info()

As such, our dataset contains 21 columns and over 3 million rows. Regarding the columns, only three contain **quantitative** data, whereas the remaining 18 pertain to **qualitative** data.

### Removing Duplicates

Next, it is important to determine if the dataset contains **duplicate** rows, as those can be safely excluded without affecting the accuracy of our models.

In [None]:
print("The dataset contains {} duplicates.".format(df.duplicated().sum()))

As the dataset contains no duplicates, that means all rows provide relevant information, so none must be removed.

### Filling in Missing Values

Another key concern has to do with **missing values**, that is, entries that are absent from the dataset. Seeing as these provide no information, it might be sensible to either delete the rows where they appear or replace the missing entries with meaningful values.

The following highlights the amount of missing values per column:

In [None]:
print("Missing values per column (%):")
100 * df.isna().sum() / len(df)

In [None]:
plt.figure(figsize=(18,12))
plt.title("Visualizing Missing Values")
sns.heatmap(df.isnull(), cbar=False, yticklabels=False)
plt.show()

It is evident that the dataset has an overabundance of missing values, with some columns having over half of its entries missing. Because of this, we will opt to to **fill in** the missing values, as removing the rows where they appear would result in a tremendous data loss.

Given our dataset has both quantitative and qualitative data, we have to deal with them separately. To facilitate this, we will categorize and extract the columns into distinct variables based on their data type.

In [None]:
target_column = 'class'

# compute the quantitative columns
quantitative_columns = df.select_dtypes(include=['number']).columns
print("Quantitative columns: ", quantitative_columns.tolist())

# compute the qualitative columns EXCEPT the target column
qualitative_columns = df.select_dtypes(include=['object']).columns.drop(target_column)
print("\nQualitative columns: ", qualitative_columns.tolist())


#### Quantitative Data

There are several methods to fill in missing numerical data. However, the most appropriate for each column depends on the **distribution** of its values:

* If the values are symmetrically distributed, it is appropriate to fill the missing entries with the **mean** as it represents the central tendency more accurately.
* If the values are asymmetrically distributed (**skewed**), then the **median** should be used because it is less affected by outliers.

The following depicts the distribution of each quantitative column:

In [None]:
# compute the skewness
print("Skewness by column:")
print(df[quantitative_columns].skew())

# plot the distribution
plt.figure(figsize=(4 * len(quantitative_columns), 4))

for index, column in enumerate(quantitative_columns):
    plt.subplot(1, len(quantitative_columns), index+1)
    sns.histplot(data=df, x=column, kde=True, bins=20, stat='probability')
    plt.title(f'Distribution of {column}')
    plt.ylabel('Frequency')
    sns.despine()

plt.tight_layout() # adjust subplots to fit into figure area
plt.show()

Considering all quantitative columns are right-skewed, we must fill their missing values with the median.

In [None]:
for column in quantitative_columns:
    # compute the median of the column's values
    median = df[column].median()

    # fill the missing values with the median
    df[column] = df[column].fillna(median)

#### Qualitative Data

Handling missing values in qualitative data requires imputation strategies that consider the nature of the data, such as using the **mode**, creating a new **category**, or employing more advanced techniques based on relationships within the data.

As became apparent in ..., there are plenty of qualitative columns where over half the entries are missing (`stem-root`, `veil-type`, `veil-color`, etc.), but there are also a few where only a small percentage is absent (`cap-shape`, `cap-color`, `does-bruise-or-bleed`, etc). So, we will take this into account when replacing the missing entries:
* If more than a predetermined percentage of data is missing, we create a new category - `Unspecified` - to group these unspecified values.
* Otherwise, we fill the missing values with the column's mode so as to preserve the distribution as much as possible.

As for the threshold, we believe 1% will help preserve the accuracy.

In [None]:
def fill_missing_qualitative_data(data: pd.Series, threshold: int) -> pd.Series:
    '''Fills missing qualitative data based on the number of missing values.'''
    missing_values = data.isna().sum() / len(data)
    mode = data.mode()

    return data.fillna('Unspecified' if missing_values > threshold or mode.empty else mode[0])


# replace the missing values with the mean
for column in qualitative_columns:
    df[column] = fill_missing_qualitative_data(df[column], 0.01)

### Remove Outliers

Having ensured all dataset entries have a value, it is appropriate to remove any **outliers** to avoid training our models with unrepresentative data.

#### Quantitative Data

To detect outliers in quantitative data, we can start by plotting the box plot of the respective columns as this type of graph is ideal for easily identifying extremes.

In [None]:
# plot the box plots
plt.figure(figsize=(10, 6))
sns.boxplot(data=df[quantitative_columns])
plt.title('Box Plots of Quantitative Columns')
sns.despine()
plt.show()

From the plots above, we can conclude that there are several outliers. However, in order to decide how to deal with them, we need to understand just how many there are. 

In [None]:
def get_outliers(data: pd.Series, lower_quantile: float) -> pd.Series:
    '''Computes the outliers of a given data column.'''
    Q1 = data.quantile(lower_quantile)
    Q3 = data.quantile(1 - lower_quantile)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    return data[(data < lower_bound) | (data > upper_bound)]


# calculate the percentage of outliers for each quantitative column
print("Outliers by column (%):")

for column in quantitative_columns:
    outliers = get_outliers(df[column], 0.25)
    outliers_percentage = 100 * len(outliers) / len(df[column])

    print(f'{column}\t{round(outliers_percentage, 2)}')

As evidenced, with $Q_{0.25}$ and $Q_{0.75}$ as lower and upper bounds, each quantitative column contains fewer than 5% outliers. Therefore, removing the rows where they apper would not incur a severe data loss for individual columns. However, assuming the worst-case scenario of only one outlier per row, we would be losing over 8% of the dataset, which is not ideal. Consequently, while we will still adopt the outlier removal strategy, we will instead use $Q_{0.10}$ and $Q_{0.90}$ as lower and upper bounds.

In [None]:
print(f'W/ outliers:\t{df.shape}')

for column in quantitative_columns:
    # compute the outliers
    outliers = get_outliers(df[column], 0.1)

    # remove the rows where the outliers appear
    df.drop(index=outliers.index, inplace=True)

print(f'W/o outliers:\t{df.shape}')

### Encoding Qualitative Data

As most machine learning algorithms and statistical models can only process numerical input, the final preprocessing step is transforming qualitative data into quantitative data.

There are several encoding techniques that achieve this, but we will go with **label encoding**, which merely consists of assigning unique integer values to distinct categories.

In [None]:
# initialize the label encoder
encoder = LabelEncoder()

# apply Label Encoding to all qualitative data
for column in df[qualitative_columns]:
    df[column] = encoder.fit_transform(df[column])

## Training the Models

In [None]:
# Features (X) and target (y)
y = df['class']
X = df.drop('class', axis=1)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

In [None]:
# Initialize the model
model = DecisionTreeClassifier(random_state=21)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')