# Binary Prediction of Poisonous Mushrooms

### Artificial Intelligence 2nd Project

The aim of this project is to implement and evaluate machine learning models for predicting whether a mushroom is **poisonous** or **edible** based on its physical characteristics.

To achieve our goal, we will follow the standard machine learning pipeline, which consists of analyzing the data, preprocessing it to ensure higher accuracy, and, finally, training and comparing the models.

## Table of Contents

TODO

## Coding environment

### Importing the libraries

Due to its extensive machine learning ecosystem, we have opted to use [Python](https://www.python.org/) for this project. As such, before proceeding, it is imperative to prepare our coding environment by importing the libraries we will be working with, namely:

* **[Pandas]** - For data manipulation and preprocessing.
* **[Scikit-learn](https://scikit-learn.org/stable/)** - For implementing machine learning models and evaluation metrics.
* **[Matplotlib](https://matplotlib.org/)** - For creating graphs, tables, and numerous other data visualization methods.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

### Loading the dataset

Next, we must load the data itself, which is stored in a compressed CSV file. However, there is no need to manually uncompress it, as Pandas handles that automatically.

In [None]:
df = pd.read_csv('data/train.zip')

## Data Analysis and Preprocessing

Having finished the setup, the following steps are to **analyze** and **preprocess** the dataset. While it is common to separate the two, we opted to apply the preprocessing as soon as we deem it necessary during our analysis. We believe this decision will enable extra preprocessing opportunities based on the insights gained during the initial exploration.

### Exploring the dataset

It goes without saying that a solid understanding of the dataset is paramount to training accurate models. Below is a small excerpt from our dataset:

In [None]:
print("First rows from our dataset:")
df.head()

Before proceeding, it is clear that the `id` column offers no significant information, as it is simply indicates the index of a row. As such, we can safely drop it from the dataset.

In [None]:
df.drop('id', axis=1, inplace=True)

With that out of the way, we should analyze the data types of the remaining columns to get a better picture of the complete dataset.

In [None]:
df.info()

As such, our dataset contains 21 columns and over 3 million rows. Regarding the columns, only three contain **quantitative** data, whereas the remaining 18 pertain to **qualitative** data.

In [None]:
df.describe()

### Removing Duplicates

Next, it is important to determine if the dataset contains **duplicate** rows, as those can be safely excluded without affecting the accuracy of our models.

In [None]:
print("The dataset contains {} duplicates.".format(df.duplicated().sum()))

As the dataset contains no duplicates, that means all rows provide relevant information, so none must be removed.

### Filling in Missing Values

Another key concern has to do with **missing values**, that is, entries that are absent from the dataset. Seeing as these provide no information, it might be sensible to either delete the rows where they appear or replace the missing entries with meaningful values.

In [None]:
print("Missing values per column:")
df.isna().sum()

It is evident that the dataset has an overabundance of missing values, with some columns having over half of its entries missing. Because of this, we will opt to to **fill in** the missing values, as removing the rows where they appear would result in a tremendous loss of data.

Given our dataset has both quantitative and qualitative data, we have to deal with them separately. To facilitate this, we will categorize and extract the columns into distinct variables based on their data type.

In [None]:
target_column = 'class'

# compute the quantitative columns
quantitative_columns = df.select_dtypes(include=['number']).columns
print("Quantitative columns: ", quantitative_columns.tolist())

# compute the qualitative columns EXCEPT the target column
qualitative_columns = df.select_dtypes(include=['object']).columns.drop(target_column)
print("\nQualitative columns: ", qualitative_columns.tolist())


#### Quantitative Data

There are several methods to fill missing numerical data. However, the most appropriate for each column depends on the **distribution** of its values:

* If the values are symmetrically distributed, it is appropriate to fill the missing entries with the **mean** as it represents the central tendency more accurately.
* If the values are asymmetrically distributed (**skewed**), then the **median** should be used because it is less affected by outliers.

The following depicts the distribution of each quantitative column:

In [None]:
# compute the skewness
print("Skewness by column:")
print(df[quantitative_columns].skew())

# plot the distribution
num_columns = len(quantitative_columns)
plt.figure(figsize=(4 * num_columns, 4))

for index, column in enumerate(quantitative_columns):
    plt.subplot(1, num_columns, index+1)
    sns.histplot(data=df, x=column, kde=True, bins=20, stat='probability')
    plt.title(f'Distribution of {column}')
    plt.ylabel('Frequency')
    sns.despine()

plt.tight_layout()  # adjust subplots to fit into figure area
plt.show()

Considering all quantitative columns are right-skewed, we must fill their missing values with the median.

In [None]:
for column in quantitative_columns:
    # compute the median of the column's values
    median = df[column].median()

    # fill the missing values with the median
    df[column] = df[column].fillna(median)

#### Qualitative Data

### Encoding qualitative data

As became clear in the previous section, the dataset is mostly comprised of qualitative data.

In [None]:
# label encoding
# Initialize LabelEncoder
encoder = LabelEncoder()

# Apply Label Encoding to all object columns
for column in df.select_dtypes(include=['object']).columns:
    df[column] = encoder.fit_transform(df[column])

df.head()

In [None]:
# Features (X) and target (y)
y = df['class']
X = df.drop('class', axis=1)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

In [None]:
# Initialize the model
model = DecisionTreeClassifier(random_state=21)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')