# Binary Prediction of Poisonous Mushrooms

### Artificial Intelligence 2nd Project

The aim of this project is to implement and evaluate machine learning models for predicting whether a mushroom is **poisonous** or **edible** based on its physical characteristics.

To achieve our goal, we will follow the standard machine learning pipeline, which consists of analyzing the data, preprocessing it to ensure higher accuracy, and, finally, training and comparing the models.

## Table of Contents

TODO

## Coding environment

Due to its extensive machine learning ecosystem, we have opted to use [Python](https://www.python.org/) for this project. As such, before proceeding, it is imperative to prepare our coding environment by importing the libraries we will be working with, namely:

* **[Pandas]** - For data manipulation and preprocessing.
* **[Scikit-learn](https://scikit-learn.org/stable/)** - For implementing machine learning models and evaluation metrics.
* **[Matplotlib](https://matplotlib.org/)** - For creating graphs, tables, and numerous other data visualization methods.

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sb

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

Next, we must load the data itself, which is stored in a compressed CSV file. However, there is no need to manually uncompress it, as Pandas handles that automatically.

In [None]:
df = pd.read_csv('data/train.zip')

## Data Analysis and Preprocessing

Having finished the setup, the following steps are to analyze and preprocess the dataset. While it is common to separate these two phases, we decided we would apply the preprocessing as soon as we deem it necessary during our analysis.

Regarding the 

In [None]:
df.info()

As such, our dataset contains 22 columns and over 3 million rows. Regarding the columns, only three (excluding the `id`) contain **quantitative** data, whereas the remaining 18 pertain to **qualitative** data.

Below is an excerpt from the dataset:

In [None]:
df.head()

Next, it is important to determine if the dataset contains duplicate rows, as those can be safely excluded without affecting the accuracy of our models.

In [None]:
print("The dataset contains {} duplicates.".format(df.duplicated().sum()))

As the dataset contains no duplicates, we will not have to eliminate duplicate rows during the data preprocessing.

Finally, we must verify if any columns have missing values.

In [None]:
print("Missing values per column:")
df.isna().sum()

It is evident that the dataset has a lot of missing values, with some columns having over half of its entries missing. This issue will need to be addressed during data preprocessing.

In [None]:
df.drop('id', axis=1, inplace=True)

In [None]:
df.head()

In [None]:
# label encoding
# Initialize LabelEncoder
encoder = LabelEncoder()

# Apply Label Encoding to all object columns
for column in df.select_dtypes(include=['object']).columns:
    df[column] = encoder.fit_transform(df[column])

# Features (X) and target (y)
y = df['class']
X = df.drop('class', axis=1)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

In [None]:
# Initialize the model
model = DecisionTreeClassifier(random_state=21)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')