# Homework 1

Due Thursday 1/22

The main objectives of this assignment are: to test the set up your Python environment, introduce the use of Jupyter style cells, and compare classical machine learning models against a basic neural network on a toy classification problem (iris dataset).

## 1. Setup 
Load the necessary Python packages and load the iris dataset

In [None]:
# Load necessary libraries
from sklearn.datasets import load_iris
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Load the dataset using seaborn for EDA and visualization
iris_df = sns.load_dataset('iris')

# Load the dataset
iris = load_iris(as_frame=True)
# Access the features (DataFrame) and target (Series) for ML tasks
X = iris.data
y = iris.target

## 2. Basic Exploratory Data Analysis

* Understand the structure of the dataset. 
* Calculate basic statistic for each feature
* Explore the correlation between features
* Explore the distribution of samples by their norm

In [None]:
# Display the structure of the DataFrame
iris_df.info()

In [None]:
# Calculate summary statistics of the dataset
print(iris_df.describe())
# Count the number of instances for each species
print(iris_df['species'].value_counts())

In [None]:
# Show the correlation between features using a correlation Heatmap
sns.heatmap(iris_df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title("Feature Correlation")
plt.show()

In [None]:
# Plot the distribution of the L2-norm of the feature vectors per species
#
# Calculate the L2 norm for the feature vectors (first 4 columns)
iris_df['feature_norm_l2'] = np.linalg.norm(iris_df.iloc[:, 0:4], axis=1)

# Plot the distribution
plt.figure(figsize=(10, 6))
sns.kdeplot(data=iris_df, x='feature_norm_l2', hue='species', fill=True, palette='viridis')
plt.title('Distribution of Feature Vector L2Norms per Species')
plt.xlabel('L2 Norm ($\|x\|_2$)')
plt.ylabel('Density')
plt.grid(True, linestyle='--', alpha=0.6)


## 3 Machine Learning Modeling and Evaluation

### Simple Train/Test Split (Holdout Validation Method)
Split the dataset into 2 parts: training (70%) and testing (30%). Make sure to stratefy the split. 

### Training
Train the following models using scikit-learn:

* Perceptron (use `Perceptron()`)
* Decision Tree (use `DecisionTreeClassifier()`)
* K-Nearest Neighbors (use `KNeighborsClassifier(n_neighbors=3)`)
* Support Vector Machine (use `SVC(kernel='linear')`)
* Multi-Layer Perceptron (use `MLPClassifier(hidden_layer_sizes=(10,), max_iter=1000)`)

### Evaluation

For each model, you must output:
- A Confusion Matrix (visualized using `ConfusionMatrixDisplay`).
- A Classification Report showing:
    - Accuracy
    - Precision (Macro Average)
    - Recall (Macro Average)
    - F1-Score (Macro Average)

In [None]:
# Machine Learning Code
# 
# Create a test/train split for X and y 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.3, 
    random_state=42,
    stratify=y)
# Print the shapes of the resulting datasets
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

In [None]:
# Function to evaluate the classifier by printing the confusion matrix
def evaluate_classifier(y_true, y_pred, class_names):
    ConfusionMatrixDisplay.from_predictions(y_true, y_pred, display_labels=class_names)

# Function to calculate overall accuracy, precision, recall, F1-score
def classification_metrics(y_true, y_pred, class_names):
    print(classification_report(y_true, y_pred, target_names=class_names))

In [None]:
# Train a perceptron classifier
from sklearn.linear_model import Perceptron
percep = Perceptron()
percep.fit(X_train, y_train)

# Print the evaluation metrics
y_pred = percep.predict(X_test)
evaluate_classifier(y_test, y_pred, iris.target_names)
classification_metrics(y_test, y_pred, iris.target_names)


### Analysis

* Which model performed the worst? Looking at the Confusion Matrix, which two species did it struggle to distinguish?
* Why is the Perceptron limited in its ability to solve non-linearly separable problems?
* The MLP is technically a "Deep Learning" model. Did it significantly outperform the SVM or KNN on this specific dataset? Why or why not?


## 4. K-Fold Cross-Validation
Performance results obtained with the holdout validation method can be highly unreliable for small datasets. 
Instead, evaluate the models using Stratified 5-Fold Cross-Validation. This ensures that every data point is used for both training and testing across different iterations.

* Use `cross_val_score` to calculate the accuracy for each of the 5 folds.
* Report the mean accuracy and the standard deviation for each model.
* Generate one confusion matrix for each model using `cross_val_predict` to see the aggregate errors across all folds.

### Analysis
* Why is a 5-Fold Cross-Validation mean accuracy more "trustworthy" than a single 80/20 split accuracy for a small dataset like Iris?

* If a model has a very high mean accuracy but a very large standard deviation across folds, what does that tell you about the model's reliability?

* Looking at your aggregate Confusion Matrices, which species was most commonly misidentified as another? Did all models struggle with the same pair of species?

In [None]:
from sklearn.model_selection import cross_val_score, StratifiedKFold