# Homework 3 — Wine Classification with K-Nearest Neighbors (KNN)

## Objective
In this homework, you will build a **KNN classifier** for the **Wine** dataset.

You will practice the supervised learning workflow:

1. Understand the task (features vs. target)  
2. Load and inspect the dataset  
3. Split into train/test  
4. Train a KNN model  
5. Evaluate performance on the test set  

---

## Deliverables
- Your completed notebook
- Short written answers to the reflection questions (provided later in the notebook)

> **Note:** In this starter notebook, we provide the setup and dataset inspection.  
> You will complete the remaining sections (currently left as TODOs).


## 1) What is the Wine dataset?

The Wine dataset is a classic supervised learning dataset.

### Task
Given chemical measurements of a wine sample, predict its **class** (cultivar).

### Features (X)
Each sample has **13 numeric features** (e.g., alcohol, magnesium, flavanoids, etc.).

### Target (y)
The target is a **categorical label** with **3 classes**.

So this is a **classification** problem.


## 2) Import libraries

In [5]:
from sklearn.datasets import load_wine
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

## 3) Load and inspect the dataset

In [2]:
wine = load_wine()

X = wine.data       # features
y = wine.target     # target labels

print("Feature names:", wine.feature_names)
print("Target names:", wine.target_names)
print("X shape:", X.shape)
print("y shape:", y.shape)


Feature names: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Target names: ['class_0' 'class_1' 'class_2']
X shape: (178, 13)
y shape: (178,)


In [3]:
# Create a DataFrame for a quick look
df = pd.DataFrame(X, columns=wine.feature_names)
df["target"] = y
df.head()


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


---

## 4) Train/Test split (TODO)

**Your task:** Split the dataset into training and test sets.

Requirements:
- Use `train_test_split`
- Use `test_size = 0.20`
- Use `random_state = 42`
- Use `stratify = y`

Write your code below.


In [7]:
# TODO: your code here
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size = 0.20,
    random_state = 42,
    stratify = y
    )
print("Train size:", X_train.shape, "\nTest size:",X_test.shape)

Train size: (142, 13) 
Test size: (36, 13)


## 5) Train a KNN classifier (TODO)

**Your task:** Train a KNN classifier.

Requirements:
- Use `KNeighborsClassifier`
- Start with `n_neighbors = 5`

Write your code below.


In [8]:
# TODO: your code here
KNN = KNeighborsClassifier(n_neighbors = 5)
print(" ---- Training KNN ---- ")
KNN.fit(X_train, y_train)

 ---- Training KNN ---- 


## 6) Predict on the test set (TODO)

**Your task:** Use your trained model to predict labels for the test set.

Write your code below.


In [9]:
# TODO: your code here
y_pred = KNN.predict(X_test)
print(" ---- Predictions ---- ")
print(y_pred[:10])

 ---- Predictions ---- 
[0 2 0 1 2 0 0 1 1 2]


## 7) Evaluate the model (TODO)

**Your task:** Evaluate your model using accuracy.

Minimum requirement:
- Report test accuracy using `accuracy_score`

Optional (recommended):
- Show confusion matrix
- Show classification report

Write your code below.


In [11]:
# TODO: your code here
test_acc = accuracy_score(y_test, y_pred)
print(f"Test accuracy (K = 5): {test_acc:.3f}")

Test accuracy (K = 5): 0.806


In [14]:
from sklearn.metrics import confusion_matrix, classification_report

# Print Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Confusion Matrix:
[[12  0  0]
 [ 0 10  4]
 [ 0  3  7]]


In [15]:
# Print Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       0.77      0.71      0.74        14
           2       0.64      0.70      0.67        10

    accuracy                           0.81        36
   macro avg       0.80      0.80      0.80        36
weighted avg       0.81      0.81      0.81        36



## 8) Reflection Questions (Answer in Markdown)

1. What test accuracy did you get with **k = 5**?
2. Which class seems easiest to predict? Which seems hardest? Use your confusion matrix to justify.
3. Why is it important to keep the test set “unseen” during training and model selection?


# **Answers:**
**1. Test Accuracy for the KNN model with K = 5 was *80.6%***

**2. Class 1 is the easiest to predict base on the confusion matrix, all 12 sample were True Positive, Class 2 and 3 are the hardest to predict, I want to make emphasis on class 2 since it had 4 samples predict as class 2.**

**3. It's important to keep the test set as unseen so that during training the model won't memorize the prediction, this will led to overfitting.**
