# 03 -- Exercise + Solution (Data Processing in Python)

In [1]:
%load_ext watermark

In [2]:
%watermark -a "Sebastian Raschka" -p numpy,scikit-learn

Author: Sebastian Raschka

numpy       : 1.23.5
scikit-learn: 1.2.2



## About the Dataset

Dataset URL: https://archive.ics.uci.edu/dataset/602/dry+bean+dataset

Images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. A total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains.

## Tasks

1. Load the dataset from `data/Dry_Bean_Dataset.csv`

2. Identify whether categorical data needs to be encoded

3. Remove missing data if applicable (Hint: use `df.isna().sum()`)

4. Divide the DataFrame into features `X` and labels `y` (Hint: `df_without_B = df.drop(columns=['B'])`)

5. Split the dataset into 60% train data, 10% validation data, 30% test data

6. Scale the dataset using z-score normalization (standardization)

7. Train a K-nearest neighbor classifier (Hint: feel free to reuse code from the previous exercise, `02-2_exercise.ipynb`)


## Solution

In [13]:
import pandas as pd

df = pd.read_csv("data/Dry_Bean_Dataset.csv")
df.head()

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,28395,610.291,208.178117,173.888747,1.197191,0.549812,28715.0,190.141097,0.763923,0.988855999,0.958027,0.913358,0.007332,0.003147,0.834222,0.998724,SEKER
1,28734,638.018,200.524796,182.734419,1.097356,0.411785,29172.0,191.272751,0.783968,0.984985603,0.887034,0.953861,0.006979,0.003564,0.909851,0.99843,SEKER
2,29380,624.11,212.82613,175.931143,1.209713,0.562727,29690.0,193.410904,0.778113,0.989558774,0.947849,0.908774,0.007244,0.003048,0.825871,0.999066,SEKER
3,30008,645.884,210.557999,182.516516,1.153638,0.498616,30724.0,195.467062,0.782681,0.976695743,0.903936,0.928329,0.007017,0.003215,0.861794,0.994199,SEKER
4,30140,620.134,201.847882,190.279279,1.060798,0.33368,30417.0,195.896503,0.773098,0.99089325,0.984877,0.970516,0.006697,0.003665,0.9419,0.999166,SEKER


In [14]:
df.isna().sum()

Area               0
Perimeter          0
MajorAxisLength    0
MinorAxisLength    2
AspectRation       3
Eccentricity       3
ConvexArea         3
EquivDiameter      2
Extent             3
Solidity           4
roundness          1
Compactness        2
ShapeFactor1       1
ShapeFactor2       2
ShapeFactor3       0
ShapeFactor4       0
Class              0
dtype: int64

In [4]:
# drop rows with missing values:
df = df.dropna(axis=0)

In [5]:
df_X = df.drop(columns=["Class"])
df_y = df["Class"]

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(
    df_X, df_y, test_size=0.4, random_state=123, stratify=df_y)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.75, random_state=123, stratify=y_temp)

**Unnormalized**

In [7]:
from sklearn.neighbors import KNeighborsClassifier


clf = KNeighborsClassifier()


for i in range(1, 11):

    clf = KNeighborsClassifier(n_neighbors=i)
    clf.fit(X_train, y_train)
    acc = clf.score(X_val, y_val)
    print(f"N neighbors: {i} | Acc: {acc:.2f}")

N neighbors: 1 | Acc: 0.72
N neighbors: 2 | Acc: 0.68
N neighbors: 3 | Acc: 0.71
N neighbors: 4 | Acc: 0.73
N neighbors: 5 | Acc: 0.73
N neighbors: 6 | Acc: 0.73
N neighbors: 7 | Acc: 0.72
N neighbors: 8 | Acc: 0.72
N neighbors: 9 | Acc: 0.72
N neighbors: 10 | Acc: 0.71


In [8]:
clf = KNeighborsClassifier(n_neighbors=1)

clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.725735294117647

**Normalized**

In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

X_train_std = scaler.transform(X_train)
X_val_std = scaler.transform(X_val)
X_test_std = scaler.transform(X_test)

In [10]:
clf = KNeighborsClassifier()


for i in range(1, 11):

    clf = KNeighborsClassifier(n_neighbors=i)
    clf.fit(X_train_std, y_train)
    acc = clf.score(X_val_std, y_val)
    print(f"N neighbors: {i} | Acc: {acc:.2f}")

N neighbors: 1 | Acc: 0.90
N neighbors: 2 | Acc: 0.89
N neighbors: 3 | Acc: 0.91
N neighbors: 4 | Acc: 0.92
N neighbors: 5 | Acc: 0.92
N neighbors: 6 | Acc: 0.91
N neighbors: 7 | Acc: 0.92
N neighbors: 8 | Acc: 0.92
N neighbors: 9 | Acc: 0.92
N neighbors: 10 | Acc: 0.92


In [12]:
clf = KNeighborsClassifier(n_neighbors=10)

clf.fit(X_train_std, y_train)
clf.score(X_test_std, y_test)

0.924264705882353