<a href="https://colab.research.google.com/github/thedatadj/FruitClassifier/blob/main/using_KNeighborsClassifier_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Binary fruit classification
In this project I used a KNeighborsClassifier from sklearn to classify fruits into "orange" or not "orange".

## What I learned
- To perform a hyperparameter search.
- To visualize the "True" and predicted output for comparison.
- Use a KNeighborsClassifier for a binary classification problem.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# I loaded the data

fruit_data = pd.read_table("/content/drive/MyDrive/Colab Notebooks/Workspace/Fruit Classifier using KNN/fruit_data_with_colors.txt")

In [None]:
fruit_data.shape

(59, 7)

There are 59 examples, in this dataset, and seven columns.

In [None]:
fruit_data.head()

Unnamed: 0,fruit_label,fruit_name,fruit_subtype,mass,width,height,color_score
0,1,apple,granny_smith,192,8.4,7.3,0.55
1,1,apple,granny_smith,180,8.0,6.8,0.59
2,1,apple,granny_smith,176,7.4,7.2,0.6
3,2,mandarin,mandarin,86,6.2,4.7,0.8
4,2,mandarin,mandarin,84,6.0,4.6,0.79


In [None]:
# I split the data into two datasets: X and y

X = fruit_data[["mass", "width", "height", "color_score"]]
y = fruit_data["fruit_label"]


# Store the fruit names in a separate variable
target_class_names = fruit_data.fruit_name.unique()

In [None]:
target_class_names

array(['apple', 'mandarin', 'orange', 'lemon'], dtype=object)

In [None]:
# I split the datasets into: training and testing sets.

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


# This shows me the distribution of the data
shapes = {"X": [X.shape, X_train.shape, X_test.shape],
          "y": [y.shape, y_train.shape, y_test.shape],
          "Percent": ["100%", "75%", "25%"]}

pd.DataFrame(shapes, index=["Data", "Train", "Test"])

Unnamed: 0,X,y,Percent
Data,"(59, 4)","(59,)",100%
Train,"(44, 4)","(44,)",75%
Test,"(15, 4)","(15,)",25%


In [None]:
# I trained a KNN classifier

from sklearn.neighbors import KNeighborsClassifier


model = KNeighborsClassifier(n_neighbors=5)

model.fit(X_train, y_train)

In [None]:
# I want to see the model's performance on the training and test sets

X_train_score = model.score(X_train, y_train)
X_test_score = model.score(X_test, y_test)

print("Train score", X_train_score, "\nTest score", X_test_score)

Train score 0.7954545454545454 
Test score 0.5333333333333333


This is a very bad performance 😞, the model is underfitting:

In [None]:
# Here I normalized the features to improve the performance

from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()

Z_train = scaler.fit_transform(X_train)

Z_test = scaler.transform(X_test)

In [None]:
# Here I compare the feature before and after scaling

print(Z_train[:5],"\n", "\n", X_train.head())

[[0.27857143 0.41176471 0.49230769 0.72972973]
 [0.35       0.44117647 0.93846154 0.45945946]
 [0.         0.         0.         0.7027027 ]
 [0.27142857 0.52941176 0.50769231 0.37837838]
 [0.31428571 0.41176471 0.46153846 0.67567568]] 
 
     mass  width  height  color_score
42   154    7.2     7.2         0.82
48   174    7.3    10.1         0.72
7     76    5.8     4.0         0.81
14   152    7.6     7.3         0.69
32   164    7.2     7.0         0.80


In [None]:
# I trained the classifier with the normalized features

model.fit(Z_train, y_train)


# This code showed me the performance on the normalized input features
Z_train_score = model.score(Z_train, y_train)
Z_test_score = model.score(Z_test, y_test)

print("Normalized Train score", Z_train_score, "\nNormalized Test score", Z_test_score)

Normalized Train score 0.9545454545454546 
Normalized Test score 1.0


The performance is way better now. 🙂

In [None]:
# This is a summary of the results

results_data = {"Unormalized": [X_train_score, X_test_score],
           "Normalized": [Z_train_score, Z_test_score]}

results = pd.DataFrame(results_data, index=["Train", "Test"])
results.columns.name = "Performance"

results

Performance,Unormalized,Normalized
Train,0.795455,0.954545
Test,0.533333,1.0
