


#### Intended Learning Outcomes:
This notebook provides a gentle introduction to classification, a type of supervised machine learning pronblem. 
By the end of this notebook you will know how to:

- Perform and evaluate Classification technique of Machine learning using the K-Nearest Neighbours model

#### Libraries to be used:
You can activate your previously used environment, though you will not use most packages from that environment. In this tutorial, we will use only the most commonly used python libraries such as: `pandas`, `numpy`, `matplotlib`, `scipy`, `seaborn` etc. 

We will use the Machine Learning library of Python, called Scikit Learn. You can use `pip` to install it. See the instructions here: https://scikit-learn.org/stable/install.html

## 1. Classification

This is an object recognition task in which you will train a supervised machine learning classifier to identify a fruit type based on certain features such as mass, width, height and color score.

#### 1.1 Read the dataset
The dataset is available here: https://raw.githubusercontent.com/rnanda17/data_science_BE/refs/heads/main/fruit_dataset.csv

#### 1.2 Find out how many unique types of fruits are in the dataset ?

#### 1.3 Which label belongs to which fruit?

So, there are 4 types of fruits and we also have 4 fruit labels. Imagine if this dataset (in real life) contained hundreds of fruits, how would you know which label belongs to which fruit ?

For this we will create a dictionary which maps fruits labels to fruit names.

In [None]:
dict_mapping_fruits = dict(zip(df_fruits['fruit_label'],df_fruits['fruit_name']))
dict_mapping_fruits

#### 1.4 Choose the features and the target variable
X represents the input features and y represents the target variable whose value needs to be predicted.

In [None]:
features_columns = ['mass','width','height','color_score'] # the features we have selected from the dataset
X = df_fruits[features_columns] # x contains the columns of the selected features: mass, width, height, color_score
y = df_fruits['fruit_label'] # y contains the fruit label

In [None]:
#Print the features, X
X.head()

In [None]:
#Print the target variable, y
y.head()

#### 1.5 Split the dataset into training and test set (75% training and 25% training)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.25)
len_train = len(X_train)
len_test = len(X_test)
print("Length of the training set: {}".format(len_train))
print("Length of the test set: {}".format(len_test))

#### 1.6 Train the KNN Classifier


In [None]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors = 5) # we set n_neigbours=5
clf.fit(X_train, y_train)

#### 1.7 Compute the accuracy of the classifier on test data

In [None]:
clf.score(X_test, y_test)

#### 1.8 Print the confusion matrix. Compute the precision, recall and F-Score

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay

y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot()

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

#### 1.9 Use the trained classifier to classify new unseen instances

In [None]:
unknown_fruit_1 = [300,200,300,0.7]
clf.predict([unknown_fruit_1])

In [None]:
#Query the dictionary to know the name of the fruit
dict_mapping_fruits[3]

#### 1.10 Bonus Exercise: For those interested to visualize the KNN Decision Boundary 

In [None]:
!pip install yellowbrick

We need to select only 2 features to predict the fruit label because of the constraints of the visualization library Yellowbrick. So we select, `width` and `height`. We train the KNN model on these 2 features to predict the fruit label. 

In [None]:
from yellowbrick.contrib.classifier import DecisionViz
select_two_features = ['width','height'] #limitations of 2 dimesions in yellowbrick
X = df_fruits[select_two_features]
y = df_fruits['fruit_label'] 



In [None]:
X = X.to_numpy()# for usage in yellowbrick module. Only numpy arrays are accepted
y = y.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)  #default split is 75 / 25

Visualize the Decision Boundary of KNN on the training set. 

In [None]:
viz = DecisionViz(
    KNeighborsClassifier(4), title="Nearest Neighbors",
    features=['width', 'height'], classes=[1,2,3,4]
)
viz.fit(X_train, y_train)
viz.draw(X_train, y_train) 
viz.show()