# **1. Load Iris Plants Dataset**

In [None]:
import time
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn')

# Load dataset
from sklearn.datasets import load_iris
iris = load_iris()
iris_df =  pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target
iris_df['target_names'] = [iris.target_names[t] for t in iris.target]
target_col = 'target'
feature_col = iris.feature_names

In [None]:
iris_df.head()

<u>**Data Set Characteristics**</u>

**Number of Instances**: 150 (50 in each of three classes)  
**Number of Attributes**: 4 numeric, predictive attributes and the class  

<u>**Attribute Information**</u>
- sepal length in cm  
- sepal width in cm  
- petal length in cm  
- petal width in cm  
- class:  
  - Iris-Setosa  
  - Iris-Versicolour  
  - Iris-Virginica  

<u>**Summary Statistics**</u>

| Features | Min | Max | Mean | SD | Class Correlation |
|-:|-:|-:|-:|-:|-:|
| sepal length | 4.3 | 7.9 | 5.84 | 0.83 | 0.7826 |
| sepal width | 2.0 | 4.4 | 3.05 | 0.43 | -0.4194 |
| petal length | 1.0 | 6.9 | 3.76 | 1.76 | 0.9490 (high!) |
| petal width | 0.1 | 2.5 | 1.20 | 0.76 | 0.9565 (high!) |

**Missing Attribute Values**: None  
**Class Distribution**: 33.3% for each of 3 classes.  
**Creator**: R.A. Fisher  
**Donor**: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)  
**Date**: July, 1988  

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.  

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.  

**References**
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
  Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
  Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
  (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
  Structure and Classification Rule for Recognition in Partially Exposed
  Environments".  IEEE Transactions on Pattern Analysis and Machine
  Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
  on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
  conceptual clustering system finds 3 classes in the data.
- Many, many more ...

In [None]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(iris_df, test_size=0.3, random_state=0)

# **2. K-Nearest Neighbors**

For Classification: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html  
For Regression: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html  

In [None]:
from sklearn.neighbors import KNeighborsClassifier

**Suggest Parameters**  

* **n_neighbors**: 4
* **weights**: distance

In [None]:
n_neighbors = 4 #@param {type:"slider", min:1, max:10, step:1}
weights = 'distance'  #@param ['uniform', 'distance']

knn = KNeighborsClassifier(
    n_neighbors=n_neighbors,
    weights=weights,
)

knn.fit(train_df[feature_col], train_df[target_col])

In [None]:
# Predict as Class
train_predict = knn.predict(train_df[feature_col])
test_predict = knn.predict(test_df[feature_col])

# Predict as Probability
train_predict_prob = knn.predict_proba(train_df[feature_col])

In [None]:
# Show first 10 prediction
print(train_predict[:10])

# Probability of 0/1
print(train_predict_prob[:10])

In [None]:
# Actual Target
train_df[target_col].iloc[:10].values

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(test_df[target_col], test_predict, labels=[1, 0])

In [None]:
from sklearn.metrics import classification_report

print(classification_report(test_df[target_col], test_predict))

# **3. Visualization**

Adapt from: https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html

In [None]:
#@markdown # **Select parameters to plot**
#@markdown **(On simplyfied model for visualization - 2 variables)**
n_neighbors = 7 #@param {type:"slider", min:1, max:10, step:1}
weights = 'distance'  #@param ['uniform', 'distance']
x_label = 'sepal length (cm)' #@param ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
y_label = 'sepal width (cm)' #@param ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

x_idx = iris_df.columns.to_list().index(x_label)
y_idx = iris_df.columns.to_list().index(y_label) 

from matplotlib import cm
from matplotlib.colors import ListedColormap
cmap_val = np.linspace(0.0, 1.0, 20)
cmap_light = ListedColormap(cm.get_cmap(plt.get_cmap('tab20'))(cmap_val)[1:7:2])
cmap_bold = ListedColormap(cm.get_cmap(plt.get_cmap('tab20'))(cmap_val)[:6:2])
h = .02  # step size in the mesh

model = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights)
model.fit(iris_df.iloc[:, [x_idx, y_idx]], iris_df[target_col])
x_min, x_max = iris_df.iloc[:, x_idx].min() - 1, iris_df.iloc[:, x_idx].max() + 1
y_min, y_max = iris_df.iloc[:, y_idx].min() - 1, iris_df.iloc[:, y_idx].max() + 1
xx, yy = np.meshgrid(
  np.arange(x_min, x_max, h),
  np.arange(y_min, y_max, h)
)
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 8), dpi=80)
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(
  iris_df.iloc[:, x_idx],
  iris_df.iloc[:, y_idx],
  c=iris_df[target_col],
  cmap=cmap_bold,
  edgecolor='k',
  s=60
)
plt.xlabel(x_label)
plt.xlim(xx.min(), xx.max())
plt.ylabel(y_label)
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
          % (n_neighbors, weights))
plt.show()

# **4. Find the best K value**

## 4.1 Grid Search

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid_search = GridSearchCV(
    estimator=KNeighborsClassifier(),
    param_grid=dict(
        n_neighbors=[1,2,3,4,5,6,7,8,9,10],
        weights=['uniform', 'distance'],
    ),
    scoring='f1_weighted',
    cv=5,
    n_jobs=-1 # Parallel
)

grid_start_time = time.time()
grid_search.fit(train_df[feature_col], train_df[target_col])
grid_end_time = time.time()
print(f"Searching Time: {datetime.timedelta(seconds=grid_end_time-grid_start_time)}")

In [None]:
# Get Searching Result
grid_search_result = grid_search.cv_results_
pd.DataFrame.from_dict(grid_search_result)

In [None]:
# Best Trained Model 
model = grid_search.best_estimator_

In [None]:
# Predict with the best model
y_pred = model.predict(test_df[feature_col])
y_pred[:10]

## 4.2 Randomized Search

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
rand_search = RandomizedSearchCV(
    estimator=KNeighborsClassifier(),
    param_distributions=dict(
        n_neighbors=[1,2,3,4,5,6,7,8,9,10],
        weights=['uniform', 'distance'],
    ),
    scoring='f1_weighted',
    cv=5,
    n_jobs=-1
)

rand_start_time = time.time()
rand_search.fit(train_df[feature_col], train_df[target_col])
rand_end_time = time.time()
print(f"Searching Time: {datetime.timedelta(seconds=rand_end_time-rand_start_time)}")

In [None]:
# Get Searching Result
rand_search_result = rand_search.cv_results_
pd.DataFrame.from_dict(rand_search_result)

In [None]:
# Best Trained Model 
model = rand_search.best_estimator_

In [None]:
# Predict with the best model
y_pred = model.predict(test_df[feature_col])
y_pred[:10]

# **5. Handwritten Digits Dataset (ORC - Optical Recognition)**



**Data Set Characteristics:**

**Number of Instances**: 5620  
**Number of Attributes**: 64  
**Attribute Information**: 8x8 image of integer pixels in the range 0..16.  
**Missing Attribute Values**: None  
**Creator**: E. Alpaydin (alpaydin '@' boun.edu.tr)  
**Date**: July; 1998  

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits  

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.  

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.  

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469, 1994.  

**References**
  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

digits_df =  pd.DataFrame(digits.data)
digits_df['image'] = list(digits.images)
digits_df['target'] = digits.target
digits_df['target_names'] = [digits.target_names[t] for t in digits.target]
digits_target_col = 'target'
digits_feature_col = np.arange(0,64)

from sklearn.model_selection import train_test_split
digits_train_df, digits_test_df = train_test_split(digits_df, test_size=0.3, random_state=0)

In [None]:
#@markdown ## **Press <u>run</u> to show sample images**
for i in range(3):
  plt.figure(dpi=100)
  plt.imshow(np.vstack([np.hstack(digits.images[(i*10):(i*10) + 10])]))
  plt.axis('off')
  plt.show()

We train KNN on each pixels (8x8) of image.

**Suggest Parameters**  

* **n_neighbors**: 8  

In [None]:
n_neighbors = 8 #@param {type:"slider", min:1, max:10, step:1}
knn = KNeighborsClassifier(
    n_neighbors=n_neighbors
)

knn.fit(digits_train_df[digits_feature_col], digits_train_df[digits_target_col])

## **5.1 Example of Nearest Neighbors**

In [None]:
select_sample = 556 #@param {type:"slider", min:0, max:1256, step:1}
print('Sample:', select_sample, '\nNumber:', digits_train_df.iloc[select_sample]['target'])
plt.figure(dpi=20)
plt.imshow(digits_train_df.iloc[select_sample]['image']) 
plt.axis('off')
plt.show()

In [None]:
# Find nearest neighbors by .kneighbors
# Input: 1 image (from above)
# Return: Distance and Index of nearest neighbors
neighbors_dist, neighbors_idx = knn.kneighbors(
    digits_train_df.iloc[select_sample][digits_feature_col].values.reshape(1, -1)
)
# .values.reshape(1, -1) to reshape as 1 input size

In [None]:
#@markdown ## **Press <u>run</u> to show nearest neighbors**
merge_image = [] 
for i in neighbors_idx[0]:
  merge_image.append(digits_train_df.iloc[i]['image'])

print('Number  :', *[f'{digits_train_df.iloc[i]["target"]:5d}' for i in neighbors_idx[0]])
print('Distance:', *[f'{d:5.2f}' for d in neighbors_dist[0]])

plt.figure(dpi=100)
plt.imshow(np.hstack(merge_image)) 
plt.axis('off')
plt.show()

## **5.2 Prediction Result**

In [None]:
prediction = knn.predict(digits_test_df[digits_feature_col])
print(classification_report(digits_test_df[digits_target_col], prediction))

**98% Accuracy !!**