# `KNeighborsClassifier With sklearn`
-----------------


## High level End to End ML project
- Get data(Files,RDBMS,NO-SQL DATABASES,GRAPH Databases)
- Pre processing of data
    - (Missing values(ex : Female age not ready to reveal)
    - outliers(ex : age is 300)
    - normalization/Unit variance
    - features identification(dimensions reduction)
    - converting unbalanced data into balanced data
    - etc
- Identify X(Independent Variables),y(Dependent Varaibale)
    * X (2D numpy array)
    * y (1D numpy array)

- Split the data into Train and Test
- Fit / Train model using train data
- Predict on test data
- Metrics(Accuracy)
- If we are not happy with accuracy then do `hyper parameter tunning` and rebuild the model
- Save model
- Using This model Create Rest API
- Test Rest API Using Post man
- UI people use this API and design web application

## Import Required Modules

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

## Load `csv file` and Understand `X` and `y` Data

In [2]:
# Change Directory to location ,where we have csv file
os.chdir("C:\\Users\\ramreddymyla\\Google Drive\\01 DS ML DL NLP and AI With Python Lab Copy\\02 Lab Data\\Python")

In [3]:
# Load csv file into DataFrame
df = pd.read_csv("iris.csv")

In [4]:
# Get top 5 Rows
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [5]:
# Observe all the columns
df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [6]:
# create a dataframe X with required input columns
X=df.loc[:,df.columns!="species"] 

In [7]:
type(X)

pandas.core.frame.DataFrame

In [8]:
X.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [9]:
y=df.species # create a series with target values

In [10]:
y.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [11]:
y.replace(['setosa', 'versicolor', 'virginica'],[0,1,2],inplace=True)

In [12]:
y.unique()

array([0, 1, 2], dtype=int64)

In [13]:
X=X.values # converting df to 2d Numpy Array

In [14]:
type(X)

numpy.ndarray

In [15]:
X.ndim

2

In [16]:
X[:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

In [17]:
y=y.values # converting Series to 1d Numpy Array

In [18]:
type(y)

numpy.ndarray

In [19]:
y.ndim

1

In [20]:
y[:5]

array([0, 0, 0, 0, 0], dtype=int64)

## Split Data for training and testing

In [21]:
seed=42

In [1]:
#train_test_split?

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=0.3, # train : 105 # test : 45
                                                    random_state=seed, # reproduce # seed
                                                    stratify=y) # input data ratio(50:50:50) = train data ratio(35:35:35) = test data ratio(15:15:15)

1. Did you understand stratify ? 

    `Must Read:`https://en.wikipedia.org/wiki/Stratified_sampling
2. Did you understand random_state?
3. What is Balanced Data ?

In [24]:
# Count of each class in orginal data
unique, counts = np.unique(y, return_counts=True)
dict(zip(unique, counts))

{0: 50, 1: 50, 2: 50}

In [25]:
# Count of each class in train sample data
unique, counts = np.unique(y_train, return_counts=True)
dict(zip(unique, counts))

{0: 35, 1: 35, 2: 35}

In [26]:
# Count of each class in test sample data
unique, counts = np.unique(y_test, return_counts=True)
dict(zip(unique, counts))

{0: 15, 1: 15, 2: 15}

## Fit The Model

In [27]:
knn = KNeighborsClassifier(n_neighbors=8)

In [28]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=8, p=2,
                     weights='uniform')

## Predict labels of test data

In [29]:
# observe top 10 samples proba
knn.predict_proba(X_test)[:10]

array([[0.   , 0.   , 1.   ],
       [0.   , 0.875, 0.125],
       [0.   , 0.5  , 0.5  ],
       [0.   , 0.75 , 0.25 ],
       [0.   , 0.375, 0.625],
       [0.   , 0.   , 1.   ],
       [0.   , 1.   , 0.   ],
       [0.   , 1.   , 0.   ],
       [1.   , 0.   , 0.   ],
       [0.   , 0.   , 1.   ]])

In [30]:
y_pred = knn.predict(X_test)

In [31]:
print(f"\nTest set predictions:\n\n {y_pred}")


Test set predictions:

 [2 1 1 1 2 2 1 1 0 2 0 0 2 2 0 2 1 0 0 0 1 0 1 2 1 1 1 1 1 0 2 2 1 0 2 0 0
 0 0 1 1 0 1 2 1]


In [32]:
y_test

array([2, 1, 2, 1, 2, 2, 1, 1, 0, 2, 0, 0, 2, 2, 0, 2, 1, 0, 0, 0, 1, 0,
       1, 2, 2, 1, 1, 1, 1, 0, 2, 2, 1, 0, 2, 0, 0, 0, 0, 1, 1, 0, 2, 2,
       1], dtype=int64)

In [33]:
#np.bincount?

In [34]:
np.bincount(y_pred)

array([15, 18, 12], dtype=int64)

## Accuracy

In [35]:
accuracy_score(y_test,y_pred)

0.9333333333333333

> **or**

In [36]:
print(knn.score(X_test, y_test))

0.9333333333333333


In [2]:
#accuracy_score?

> In above exercise, How you know `n_neighbors=8`? do you have any way to find `Best Parameter` ???

## Hyper Parameter Tuning 
1. Useing `our own code`
2. Useing `GridSearchCV`
3. Useing `RandomizedSearchCV`

### Method 1: Use your `own Code`

In [38]:
rs= []
for i in list(range(1,20)):
    knn = KNeighborsClassifier(n_neighbors=i)
    #print(knn)
    knn.fit(X, y)
    y_test_pred = knn.predict(X_test)
    rs.append((i,accuracy_score(y_test,y_test_pred)))    

In [39]:
rs[1]

(2, 0.9555555555555556)

#### change multiple args

In [40]:
scores = [] # empty list
for i in range(1,11):
    for w in ['uniform','distance']:
        for a in ['ball_tree', 'kd_tree', 'brute']:
            knn_clf_obj=KNeighborsClassifier(n_neighbors=i,
                                             weights=w,
                                            algorithm=a)
            knn_clf_obj.fit(X_train,y_train)
            y_test_pred=knn_clf_obj.predict(X_test)
            scores.append((i,w,a,accuracy_score(y_test,y_test_pred)))
scores    

[(1, 'uniform', 'ball_tree', 0.9333333333333333),
 (1, 'uniform', 'kd_tree', 0.9333333333333333),
 (1, 'uniform', 'brute', 0.9333333333333333),
 (1, 'distance', 'ball_tree', 0.9333333333333333),
 (1, 'distance', 'kd_tree', 0.9333333333333333),
 (1, 'distance', 'brute', 0.9333333333333333),
 (2, 'uniform', 'ball_tree', 0.9111111111111111),
 (2, 'uniform', 'kd_tree', 0.9111111111111111),
 (2, 'uniform', 'brute', 0.9111111111111111),
 (2, 'distance', 'ball_tree', 0.9333333333333333),
 (2, 'distance', 'kd_tree', 0.9333333333333333),
 (2, 'distance', 'brute', 0.9333333333333333),
 (3, 'uniform', 'ball_tree', 0.9555555555555556),
 (3, 'uniform', 'kd_tree', 0.9555555555555556),
 (3, 'uniform', 'brute', 0.9555555555555556),
 (3, 'distance', 'ball_tree', 0.9555555555555556),
 (3, 'distance', 'kd_tree', 0.9555555555555556),
 (3, 'distance', 'brute', 0.9555555555555556),
 (4, 'uniform', 'ball_tree', 0.9555555555555556),
 (4, 'uniform', 'kd_tree', 0.9555555555555556),
 (4, 'uniform', 'brute', 0.93

### Method 2: Use `GridSearchCV`

#### What is CV ??

[Refer help document of sklearn](https://scikit-learn.org/stable/modules/cross_validation.html)

In [41]:
from sklearn.model_selection import GridSearchCV

In [42]:
#GridSearchCV?

In [43]:
param_grid = {'n_neighbors': np.arange(2, 10),
              'weights': ['uniform','distance'],
              'algorithm':['ball_tree', 'kd_tree', 'brute']}

In [44]:
knn_cv = GridSearchCV(knn,param_grid)

In [45]:
knn_cv.fit(X, 
           y)



GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=19, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=None,
             param_grid={'algorithm': ['ball_tree', 'kd_tree', 'brute'],
                         'n_neighbors': array([2, 3, 4, 5, 6, 7, 8, 9]),
                         'weights': ['uniform', 'distance']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [46]:
#knn_cv.cv_results_

In [47]:
knn_cv.best_estimator_

KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [48]:
knn_cv.best_params_

{'algorithm': 'ball_tree', 'n_neighbors': 5, 'weights': 'uniform'}

In [49]:
knn_cv.best_score_

0.9866666666666667

In [50]:
knn_cv.best_index_

6

In [51]:
knn_cv.n_splits_

3

### Method 3: Use `RandomizedSearchCV`

In [52]:
from sklearn.model_selection import RandomizedSearchCV
#RandomizedSearchCV?

In [53]:
knn_cv_rand = RandomizedSearchCV(knn,param_grid,random_state=seed,cv=5)

In [54]:
knn_cv_rand.fit(X,y)

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=KNeighborsClassifier(algorithm='auto',
                                                  leaf_size=30,
                                                  metric='minkowski',
                                                  metric_params=None,
                                                  n_jobs=None, n_neighbors=19,
                                                  p=2, weights='uniform'),
                   iid='warn', n_iter=10, n_jobs=None,
                   param_distributions={'algorithm': ['ball_tree', 'kd_tree',
                                                      'brute'],
                                        'n_neighbors': array([2, 3, 4, 5, 6, 7, 8, 9]),
                                        'weights': ['uniform', 'distance']},
                   pre_dispatch='2*n_jobs', random_state=42, refit=True,
                   return_train_score=False, scoring=None, verbose=0)

In [55]:
knn_cv_rand.best_params_

{'weights': 'distance', 'n_neighbors': 7, 'algorithm': 'kd_tree'}

In [56]:
knn_cv_rand.best_score_

0.98

## Save The model


In [57]:
import pickle
# save the model to disk
filename = 'knn_model.pkl'
pickle.dump(knn, open(filename, 'wb'))

## Load The model

In [58]:
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, y_test)
loaded_model
print(result)

0.9555555555555556


In [59]:
result1 = loaded_model.predict([[5.1,3.5,1.4,0.2]])
print(result1)
result2 = loaded_model.predict([[4.9,2.4,3.3,1]]) 
print(result2)
result3 = loaded_model.predict([[7.7,3.8,6.7,2.2]])
print(result3)

[0]
[1]
[2]


## Create REST API on Model
please read it https://flask.palletsprojects.com
1. Open **Anaconda Prompt** > If needed change environment **conda activate dl** > Change directory **cd C:\Users\ramreddymyla\RRITEC_TRAINING_ASSETS\Machine-Learning\Level 04_of_06_Machine_Learning_Algorithms**
1. run command **python app_knn_model.py**
1.Test API using Postman
    * Download and install postman https://www.postman.com/downloads/
    * Open **postman**
    * Create **New Request** as shown below
        ![](https://github.com/rritec/powerbi/blob/master/images/PBI_0142.png?raw=true)
    * Click **Send**
    * Observe **Result**
    

## Consume API in UI
* UI Resource will take Care

## **Home Work:** 
* Reading Exercise1: [Refer more from sklearn doc](https://scikit-learn.org/stable/modules/neighbors.html)

* Reading Exercise 2: https://www.kaggle.com/dkim1992/grid-search-vs-random-search

* Once you complete above reading answer below Questions?

    1. In KNN argument **algorithm**, will not accept below value?

        <input type="radio" disabled> brute

        <input type="radio" disabled> KDtree

        <input type="radio" disabled> balltree

        <input type="radio" disabled checked> random tree

    2. KNN is ?

        <input type="radio" disabled checked> non-parametric method

        <input type="radio" disabled> parametric method

        [Refer about Nonparametric statistics](https://en.wikipedia.org/wiki/Nonparametric_statistics#Applications_and_purpose)

        [Refer about Parametric statistics](https://en.wikipedia.org/wiki/Parametric_statistics)
        
        
    3. What is estimator ?
    2. When to go for brute, kd tree and ball tree?
    3. In KNN weights parameter has two values uniform and distance. What is the difference?
    4. p=1 then what type of distance?
    5. p=2 then what type of distance?
