#03/11/22 k-Nearest-Neighbors (k-NN) 

It is a supervised machine learning model. Supervised learning is when a model learns from data that is already labeled. A supervised learning model takes in a set of input objects and output values. The model then trains on that data to learn how to map the inputs to the desired output so it can learn to make predictions on unseen data.
k-NN models work by taking a data point and looking at the ‘k’ closest labeled data points. The data point is then assigned the label of the majority of the ‘k’ closest points.

Scikit-learn is a machine learning library for Python. In this tutorial, we will build a k-NN model using Scikit-learn to predict whether or not a patient has diabetes.

Question: As K increases, the score/accuracy for training data ?
The score/accuracy for the test data is ?

What to do if the number of Features is High? PCA(principal component analysis) to reduce the dimesion.

Distance definition: 
1. Euclidean distance, 
$d=\sqrt{∑_{i=1}^{p}(x_{ni}-x_{mi})^2}$

2. Manhattan distance is calculated as the sum of the absolute differences between the two vectors, 

3. Correlations.

4. Cos similiraty.



# **Import package**

In [3]:

# Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder#for encoding, converting catogroey variables
from sklearn.metrics import classification_report, confusion_matrix #Import scikit-learn metrics module for accuracy calculation


# **Read in data**

In [None]:

#read in the data using pandas
filein="https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
df = pd.read_csv(filein)
#check data has been read in properly
df.head()

#**split up into target and predictors**

In [None]:
df.groupby('Outcome').describe() #500 of 0, 268 of 1

In [None]:
#create a dataframe with all training data except the target column
X = df.drop(columns=['Outcome'])
#check that the target variable has been removed
X.head()

#separate target values
y = df['Outcome'].values
#view target values
y[0:5]

In [10]:

#split dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

## **KNN with K=3**

In [11]:
# Create KNN classifier
knn = KNeighborsClassifier(n_neighbors = 3)
# Fit the classifier to the data
knn.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=3)

In [None]:
#show first 5 model predictions on the test data
knn.predict(X_test)[0:5]

In [None]:
#check accuracy of our model on the test data
knn.score(X_test, y_test)

# **Program to choose the best K**

In [14]:

#Setup arrays to store training and test accuracies
#np.empty(16,dtype=int)
neighbors_K = np.arange(1,25)
train_accuracy =np.empty(len(neighbors_K))
test_accuracy = np.empty(len(neighbors_K))

for i,k in enumerate(neighbors_K):
    #Setup a knn classifier with k neighbors
    knn = KNeighborsClassifier(n_neighbors=k)
    
    #Fit the model
    knn.fit(X_train, y_train)
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)
    
    #Compute accuracy on the test set
    test_accuracy[i] = knn.score(X_test, y_test) 

In [None]:
#Generate plot
plt.title('k-NN Varying number of neighbors')
plt.plot(neighbors_K, test_accuracy, label='Testing Accuracy')
plt.plot(neighbors_K, train_accuracy, label='Training accuracy')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show()

In [17]:
test_accuracy=test_accuracy.tolist()
train_accuracy=train_accuracy.tolist()

In [None]:

k=test_accuracy.index(max(test_accuracy))
#Setup a knn classifier with k neighbors
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train,y_train)

cm=knn.score(X_test,y_test)

y_pred = knn.predict(X_test)
cm=confusion_matrix(y_test,y_pred)
k,cm

In [None]:
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

In [None]:
plt.figure(figsize=(6,8))
sns.heatmap(data=cm,linewidths=.5, annot=True,square = True,  cmap = 'Blues')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'Neighbor K=' +str(k)+ ', with Accuracy Score: {0}'.format(round(knn.score(X_test,y_test),3))
plt.title(all_sample_title, size = 15)
plt.show()

In [None]:
y_pred = knn.predict(X)
cm=confusion_matrix(y,y_pred)
plt.figure(figsize=(6,8))
sns.heatmap(data=cm,linewidths=.5, annot=True,square = True,  cmap = 'Blues')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'Neighbor K=' +str(k)+ ', with Accuracy Score: {0}'.format(round(knn.score(X,y),3))
plt.title(all_sample_title, size = 15)
plt.show()

In [None]:
pd.DataFrame(y_pred).sum()

In [None]:
cm

In [None]:
#import classification_report
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

# **k-Fold Cross-Validation: Cross validation to find the best model**

Cross-validation is when the dataset is randomly split up into ‘k’ groups. One of the groups is used as the test set and the rest are used as the training set. The model is trained on the training set and scored on the test set. Then the process is repeated until each unique group as been used as the test set.


In [None]:

knn_cv = KNeighborsClassifier(n_neighbors=4)
#train model with cv of 5 
cv_scores = cross_val_score(knn_cv, X, y, cv=10)
#print each cv score (accuracy) and average them
print(cv_scores)
print('cv_scores mean:{}'.format(np.mean(cv_scores)))

#**Hypertuning model parameters using GridSearchCV**

GidSearchCV works by training our model multiple times on a range of parameters that we specify. That way, we can test our model with each parameter and figure out the optimal values to get the best accuracy results.


In [None]:
#import GridSearchCV
from sklearn.model_selection import GridSearchCV
#In case of classifier like knn the parameter to be tuned is n_neighbors
param_grid = {'n_neighbors':np.arange(1,50)}

knn = KNeighborsClassifier()
knn_cv= GridSearchCV(knn,param_grid,cv=5)
knn_cv.fit(X,y)
print("Best Parameter and best Score are ", [knn_cv.best_params_, knn_cv.best_score_])

# **HW, For the fruit data located at**

filein="https://raw.githubusercontent.com/susanli2016/Machine-Learning-with-Python/master/fruit_data_with_colors.txt"

run models: KNN, LogisticRegression, decision trees, SVC and compare the models

In [None]:
filein="https://raw.githubusercontent.com/susanli2016/Machine-Learning-with-Python/master/fruit_data_with_colors.txt"

##or this location
#filein="https://storage.googleapis.com/kagglesdsdata/datasets/9590/13660/fruit_data_with_colors.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20220208%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20220208T204559Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=99552a3b8e675b58c3b303da71b49ee2e5f4a498cc443ee2372308ca4aedfd7c20b886b02a1186e1997d9006e8b78308dd1738de75a7447f585878b63731667cbd166b255327ce2989faf85fa7ba9c70f9fdfd5923dafb3d98ba42db3a6e609339abcd64d00ea81df55a28d873aafad730a5efeedca69e536f01ef71c617efd602843d5cb4eb81604e31218350ed4726633c9081f1333f12944c6ee256015b506fb89d461c6c02c3f446296f6918ac1d8196d9b184402c485293839799ddfcb14d954af8e49e6d6b6ebb7b4d8eb7c042dc40313939ad9556e6bfac39871bfda8f27b30697e3380462640f7577954eae19f1523e0240a7cdc629015363eb6fa1c"
df=pd.read_table(filein)
df.head()

In [None]:

feature_names = ['mass', 'width', 'height', 'color_score']
X = df[feature_names]
y = df['fruit_label']

In [None]:
X.describe()

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_sca=scaler.transform(X)


In [None]:
pd.DataFrame(X_sca).describe()

In [None]:

#split dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X_sca, y, test_size=0.2, random_state=1, stratify=y)

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
# Fit the classifier to the data
knn.fit(X_train,y_train)
print('Accuracy of KNN on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))
print('Accuracy of KNN on test set: {:.2f}'
     .format(knn.score(X_test, y_test)))

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier().fit(X_train, y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

In [None]:
from sklearn.svm import SVC 
from sklearn.svm import LinearSVC

In [None]:
svclassifier = SVC(kernel='linear') #Linear
svclassifier.fit(X_train, y_train)
print('Accuracy of SVC classifier on training set: {:.2f}'
     .format(svclassifier.score(X_train, y_train)))
print('Accuracy of SVC classifier on test set: {:.2f}'
     .format(svclassifier.score(X_test, y_test)))