# K-Nearest Neighbors

K-Nearest Neighbors (KNN) is an algorithm for classifcation and regression problems. It stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K-NN algorithm.

Introduction

We have a dataset that contains description of flags that will predict the religion of that flag's country. We have image of different flags that looks similar to the dataset flags. The K-Nearest Neighbors model will find the similar features of the new data set to fit the flags description and based on the similar features it will put it into a religion: Christian, Muslim, Buddhist, Hindu and other.  

Steps to implement the K-NN algorithm

1) Data pre-processing step  2) Fitting the K-NN algorithm to the Training set  3) Predicting the test result  4) Test accuracy of the result  5) Visualizing the test set result

In [None]:
#Imports
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import os

In [None]:
#Loading Dataset
flag = pd.read_csv("flags_with_headers_v_5.csv")
flag.head()

In [None]:
flag.info()

In [None]:
flag = pd.get_dummies(flag)
flag.head()

In [None]:
#Drop the 'religion' & unnecessary column: 'Unnamed' from the input set. 
#Religion will become the y value (dependent).
X = flag.drop(columns=['religion', 'Unnamed: 0'], axis=1)
y = flag.religion

# Scaling the Data

In [None]:
#Split the training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=35)

In [None]:
#Create a StandardScaler model and fit it to the training data
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler().fit(X_train)

In [None]:
#Transform the training and testing data using the X_scaler and y_scaler models
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

# Testing K-Nearest Neighbors

In [None]:
#Loop through different k values to see which has the highest accuracy
train_scores = []
test_scores = []
for k in range(1, 20, 2):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    train_score = knn.score(X_train_scaled, y_train)
    test_score = knn.score(X_test_scaled, y_test)
    train_scores.append(train_score)
    test_scores.append(test_score)
    print(f"k: {k}, Train/Test Score: {train_score:.3f}/{test_score:.3f}")
    
    
plt.plot(range(1, 20, 2), train_scores, marker='o')
plt.plot(range(1, 20, 2), test_scores, marker="x")
plt.xlabel("k neighbors")
plt.ylabel("Testing accuracy Score")
plt.show()

In [None]:
#Note that k: 7 provides the best accuracy where the classifier starts to stablize
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train_scaled, y_train)
print('K=9 Test Acc: %.3f' % knn.score(X_test_scaled, y_test))

In [None]:
#Calculate classification report
from sklearn.metrics import classification_report
predictions = knn.predict(X_test_scaled)
print(classification_report(y_test, predictions, target_names =['Christian', 'Muslim', 'Buddhist', 'Hindu', 'Other']))

# Classification Report Interpretation

The precision indicates the proportion of positive identifications which were actually correct. 1.0 is a model that produces no false positives. Per the K-Nearest Neighbors result for classifying Christianity, our precision is .63, fairly average indicating that was some false positives in our classification.

The recall indicates the proportion of actual positives which were correctly classify. A model which produces no false negatives has a number of 1.0. Per the K- Nearest Neighbors result for classifying Christianity, our recall is at .96 which mean the model classified the positives correctly, close to perfect. 

The F-1 score is a combination of precision and recall. A perfect model also consists of 1.0. Per the K- Nearest Neighbors result for classifying Christianity, our recall is .76 which is fairly high. 

The support is the number of samples each metric was calculated on. Per the K- Nearest Neighbors results for classifying Christianity, the accuracy metrics was calculated on 49 samples whereas each religion: Christian was calculated with 27 samples, Muslim was 9 samples, Buddhist was 3 samples, Hindu was 1 sample and Other was 9 samples. Each of these samples added together, makes up 49 samples for the accuracy metrics. 

The accuracy of the model confirms how accurate the model is. Perfect accuracy is 1.0. Per the K- Nearest Neighbors result, our accuracy is .69 which indicate the model is classifying the data half correct.

The Macro Average process the average of precision, recall and F1 score betweem classes. This metric performs overall across the sets of data regardless of any imbalances. This metric is a useful measure when the dataset varies in size.

The Weighted Average process the average of precision, recall and F1 score. Each metric is calculated with the consideration of the sample sizes. An example is they will give a high number when one religion outperforms another due to having more samples.  

# Prediction Example

In [None]:
knn.predict(X_test_scaled)

In [None]:
y_test.values

# Analysis

Per the KNN model, we have an accuracy of 63% of classifying flag colors, shapes, images, and text to predict the country's religion. I chose K = 9 as per the graph, that's when the score starts to stabilize and choosing an odd number to confirm there are no ties between classes. Per the classification report, classifying the religion Christian was secure as the precision method (percentage of the predictions were correct) is at 63%, recall method (percentage of the positive cases classified correctly) is at 96%, and f1-score method (percentage of positive predictions were correct) is at 76% but that has to do with the support method (sample size) of 27 vs the other religion in single digits. 


In conclusion, per the support method of the religion data points being imbalanced, the macro average confirmed our overall results. Although the KNN method has an accuracy of 63% in classifying the religion, due to our imbalance data set, our overall precision is 27%, our overall recall is 30% and overall f1-score is 28% with a sample size of 49. In order to improve the overall result, we will need a larger data set or narrowing the religion to 3-4. The percentage is fairly low overall; however, I can conclude that the KNN model can classify countries with Christianity as their religion efficiently based on the attributes of the country’s flag. 