## K-Nearest Neighbor Classification

### What is actually going on?  

In [None]:
# Data with Different Scales: Normalization
def min_max_normalize(lst):
    
    minimum = min(lst)
    maximum = max(lst)
    normalized = []
    for nb in lst:
        normalized.append((nb - minimum) / (maximum - minimum))
    return normalized

# Finding the Nearest Neighbors
def distance(movie1, movie2):
  squared_difference = 0
  for i in range(len(movie1)):
    squared_difference += (movie1[i] - movie2[i]) ** 2
  final_distance = squared_difference ** 0.5
  return final_distance

# Classify the new point based on those neighbors
def classify(unknown, dataset, labels, k):
  distances = []
  #Looping through all points in the dataset
  for title in dataset: # title = str
    movie = dataset[title]  # movie = list of numbers
    distance_to_point = distance(movie, unknown)
    #Adding the distance and point associated with that distance
    distances.append([distance_to_point, title])
  distances.sort()
  #Taking only the k closest points
  neighbors = distances[0:k]
  num_good = 0
  num_bad = 0
  for movie in neighbors:  
    title = movie[1]
    if labels[title] == 1:
      num_good += 1
    else :
      num_bad += 1
    if num_good > num_bad:
      return 1
    else :
      return 0


#### By using sklearn

In [2]:
import movie_dataset.py as movie_dataset
import labels.py as labels

from sklearn.neighbors import KNeighborsClassifier

# Create a KNeighborsClassifier named classifier that uses k=5
classifier = KNeighborsClassifier(n_neighbors = 5)

# Train your classifier using movie_dataset as the training points and labels as the training labels
classifier.fit(movie_dataset, labels)

#  classify some movies
unknown_movies = [
  [.45, .2, .5], 
  [.25, .8, .9],
  [.1, .1, .9]
]
guesses = classifier.predict(unknown_movies)
print(guesses)

ModuleNotFoundError: No module named 'movie_dataset.py'; 'movie_dataset' is not a package

- Data with n features can be conceptualized as points lying in n-dimensional space.  


- Data points can be compared by using the distance formula. Data points that are similar will have a smaller distance between them.  


- A point with an unknown class can be classified by finding the k nearest neighbors.  


- To verify the effectiveness of a classifier, data with known classes can be split into a training set and a validation set. Validation error can then be calculated.  


- Classifiers have parameters that can be tuned to increase their effectiveness. In the case of K-Nearest Neighbors, k can be changed.  


- A classifier can be trained improperly and suffer from overfitting or underfitting. In the case of K-Nearest Neighbors, a low k often leads to overfitting and a large k often leads to underfitting.  


- Python’s sklearn library can be used for many classification and machine learning algorithms.  

## K-Nearest Neighbor Regressor

### What is actually goin on? (K-Nearest Neighbor Regressor with **Weighted Regression**)

In [None]:
def distance(movie1, movie2):
  squared_difference = 0
  for i in range(len(movie1)):
    squared_difference += (movie1[i] - movie2[i]) ** 2
  final_distance = squared_difference ** 0.5
  return final_distance

def predict(unknown, dataset, movie_ratings, k):
  distances = []
  #Looping through all points in the dataset
  for title in dataset:
    movie = dataset[title]
    distance_to_point = distance(movie, unknown)
    #Adding the distance and point associated with that distance
    distances.append([distance_to_point, title])
  distances.sort()
  #Taking only the k closest points
  neighbors = distances[0:k]
  numerator = 0
  denominator = 0 
  for neighbor in neighbors: # list of [distance, title] pairs
    rating = movie_ratings[neighbor[1]]
    distance_to_neighbor = neighbor[0]
    numerator += rating / distance_to_neighbor
    denominator += 1 / distance_to_neighbor
  return numerator / denominator

### By using sklearn

In [3]:
from movies import movie_dataset, movie_ratings
from sklearn.neighbors import KNeighborsRegressor

# If weights equals "uniform", all neighbors will be considered equally in the average. 
# If weights equals "distance", then a weighted average is used.
regressor = KNeighborsRegressor(n_neighbors = 5, weights = "distance")

# Train your classifier using movie_dataset as the training points and movie_ratings as the training values
regressor.fit(movie_dataset, movie_ratings)

# predict some movie ratings follow:
unknown_movies = [
    [0.016, 0.300, 1.022],
    [0.0004092981, 0.283, 1.0112],
    [0.00687649, 0.235, 1.0112]
]
guesses = regressor.predict(unknown_movies)
print(guesses)

ModuleNotFoundError: No module named 'movies'

- The K-Nearest Neighbor algorithm can be used for regression. Rather than returning a classification, it returns a number.  


- By using a weighted average, data points that are extremely similar to the input point will have more of a say in the final result.  


- scikit-learn has an implementation of a K-Nearest Neighbor regressor named KNeighborsRegressor.

# Tutorial

#### Explore the data

In [6]:
from sklearn.datasets import load_breast_cancer
breast_cancer_data = load_breast_cancer()

print(breast_cancer_data.data[0])
print(breast_cancer_data.feature_names)

print(breast_cancer_data.target)
print(breast_cancer_data.target_names)

[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1

#### Splitting the data into Training and Validation Sets

In [15]:
from sklearn.model_selection import train_test_split
training_data, validation_set , training_labels, validation_labels = train_test_split(breast_cancer_data.data, breast_cancer_data.target,
                test_size = 0.2, random_state = 100)
'''because : train_test_split returns four values in the following order:
    The training set
    The validation set
    The training labels
    The validation labels
'''

# check if both have same size
print(len(training_data), len(training_labels))

455 455


#### Running the classifier

In [16]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors = 3)
classifier.fit(training_data, training_labels)
print(classifier.score(validation_set, validation_labels))

for k in range(1,101):
    classifier = KNeighborsClassifier(n_neighbors = k)
    classifier.fit(training_data, training_labels)
    print("for ", k, ": ", classifier.score(validation_set, validation_labels))

0.9473684210526315
for  1 :  0.9298245614035088
for  2 :  0.9385964912280702
for  3 :  0.9473684210526315
for  4 :  0.9473684210526315
for  5 :  0.9473684210526315
for  6 :  0.9473684210526315
for  7 :  0.9473684210526315
for  8 :  0.9473684210526315
for  9 :  0.956140350877193
for  10 :  0.956140350877193
for  11 :  0.956140350877193
for  12 :  0.956140350877193
for  13 :  0.956140350877193
for  14 :  0.956140350877193
for  15 :  0.956140350877193
for  16 :  0.956140350877193
for  17 :  0.956140350877193
for  18 :  0.956140350877193
for  19 :  0.956140350877193
for  20 :  0.956140350877193
for  21 :  0.956140350877193
for  22 :  0.956140350877193
for  23 :  0.9649122807017544
for  24 :  0.9649122807017544
for  25 :  0.956140350877193
for  26 :  0.956140350877193
for  27 :  0.956140350877193
for  28 :  0.956140350877193
for  29 :  0.9473684210526315
for  30 :  0.9473684210526315
for  31 :  0.9473684210526315
for  32 :  0.9473684210526315
for  33 :  0.9473684210526315
for  34 :  0.94736

#### Graphing the results
exactly the same as previously but this time we represent the results in a more visual way

In [18]:
%matplotlib notebook
import matplotlib.pyplot as plt
k_list = range(1,101)
accuracies = []
for k in range(1,101):
  classifier = KNeighborsClassifier(n_neighbors = k)
  classifier.fit(training_data, training_labels)
  accuracies.append(classifier.score(validation_set, validation_labels))

plt.plot(k_list, accuracies)
plt.xlabel("k")
plt.ylabel("Validation Accuracy")
plt.title("Breast Cancer Classifier Accuracy")
plt.show()

<IPython.core.display.Javascript object>