# K-Nearest Neighbor

Given a point, finds the k closest points and makes a prediction based on those closest points

## Using K-Nearest Neighbor Algorithm

## 1. Imports:

1. Pandas to use and manipulate DataFrames
2. KneighborsClassifier: for KNN algorithm
3. Seaborn: load the dataset

In [1]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from seaborn import load_dataset

## 2. Load Dataset and view first few rows

In [2]:
df = load_dataset('penguins')

print(df.head())

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm   
0  Adelie  Torgersen            39.1           18.7              181.0  \
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    Male  
1       3800.0  Female  
2       3250.0  Female  
3          NaN     NaN  
4       3450.0  Female  


## 3. Drop Missing Values and Select Features

In [3]:
df = df.dropna()

X = df[['bill_length_mm']]
y = df['species']

## 4. Split data into Training and Testing Data

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

## Understanding KNeighborsClassifier in Sklearn

In [5]:
KNeighborsClassifier(
    n_neighbors=5,          # The number of neighbours to consider
    weights='uniform',      # How to weight distances
    algorithm='auto',       # Algorithm to compute the neighbours
    leaf_size=30,           # The leaf size to speed up searches
    p=2,                    # The power parameter for the Minkowski metric
    metric='minkowski',     # The type of distance to use
    metric_params=None,     # Keyword arguments for the metric function
    n_jobs=None             # How many parallel jobs to run
)

## 5. Create a KNN object

clf is convention for classifier object

use p=1 for euclidian method to measure distances

Default neighbors is 5

In [6]:
clf = KNeighborsClassifier(p=1)

## 6. Fit model to training data

In [7]:
clf.fit(X_train, y_train)

## 7. Make predictions about test values

In [8]:
predictions = clf.predict(X_test)

print("Predictions:\n",predictions[:5])

print("\nTrue Values:\n",y_test.head())

Predictions:
 ['Chinstrap' 'Gentoo' 'Chinstrap' 'Adelie' 'Gentoo']

True Values:
 184    Chinstrap
181    Chinstrap
340       Gentoo
52        Adelie
296       Gentoo
Name: species, dtype: object


## 8. Evaluate Accuracy

In [9]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, predictions)

print("Accuracy:",accuracy)

Accuracy: 0.6666666666666666


## Using Multiple Dimensions

Using more features in the prediction

**1.** Load dataset

In [10]:
df = load_dataset('penguins')
df = df.dropna()

**2.** Set X to contain all numeric data

In [11]:
X = df.select_dtypes(include="number")
y = df['species']

**3.** Separate data into trainging and testing

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

**4.** Create a classifier and fit it to the training data 

In [13]:
clf = KNeighborsClassifier(p=1)
clf.fit(X_train, y_train)

**5.** Make predictions of test data

In [14]:
predictions = clf.predict(X_test)

print("Predictions:\n",predictions[:5])

print("\nTrue Values:\n",y_test.head())

Predictions:
 ['Adelie' 'Adelie' 'Gentoo' 'Adelie' 'Gentoo']

True Values:
 184    Chinstrap
181    Chinstrap
340       Gentoo
52        Adelie
296       Gentoo
Name: species, dtype: object


**6.** Determine accuracy

In [15]:
accuracy = accuracy_score(y_test, predictions)

print("Accuracy:",accuracy)

Accuracy: 0.7738095238095238


## Working with Categorical Data

Algorithms don't work with non-numeric data, so we must convert it

**1.** Imports to make this possible

In [16]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

**2.** Change X set to include all features, but the species

In [17]:
X = df.drop(columns=['species'])

**3.** Create new training and testing sets

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

**4.** Convert strings to numeric data

In [19]:
column_transformer = make_column_transformer(
    (OneHotEncoder(), ['sex', 'island']),
    remainder='passthrough')

X_train = column_transformer.fit_transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=column_transformer.get_feature_names_out())

X_test = column_transformer.fit_transform(X_test)
X_test = pd.DataFrame(data=X_test, columns=column_transformer.get_feature_names_out())

**5.** Create and train model

In [20]:
clf = KNeighborsClassifier(p=1)
clf.fit(X_train, y_train)

**6.** Make predictions

In [21]:
predictions = clf.predict(X_test)

print("Predictions:\n",predictions[:5])

print("\nTrue Values:\n",y_test.head())

Predictions:
 ['Adelie' 'Adelie' 'Gentoo' 'Adelie' 'Gentoo']

True Values:
 184    Chinstrap
181    Chinstrap
340       Gentoo
52        Adelie
296       Gentoo
Name: species, dtype: object


**7.** Evaluate Accuracy

In [22]:
accuracy = accuracy_score(y_test, predictions)

print("Accuracy:",accuracy)

Accuracy: 0.7738095238095238


## Scaling Data

Larger data has larger impact on data.

Reduce bias by scaling larger numeric values down

**1.** Imports

In [23]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
from seaborn import load_dataset

**2.** Load dataset, drop missing values, and select features

In [24]:
df = load_dataset('penguins')
df = df.dropna()

X = df.drop(columns=['species'])
y = df['species']

**3.** Separate data into training and testing

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

**4.** Change columns to all numeric and scale down numerics

In [26]:
column_transformer = make_column_transformer(
    (OneHotEncoder(), ['sex', 'island']),
    (MinMaxScaler(), ['bill_depth_mm', 'bill_length_mm', 'flipper_length_mm', 'body_mass_g']),
    remainder='passthrough')

X_train = column_transformer.fit_transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=column_transformer.get_feature_names_out())

X_test = column_transformer.transform(X_test)
X_test = pd.DataFrame(data=X_test, columns=column_transformer.get_feature_names_out())

**5.** Make a model and fit it to the data

In [27]:
clf = KNeighborsClassifier(p=1)
clf.fit(X_train, y_train)

**6.** Make predictions

In [28]:
predictions = clf.predict(X_test)

print("Predictions:\n",predictions[:5])

print("\nTrue Values:\n",y_test.head())

Predictions:
 ['Chinstrap' 'Chinstrap' 'Gentoo' 'Adelie' 'Gentoo']

True Values:
 184    Chinstrap
181    Chinstrap
340       Gentoo
52        Adelie
296       Gentoo
Name: species, dtype: object


**7.** Evaluate Accuracy

In [29]:
accuracy = accuracy_score(y_test,predictions)

print("Accuracy:",accuracy)

Accuracy: 1.0


## Hyper-Parameter Tuning

Find the best values for the parameters out of a list

**1.** Need another import to accomplish this

In [30]:
from sklearn.model_selection import GridSearchCV

**2.** Create a dictionary of possible parameter values

In [31]:
params = {
    'n_neighbors': range(1,15,2),
    'p':[1,2],
    'weights':['uniform','distance']
}

**3.** Determine best parameters

In [32]:
clf = GridSearchCV(
    estimator = KNeighborsClassifier(),
    param_grid = params,
    cv = 5,
    n_jobs = 5,
    verbose = 1,
)

**4.** Fit model to training data and view the best parameters

In [33]:
clf.fit(X_train, y_train)

print(clf.best_params_)

Fitting 5 folds for each of 28 candidates, totalling 140 fits
{'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}


# Putting it all together

The full code using all of the features covered in the tutorial

**1.** Imports

In [34]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
from seaborn import load_dataset

**2.** Load data, remove missing values, and select features

In [35]:
df = load_dataset('penguins')
df = df.dropna()

X = df.drop(columns=['species'])
y = df['species']

**3.** Get training and testing data

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

**4.** Convert non-numeric data to numeric and scale numeric data down

In [44]:
column_transformer = make_column_transformer(
    (OneHotEncoder(), ['sex', 'island']),
    (MinMaxScaler(), ['bill_depth_mm', 'bill_length_mm', 'flipper_length_mm', 'body_mass_g']),
    remainder='passthrough')

X_train = column_transformer.fit_transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=column_transformer.get_feature_names_out())

X_test = column_transformer.transform(X_test)
X_test = pd.DataFrame(data=X_test, columns=column_transformer.get_feature_names_out())

**5.** Create model using best parameters and fit it to data

In [45]:
clf = KNeighborsClassifier(n_neighbors=5,
                          p=1,
                          weights='uniform')

clf.fit(X_train, y_train)

**6.** Make predictions

In [46]:
predictions = clf.predict(X_test)

print("Predictions:\n",predictions[:5])

print("\nTrue Values:\n",y_test.head())

Predictions:
 ['Chinstrap' 'Chinstrap' 'Gentoo' 'Adelie' 'Gentoo']

True Values:
 184    Chinstrap
181    Chinstrap
340       Gentoo
52        Adelie
296       Gentoo
Name: species, dtype: object


**7.** Evaluate accuracy

In [47]:
accuracy = accuracy_score(y_test, predictions)

print("Accuracy:",accuracy)

Accuracy: 1.0
