<h1 <span style='color: #FFA500; font-family: Arial; font-size: 3em;'>K-Nearest Neighbors</span> </h1>

First we import the main libraries we'll be using

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler #We need to standardize the data in order to use K nearest neighbors
from sklearn.neighbors import KNeighborsClassifier #The model
from sklearn.metrics import classification_report, confusion_matrix #metrics we'll be using to evaluate the finished model

------

### Import the dataset

In [10]:
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv')
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [11]:
df.count()

species              344
island               344
bill_length_mm       342
bill_depth_mm        342
flipper_length_mm    342
body_mass_g          342
sex                  333
dtype: int64

-----

### Data Scrubbing

In [13]:
del df['sex']
df.dropna(axis = 0, how = 'any', subset = None, inplace = True)
df = pd.get_dummies(df,columns=['island'])
df.head()
df.count()

species              342
bill_length_mm       342
bill_depth_mm        342
flipper_length_mm    342
body_mass_g          342
island_Biscoe        342
island_Dream         342
island_Torgersen     342
dtype: int64

>As seen in the counts above, just 2 rows contained NA values, which makes the dataset quite useful. However, it'd normally take more than this to clean a dataset, that's why it's recommended to practice with raw datasets to practice _data scrubbing_.

----

### Standardizing independent variables

because K-nearest neighbords, as many other algorithms (_K means clustering_, _Hierarchical Clustering_, _Support Vector Machine_), use the euclidean distance for their calculations, a difference in scales between variables could greatly influence the models predictions (in a bad way). That's why we are standardizing the data.

In [15]:
Scaler = StandardScaler()
Scaler.fit(df.drop('species',axis=1))
scaled_df = Scaler.transform(df.drop('species',axis=1))
# axis=0 (o 'index'): La operación se realiza verticalmente, estás buscando un índice (una fila).
# axis=1 (o 'columns'): La operación se realiza horizontalmente. Si haces un drop con axis=1, estás buscando el nombre de una columna.

---

### Assigning x and y variables, and split test data

In [17]:
X = scaled_df
Y = df['species']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, shuffle = True)

---

### Model and testing

It is worth highlighting that the number of neighbors should be calculated and not just decided through simple inspection

In [21]:
model = KNeighborsClassifier(n_neighbors = 5)

#Link algorithm to X and Y variables
model.fit(X_train,Y_train)

model_test = model.predict(X_test)

---

### Evaluating model's predictions

In [22]:
print(confusion_matrix(Y_test,model_test))
print(classification_report(Y_test,model_test))

[[50  0  0]
 [ 0 13  0]
 [ 0  0 40]]
              precision    recall  f1-score   support

      Adelie       1.00      1.00      1.00        50
   Chinstrap       1.00      1.00      1.00        13
      Gentoo       1.00      1.00      1.00        40

    accuracy                           1.00       103
   macro avg       1.00      1.00      1.00       103
weighted avg       1.00      1.00      1.00       103



---

### Testing

In [24]:
# Data point to predict
penguin = [
	39, #bill_length_mm
	18.5, #bill_depth_mm
	180, #flipper_length_mm 
	3750, #body_mass_g
	0, #island_Biscoe    
	0, #island_Dream
	1, #island_Torgersen    
]

# Make prediction
new_penguin = model.predict([penguin])
new_penguin

array(['Gentoo'], dtype=object)