# Unit 3 K-Nearest Neighbors (KNN) Basics

Skip to main content

Back

## K-Nearest Neighbors (KNN) Basics

**System**

**Cosmo**
**Lesson**
**K-Nearest Neighbors (KNN) Basics**

### Lesson Introduction

Welcome to the lesson on K-Nearest Neighbors (KNN)\! Today, we’ll explore this simple yet intuitive algorithm. KNN is used for classification and regression tasks. Our goal is to understand how KNN works and implement it in Python using Scikit-Learn. By the end, you'll be able to classify data points based on their features.

### K-Nearest Neighbors (KNN) Basics

What is KNN? Imagine identifying a fruit as an apple or an orange. Instead of using a dictionary, you ask nearby people for their opinions. The majority wins. This is the idea behind KNN, classifying a data point based on its nearest neighbors.

Let's take a look at an example:

In this image, we see a target point (black cross) that we want to predict the class for. This target point's three nearest neighbors are two red points and one green point. As the majority of the neighbors are red points, the target point will be also classified as a red point.

Why use KNN? It's easy to understand and implement. It is useful in recommending products, recognizing medical patterns, etc.

### Loading Dataset

Let's load the Iris dataset, which contains information about different flowers. Here's how we do it using Scikit-Learn:

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
```

This code loads the Iris dataset and splits it into features `X` and labels `y`. The dataset contains various information about flowers, including Sepal Length, Sepal Width, Petal Length and Petal Width. Our goal is to predict the type of the flower, which is one of the following: Setosa, Versicolour, and Virginica.

Note, that as we predict three classes instead of two, this is not a binary classification. Now we are working with a multiclass classification. Luckily for us, the KNN-classifier is perfectly suitable for this type of tasks. Decision trees can also be used for it, as we saw in the previous lesson.

One important thing to know is that the logistic regression is not suitable for multiclass classification in its original form. However, it can be adapted for this type of task using techniques like One-vs-Rest (OvR) or Softmax Regression (Multinomial Logistic Regression).

### Using the KNN Classifier

Now, let's initialize and fit our KNN classifier. KNN classifies a data point based on the majority class among its nearest neighbors. We’ll start with `k=3`, meaning we will check the three nearest neighbors when making a prediction.

Here’s how to fit a KNN classifier:

```python
from sklearn.neighbors import KNeighborsClassifier

# Initialize the KNN classifier with 3 neighbors
knn_clf = KNeighborsClassifier(n_neighbors=3)

# Fit the KNN classifier
knn_clf.fit(X_train, y_train)
```

This initializes the `KNeighborsClassifier` with 3 neighbors. Unlike models we studied earlier, KNN doesn't perform any computations during the fitting phase but rather prepares the data for future comparisons during the prediction phase. Essentially, it does not require any training at all; it just uses the data's structure to make predictions.

### Evaluating the Model

We can evaluate our model's performance by calculating accuracy to see how often it predicts correctly.

```python
from sklearn.metrics import accuracy_score

# Predict using the KNN classifier
y_pred = knn_clf.predict(X_test)

# Calculate accuracy using accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy * 100:.2f}%")  # Model accuracy: 98.33%
```

The `score` method evaluates the model using the test set, printing the accuracy as a percentage.

### Performance Comparison

Let's also train the decision tree model on the same data and see how it performs. We will use a `.score` method instead of `.predict`. The `score` method combines two steps:

1.  Calculate the predictions
2.  Calculate the accuracy

We use the `.score` method to make the code shorter and easier to maintain when we don't need the predictions themselves but only care about the model's accuracy, like here.

```python
from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X_train, y_train)
dt_accuracy = dt_clf.score(X_test, y_test)
print(f"Decision Tree model accuracy: {dt_accuracy * 100:.2f}%")
# Decision Tree model accuracy: 96.67%
```

In this case, KNN outperforms the Decision Tree. However, if we tune the decision tree a bit by limiting its depth, we can achieve the same result:

```python
dt_clf = DecisionTreeClassifier(random_state=42, max_depth=3)
dt_clf.fit(X_train, y_train)
dt_accuracy = dt_clf.score(X_test, y_test)
print(f"Decision Tree model accuracy: {dt_accuracy * 100:.2f}%")
# Decision Tree model accuracy: 98.33%
```

Note that we added the `max_depth=3` parameter to the model initialization and improved the model's performance. This shows the importance of tuning your models by choosing the best parameters.

In this case, the `max_depth` was chosen randomly. But in the last course of this course path, we will learn how to find the best possible parameters using a more controllable approach.

### Lesson Summary

Great job\! You've learned the basics of K-Nearest Neighbors (KNN) and how to implement it using Python and Scikit-Learn. We covered:

  * The concept of KNN
  * Loading and understanding the Iris dataset
  * Splitting the dataset into training and testing sets
  * Fitting a KNN classifier
  * Brief model evaluation

Now it’s time to practice. You'll engage in hands-on activities to solidify your understanding of KNN and see how it works in different scenarios. Get ready to classify data points and measure your model’s performance\!

Dive into the practice exercises, and good luck\!

## KNN Flower Classification with Iris Dataset

Explore how well a K-Nearest Neighbors (KNN) model can classify different types of flowers using the Iris dataset. The given code loads the dataset, splits it, trains a KNN classifier with 5 neighbors, and evaluates its accuracy.

Click Run to discover the model's accuracy!

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Initialize the KNN classifier with 5 neighbors
knn_clf = KNeighborsClassifier(n_neighbors=5)

# Train the KNN classifier
knn_clf.fit(X_train, y_train)

# Evaluate the model
accuracy = knn_clf.score(X_test, y_test)
print(f"Model accuracy: {accuracy * 100:.2f}%")

```

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Initialize the KNN classifier with 5 neighbors
knn_clf = KNeighborsClassifier(n_neighbors=5)

# Train the KNN classifier
knn_clf.fit(X_train, y_train)

# Evaluate the model
accuracy = knn_clf.score(X_test, y_test)
print(f"Model accuracy: {accuracy * 100:.2f}%")
```

**KNN Flower Classification with Iris Dataset Report**

The K-Nearest Neighbors (KNN) model, with `n_neighbors=5`, was used to classify different types of flowers in the Iris dataset. After loading the dataset and splitting it into training and testing sets (60% training, 40% testing), the model was trained and evaluated.

The model achieved an accuracy of **98.33%**. This indicates that the KNN classifier performed very well in predicting the correct flower type based on its features in the Iris dataset.

## Adjust K Value for KNN Classifier

Cosmo
Just now
Read message aloud
Hey there, Space Voyager!

Let's tweak our KNN model a bit. Change the n_neighbors parameter from 3 to 4 in the KNeighborsClassifier initialization. This will help you understand how the number of neighbors influences the model's performance.

Good luck!

```python
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.6, random_state=42)

# TODO: change initialization to use n_neighbors=4
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)

# Evaluate the model
accuracy = knn_clf.score(X_test, y_test)
print(f"Model accuracy: {accuracy * 100:.2f}%")

```

```python
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.6, random_state=42)

# Change initialization to use n_neighbors=4
knn_clf = KNeighborsClassifier(n_neighbors=4)
knn_clf.fit(X_train, y_train)

# Evaluate the model
accuracy = knn_clf.score(X_test, y_test)
print(f"Model accuracy: {accuracy * 100:.2f}%")
```

Hey there, Space Voyager\!

You've successfully tweaked the `n_neighbors` parameter for our KNN model. By changing it from 3 to 4, we can observe the impact of this hyperparameter on the model's performance.

Here's the accuracy with `n_neighbors=4`:

Model accuracy: 96.67%

## Complete the KNN Classifier for Iris Dataset

Astro Explorer, it's time to level up! Fill in the blanks to complete the code. Implement the missing pieces to load the wine dataset and fit a KNN classifier.

Remember, the dataset contains features like wine's hue or chemical parameters.

Get coding and see if you can achieve high accuracy!

```python
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Load the Iris dataset
X, y = load_wine(return_X_y=True)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# TODO: Initialize KNN classifier with 5 neighbors

# TODO: Fit the KNN classifier

# TODO: Make predictions for the test data and print the obtained accuracy score

```

```python
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Load the Iris dataset
X, y = load_wine(return_X_y=True)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize KNN classifier with 5 neighbors
knn_clf = KNeighborsClassifier(n_neighbors=5)

# Fit the KNN classifier
knn_clf.fit(X_train, y_train)

# Make predictions for the test data
y_pred = knn_clf.predict(X_test)

# Calculate and print the obtained accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy * 100:.2f}%")
```

Astro Explorer, you've successfully completed the KNN classifier for the Wine dataset\!

The code now initializes a `KNeighborsClassifier` with 5 neighbors, fits it to the training data, makes predictions on the test set, and calculates the accuracy.

Here's the accuracy achieved by your KNN model on the Wine dataset:

Model accuracy: 74.07%

This accuracy shows how well your model classifies different wines based on their chemical properties. Great work\!

## Classify Iris Flowers with KNN

Nice progress, Space Explorer!

Now, let's dive deeper. Your task is to complete the missing pieces in the code to classify the Wine using K-Nearest Neighbors (KNN). This time, we will try plugging in different ks and see how the model performs to choose the best one.

This process is called hypertuning, and it essentially involves searching for the optimal mode's initialization parameters. Sounds cool, right?

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load dataset and split to train and test
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Iterate over different k values
for k in range(2, 10):
    # TODO: initialize the KNeighborsClassifier with n_neighbors=k
    # TODO: make predictions and calculate the accuracy for the test data. Save the result in the "accuracy" variable
    print(f'k={k} -> Accuracy: {accuracy * 100:.2f}%')
```

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load dataset and split to train and test
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Iterate over different k values
for k in range(2, 10):
    # Initialize the KNeighborsClassifier with n_neighbors=k
    knn_clf = KNeighborsClassifier(n_neighbors=k)

    # Fit the KNN classifier to the training data
    knn_clf.fit(X_train, y_train)

    # Make predictions and calculate the accuracy for the test data. Save the result in the "accuracy" variable
    accuracy = knn_clf.score(X_test, y_test)
    print(f'k={k} -> Accuracy: {accuracy * 100:.2f}%')
```

Nice progress, Space Explorer\!

You've successfully completed the code to classify the Wine dataset using K-Nearest Neighbors (KNN) and performed hypertuning by iterating through different `k` values.

Running this code will show you the accuracy of the KNN model for each `k` from 2 to 9, allowing you to observe how the number of neighbors influences the model's performance. This is a great practical example of finding the optimal model parameters\!

## Flower Classification with KNN

Alright, Space Wanderer, it’s time to put your classification skills to a real test! You will use the Wine dataset, fit both a KNN and a Decision Tree classifier, hypertune them, and compare their best accuracies.

Load and split the Wine dataset:
Initialize and fit the KNN classifier:
Iterate over different k values (e.g., from 2 to 10) and find the best k that yields the highest accuracy.
Initialize and train the Decision Tree classifier:
Iterate over different depths (e.g., from 1 to 10) and find the best depth that yields the highest accuracy.
Compare their accuracies: Print the best accuracies for both classifiers.
It’s time to show what you’ve got!


```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Load dataset and split to train and test
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Hypertune KNN
best_knn_acc = 0
best_k = 0
for k in range(2, 11):
    # TODO: calculate the accuracy with the given k
    if accuracy > best_knn_acc:
        best_knn_acc = accuracy
        best_k = k
print(f'Best K for KNN: {best_k} -> Accuracy: {best_knn_acc * 100:.2f}%')

# Hypertune Decision Tree
best_dt_acc = 0
best_depth = 0
for depth in range(1, 11):
    # TODO: calculate the accuracy with the given max_depth
    if accuracy > best_dt_acc:
        best_dt_acc = accuracy
        best_depth = depth
print(f'Best Depth for Decision Tree: {best_depth} -> Accuracy: {best_dt_acc * 100:.2f}%')

```

Sure, Space Wanderer\! Let's get these classifiers trained and tuned.

Here's the completed code to hypertune both the KNN and Decision Tree classifiers on the Wine dataset, find their best accuracies, and compare them:

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Load dataset and split to train and test
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Hypertune KNN
best_knn_acc = 0
best_k = 0
for k in range(2, 11):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    accuracy = knn.score(X_test, y_test)
    if accuracy > best_knn_acc:
        best_knn_acc = accuracy
        best_k = k
print(f'Best K for KNN: {best_k} -> Accuracy: {best_knn_acc * 100:.2f}%')

# Hypertune Decision Tree
best_dt_acc = 0
best_depth = 0
for depth in range(1, 11):
    dtree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dtree.fit(X_train, y_train)
    accuracy = dtree.score(X_test, y_test)
    if accuracy > best_dt_acc:
        best_dt_acc = accuracy
        best_depth = depth
print(f'Best Depth for Decision Tree: {best_depth} -> Accuracy: {best_dt_acc * 100:.2f}%')

```