# **The k-Nearest Neighbors (KNN) Algorithm**

## **Introduction**

The k-Nearest Neighbors (KNN) algorithm is a highly regarded method within the realm of machine learning, known for its application in both classification and regression tasks. As an instance-based learning strategy, KNN operates on the premise that objects in close proximity to one another are likely to share similar outcomes. This non-parametric and lazy learning algorithm bases its predictions on the nearest neighbors within the training dataset, without necessitating a distinct training phase beforehand.

### k-NN Algorithm

The k-Nearest Neighbors algorithm can be summarized as follows:

1. **Initialization:** It begins with a pre-labeled training dataset alongside a new, unlabeled point that requires classification or value prediction.

2. **Distance Calculation:** The algorithm computes the distance between the new point and all points in the training dataset utilizing a specific distance metric, such as Euclidean distance, to gauge closeness.

3. **Neighbor Selection:** It identifies the 'k' closest points (neighbors) to the new point based on the calculated distances.

4. **Outcome Prediction:**
    - In classification tasks, KNN assigns a class to the new point based on the most common class among its 'k' nearest neighbors.
    
    - For regression tasks, it predicts a value by averaging the values of the 'k' nearest neighbors.

## Advantages and Disadvantages

### Advantages

1. **Simplicity**: The straightforwardness of KNN makes it easily understandable and implementable, serving as a powerful tool for analysts at all skill levels.

2. **No Training Requirement**: Being a lazy learner, KNN is particularly suited for scenarios with dynamically changing datasets as it requires no explicit training phase.

3. **Versatility**: KNN is adept at handling various types of data for classification, regression, and even anomaly detection.

4. **Adaptability**: The algorithm is capable of adapting to different data distributions and capturing non-linear relationships.

### Disadvantages

1. **Computational Demand**: The need to compute distances to all training points for each prediction renders KNN computationally intensive, especially with large datasets.

2. **Sensitivity to k**: The performance of KNN heavily depends on the chosen 'k' value, necessitating careful selection.

3. **Feature Scaling Necessity**: Given KNN's reliance on distance calculations, appropriate scaling of features is crucial to avoid distortions.

4. **Curse of Dimensionality**: KNN's efficiency may decrease in high-dimensional spaces due to sparse data distribution and increased computation.

### Dataset Overview

For our exploration, we pivot from the Swiss Roll dataset to the "Life Expectancy of the World" dataset. This rich dataset provides a comprehensive look at life expectancy metrics across various countries, serving as a foundation for understanding global health trends. Unlike the Swiss Roll's three-dimensional, continuous, and non-linear manifold characteristics, the Life Expectancy dataset offers a real-world scenario where life expectancy figures are tied to specific countries and their geographical locations.

### Dataset Characteristics
1. **Dimensionality:** The dataset spans multiple dimensions but focuses on life expectancy figures for overall, male, and female populations in each country.

2. **Geographical Association:** Each country's data point is associated with a continent, introducing a categorical dimension that facilitates classification tasks.

3. **Real-World Application:** The dataset's real-world applicability extends to public health analysis, demographic studies, and geographic classification tasks, offering a stark contrast to the Swiss Roll's abstract nature.

4. **Analytical Potential:** The life expectancy figures allow for regression analysis to predict life expectancy based on various factors, while the country-to-continent mapping supports classification tasks.

### Implementing K-Nearest Neighbors (KNN) with the Dataset

The objective is to use KNN not for unfolding a manifold but for predicting a country's continent based on its life expectancy metrics, showcasing the algorithm's versatility beyond theoretical examples to tangible, impactful analyses.

In [None]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load the Life Expectancy dataset
data = pd.read_csv('Life_expectancy_dataset.csv')

# Prepare the dataset
features = data[['Overall Life', 'Male Life', 'Female Life']]
target = data['Continent']

# Encode categorical data and split the dataset
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Make predictions and evaluate the model
y_pred = knn.predict(X_test_scaled)
print('Classification Report:\n', classification_report(y_test, y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('Accuracy Score:', accuracy_score(y_test, y_pred))


In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

# Encoding the 'Continent' column to transform it from categorical to numerical
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(data['Continent'])

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(data[['Overall Life', 'Male Life', 'Female Life']], y_encoded, test_size=0.3, random_state=42)

# Scaling the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# K-Nearest Neighbors model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Making predictions and evaluating the model
y_pred = knn.predict(X_test_scaled)
print('Classification Report:\n', classification_report(y_test, y_pred, target_names=label_encoder.classes_))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('Accuracy Score:', accuracy_score(y_test, y_pred))

# Visualization - Please adjust or replace this part as it's conceptual and may not directly apply without adjustments
plt.figure(figsize=(10, 8))
plt.title('Life Expectancy Data Classification with KNN')
plt.scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], c=y_pred, cmap='viridis', marker='o', edgecolor='k', s=20, alpha=0.5)
plt.xlabel('Scaled Overall Life Expectancy')
plt.ylabel('Scaled Male Life Expectancy')
plt.colorbar(label='Continent', ticks=range(len(label_encoder.classes_)), format=plt.FuncFormatter(lambda val, loc: label_encoder.classes_[val]))
plt.show()


NameError: name 'data' is not defined