## [K-Nearest Neighbor Regressor](https://medium.com/data-science/k-nearest-neighbor-regressor-explained-a-visual-guide-with-code-examples-df5052c8c889)

> Finding the neighbors FAST with KD Trees and Ball Trees

The Nearest Neighbor Regressor is a straightforward predictive model that estimates values by averaging the outcomes of nearby data points. This method builds on the idea that similar inputs likely yield similar outputs.

The **Nearest Neighbor Regressor** works similarly to its classifier counterpart, but instead of voting on a class, it averages the target values. Here's the basic process:

- Calculate the distance between the new data point and all points in the training set.
- Select the K nearest neighbors based on these distances.
- Calculate the average of the target values of these K neighbors.
- Assign this average as the predicted value for the new data point.

In [1]:
!pip install -q pandas numpy scikit-learn matplotlib

In [2]:
import warnings
warnings.filterwarnings('ignore')

#### KD Tree for KNN Regression

KD Tree (K-Dimensional Tree) is a binary tree structure used for organizing points in a k-dimensional space. It’s particularly useful for tasks like nearest neighbor searches and range searches in multidimensional data.

#### Ball Tree for KNN Regression

Ball Tree is another space-partitioning data structure that organizes points in a series of nested hyperspheres. It’s particularly effective for high-dimensional data where KD Trees may become less efficient.

In [3]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# Create dataset
dataset_dict = {
    'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
    'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
    'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
    'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
    'Num_Players': [52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41]
}

df = pd.DataFrame(dataset_dict)

# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'])

# Convert 'Wind' column to binary
df['Wind'] = df['Wind'].astype(int)

# Split data into features and target, then into training and test sets
X, y = df.drop(columns='Num_Players'), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Identify numerical columns
numerical_columns = ['Temperature', 'Humidity']

# Create a ColumnTransformer to scale only numerical columns
ct = ColumnTransformer([
    ('scaler', StandardScaler(), numerical_columns)
], remainder='passthrough')

# Fit the ColumnTransformer on the training data and transform both training and test data
X_train_transformed = ct.fit_transform(X_train)
X_test_transformed = ct.transform(X_test)

# Convert the transformed data back to DataFrames
feature_names = numerical_columns + [col for col in X_train.columns if col not in numerical_columns]
X_train_scaled = pd.DataFrame(X_train_transformed, columns=feature_names, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_transformed, columns=feature_names, index=X_test.index)

# Initialize and train KNN Regressor
knn = KNeighborsRegressor(n_neighbors=5, 
                          algorithm='kd_tree', #'ball_tree', 'brute'
                          leaf_size=5) #default is 30
knn.fit(X_train_scaled, y_train)

# Make predictions
y_pred = knn.predict(X_test_scaled)

# Calculate and print RMSE
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")

RMSE: 8.6698
