# KNeighbors Regressor (KNN Regressor)
KNeighbors Regressor is a type of instance-based learning method used for regression. It predicts the value of a new sample based on a defined number of nearest neighbors in the training data.

## Advantages:
- Simplicity: Easy to understand and implement.
- No Assumptions: Makes no assumptions about the underlying data distribution.
- Versatility: Can be used for both classification and regression tasks.

## Disadvantages:
- Computationally Expensive: Slow for large datasets because it requires distance calculations for each query.
- Memory Intensive: Requires storing the entire training dataset.
- Sensitive to Irrelevant Features: Performance can be degraded by irrelevant or redundant features.

## Use Case:
- Recommendation Systems: Predicting user preferences.
- Predicting House Prices: Based on similar houses in the neighborhood.
- Patient Diagnosis: Predicting health metrics based on similar patients' data.

## Scaling (necessary)
Yes, scaling is necessary for KNeighbors Regressor because it uses distance metrics (like Euclidean distance) that are sensitive to the magnitude of the features.

## Encoding (necessary)
If you have categorical features, they need to be encoded into numerical values.

# Import Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from scipy.stats import uniform, loguniform

# Read Dataset

In [2]:
df = pd.read_csv('50_StartUp_dataset.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit,Florida,New York
0,0,165349.2,136897.8,471784.1,192261.83,0.0,1.0
1,1,162597.7,151377.59,443898.53,191792.06,0.0,0.0
2,2,153441.51,101145.55,407934.54,191050.39,1.0,0.0
3,3,144372.41,118671.85,383199.62,182901.99,0.0,1.0
4,4,142107.34,91391.77,366168.42,166187.94,1.0,0.0


# get X , Y

In [3]:
x=df.drop('Profit',axis=1)
y=df['Profit']

## Get train, test and valid data

In [4]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=.1, random_state=42)
x_train, x_valid, y_train, y_valid=train_test_split(x_train,y_train,test_size=.1, random_state=42)

In [5]:
print('x_train shape =',x_train.shape)
print('x_test shape =',x_test.shape)
print('x_valid shape =',x_valid.shape)
print('y_train shape =',y_train.shape)
print('y_test shape =',y_test.shape)
print('y_valid shape =',y_valid.shape)

x_train shape = (40, 6)
x_test shape = (5, 6)
x_valid shape = (5, 6)
y_train shape = (40,)
y_test shape = (5,)
y_valid shape = (5,)


# Scaling

In [6]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
x_train=scaler.fit_transform(x_train)
x_valid=scaler.transform(x_valid)
x_test=scaler.transform(x_test)

# Train

## Grid Search

In [7]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

knn = KNeighborsRegressor()

params = {
    'n_neighbors' : [3,5,7,9,11,13] , 
    'p' : [1,2] , 
    'weights': ["uniform", "distance"]
}

param_grid = {
    'n_neighbors' : [3, 5, 7, 9, 11, 13, 15, 17, 19, 21] , 
    'p' : [1,2] , 
    'weights': ["uniform", "distance"],
    'regressor__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'regressor__leaf_size': [10, 20, 30, 40, 50]
}

grid_search = GridSearchCV(knn, params, scoring='r2', cv=5, n_jobs=-1)

# Train the grid search
grid_search.fit(x_train, y_train)  

In [8]:
print("Best Hyperparameter Index:", grid_search.best_index_)
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validated Score:", grid_search.best_score_)

Best Hyperparameter Index: 1
Best Hyperparameters: {'n_neighbors': 3, 'p': 1, 'weights': 'distance'}
Best Cross-Validated Score: 0.8529792162338712


In [9]:
# Get the model with best hyperparameters
model = grid_search.best_estimator_
y_pred = model.predict(x_test)

## Randomized Search

In [10]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import RandomizedSearchCV

knn = KNeighborsRegressor()

params = {
    'n_neighbors' : np.arange(1, 31) , 
    'p' : [1,2] , 
    'weights': ["uniform", "distance"]
}

param_dist = {
    'n_neighbors': np.arange(1, 31),
    'p': [1, 2],
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'leaf_size': np.arange(10, 51)
}

random_search = RandomizedSearchCV(knn, params, scoring='r2', n_iter=10, cv=5, n_jobs=-1, random_state=42)

# Train the random search
random_search.fit(x_train, y_train)

found 0 physical cores < 1
  File "c:\Users\PC\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
    raise ValueError(f"found {cpu_count_physical} physical cores < 1")


In [11]:
# print("Best Hyperparameter Index:", random_search.best_index_)
# print("Best Hyperparameters:", random_search.best_params_)
# print("Best Cross-Validated Score:", random_search.best_score_)

In [12]:
# model = random_search.best_estimator_
# y_pred = model.predict(x_test)

## Train KNeighborsRegressor without search

In [13]:
from sklearn.neighbors import KNeighborsRegressor
model=KNeighborsRegressor(n_neighbors = 5, p = 1, weights = 'uniform')
# model=KNeighborsRegressor(n_neighbors = 5, p = 1, weights = 'uniform', algorithm = 'auto', leaf_size = 30)
model.fit(x_train, y_train)

# Check overfiiting

In [14]:
y_train_pred=model.predict(x_train)
r2_score(y_train_pred , y_train)

0.8396388914270418

In [15]:
y_valid_pred=model.predict(x_valid)
r2_score(y_valid_pred , y_valid)

-0.6164730823773017

# Evaluate model

In [16]:
y_pred = model.predict(x_test)

## r2_score

In [17]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
r2

0.9467415950366211

## mean_squared_error

In [18]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
mse

36273880.75052003

## mean_absolute_error

In [19]:
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
mae

4079.308000000002