# KNeighborsClassifier
KNeighborsClassifier (k-NN) is a type of instance-based learning or non-generalizing learning: it does not explicitly learn a model but memorizes the training instances. In the classification phase, the algorithm assigns the most common class among the k-nearest neighbors to the test instance.

## Advantages
- Simple to Implement: Easy to understand and implement.
- No Training Phase: The algorithm doesn't require a training phase, making it very fast for small datasets.
- Flexibility: Can adapt to any problem as long as a suitable distance metric is defined.
- Interpretable: The classification results are easy to interpret.

## Disadvantages
- Computationally Intensive: Requires high computation time and memory during the prediction phase, especially for large datasets.
- Sensitive to Noise: Can be sensitive to noisy data and irrelevant features.
- Curse of Dimensionality: Performance degrades with high-dimensional data as distances become less meaningful.
- No Model Training: Doesn't provide an explicit model, so feature importance and other insights cannot be derived directly.

## Use Cases
- Pattern Recognition: Image and handwriting recognition.
- Recommendation Systems: Suggesting products based on user similarity.
- Medical Diagnosis: Predicting disease based on symptoms.
- Anomaly Detection: Identifying outliers in data.

## Scaling(necessary)
KNeighborsClassifier requires feature scaling since it relies on distance metrics like Euclidean distance, which can be affected by the magnitude of features.

## Encoding(necessary) 
Categorical data needs to be encoded into numerical values.

# Import library

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.datasets import load_iris

# Read data

In [2]:
df = pd.read_csv('Breast_Cancer.csv')
x = df.drop('diagnosis',axis=1)
y = df['diagnosis']

In [3]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Scale data

In [4]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# Train

## Grid Search

In [5]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier()

params = {
    'n_neighbors': range(1, 31),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

param_grid = {
    'n_neighbors': range(1, 51),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski', 'chebyshev']
}


grid_search = GridSearchCV(knn, params, scoring='accuracy', cv=5, n_jobs=-1)

# Train the grid search
grid_search.fit(x_train, y_train)  

In [6]:
print("Best Hyperparameter Index:", grid_search.best_index_)
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validated Score:", grid_search.best_score_)

Best Hyperparameter Index: 19
Best Hyperparameters: {'metric': 'euclidean', 'n_neighbors': 10, 'weights': 'distance'}
Best Cross-Validated Score: 0.9670329670329672


In [7]:
# Get the model with best hyperparameters
model = grid_search.best_estimator_
y_pred = model.predict(x_test)

found 0 physical cores < 1
  File "c:\Users\PC\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
    raise ValueError(f"found {cpu_count_physical} physical cores < 1")


## Randomized Search

In [10]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV

knn = KNeighborsClassifier()

params = {
    'n_neighbors': range(1, 31),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

param_dist = {
    'n_neighbors': range(1, 51),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski', 'chebyshev']
}


random_search = RandomizedSearchCV(knn, params, scoring='accuracy', n_iter=10, cv=5, n_jobs=-1, random_state=42)

# Train the random search
random_search.fit(x_train, y_train)

In [12]:
print("Best Hyperparameter Index:", random_search.best_index_)
print("Best Hyperparameters:", random_search.best_params_)
print("Best Cross-Validated Score:", random_search.best_score_)

Best Hyperparameter Index: 0
Best Hyperparameters: {'weights': 'distance', 'n_neighbors': 10, 'metric': 'euclidean'}
Best Cross-Validated Score: 0.9670329670329672


In [13]:
model = random_search.best_estimator_
y_pred = model.predict(x_test)

## Train LogisticRegression without search

In [14]:
from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier(metric='euclidean', n_neighbors=10, weights='distance')
model.fit(x_train, y_train)