🔍 What is KNN?
K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression. However, it is more widely used for classification tasks.

KNN is a non-parametric, instance-based learning algorithm. It doesn't learn an explicit model; instead, it memorizes the training data and makes decisions at the time of prediction.

🧠 How Does KNN Work?
Choose the number of neighbors, K.

Calculate the distance between the query point and all data points (commonly used: Euclidean distance).

Select the K nearest neighbors.

For classification: assign the class with the majority vote among the neighbors.

For regression: compute the average of the target values of the K nearest neighbors.

📐 Common Distance Metrics
Euclidean Distance: sqrt((x1 - x2)² + (y1 - y2)²)

Manhattan Distance: |x1 - x2| + |y1 - y2|



⚙️ Choosing K
A small K (e.g., 1) can be noisy and lead to overfitting.

A large K smooths out predictions but may underfit.

Typically, an odd number is chosen to avoid ties in binary classification.

✅ Advantages
Simple to implement and understand.

No training phase (lazy learner).

Adaptable to multi-class problems.

❌ Disadvantages
Computationally expensive for large datasets (since it needs to compute distance to all points).

Sensitive to irrelevant features and the scale of the data.

Poor performance on high-dimensional data due to the "curse of dimensionality".

🛠️ Tips
Always normalize or scale your features (e.g., using StandardScaler or MinMaxScaler).

Use Cross-Validation to choose the optimal value of K.

Use KD-Trees or Ball Trees for faster nearest-neighbor search in large datasets.



In [12]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler

In [2]:
data=pd.read_csv("glass.csv")

In [3]:
data.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [4]:
data["Type"].value_counts()

Type
2    76
1    70
7    29
3    17
5    13
6     9
Name: count, dtype: int64

In [5]:
data.duplicated().sum()

1

In [6]:
data.drop_duplicates(inplace=True)

In [7]:
X=data.drop("Type",axis=1)

In [8]:
X

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe
0,1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.0
1,1.51761,13.89,3.60,1.36,72.73,0.48,7.83,0.00,0.0
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.00,0.0
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.00,0.0
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.00,0.0
...,...,...,...,...,...,...,...,...,...
209,1.51623,14.14,0.00,2.88,72.61,0.08,9.18,1.06,0.0
210,1.51685,14.92,0.00,1.99,73.06,0.00,8.40,1.59,0.0
211,1.52065,14.36,0.00,2.02,73.42,0.00,8.44,1.64,0.0
212,1.51651,14.38,0.00,1.94,73.61,0.00,8.48,1.57,0.0


In [10]:
y=data["Type"]

In [11]:
y


0      1
1      1
2      1
3      1
4      1
      ..
209    7
210    7
211    7
212    7
213    7
Name: Type, Length: 213, dtype: int64