# Feature Scaling

Feature Scaling is a data preprocessing technique used in machine learning to normalize the range of independent variables (features) in your dataset. It ensures that all features contribute equally to model training, especially for algorithms that are sensitive to the scale of data like:

K-Nearest Neighbors (KNN)

Support Vector Machines (SVM)

Gradient Descent-based models (Linear/Logistic Regression)

Principal Component Analysis (PCA)

ðŸ”¹ Why is Feature Scaling Needed?
If features have different scales, those with larger ranges may dominate the learning process, leading to biased models.

Example:

Age: 20â€“60

Salary: 20,000â€“100,000

Distance: 0â€“5 km

Here, salary may dominate due to its large values.



# ðŸ“Š Types of Feature Scaling

**1. Min-Max Scaling (Normalization)**
X_scaled = (X - X_min) / (X_max - X_min)
Range: [0, 1]

Used when: You want features bounded between 0 and 1.

**2. Standardization (Z-score Normalization)**
X_scaled = (X - mean) / standard_deviation
Where:
Î¼ is the mean of feature

Ïƒ is the standard deviation

Range: Not bounded; mean = 0, std = 1

Used when: Features are normally distributed.

**3. MaxAbs Scaling**
X_scaled = X / abs(X_max)
Range: [-1, 1]

Used when: Data is sparse or already centered at 0.

**4. Robust Scaling (using Median and IQR)**
X_scaled = (X - Median) / IQR
IQR = Q3 - Q1

Where IQR = Interquartile Range (Q3 - Q1)

Used when: Data contains outliers.



In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, MaxAbsScaler, RobustScaler


In [None]:
df = pd.DataFrame({"Salary": [70000, 65000, 52000, 45000]})
print(df)

   Salary
0   70000
1   65000
2   52000
3   45000


In [None]:
# MinMax Scalar[0, 1]
scalar = MinMaxScaler()
df['Salary'] = scalar.fit_transform(df[['Salary']])
df

Unnamed: 0,Salary
0,1.0
1,0.8
2,0.28
3,0.0


In [None]:
# standard Scalar[-3,3]
ss = StandardScaler()
ss.fit_transform(df[['Salary']])

array([[ 1.2030113 ],
       [ 0.70175659],
       [-0.60150565],
       [-1.30326224]])

In [None]:
# MaxAbs Scaling[-1,1]
ma = MaxAbsScaler()
ma.fit_transform(df[['Salary']])

array([[1.        ],
       [0.92857143],
       [0.74285714],
       [0.64285714]])

In [None]:
df = pd.DataFrame({"Salary": [70000, 65000, 52000, 45000, 2000000]})
print(df)

    Salary
0    70000
1    65000
2    52000
3    45000
4  2000000


In [None]:
rs = RobustScaler()
rs.fit_transform(df[['Salary']])

array([[  0.27777778],
       [  0.        ],
       [ -0.72222222],
       [ -1.11111111],
       [107.5       ]])

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report

In [None]:
df = pd.read_csv("diabetes.csv")
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [None]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [None]:
for col in df.columns:
  df[col] = scalar.fit_transform(df[[col]]) # MinMaxScalar

In [None]:
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.352941,0.743719,0.590164,0.353535,0.000000,0.500745,0.234415,0.483333,1.0
1,0.058824,0.427136,0.540984,0.292929,0.000000,0.396423,0.116567,0.166667,0.0
2,0.470588,0.919598,0.524590,0.000000,0.000000,0.347243,0.253629,0.183333,1.0
3,0.058824,0.447236,0.540984,0.232323,0.111111,0.418778,0.038002,0.000000,0.0
4,0.000000,0.688442,0.327869,0.353535,0.198582,0.642325,0.943638,0.200000,1.0
...,...,...,...,...,...,...,...,...,...
763,0.588235,0.507538,0.622951,0.484848,0.212766,0.490313,0.039710,0.700000,0.0
764,0.117647,0.613065,0.573770,0.272727,0.000000,0.548435,0.111870,0.100000,0.0
765,0.294118,0.608040,0.590164,0.232323,0.132388,0.390462,0.071307,0.150000,0.0
766,0.058824,0.633166,0.491803,0.000000,0.000000,0.448584,0.115713,0.433333,1.0


In [None]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [None]:
X = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']]
y = df['Outcome']

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 42)


In [None]:
model = KNeighborsClassifier()
model.fit(X_train, y_train)

In [None]:
y_pred= model.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.74      0.79      0.76        99
         1.0       0.57      0.51      0.54        55

    accuracy                           0.69       154
   macro avg       0.66      0.65      0.65       154
weighted avg       0.68      0.69      0.68       154

