In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

In [2]:
df = pd.DataFrame({"Names":["Varun","Sagar","Tejshree","Nilesh"],"Salary":[70000,60000,52000,45000]})
df

Unnamed: 0,Names,Salary
0,Varun,70000
1,Sagar,60000
2,Tejshree,52000
3,Nilesh,45000


In [3]:
# 1  MinMaxScalar

scalar = MinMaxScaler()

In [4]:
scalar.fit_transform(df[['Salary']])

array([[1.  ],
       [0.6 ],
       [0.28],
       [0.  ]])

In [5]:
# 2  StandardScalar

scalar = StandardScaler()

In [6]:
scalar.fit_transform(df[['Salary']])

array([[ 1.42310728],
       [ 0.34906405],
       [-0.51017053],
       [-1.26200079]])

In [7]:
# 3  MaxAbsScalar

scalar = MaxAbsScaler()

In [8]:
scalar.fit_transform(df[['Salary']])

array([[1.        ],
       [0.85714286],
       [0.74285714],
       [0.64285714]])


---

# 📌 1. **Min-Max Scaling (Normalization)**

**Definition:**
Rescales the feature values into a fixed range, usually \[0,1].

**Equation:**

$$
x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}
$$

**Range:**
$[0,1]$ (or any custom range $[a,b]$)

**Example:**
Data = $[10, 20, 30, 40, 50]$
For $x = 30$:

$$
x' = \frac{30 - 10}{50 - 10} = 0.5
$$

**When to use:**

* When features have **different units** (e.g., height in cm vs weight in kg).
* Algorithms that rely on **distance metrics** (e.g., KNN, K-Means, SVM) because scale impacts distance.
* Neural networks (helps faster convergence).

⚠️ Sensitive to **outliers**.

---

# 📌 2. **Standardization (Z-Score Normalization)**

**Definition:**
Centers the data by subtracting the mean and scales it to have unit variance.

**Equation:**

$$
x' = \frac{x - \mu}{\sigma}
$$

**Range:**
Unbounded $(-∞, ∞)$, but mean = 0 and standard deviation = 1.

**Example:**
Data = $[10, 20, 30, 40, 50]$
Mean = 30, Std = 15.81
For $x = 40$:

$$
x' = \frac{40 - 30}{15.81} = 0.63
$$

**When to use:**

* If data follows a **Gaussian distribution**.
* Works well with algorithms assuming normal distribution (e.g., **Linear Regression, Logistic Regression, PCA**).
* More robust than Min-Max when outliers exist.

---

# 📌 3. **Mean Normalization**

**Definition:**
Centers the data around zero and rescales within -1 to 1.

**Equation:**

$$
x' = \frac{x - \mu}{x_{\max} - x_{\min}}
$$

**Range:**
Approximately $[-1, 1]$.

**Example:**
Data = $[10, 20, 30, 40, 50]$
Mean = 30, Min = 10, Max = 50
For $x = 40$:

$$
x' = \frac{40 - 30}{50 - 10} = 0.25
$$

**When to use:**

* When you need **zero-centered data** (mean=0).
* Rarely used in practice (Z-score or Min-Max are more common).

---

# 📌 4. **Max-Abs Scaling**

**Definition:**
Scales each feature by its maximum absolute value, keeping the data centered at zero.

**Equation:**

$$
x' = \frac{x}{|x_{\max}|}
$$

**Range:**
$[-1, 1]$.

**Example:**
Data = $[-50, -20, 0, 20, 50]$
Max abs = 50
For $x = -20$:

$$
x' = \frac{-20}{50} = -0.4
$$

**When to use:**

* For **sparse data** (many zeros), e.g., text data with **TF-IDF / Bag-of-Words**.
* When negative values are meaningful and should be preserved.

---

# 📌 5. **Robust Scaling (Median & IQR Scaling)**

**Definition:**
Uses **median** and **interquartile range (IQR = Q3 - Q1)** instead of mean/std, making it less sensitive to outliers.

**Equation:**

$$
x' = \frac{x - \text{median}}{IQR}
$$

**Range:**
Unbounded, but more robust against outliers.

**Example:**
Data = $[10, 20, 30, 40, 100]$
Median = 30, Q1 = 20, Q3 = 40 → IQR = 20
For $x = 100$:

$$
x' = \frac{100 - 30}{20} = 3.5
$$

**When to use:**

* When dataset has **outliers** (e.g., salaries, housing prices).
* Works well for tree-based models or robust regression.

---

# 📌 6. **Unit Vector Scaling (Normalization using Norms)**

**Definition:**
Scales a vector so that its length (L2 norm) becomes 1. Useful in **directional data**.

**Equation:**

$$
x' = \frac{x}{||x||_2}, \quad ||x||_2 = \sqrt{\sum x_i^2}
$$

**Range:**
Each feature vector has length = 1; values lie in $[-1,1]$.

**Example:**
Vector = $[3, 4]$
Norm = $\sqrt{3^2 + 4^2} = 5$

$$
x' = \left[\frac{3}{5}, \frac{4}{5}\right] = [0.6, 0.8]
$$

**When to use:**

* In **text classification / NLP** (TF-IDF vectors, cosine similarity).
* When the **magnitude** of features is less important than **direction**.

---

✅ **Comparison Table**

| Method                    | Range              | Sensitive to Outliers? | When to Use                                     |
| ------------------------- | ------------------ | ---------------------- | ----------------------------------------------- |
| Min-Max                   | \[0,1] (or \[a,b]) | ✅ Yes                  | Distance-based models, Neural Nets              |
| Z-Score (Standardization) | (-∞,∞), mean=0     | ⚠️ Somewhat            | Normal distribution assumption, Regression, PCA |
| Mean Normalization        | \[-1,1] approx     | ✅ Yes                  | Rarely used, when zero-centering needed         |
| Max-Abs                   | \[-1,1]            | ✅ Yes                  | Sparse data (text, embeddings)                  |
| Robust Scaling            | Unbounded          | ❌ No                   | Outlier-heavy data (salaries, prices)           |
| Unit Vector               | \[-1,1], length=1  | ⚠️ Somewhat            | Text/NLP, cosine similarity                     |

---


