# 🔧 Transformations for Numerical Data

Sometimes continuous numerical features need to be transformed into **discrete or categorical forms** for certain models.  
Two common approaches are **Binarization** and **KBinsDiscretizer**.

## 🔹 Binarization

- Converts numerical values into **0 or 1** based on a **threshold**.  

$x' = \begin{cases} 1 & \text{if } x > \text{threshold} \\ 0 & \text{otherwise}\end{cases}$

**Example**:  
- Income → {> 3000 = 1, ≤ 3000 = 0}  

**Pros**:  
- Simplifies numeric features into binary categories.  
- Useful for **rule-based learning** or highlighting presence/absence.  

**Cons**:  
- Loses information about magnitude.  

## 🔹 KBinsDiscretizer

- Divides continuous values into **k discrete bins (intervals)**.  
- Each bin can be encoded in different ways:  
  - **Ordinal** → each bin gets an integer label (0,1,2,…).  
  - **One-Hot** → each bin gets its own binary column.  
  - **Binary encoding** (rarely used).  

**Example**:  
- Age values → binned into intervals:  
  - [0–18) → 0, [18–35) → 1, [35–60) → 2, [60+) → 3  

**Pros**:  
- Captures **non-linear relationships**.  
- Makes continuous data more interpretable.  

**Cons**:  
- Choice of number of bins **k** strongly affects results.  
- May lose fine-grained details.  

---


#### » Load the "tips" dataset

In [4]:
import seaborn as sns
tips = sns.load_dataset("tips")
df = tips.copy()
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


## Binarization

In [24]:
df_new = df[["size"]].astype("float")
df_new

Unnamed: 0,size
0,2.0
1,3.0
2,3.0
3,2.0
4,4.0
...,...
239,3.0
240,2.0
241,2.0
242,2.0


In [26]:
from sklearn import preprocessing
binarizer = preprocessing.Binarizer(threshold=2).fit(df_new)
binarizer.transform(df_new)

array([[0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],

## KBinsDiscretizer

In [29]:
from sklearn.preprocessing import KBinsDiscretizer
df = tips.copy()
dff = df.select_dtypes(include=["float64", "int64"])
dff.head()

Unnamed: 0,total_bill,tip,size
0,16.99,1.01,2
1,10.34,1.66,3
2,21.01,3.5,3
3,23.68,3.31,2
4,24.59,3.61,4


In [30]:
est = KBinsDiscretizer(n_bins=[3,2,2],encode="ordinal",strategy="quantile").fit(dff)
est.transform(dff)[0:10]

array([[1., 0., 1.],
       [0., 0., 1.],
       [1., 1., 1.],
       [2., 1., 1.],
       [2., 1., 1.],
       [2., 1., 1.],
       [0., 0., 1.],
       [2., 1., 1.],
       [1., 0., 1.],
       [0., 1., 1.]])