# Encoding Numerical Data
-Numerical encoding mainly means scaling and transforming numbers so ML models can learn patterns more effectively.

## Can be done by:
# Bining(Discretization)
- Converts continuous numerical data into intervals (bins)

- Reduces noise and simplifies data

- Example:
- Age → 0–18, 19–35, 36–60

- Use: When exact values are not important, only ranges matter.

# Binarization
- Converts numerical data into binary values (0 or 1)

- Based on a threshold

- Example:
- Income ≥ 50,000 → 1, else → 0

- Use: When only presence/absence or yes/no information is needed.

## In short:

## Binning → many ranges

## Binarization → only two values (0 and 1)

# Types of Bining

## 1. Equal-Width (Uniform) Binning

- Divides the entire data range into bins of equal size

- Bin width is calculated as:

Width = 
max
−
min/
number of bins

- Simple and easy to implement

- Drawback: Sensitive to outliers; bins may be unevenly populated

- Example:
- Scores 0–100 → 0–20, 20–40, 40–60, 60–80, 80–100

## 2. Equal-Frequency (Quantile) Binning

- Each bin contains the same number of data points

- Bin sizes vary, but frequency is equal

- Works well for skewed data

- Example:
- 100 data points → 4 bins with 25 values each

## 3. Custom (Manual) Binning

- Bin boundaries are defined using domain knowledge

- Useful when natural cutoffs exist

- Example:
- Age → Child (0–17), Adult (18–59), Senior (60+)

## 4. K-Means Binning

- Uses k-means clustering to form bins

- Groups values based on similarity, not fixed ranges

- Adapts well to complex data distributions

- More computationally expensive

In [1]:

import pandas as pd
import numpy as np


In [29]:
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import KBinsDiscretizer
from sklearn.compose import ColumnTransformer

In [30]:

df = pd.read_csv('train.csv',usecols=['Age','Fare','Survived'])


In [31]:
df.dropna(inplace = True)

In [32]:
df.shape

(714, 3)

In [33]:
df.head()

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.25
1,1,38.0,71.2833
2,1,26.0,7.925
3,1,35.0,53.1
4,0,35.0,8.05


In [34]:
x = df.iloc[:,1:]
y = df.iloc[:,0]


In [35]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2, random_state = 42)

In [36]:
x_train.head(2)

Unnamed: 0,Age,Fare
328,31.0,20.525
73,26.0,14.4542


In [37]:
clf = DecisionTreeClassifier()

In [38]:
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)

In [39]:
accuracy_score(y_test,y_pred)

0.6293706293706294

In [40]:
np.mean(cross_val_score(DecisionTreeClassifier(),x,y,cv=10,scoring='accuracy'))


np.float64(0.630281690140845)

In [42]:
kbin_age = KBinsDiscretizer(n_bins = 10, encode  = 'ordinal', strategy = 'quantile')
kbin_fare = KBinsDiscretizer(n_bins = 10, encode  = 'ordinal', strategy = 'quantile')

In [43]:
trf = ColumnTransformer([
    ('first',kbin_age,[0]),
    ('second',kbin_fare,[1])
])

In [46]:
x_train_trf = trf.fit_transform(x_train)
x_test_trf = trf.fit_transform(x_test)



In [47]:
trf.named_transformers_['first'].bin_edges_


array([array([ 1. , 11. , 17. , 20.6, 24. , 28. , 30.1, 34.4, 38.6, 47.8, 62. ])],
      dtype=object)

In [48]:
trf.named_transformers_['first'].bin_edges_


array([array([ 1. , 11. , 17. , 20.6, 24. , 28. , 30.1, 34.4, 38.6, 47.8, 62. ])],
      dtype=object)

In [49]:

output = pd.DataFrame({
    'age':x_train['Age'],
    'age_trf':x_train_trf[:,0],
    'fare':x_train['Fare'],
    'fare_trf':x_train_trf[:,1]
})

In [51]:
output['age_labels'] = pd.cut(x=x_train['Age'],
                                    bins=trf.named_transformers_['first'].bin_edges_[0].tolist())
output['fare_labels'] = pd.cut(x=x_train['Fare'],
                                    bins=trf.named_transformers_['second'].bin_edges_[0].tolist())



In [52]:
output.sample(5)

Unnamed: 0,age,age_trf,fare,fare_trf,age_labels,fare_labels
266,16.0,1.0,39.6875,7.0,"(11.0, 17.0]","(38.1, 57.783]"
1,38.0,7.0,71.2833,8.0,"(34.4, 38.6]","(57.783, 512.329]"
787,8.0,0.0,29.125,7.0,"(1.0, 11.0]","(28.39, 38.1]"
731,11.0,0.0,18.7875,5.0,"(1.0, 11.0]","(14.454, 22.62]"
92,46.0,8.0,61.175,8.0,"(38.6, 47.8]","(57.783, 512.329]"


In [54]:

clf = DecisionTreeClassifier()
clf.fit(x_train_trf,y_train)
y_pred2 = clf.predict(x_test_trf)

In [55]:
accuracy_score(y_test,y_pred2)

0.6853146853146853

In [57]:
x_trf = trf.fit_transform(x)
np.mean(cross_val_score(DecisionTreeClassifier(),x,y,cv=10,scoring='accuracy'))




np.float64(0.6303208137715179)

In [58]:
def discretize(bins,strategy):
    kbin_age = KBinsDiscretizer(n_bins=bins,encode='ordinal',strategy=strategy)
    kbin_fare = KBinsDiscretizer(n_bins=bins,encode='ordinal',strategy=strategy)
    
    trf = ColumnTransformer([
        ('first',kbin_age,[0]),
        ('second',kbin_fare,[1])
    ])
    
    X_trf = trf.fit_transform(X)
    print(np.mean(cross_val_score(DecisionTreeClassifier(),x,y,cv=10,scoring='accuracy')))
    
    plt.figure(figsize=(14,4))
    plt.subplot(121)
    plt.hist(x['Age'])
    plt.title("Before")

    plt.subplot(122)
    plt.hist(x_trf[:,0],color='red')
    plt.title("After")

    plt.show()
    
    plt.figure(figsize=(14,4))
    plt.subplot(121)
    plt.hist(x['Fare'])
    plt.title("Before")

    plt.subplot(122)
    plt.hist(x_trf[:,1],color='red')
    plt.title("Fare")

    plt.show()
    

In [59]:
discretize(5,'kmeans')


NameError: name 'X' is not defined