## [Discretization](https://towardsdatascience.com/discretization-explained-a-visual-guide-with-code-examples-for-beginners-f056af9102fa/?gi=c1bf25229f86)

> 6 fun ways to categorize numbers into bins!

Discretization, also known as binning, is the process of transforming continuous numerical variables into discrete categorical features. It involves dividing the range of a continuous variable into intervals (bins) and assigning data points to these bins based on their values.

#### Need of Binning

1. **Handling Outliers**: Binning can reduce the impact of outliers without removing data points.
2. **Improving Model Performance**: Some algorithms perform better with categorical inputs (such as Bernoulli Naive Bayes).
3. **Simplifying Visualization**: Binned data can be easier to visualize and interpret.
4. **Reducing Overfitting**: It can prevent models from fitting to noise in high-precision data.

In [1]:
!pip install -q numpy pandas scikit-learn scipy matplotlib

In [2]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# Create the dataset as a dictionary
data = {
    'UVIndex': [2, 10, 1, 7, 3, 9, 5, 11, 1, 8, 3, 9, 11, 5, 7],
    'Humidity': [15, 95, 10, 98, 18, 90, 25, 80, 95, 40, 20, 30, 85, 92, 12], 
    'WindSpeed': [2, 90, 1, 30, 3, 10, 40, 5, 60, 15, 20, 45, 25, 35, 50],
    'RainfallAmount': [5,2,7,3,18,3,0,1,25,0,9,0,18,7,0],    
    'Temperature': [68, 60, 63, 55, 50, 56, 57, 65, 66, 68, 71, 72, 79, 83, 81],  
    'Crowdedness': [0.15, 0.98, 0.1, 0.85, 0.2, 0.9, 0.92, 0.25, 0.12, 0.99, 0.2, 0.8, 0.05, 0.3, 0.95]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

df.head()

Unnamed: 0,UVIndex,Humidity,WindSpeed,RainfallAmount,Temperature,Crowdedness
0,2,15,2,5,68,0.15
1,10,95,90,2,60,0.98
2,1,10,1,7,63,0.1
3,7,98,30,3,55,0.85
4,3,18,3,18,50,0.2


#### Equal-Width Binning
Equal-width binning divides the range of a variable into a specified number of intervals, all with the same width.

In [3]:
# 1. Equal-Width Binning for UVIndex
df['UVIndexBinned'] = pd.cut(df['UVIndex'], bins=4, 
                             labels=['Low', 'Moderate', 'High', 'Very High'])

df.head()

Unnamed: 0,UVIndex,Humidity,WindSpeed,RainfallAmount,Temperature,Crowdedness,UVIndexBinned
0,2,15,2,5,68,0.15,Low
1,10,95,90,2,60,0.98,Very High
2,1,10,1,7,63,0.1,Low
3,7,98,30,3,55,0.85,High
4,3,18,3,18,50,0.2,Low


#### Equal-Frequency Binning (Quantile Binning)
Equal-frequency binning creates bins that contain approximately the same number of observations.

In [4]:
# 2. Equal-Frequency Binning for Humidity
df['HumidityBinned'] = pd.qcut(df['Humidity'], q=3, 
                               labels=['Low', 'Medium', 'High'])

df.head()

Unnamed: 0,UVIndex,Humidity,WindSpeed,RainfallAmount,Temperature,Crowdedness,UVIndexBinned,HumidityBinned
0,2,15,2,5,68,0.15,Low,Low
1,10,95,90,2,60,0.98,Very High,High
2,1,10,1,7,63,0.1,Low,Low
3,7,98,30,3,55,0.85,High,High
4,3,18,3,18,50,0.2,Low,Low


#### Custom Binning
Custom binning allows you to define your own bin edges based on domain knowledge or specific requirements.

In [5]:
# 3. Custom Binning for RainfallAmount
df['RainfallAmountBinned'] = pd.cut(df['RainfallAmount'], bins=[-np.inf, 2, 4, 12, np.inf], 
                                    labels=['No Rain', 'Drizzle', 'Rain', 'Heavy Rain'])

df.head()

Unnamed: 0,UVIndex,Humidity,WindSpeed,RainfallAmount,Temperature,Crowdedness,UVIndexBinned,HumidityBinned,RainfallAmountBinned
0,2,15,2,5,68,0.15,Low,Low,Rain
1,10,95,90,2,60,0.98,Very High,High,No Rain
2,1,10,1,7,63,0.1,Low,Low,Rain
3,7,98,30,3,55,0.85,High,High,Drizzle
4,3,18,3,18,50,0.2,Low,Low,Heavy Rain


#### Logarithmic Binning
Logarithmic binning creates bins that grow exponentially in size. The method basically applies log transformation first then performs equal-width binning.

In [6]:
# 4. Logarithmic Binning for WindSpeed
df['WindSpeedBinned'] = pd.cut(np.log1p(df['WindSpeed']), bins=3, 
                               labels=['Light', 'Moderate', 'Strong'])

df.head()

Unnamed: 0,UVIndex,Humidity,WindSpeed,RainfallAmount,Temperature,Crowdedness,UVIndexBinned,HumidityBinned,RainfallAmountBinned,WindSpeedBinned
0,2,15,2,5,68,0.15,Low,Low,Rain,Light
1,10,95,90,2,60,0.98,Very High,High,No Rain,Strong
2,1,10,1,7,63,0.1,Low,Low,Rain,Light
3,7,98,30,3,55,0.85,High,High,Drizzle,Strong
4,3,18,3,18,50,0.2,Low,Low,Heavy Rain,Light


#### Standard Deviation-based Binning
Standard Deviation based binning creates bins based on the number of standard deviations away from the mean. This approach is useful when working with normally distributed data or when you want to bin data based on how far values deviate from the central tendency.

In [7]:
# 5. Standard Deviation-Based Binning for Temperature
mean_temp, std_dev = df['Temperature'].mean(), df['Temperature'].std()
bin_edges = [
    float('-inf'),  # Ensure all values are captured
    mean_temp - 2.5 * std_dev,
    mean_temp - 1.5 * std_dev,
    mean_temp - 0.5 * std_dev,
    mean_temp + 0.5 * std_dev,
    mean_temp + 1.5 * std_dev,
    mean_temp + 2.5 * std_dev,
    float('inf')   # Ensure all values are captured
]
df['TemperatureBinned'] = pd.cut(df['Temperature'], bins=bin_edges, 
                                 labels=['Very Low', 'Low', 'Below Avg', 'Average','Above Avg', 'High', 'Very High'])

df.head()

Unnamed: 0,UVIndex,Humidity,WindSpeed,RainfallAmount,Temperature,Crowdedness,UVIndexBinned,HumidityBinned,RainfallAmountBinned,WindSpeedBinned,TemperatureBinned
0,2,15,2,5,68,0.15,Low,Low,Rain,Light,Average
1,10,95,90,2,60,0.98,Very High,High,No Rain,Strong,Below Avg
2,1,10,1,7,63,0.1,Low,Low,Rain,Light,Average
3,7,98,30,3,55,0.85,High,High,Drizzle,Strong,Below Avg
4,3,18,3,18,50,0.2,Low,Low,Heavy Rain,Light,Low


#### K-Means Binning
K-Means binning uses the K-Means clustering algorithm to create bins. It groups data points into clusters based on how similar the data points are to each other, with each cluster becoming a bin.

In [8]:
# 6. K-Means Binning for Crowdedness
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42).fit(df[['Crowdedness']])
df['CrowdednessBinned'] = pd.Categorical.from_codes(kmeans.labels_, categories=['Low', 'Medium', 'High'])

df.head()

Unnamed: 0,UVIndex,Humidity,WindSpeed,RainfallAmount,Temperature,Crowdedness,UVIndexBinned,HumidityBinned,RainfallAmountBinned,WindSpeedBinned,TemperatureBinned,CrowdednessBinned
0,2,15,2,5,68,0.15,Low,Low,Rain,Light,Average,High
1,10,95,90,2,60,0.98,Very High,High,No Rain,Strong,Below Avg,Low
2,1,10,1,7,63,0.1,Low,Low,Rain,Light,Average,High
3,7,98,30,3,55,0.85,High,High,Drizzle,Strong,Below Avg,Low
4,3,18,3,18,50,0.2,Low,Low,Heavy Rain,Light,Low,Medium


In [9]:
# Print only the binned columns
binned_columns = [col for col in df.columns if col.endswith('Binned')]
print(df[binned_columns])

   UVIndexBinned HumidityBinned RainfallAmountBinned WindSpeedBinned  \
0            Low            Low                 Rain           Light   
1      Very High           High              No Rain          Strong   
2            Low            Low                 Rain           Light   
3           High           High              Drizzle          Strong   
4            Low            Low           Heavy Rain           Light   
5      Very High           High              Drizzle        Moderate   
6       Moderate         Medium              No Rain          Strong   
7      Very High         Medium              No Rain           Light   
8            Low           High           Heavy Rain          Strong   
9           High         Medium              No Rain        Moderate   
10           Low            Low                 Rain        Moderate   
11     Very High         Medium              No Rain          Strong   
12     Very High         Medium           Heavy Rain          St