In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv(r"../Datasets/hungary_chickenpox.csv")
df

Unnamed: 0,Date,BUDAPEST,BARANYA,BACS,BEKES,BORSOD,CSONGRAD,FEJER,GYOR,HAJDU,...,JASZ,KOMAROM,NOGRAD,PEST,SOMOGY,SZABOLCS,TOLNA,VAS,VESZPREM,ZALA
0,03/01/2005,168,79,30,173,169,42,136,120,162,...,130,57,2,178,66,64,11,29,87,68
1,10/01/2005,157,60,30,92,200,53,51,70,84,...,80,50,29,141,48,29,58,53,68,26
2,17/01/2005,96,44,31,86,93,30,93,84,191,...,64,46,4,157,33,33,24,18,62,44
3,24/01/2005,163,49,43,126,46,39,52,114,107,...,63,54,14,107,66,50,25,21,43,31
4,31/01/2005,122,78,53,87,103,34,95,131,172,...,61,49,11,124,63,56,7,47,85,60
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
517,01/12/2014,95,12,41,6,39,0,16,15,14,...,56,7,13,122,4,23,4,11,110,10
518,08/12/2014,43,39,31,10,34,3,2,30,25,...,34,20,18,70,36,5,23,22,63,9
519,15/12/2014,35,7,15,0,0,0,7,7,4,...,30,36,4,72,5,21,14,0,17,10
520,22/12/2014,30,23,8,0,11,4,1,9,10,...,27,17,21,12,5,17,1,1,83,2


Discretizing Data

Discretizing data means converting continuous variables (which can take any value within a range) into discrete variables (which can only take specific, fixed values). This process groups numerical values into intervals or categories.

For example:

Continuous variable: Heights of people (e.g., 160.5 cm, 172.8 cm, etc.).

Discretized version: Group heights into categories like Short, Medium, Tall.

_____________________________________________________________________________________________________________________

Binning Data

Binning is the process of dividing a continuous variable into intervals (bins) and assigning a label or category to each interval. Each data point falls into one of these bins based on its value.

For example: If you have a continuous variable of test scores ranging from 0 to 100, you can bin it into:

Bins: [0–50], [51–75], [76–100]

Labels: Low, Medium, High

_____________________________________________________________________________________________________________________

Pandas provides two main methods for binning:

pd.cut(): Used for creating bins of fixed width or custom-defined bins.

pd.qcut(): Used for creating quantile-based bins where each bin contains approximately the same number of data points.


# __pandas.cut()__

The cut() function is used to segment and sort data values into bins. This method is ideal when you already know the bin edges or want equally spaced intervals.

Parameters

x: The data you want to bin.

bins: The boundaries of the bins.

labels: Labels for the bins (optional).

right: Whether the bin intervals should be right-inclusive (default is True).

In [3]:
# Example data
ages = [5, 12, 17, 24, 35, 50, 70, 85]

# Define bin edges
bins = [0, 12, 18, 60, 100]
labels = ["Child", "Teen", "Adult", "Senior"]

# Apply pd.cut()
age_categories = pd.cut(ages, bins=bins, labels=labels)
print(age_categories)

['Child', 'Child', 'Teen', 'Adult', 'Adult', 'Adult', 'Senior', 'Senior']
Categories (4, object): ['Child' < 'Teen' < 'Adult' < 'Senior']


__New Task:__ 

Bin `BUDAPEST` based on predetermined buckets: [10, 50, 100]

You need to make sure all the possible cases will be covered by the specified bin, else, expand your bin ranges.

In [4]:
pd.cut(df['BUDAPEST'],bins=[0,10,50,100,1000] )

0      (100, 1000]
1      (100, 1000]
2        (50, 100]
3      (100, 1000]
4      (100, 1000]
          ...     
517      (50, 100]
518       (10, 50]
519       (10, 50]
520       (10, 50]
521    (100, 1000]
Name: BUDAPEST, Length: 522, dtype: category
Categories (4, interval[int64, right]): [(0, 10] < (10, 50] < (50, 100] < (100, 1000]]

In [5]:
# cut with labels
pd.cut(df['BUDAPEST'], bins=[0, 10, 50, 100, 1000], labels=[1,2,3,4])

0      4
1      4
2      3
3      4
4      4
      ..
517    3
518    2
519    2
520    2
521    4
Name: BUDAPEST, Length: 522, dtype: category
Categories (4, int64): [1 < 2 < 3 < 4]

## Numbers to labels

In [6]:
quantile = pd.qcut(df['BUDAPEST'],q=4,labels=['LOW','MEDIUM','HIGH','VERY HIGH'])
quantile

0      VERY HIGH
1      VERY HIGH
2           HIGH
3      VERY HIGH
4           HIGH
         ...    
517         HIGH
518       MEDIUM
519       MEDIUM
520          LOW
521    VERY HIGH
Name: BUDAPEST, Length: 522, dtype: category
Categories (4, object): ['LOW' < 'MEDIUM' < 'HIGH' < 'VERY HIGH']

# __3. pandas.qcut()__
The qcut() function divides data into quantiles (bins with approximately equal numbers of observations). This is useful when the data distribution is uneven.

Example 1: Quantile-Based Binning

Suppose we want to split the data into 4 quartiles.

In [7]:
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Divide into 4 quantile bins
quantile_bins = pd.qcut(data, q=4)
print(quantile_bins)


[(9.999, 32.5], (9.999, 32.5], (9.999, 32.5], (32.5, 55.0], (32.5, 55.0], (55.0, 77.5], (55.0, 77.5], (77.5, 100.0], (77.5, 100.0], (77.5, 100.0]]
Categories (4, interval[float64, right]): [(9.999, 32.5] < (32.5, 55.0] < (55.0, 77.5] < (77.5, 100.0]]


Example 2: Quantile Binning with Labels

In [8]:
labels = ["Q1", "Q2", "Q3", "Q4"]
quantile_bins = pd.qcut(data, q=4, labels=labels)
print(quantile_bins)


['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q3', 'Q3', 'Q4', 'Q4', 'Q4']
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']


In [9]:
# qcut bins each item into the respective bucket
quantile = pd.qcut(df['BUDAPEST'],q=4)
quantile

0       (149.0, 479.0]
1       (149.0, 479.0]
2        (93.0, 149.0]
3       (149.0, 479.0]
4        (93.0, 149.0]
            ...       
517      (93.0, 149.0]
518      (34.25, 93.0]
519      (34.25, 93.0]
520    (-0.001, 34.25]
521     (149.0, 479.0]
Name: BUDAPEST, Length: 522, dtype: category
Categories (4, interval[float64, right]): [(-0.001, 34.25] < (34.25, 93.0] < (93.0, 149.0] < (149.0, 479.0]]

## Numbers to labels

In [10]:
quantile = pd.qcut(df['BUDAPEST'],q=4,labels=['LOW','MEDIUM','HIGH','VERY HIGH'])
quantile

0      VERY HIGH
1      VERY HIGH
2           HIGH
3      VERY HIGH
4           HIGH
         ...    
517         HIGH
518       MEDIUM
519       MEDIUM
520          LOW
521    VERY HIGH
Name: BUDAPEST, Length: 522, dtype: category
Categories (4, object): ['LOW' < 'MEDIUM' < 'HIGH' < 'VERY HIGH']