# 🌳 🔄 Complete Guide to Data Transformation A to Z

Welcome to this comprehensive guide on data transformation, designed to equip you with the knowledge and skills to effectively preprocess and transform your datasets. Whether you're a budding data scientist or a seasoned professional looking to refine your data transformation techniques, this notebook is tailored for you!

## What Will You Learn?

In this guide, we will explore various methods to normalize, construct, discretize, and aggregate features, ensuring you have the tools to confidently prepare your data for any analysis or modeling task. Here's what we'll cover:

### 1. Feature Normalization

Learn how to scale and adjust the statistical distribution of feature values to improve the performance and accuracy of your models.

- **Min-Max Scaling**: Scale features to a fixed range, typically 0 to 1.
- **Z-Score Standardization**: Transform features to have a mean of 0 and a standard deviation of 1.
- **Robust Scaling**: Scale features using statistics that are robust to outliers, such as the median and interquartile range.
- **Yeo-Johnson Transformation**: Apply a transformation that can handle both positive and negative values to achieve normality.
- **Box-Cox Transformation**: Apply a transformation that works with positive values to achieve normality and reduce skewness.

### 2. Feature Construction

Learn techniques to create new features from existing ones, enhancing the predictive power of your models.

- **Use of Domain Knowledge**: Incorporate insights from the specific field or industry to construct meaningful features.
- **Using Statistical Relationships Between Features**: Identify and utilize correlations and interactions between features.
- **Numerical Coding of Nominal Values**:
  - **One-Hot Encoding**: Convert categorical variables into a series of binary variables.
  - **Ordinal or Label Encoding**: Assign integer values to categories based on their order or labels.
  - **Probability Ratio Encoding**: Encode categorical features based on the probability ratio of the target variable.

### 3. Feature Discretization

Learn how to transform continuous features into discrete ones to simplify models and capture nonlinear relationships.

- **Domain Knowledge**: Use expert knowledge to define meaningful bins.
- **Unsupervised Methods**:
  - **Equal-Width Binning**: Divide the range of values into equal-width bins.
  - **Equal-Frequency Binning**: Divide the range of values so that each bin has approximately the same number of observations.
  - **K-Means Binning**: Use k-means clustering to create bins based on feature similarity.
- **Supervised Methods**:
  - **ChiMerge**: Merge bins based on the chi-squared statistic to ensure similarity with respect to the target variable.
  - **Decision Tree Binning**: Use decision trees to create bins based on target variable splits.


## Why This Guide?

- **Step-by-Step Tutorials**: Each section includes clear explanations followed by practical examples, ensuring you not only learn but also apply your knowledge.
- **Interactive Learning**: Engage with interactive code cells that allow you to see the effects of data transformation methods in real-time.

### How to Use This Notebook

- **Run the Cells**: Follow along with the code examples by running the cells yourself. Modify the parameters to see how the results change.
- **Explore Further**: After completing the guided sections, try applying the methods to your own datasets to reinforce your learning.

Prepare to unlock the full potential of data transformation in data analysis. Let's dive in and transform data into valuable insights!


In [None]:
import pandas as pd

# Load the dataset
file_path = '/kaggle/input/loans-and-liability/LoanData_Preprocessed_v1.2.csv'
data = pd.read_csv(file_path)

# Convert 'ed' and 'default' columns to object type
data['ed'] = data['ed'].astype('object')
data['default'] = data['default'].astype('object')

# Display the first few rows of the dataset
data.info()


In [None]:
data.describe()

# Dataset Overview

The dataset contains information about loan applicants and includes the following columns:

- **age**: The age of the applicant, indicating how many years they have lived.
  - **Range**: 18 - 66
  - **Mean**: 34.40
  - **Skewness**: Slightly skewed to the right (positive skew).


- **employ**: The number of years the applicant has been employed, which can indicate their job stability and experience.
  - **Range**: 0 - 31
  - **Mean**: 8.21
  - **Skewness**: Right-skewed (positive skew).


- **address**: The number of years the applicant has lived at their current address, providing insights into their residential stability.
  - **Range**: 0 - 28
  - **Mean**: 5.58
  - **Skewness**: Right-skewed (positive skew).


- **income**: The annual income of the applicant (in thousands), representing their earning capacity.
  - **Range**: 10 - 330
  - **Mean**: 55.50
  - **Skewness**: Right-skewed (positive skew).


- **debtinc**: The debt-to-income ratio of the applicant, calculated as the percentage of their income that goes towards paying debts. This ratio helps assess their financial burden.
  - **Range**: 0.00 - 37.30
  - **Mean**: 10.27
  - **Skewness**: Right-skewed (positive skew).


- **creddebt**: The amount of credit card debt the applicant has (in thousands), showing their reliance on credit and their debt levels.
  - **Range**: 0.00 - 22.12
  - **Mean**: 3.51
  - **Skewness**: Right-skewed (positive skew).


- **othdebt**: The amount of other debt the applicant has (in thousands), which includes all other forms of debt apart from credit card debt.
  - **Range**: 0.00 - 57.03
  - **Mean**: 5.05
  - **Skewness**: Right-skewed (positive skew).


- **ed**: The education level of the applicant (encoded numerically), where higher numbers may represent higher levels of education.
  - **Unique Values**: 1.0, 2.0, 3.0, 4.0, 5.0
  - **Most Frequent Value (Mode)**: 1.0


- **default**: A binary indicator of whether the applicant defaulted on the loan (1 for default, 0 for no default), indicating their credit risk.
  - **Unique Values**: 0, 1
  - **Most Frequent Value (Mode)**: 0 (majority did not default)


# 1. Feature Normalization

Feature normalization is a crucial step in the data preprocessing pipeline. It involves adjusting the values of numerical features to ensure they have a common scale, which can improve the performance and training stability of machine learning models. In this section, we will explore various normalization techniques and demonstrate how to apply them using practical examples.

### Why Normalize Features?

Normalization can help in:
- **Improving Model Performance**: Algorithms such as gradient descent converge faster with normalized data.
- **Enhancing Accuracy**: Normalization can reduce the impact of features with larger scales on the model.
- **Stability**: Models can become more stable and less sensitive to variations in the data.

### Techniques Covered:

1. **Min-Max Scaling**: This technique scales the features to a fixed range, typically [0, 1]. The formula is:

   $$
   X' = \frac{X - X_{min}}{X_{max} - X_{min}}
   $$

2. **Z-Score Standardization**: Also known as standardization, this method transforms features to have a mean of 0 and a standard deviation of 1. The formula is:

   $$
   X' = \frac{X - \mu}{\sigma}
   $$

   where \(\mu\) is the mean and \(\sigma\) is the standard deviation of the feature.

3. **Robust Scaling**: This method scales features using statistics that are robust to outliers, such as the median and the interquartile range. The formula is:

   $$
   X' = \frac{X - Q2}{Q3 - Q1}
   $$

   where \(Q1\) and \(Q3\) are the 1st and 3rd quartiles, respectively, and \(Q2\) is the median.

4. **Yeo-Johnson Transformation**: This technique can handle both positive and negative values and transforms the data to be more normally distributed.

5. **Box-Cox Transformation**: This method works with positive values and transforms the data to be more normally distributed, reducing skewness.

### Why Split Train and Test Data?

Splitting the data into training and testing sets is a crucial step in the machine learning pipeline. It ensures that the model's performance can be evaluated on unseen data, providing a more realistic estimate of its effectiveness in real-world scenarios. By keeping the test data separate:
- **Avoid Data Leakage**: Ensures that information from the test set does not influence the model during training.
- **Model Evaluation**: Provides an unbiased evaluation metric for how well the model generalizes to new data.
- **Hyperparameter Tuning**: Helps in tuning model parameters by validating performance on the test set.

Let's see how to apply these transformations to our dataset after splitting the data into training and testing sets.


In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

# Split the dataset into training and testing sets
train_data, test_data = train_test_split(data.copy(), test_size=0.2, random_state=42)


# 1.1 Min-Max Scaling

Min-Max Scaling transforms features by scaling them to a given range, usually [0, 1]. This technique is useful when the features have different ranges and you want to ensure they contribute equally to the analysis.

### Columns

For our dataset, the following columns are suitable for Min-Max Scaling:
- `age`
- `debtinc`
- `creddebt`

### Applying Min-Max Scaling

Let's apply Min-Max Scaling to these columns.


In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Select the columns to scale
columns_to_scale = ['age', 'debtinc', 'creddebt']

# Fit the scaler to the training data and transform both training and testing data
train_data_min_max_scaled = train_data.copy()
test_data_min_max_scaled = test_data.copy()

train_data_min_max_scaled[columns_to_scale] = scaler.fit_transform(train_data[columns_to_scale])
test_data_min_max_scaled[columns_to_scale] = scaler.transform(test_data[columns_to_scale])

# Display the first few rows of the transformed training dataset to verify the scaling
print("Min-Max Scaled Data (Train):")
display(train_data_min_max_scaled.head(20))


# 1.2 Z-Score Standardization

Z-Score Standardization, also known as standardization, transforms features to have a mean of 0 and a standard deviation of 1. This method is less sensitive to outliers and ensures that each feature contributes equally to the analysis.

### Columns

For our dataset, the following columns are suitable for Z-Score Standardization:
- `age`
- `debtinc`
- `creddebt`

### Applying Z-Score Standardization

Let's apply Z-Score Standardization to these columns.


In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Select the columns to scale
columns_to_scale = ['age', 'debtinc', 'creddebt']

# Fit the scaler to the training data and transform both training and testing data
train_data_z_score_scaled = train_data.copy()
test_data_z_score_scaled = test_data.copy()

train_data_z_score_scaled[columns_to_scale] = scaler.fit_transform(train_data[columns_to_scale])
test_data_z_score_scaled[columns_to_scale] = scaler.transform(test_data[columns_to_scale])

# Display the first few rows of the transformed training dataset to verify the scaling
print("Z-Score Standardized Data (Train):")
display(train_data_z_score_scaled.head(20))


# 1.3 Robust Scaling

Robust Scaling uses statistics that are robust to outliers, such as the median and the interquartile range, to scale features. This method is useful when the dataset contains outliers that could skew the results of standard scaling methods.

### Columns

For our dataset, the following columns are suitable for Robust Scaling:
- `income`
- `othdebt`

### Applying Robust Scaling

Let's apply Robust Scaling to these columns.


In [None]:
from sklearn.preprocessing import RobustScaler

# Initialize the RobustScaler
scaler = RobustScaler()

# Select the columns to scale
columns_to_scale = ['income', 'othdebt']

# Fit the scaler to the training data and transform both training and testing data
train_data_robust_scaled = train_data.copy()
test_data_robust_scaled = test_data.copy()

train_data_robust_scaled[columns_to_scale] = scaler.fit_transform(train_data[columns_to_scale])
test_data_robust_scaled[columns_to_scale] = scaler.transform(test_data[columns_to_scale])

# Display the first few rows of the transformed training dataset to verify the scaling
print("Robust Scaled Data (Train):")
display(train_data_robust_scaled.head(20))



# 1.4 Yeo-Johnson and Box-Cox Transformations

The Yeo-Johnson and Box-Cox transformations are power transformations used to stabilize variance and make the data more normally distributed. Yeo-Johnson can handle both positive and negative values, whereas Box-Cox is only applicable to positive values.

### Columns

For our dataset, we will check the following columns for negative values and apply the appropriate transformation:
- `age`
- `income`
- `debtinc`
- `creddebt`
- `othdebt`

### Applying Transformations

Let's check for negative values in the columns and apply the Yeo-Johnson Transformation to columns with negative values and the Box-Cox Transformation to columns with only positive values.


In [None]:
from sklearn.preprocessing import PowerTransformer
import numpy as np

# Check if any column contains negative values
columns_to_check = ['age', 'income', 'debtinc', 'creddebt', 'othdebt']
contains_negative = {col: np.any(train_data[col] < 0) for col in columns_to_check}

# Initialize the PowerTransformers
yeo_johnson_transformer = PowerTransformer(method='yeo-johnson')
box_cox_transformer = PowerTransformer(method='box-cox')

# Make copies of the train and test data for transformations
train_data_yeo_johnson = train_data.copy()
test_data_yeo_johnson = test_data.copy()
train_data_box_cox = train_data.copy()
test_data_box_cox = test_data.copy()

# Apply Yeo-Johnson transformation to columns with negative values
columns_to_transform_yeo_johnson = [col for col, has_negative in contains_negative.items() if has_negative]
if columns_to_transform_yeo_johnson:
    train_data_yeo_johnson[columns_to_transform_yeo_johnson] = yeo_johnson_transformer.fit_transform(train_data[columns_to_transform_yeo_johnson])
    test_data_yeo_johnson[columns_to_transform_yeo_johnson] = yeo_johnson_transformer.transform(test_data[columns_to_transform_yeo_johnson])
    print("Yeo-Johnson Transformed Data (Train):")
    display(train_data_yeo_johnson.head(20))
else:
    print("No columns with negative values for Yeo-Johnson transformation.")

# Apply Box-Cox transformation to columns with only positive values
columns_to_transform_box_cox = [col for col, has_negative in contains_negative.items() if not has_negative]
if columns_to_transform_box_cox:
    train_data_box_cox[columns_to_transform_box_cox] = box_cox_transformer.fit_transform(train_data[columns_to_transform_box_cox])
    test_data_box_cox[columns_to_transform_box_cox] = box_cox_transformer.transform(test_data[columns_to_transform_box_cox])
    print("Box-Cox Transformed Data (Train):")
    display(train_data_box_cox.head(20))
else:
    print("No columns with only positive values for Box-Cox transformation.")

# 2. Feature Construction

Feature construction involves creating new features from existing ones to enhance the predictive power of your models. This can be done using domain knowledge, statistical relationships between features, and various encoding techniques. In this section, we will explore various methods of feature construction and demonstrate how to apply them using practical examples.

### Why Construct New Features?

Constructing new features can help in:
- **Improving Model Performance**: New features can provide additional information that helps the model make better predictions.
- **Capturing Non-Linear Relationships**: Interactions and transformations of existing features can capture non-linear relationships.
- **Reducing Dimensionality**: Aggregating features can reduce the number of dimensions, making models simpler and faster.

### Techniques Covered:

1. **Use of Domain Knowledge**: Incorporate insights from the specific field or industry to construct meaningful features.
2. **Using Statistical Relationships Between Features**: Identify and utilize correlations and interactions between features.
3. **Numerical Coding of Nominal Values**:
   - **One-Hot Encoding**: Convert categorical variables into a series of binary variables.
   - **Ordinal or Label Encoding**: Assign integer values to categories based on their order or labels.
   - **Probability Ratio Encoding**: Encode categorical features based on the probability ratio of the target variable.


Let's explore and apply various feature construction techniques to our dataset.


# 2.1 Use of Domain Knowledge

Incorporating domain knowledge can help create meaningful features that enhance model performance. For our loan dataset, we can create several new features based on common financial metrics used to evaluate a borrower's creditworthiness.

### New Features:

1. **Debt-to-Income Ratio**: A metric to evaluate a borrower's ability to manage monthly payments and repay debts.


   $$
   \text{Debt-to-Income Ratio} = \frac{\text{creddebt} + \text{othdebt}}{\text{income}}
   $$
   

2. **Income per Year of Employment**: Indicates the stability and substantiality of a person's income relative to their employment duration.


   $$
   \text{Income per Year of Employment} = \frac{\text{income}}{\text{employ} + 1}
   $$


3. **Credit-to-Income Ratio**: Assesses how much of a person's income is used to pay off credit debt.


   $$
   \text{Credit-to-Income Ratio} = \frac{\text{creddebt}}{\text{income}}
   $$


4. **Total Debt**: Sum of all debts to gauge the total debt burden.


   $$
   \text{Total Debt} = \text{creddebt} + \text{othdebt}
   $$


5. **Monthly Debt Payment Burden**: Evaluates the monthly debt payment burden relative to monthly income.


   $$
   \text{Monthly Debt Payment Burden} = \frac{\text{Total Debt}}{\text{income} / 12}
   $$

Let's create these new features in our dataset.


In [None]:
# Make a copy of the original dataset
data_domain_features = data.copy()

# 1. Debt-to-Income Ratio
data_domain_features['debt_to_income_ratio'] = (data_domain_features['creddebt'] + data_domain_features['othdebt']) / data_domain_features['income']

# 2. Income per Year of Employment
data_domain_features['income_per_year_employ'] = data_domain_features['income'] / (data_domain_features['employ'] + 1)  # +1 to avoid division by zero

# 3. Credit-to-Income Ratio
data_domain_features['credit_to_income_ratio'] = data_domain_features['creddebt'] / data_domain_features['income']

# 4. Total Debt
data_domain_features['total_debt'] = data_domain_features['creddebt'] + data_domain_features['othdebt']

# 5. Monthly Debt Payment Burden
data_domain_features['monthly_debt_payment_burden'] = data_domain_features['total_debt'] / (data_domain_features['income'] / 12)

# Display the first 20 rows to verify the new features
data_domain_features[['debt_to_income_ratio', 'income_per_year_employ', 'credit_to_income_ratio', 'total_debt', 'monthly_debt_payment_burden']].head(20)


# 2.2 Using Statistical Relationships

Utilizing statistical relationships between features can help create new features that capture interactions and dependencies within the data. This can enhance the predictive power of your models by providing additional insights and improving their ability to detect patterns.

### New Features:

1. **Interaction Terms**: Capturing the interaction between two or more features can help in understanding the combined effect of these features.
2. **Polynomial Features**: Creating polynomial features can help in capturing non-linear relationships.
3. **Ratios and Differences**: Creating features based on ratios and differences between existing features.

Let's create these new features in our dataset.


In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.impute import SimpleImputer
import pandas as pd

# Make a copy of the original dataset
data_stat_features = data.copy()

# Impute missing values with the mean for 'age' and 'income' 
imputer = SimpleImputer(strategy='mean')
data_stat_features[['age', 'income']] = imputer.fit_transform(data_stat_features[['age', 'income']])

# 1. Interaction Terms
# Create interaction terms between 'income' and 'debtinc'
data_stat_features['income_debtinc_interaction'] = data_stat_features['income'] * data_stat_features['debtinc']

# 2. Polynomial Features
# Create polynomial features for 'age' and 'income'
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(data_stat_features[['age', 'income']])
poly_feature_names = poly.get_feature_names_out(['age', 'income'])
poly_df = pd.DataFrame(poly_features, columns=poly_feature_names, index=data_stat_features.index)
data_stat_features = pd.concat([data_stat_features, poly_df], axis=1)

# 3. Ratios and Differences
# Create ratio and difference features between 'creddebt' and 'othdebt'
data_stat_features['creddebt_othdebt_ratio'] = data_stat_features['creddebt'] / data_stat_features['othdebt']
data_stat_features['creddebt_othdebt_diff'] = data_stat_features['creddebt'] - data_stat_features['othdebt']

# Display the first 20 rows to verify the new features
data_stat_features[['income_debtinc_interaction', 'age', 'income', 'age^2', 'age income', 'income^2', 'creddebt_othdebt_ratio', 'creddebt_othdebt_diff']].head(20)


# 2.3 Numerical Coding of Nominal Values

Numerical coding of nominal values is a method to convert categorical data into numerical format, which can be easily used by machine learning models. This process is crucial when dealing with categorical features as many algorithms cannot handle non-numeric data directly.

### Techniques Covered:

1. **One-Hot Encoding**: Convert categorical variables into a series of binary variables.
2. **Ordinal or Label Encoding**: Assign integer values to categories based on their order or labels.
3. **Probability Ratio Encoding**: Encode categorical features based on the probability ratio of the target variable.

Let's apply these techniques to our dataset.


# 2.3.1 One-Hot Encoding

One-Hot Encoding is a technique used to convert categorical variables into a series of binary variables (0 or 1). Each category in the original variable is represented as a separate column, with a 1 indicating the presence of that category and a 0 indicating its absence. This technique is useful when there is no ordinal relationship between the categories.

### Example

For our dataset, we'll apply one-hot encoding to the `ed` and `default` columns, which represents education levels.

Let's apply one-hot encoding to our dataset.


In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Make a copy of the original dataset
data_one_hot_encoded = data.copy()

# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' to avoid multicollinearity

# Select the columns to encode
columns_to_encode = ['ed', 'default']

# Fit and transform the encoder on the selected columns
encoded_data = encoder.fit_transform(data_one_hot_encoded[columns_to_encode])

# Create a DataFrame with the encoded data
encoded_columns = encoder.get_feature_names_out(columns_to_encode)
encoded_df = pd.DataFrame(encoded_data, columns=encoded_columns, index=data_one_hot_encoded.index)

# Concatenate the encoded columns with the original dataset (excluding the original categorical columns)
data_one_hot_encoded = pd.concat([data_one_hot_encoded.drop(columns_to_encode, axis=1), encoded_df], axis=1)

# Display the first 20 rows to verify the one-hot encoding
data_one_hot_encoded.head(20)


# 2.3.2 Ordinal or Label Encoding

Ordinal or Label Encoding is a technique used to convert categorical variables into integer values based on their order or labels. This technique is useful when there is an ordinal relationship between the categories, or when simply converting categorical data into a numerical format.

### Example

For our dataset, we'll apply label encoding to the `ed` column, which represents education levels, and the `default` column, which represents default status.

Let's apply label encoding to our dataset.


In [None]:
from sklearn.preprocessing import LabelEncoder

# Make a copy of the original dataset
data_label_encoded = data.copy()

# Initialize the LabelEncoder for 'ed' and 'default'
label_encoder_ed = LabelEncoder()
label_encoder_default = LabelEncoder()

# Fit and transform the 'ed' column
data_label_encoded['ed'] = label_encoder_ed.fit_transform(data_label_encoded['ed'])

# Fit and transform the 'default' column
data_label_encoded['default'] = label_encoder_default.fit_transform(data_label_encoded['default'])

# Display the first 20 rows to verify the label encoding
data_label_encoded[['ed', 'default']].head(20)


# 2.3.3 Probability Ratio Encoding

Probability Ratio Encoding is a technique that encodes categorical features based on the probability ratio of the target variable for each category. This method is particularly useful when the categories have a predictive relationship with the target variable.

### Example

For our dataset, we'll apply probability ratio encoding to the `ed` column, which represents education levels, using the `default` column as the target variable.

Let's apply probability ratio encoding to our dataset.


In [None]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

import category_encoders as ce
import pandas as pd

# Make a copy of the original dataset
data_prob_ratio_encoded = data.copy()

# Initialize the TargetEncoder (Probability Ratio Encoder)
prob_ratio_encoder = ce.TargetEncoder(cols=['ed'])

# Fit the encoder to the data and transform the 'ed' column
data_prob_ratio_encoded['ed_prob_ratio'] = prob_ratio_encoder.fit_transform(data_prob_ratio_encoded['ed'], data_prob_ratio_encoded['default'])

# Display the first 20 rows to verify the probability ratio encoding
data_prob_ratio_encoded[['ed', 'ed_prob_ratio']].head(20)


# 3. Feature Discretization

Feature discretization involves transforming continuous features into discrete ones. This process can simplify models, capture non-linear relationships, and improve interpretability. In this section, we will explore various discretization techniques and demonstrate how to apply them using practical examples.

### Techniques Covered:

1. **Domain Knowledge**: Use expert knowledge to define meaningful bins.
2. **Equal-Width Binning**: Divide the range of values into equal-width bins.
3. **Equal-Frequency Binning**: Divide the range of values so that each bin has approximately the same number of observations.
4. **K-Means Binning**: Use k-means clustering to create bins based on feature similarity.
5. **ChiMerge**: Merge bins based on the chi-squared statistic to ensure similarity with respect to the target variable.
6. **Decision Tree Binning**: Use decision trees to create bins based on target variable splits.

Let's apply these discretization techniques to our dataset.


# 3.1 Domain Knowledge

Using domain knowledge for feature discretization involves leveraging expert insights to define meaningful bins. This method ensures that the bins are relevant and can capture important patterns in the data.


For our dataset, we'll apply domain knowledge to discretize the following columns:
- `income`
- `age`

Let's apply domain knowledge-based discretization to these columns.


In [None]:
import pandas as pd

# Make a copy of the original dataset
data_discretized = data.copy()

# Define bins for 'income' based on domain knowledge
# Example bins: Low (< 30k), Medium (30k-60k), High (60k-100k), Very High (> 100k)
income_bins = [0, 30000, 60000, 100000, float('inf')]
income_labels = [1, 2, 3, 4]  # Using numbers for labels
data_discretized['income_bin_domain'] = pd.cut(data_discretized['income'], bins=income_bins, labels=income_labels)

# Define bins for 'age' based on domain knowledge
# Example bins: Young (< 25), Adult (25-40), Middle-Aged (40-60), Senior (> 60)
age_bins = [0, 25, 40, 60, float('inf')]
age_labels = [1, 2, 3, 4]  # Using numbers for labels
data_discretized['age_bin_domain'] = pd.cut(data_discretized['age'], bins=age_bins, labels=age_labels)

# Display the first 20 rows to verify the domain knowledge-based discretization
data_discretized[['income', 'income_bin_domain', 'age', 'age_bin_domain']].head(20)


# 3.2 Equal-Width Binning

Equal-Width Binning divides the range of values into equal-width bins. This technique is useful for creating evenly spaced intervals, which can help in simplifying the model and capturing important patterns.


For our dataset, we'll apply equal-width binning to the following columns:
- `income`
- `debtinc`

We'll use the `KBinsDiscretizer` from `sklearn.preprocessing` to perform the equal-width binning.



In [None]:
from sklearn.preprocessing import KBinsDiscretizer
import pandas as pd

# Make a copy of the original dataset
data_discretized = data.copy()

# Initialize the KBinsDiscretizer for equal-width binning
# Strategy 'uniform' is used for equal-width binning
kbin_discretizer = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform')

# Select the columns to discretize
columns_to_discretize = ['income', 'age']

# Fit the discretizer to the data and transform the selected columns
data_discretized[columns_to_discretize] = kbin_discretizer.fit_transform(data_discretized[columns_to_discretize])

# Display the first 20 rows to verify the equal-width binning
data_discretized[['income', 'age']].head(20)

# 3.3 Equal-Frequency Binning

Equal-Frequency Binning divides the range of values so that each bin has approximately the same number of observations. This technique is useful for creating bins with an equal number of data points.


Let's apply equal-frequency binning to the `income` and `age` columns.

We'll use the `KBinsDiscretizer` from `sklearn.preprocessing` to perform the equal-frequency binning.


In [None]:
from sklearn.preprocessing import KBinsDiscretizer
import pandas as pd

# Strategy 'quantile' is used for equal-frequency binning
kbin_discretizer = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')

# Select the columns to discretize
columns_to_discretize = ['income', 'age']

# Fit the discretizer to the training data and transform both training and testing data
train_data_discretized = train_data.copy()
test_data_discretized = test_data.copy()

train_data_discretized[columns_to_discretize] = kbin_discretizer.fit_transform(train_data[columns_to_discretize])
test_data_discretized[columns_to_discretize] = kbin_discretizer.transform(test_data[columns_to_discretize])

# Display the first 20 rows to verify the equal-frequency binning on training data
train_data_discretized[['income', 'age']].head(20)


# 3.4 K-Means Binning

K-Means Binning uses k-means clustering to create bins based on feature similarity. This technique is useful for capturing clusters within the data.


Let's apply k-means binning to the `income` and `age` columns.

We'll use the `KMeans` from `sklearn.cluster` to perform the k-means binning.


In [None]:
from sklearn.cluster import KMeans
import pandas as pd


# Initialize the KMeans model for binning
kmeans = KMeans(n_clusters=4, random_state=42)

# Select the columns to discretize
columns_to_discretize = ['income', 'age']

# Fit the KMeans model to the training data and transform both training and testing data
train_data_discretized = train_data.copy()
test_data_discretized = test_data.copy()

for column in columns_to_discretize:
    # Reshape the data for KMeans
    train_col_reshaped = train_data[[column]].values.reshape(-1, 1)
    test_col_reshaped = test_data[[column]].values.reshape(-1, 1)
    
    # Fit the KMeans model to the training data
    kmeans.fit(train_col_reshaped)
    
    # Transform the training and testing data
    train_data_discretized[column + '_bin_kmeans'] = kmeans.predict(train_col_reshaped)
    test_data_discretized[column + '_bin_kmeans'] = kmeans.predict(test_col_reshaped)

# Display the first 20 rows to verify the k-means binning on training data
train_data_discretized[[f'{col}_bin_kmeans' for col in columns_to_discretize]].head(20)


# 3.5 ChiMerge

ChiMerge uses a chi-squared statistic to merge bins based on the similarity of the target variable distribution. This technique is useful for creating bins that are statistically similar with respect to the target variable.


Let's apply ChiMerge to the `income` and `age` columns using the `default` target variable.

We will use the `woebin` function from the `scorecardpy` package to perform the discretization.


In [None]:
!pip install scorecardpy

In [None]:
import pandas as pd
import scorecardpy as sc

# Assuming train_data and test_data have already been defined and split

# Make copies of the training and testing datasets
train_data_discretized = train_data.copy()
test_data_discretized = test_data.copy()

# Select the columns to discretize
columns_to_discretize = ['income', 'age']

# Apply woebin to the selected columns
bins = sc.woebin(train_data, y='default', x=columns_to_discretize, max_num_bin=4, method='chimerge')

# Function to apply bins to a dataset
def apply_bins(data, bins):
    for col in bins.keys():
        bin_df = bins[col]
        bin_edges = bin_df['breaks'].apply(lambda x: float(x.split(",")[0].replace("(", "").replace("[", "")))
        bin_edges = [-float('inf')] + sorted(set(bin_edges)) + [float('inf')]  # Ensure unique edges
        data[col + '_bin_chimerge'] = pd.cut(data[col], bins=bin_edges, labels=False, include_lowest=True, duplicates='drop')
    return data

# Apply the bins to the training and testing data
train_data_discretized = apply_bins(train_data_discretized, bins)
test_data_discretized = apply_bins(test_data_discretized, bins)

# Display the first 20 rows to verify the ChiMerge binning on the training data
train_data_discretized[[f'{col}_bin_chimerge' for col in columns_to_discretize]].head(20)


# 3.6 Decision Tree Binning

Decision Tree Binning uses a decision tree to determine binning points based on the target variable. This technique is useful for creating bins that optimize splits with respect to the target variable.


Let's apply decision tree binning to the `income` and `age` columns using the `default` target variable.


In [None]:
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder


# Ensure the target variable is properly encoded
label_encoder = LabelEncoder()
train_data['default'] = label_encoder.fit_transform(train_data['default'])
test_data['default'] = label_encoder.transform(test_data['default'])

# Make copies of the training and testing datasets
train_data_discretized = train_data.copy()
test_data_discretized = test_data.copy()

# Select the columns to discretize
columns_to_discretize = ['income', 'age']

# Function to apply decision tree binning to a dataset
def apply_decision_tree_binning(train_data, test_data, column, target, max_leaf_nodes):
    # Initialize the decision tree model
    tree_model = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes, random_state=42)
    
    # Fit the model to the training data
    tree_model.fit(train_data[[column]], train_data[target])
    
    # Extract the binning points from the decision tree model
    binning_points = np.sort(tree_model.tree_.threshold[tree_model.tree_.threshold != -2])
    
    # Ensure unique binning points and include boundaries
    binning_points = np.unique(binning_points)
    binning_points = [-float('inf')] + binning_points.tolist() + [float('inf')]
    
    # Apply the bins to the training and testing data
    train_data[column + '_bin_tree'] = pd.cut(train_data[column], bins=binning_points, labels=False, include_lowest=True)
    test_data[column + '_bin_tree'] = pd.cut(test_data[column], bins=binning_points, labels=False, include_lowest=True)
    
    return train_data, test_data

# Apply decision tree binning to the selected columns
for column in columns_to_discretize:
    train_data_discretized, test_data_discretized = apply_decision_tree_binning(train_data_discretized, test_data_discretized, column, 'default', max_leaf_nodes=4)

# Display the first 20 rows to verify the decision tree binning on the training data
train_data_discretized[[f'{col}_bin_tree' for col in columns_to_discretize]].head(20)


### Note: Also you can use Feature-engine Library for Desicion tree binning and more
https://feature-engine.trainindata.com/en/latest/

# Thank You for Exploring This Notebook!

If you have any questions, suggestions, or just want to discuss any of the topics further, please don't hesitate to reach out or leave a comment. Your feedback is not only welcome but also invaluable! If you have any additional insights or methods that were not covered in this notebook, please suggest them in the comments. This notebook will be updated regularly to include more helpful tips and techniques!

Happy analyzing, and stay curious!

Best regards,

[Matin Mahmoudi](https://www.kaggle.com/matinmahmoudi)
