**Case on Process of Feature Engineering**

Detailed banking-specific use case covering each process of feature engineering—Feature Creation, Transformation, Feature Extraction, and Feature Selection—using a complex dataset.

##**Use Case: Customer Credit Risk Prediction**##

Objective: Predict the likelihood of a customer defaulting on a credit card payment based on various customer features and transaction data.

Dataset Description:
- Customer Data: Includes demographic and financial information about the customer.
- Transaction Data: Includes details about each transaction made by the customer.

**1. Feature Creation**

Scenario: You want to create features that might better capture a customer’s financial behavior for predicting credit risk.

Original Features:
- `Total_Income`
- `Total_Debt`
- `Number_of_Transactions`
- `Monthly_Income`
- `Account_Age`

Feature Creation Step:
  - Feature: `Debt_to_Income_Ratio`
  - Description: The ratio of total debt to total income, which might indicate financial strain.
  - Formula: `Debt_to_Income_Ratio = Total_Debt / Total_Income`

  - Feature: `Average_Transaction_Value`
  - Description: Average value of transactions over a period.
  - Formula: `Average_Transaction_Value = Total_Transaction_Value / Number_of_Transactions`

In [1]:
# 1. Feature Creation - Implementation:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Total_Income': [50000, 60000, 70000],
    'Total_Debt': [10000, 20000, 30000],
    'Number_of_Transactions': [50, 60, 70],
    'Total_Transaction_Value': [100000, 120000, 140000],
    'Account_Age': [5, 10, 15]
})

# Feature Creation
df['Debt_to_Income_Ratio'] = df['Total_Debt'] / df['Total_Income']
df['Average_Transaction_Value'] = df['Total_Transaction_Value'] / df['Number_of_Transactions']
print(df)

   Total_Income  Total_Debt  Number_of_Transactions  Total_Transaction_Value  \
0         50000       10000                      50                   100000   
1         60000       20000                      60                   120000   
2         70000       30000                      70                   140000   

   Account_Age  Debt_to_Income_Ratio  Average_Transaction_Value  
0            5              0.200000                     2000.0  
1           10              0.333333                     2000.0  
2           15              0.428571                     2000.0  


**2. Feature Transformation**

Scenario: Transform features to normalize their distributions and improve model performance.

Original Features:
- `Monthly_Income`
- `Total_Debt`
- `Total_Transaction_Value`

    - Transformation: Apply Min-Max Scaling to features to normalize them to a range [0, 1].
    - Formula: `Normalized_Feature = (Feature - min(Feature)) / (max(Feature) - min(Feature))`

In [2]:
# 2. Feature Transformation - Implementation:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Monthly_Income': [4000, 5000, 6000],
    'Total_Debt': [10000, 20000, 30000],
    'Total_Transaction_Value': [100000, 120000, 140000]
})

# Feature Transformation
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(df)
df_scaled = pd.DataFrame(scaled_features, columns=df.columns)
print(df_scaled)

   Monthly_Income  Total_Debt  Total_Transaction_Value
0             0.0         0.0                      0.0
1             0.5         0.5                      0.5
2             1.0         1.0                      1.0


**3. Feature Extraction**

Scenario: Extract meaningful features from transaction data for analyzing customer behavior.

Original Data: Transaction details including `Transaction_Description`, `Transaction_Amount`, and `Transaction_Date`.

    - Feature Extraction: Extract features such as `Transaction_Frequency` and `Transaction_Category`.
    - Transaction Frequency: Number of transactions per month.
    - Transaction Category: Categories of transactions (e.g., groceries, utilities, dining) extracted using NLP.

In [3]:
# 3. Feature Extraction - Implementation:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample DataFrame with transaction descriptions
transactions = [
    "Grocery store purchase", "Electricity bill payment",
    "Restaurant meal", "Grocery store purchase", "Rent payment"
]

# Feature Extraction for Transaction Categories
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(transactions)
feature_names = vectorizer.get_feature_names_out()
df_features = pd.DataFrame(X.toarray(), columns=feature_names)
print(df_features)

   bill  electricity  grocery  meal  payment  purchase  rent  restaurant  \
0     0            0        1     0        0         1     0           0   
1     1            1        0     0        1         0     0           0   
2     0            0        0     1        0         0     0           1   
3     0            0        1     0        0         1     0           0   
4     0            0        0     0        1         0     1           0   

   store  
0      1  
1      0  
2      0  
3      1  
4      0  


**4. Feature Selection:**

Scenario: Use selected features to build a model that predicts credit default risk.

Original Features:
- `Debt_to_Income_Ratio`
- `Average_Transaction_Value`
- `Normalized_Monthly_Income`
- `Normalized_Total_Debt`
- `Transaction_Category_1`
- `Transaction_Category_2`
- `Transaction_Category_3`

Feature Selection Technique: Use Recursive Feature Elimination (RFE) with a Logistic Regression model to select the most relevant features for predicting credit default.

In [4]:
# Feature Selection - Implementation:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Debt_to_Income_Ratio': [0.2, 0.3, 0.4],
    'Average_Transaction_Value': [200, 300, 400],
    'Normalized_Monthly_Income': [0.4, 0.5, 0.6],
    'Normalized_Total_Debt': [0.3, 0.5, 0.7],
    'Transaction_Category_1': [1, 0, 1],
    'Transaction_Category_2': [0, 1, 0],
    'Transaction_Category_3': [0, 0, 1],
    'Credit_Default': [0, 1, 1]
})

# Features and target
X = df.drop('Credit_Default', axis=1)
y = df['Credit_Default']

# Feature Selection with RFE
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=4)
fit = rfe.fit(X, y)
print("Selected Features: %s" % list(X.columns[fit.support_]))

Selected Features: ['Average_Transaction_Value', 'Normalized_Total_Debt', 'Transaction_Category_1', 'Transaction_Category_2']


**Summary:**

- Feature Creation: Generated `Debt_to_Income_Ratio` and `Average_Transaction_Value` to better capture financial behavior.
- Feature Transformation: Normalized features like `Monthly_Income` and `Total_Debt` using Min-Max Scaling.
- Feature Extraction: Extracted `Transaction_Category` features from transaction descriptions using NLP.
- Feature Selection: Used RFE with Logistic Regression to select the most relevant features for predicting credit risk.

These examples illustrate how feature engineering processes can be applied to datasets to improve predictive modeling and gain actionable insights.