# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [23]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder



In [24]:
# Load data
transactions = pd.read_csv("../data/bank_transactions.csv")


In [25]:
# Drop non-predictive columns
transactions = transactions.drop(columns=['nameOrig', 'nameDest', 'isFlaggedFraud'])


In [26]:
# Check column names
print("\nColumn names:")
print(transactions.columns)

# Drop non-predictive columns (like account names and naive flag)
non_predictive_cols = ['nameOrig', 'nameDest', 'isFlaggedFraud']

# List of potentially non-predictive columns
non_predictive_cols = ['nameOrig', 'nameDest', 'isFlaggedFraud']

# Only drop the columns that exist in the DataFrame
cols_to_drop = [col for col in non_predictive_cols if col in transactions.columns]
transactions = transactions.drop(columns=cols_to_drop)

# Confirm drop
print("\nColumns after dropping non-predictive ones:")
print(transactions)

# Confirm data shape
print("\nShape after drop:", transactions.shape)



Column names:
Index(['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest', 'isFraud'],
      dtype='object')

Columns after dropping non-predictive ones:
            type      amount  oldbalanceOrg  newbalanceOrig  oldbalanceDest  \
0        PAYMENT      983.09       36730.24        35747.15            0.00   
1        PAYMENT    55215.25       99414.00        44198.75            0.00   
2        CASH_IN   220986.01     7773074.97      7994060.98       924031.48   
3       TRANSFER  2357394.75           0.00            0.00      4202580.45   
4       CASH_OUT    67990.14           0.00            0.00       625317.04   
...          ...         ...            ...             ...             ...   
999995   PAYMENT    13606.07      114122.11       100516.04            0.00   
999996   PAYMENT     9139.61           0.00            0.00            0.00   
999997  CASH_OUT   153650.41       50677.00            0.00            0.00   
999998  CASH_

In [27]:
# Encode transaction type
encoder = LabelEncoder()
transactions['type'] = type_encoder.fit_transform(transactions['type'])


NameError: name 'type_encoder' is not defined

In [None]:
# Create engineered features
transactions['errorOrig'] = transactions['oldbalanceOrg'] - transactions['newbalanceOrig'] - transactions['amount']
transactions['errorDest'] = transactions['newbalanceDest'] - transactions['oldbalanceDest'] - transactions['amount']


In [None]:
# Confirm changes
print("\nTransformed Dataset Preview:")
print(transactions.head())
print("\nShape after transformation:", transactions.shape)


Transformed Dataset Preview:
   type      amount  oldbalanceOrg  newbalanceOrig  oldbalanceDest  \
0     3      983.09       36730.24        35747.15            0.00   
1     3    55215.25       99414.00        44198.75            0.00   
2     0   220986.01     7773074.97      7994060.98       924031.48   
3     4  2357394.75           0.00            0.00      4202580.45   
4     1    67990.14           0.00            0.00       625317.04   

   newbalanceDest  isFraud     errorOrig  errorDest  
0            0.00        0 -3.524292e-12    -983.09  
1            0.00        0  0.000000e+00  -55215.25  
2       703045.48        0 -4.419720e+05 -441972.01  
3      6559975.19        0 -2.357395e+06      -0.01  
4       693307.19        0 -6.799014e+04       0.01  

Shape after transformation: (1000000, 9)


In [None]:
# Check for class imbalance
print("\nClass Balance (isFraud):")
print(transactions['isFraud'].value_counts())


Class Balance (isFraud):
isFraud
0    998703
1      1297
Name: count, dtype: int64


## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

 There are **no missing values** in any of the columns, however there are **non-predictive columns** that should be removed to improve model generalization:
'nameOrig' and 'nameDest' are account identifiers, which do not contain meaningful patterns generalizable to new data. Including them could lead to overfitting.
'isFlaggedFraud' is a naive flag based on a fixed threshold and is not helpful as a predictor since it's already derived from the 'amount' column and performs poorly in flagging actual fraud.

To improve predictive capabilities, we can drop these columns. The cleaned dataset now only includes relevant numerical features and the target variable `isFraud`.

## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

Yes, certain transaction types (like TRANSFER and CASH_OUT) are more strongly associated with fraudulent behavior. Meanwhile, types like PAYMENT, DEBIT, and CASH_IN are rarely or never fraudulent.

To make this pattern usable by a machine learning model, we should encode the type column into numerical values. A LabelEncoder is appropriate if we are using tree-based models like Random Forest or XGBoost. For linear models like logistic regression, we might instead use one-hot encoding to avoid introducing unintended ordinal relationships.

## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

Fraudulent transactions make up less than 1% of the dataset, creating a severe class imbalance. If left unaddressed, a machine learning model may learn to always predict "not fraud" (class 0), achieving high accuracy but zero recall on fraudulent cases.

To overcome this, we can apply techniques like:

Resampling, such as:

SMOTE (Synthetic Minority Over-sampling Technique)

Random undersampling the majority class

Using class weights, which penalize misclassifying minority class examples more

Evaluating with metrics like F1 score, precision, and recall, not just accuracy

These ensure the model learns meaningful fraud-detection patterns.

## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

Answer Here

In [None]:
# write out newly transformed dataset to your folder
...