# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [3]:
import pandas as pd
import numpy as np

transactions = pd.read_csv("../data/bank_transactions.csv")


In [4]:
# Check for missing values
print("Missing values per column:")
print(transactions.isnull().sum())

# Check data types and head
print("\nData types:")
print(transactions.dtypes)

# Check column names
print("\nColumn names:")
print(transactions.columns)

# Drop non-predictive columns (like account names and naive flag)
non_predictive_cols = ['nameOrig', 'nameDest', 'isFlaggedFraud']

# Drop the columns from the dataset
transactions_cleaned = transactions.drop(columns=non_predictive_cols)

# Confirm drop
print("\nColumns after dropping non-predictive ones:")
print(transactions_cleaned.columns)

# Confirm data shape
print("\nShape after drop:", transactions_cleaned.shape)


Missing values per column:
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

Data types:
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

Column names:
Index(['type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')

Columns after dropping non-predictive ones:
Index(['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest', 'isFraud'],
      dtype='object')

Shape after drop: (1000000, 7)


## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

 There are **no missing values** in any of the columns, however there are **non-predictive columns** that should be removed to improve model generalization:
'nameOrig' and 'nameDest' are account identifiers, which do not contain meaningful patterns generalizable to new data. Including them could lead to overfitting.
'isFlaggedFraud' is a naive flag based on a fixed threshold and is not helpful as a predictor since it's already derived from the 'amount' column and performs poorly in flagging actual fraud.

To improve predictive capabilities, we can drop these columns. The cleaned dataset now only includes relevant numerical features and the target variable `isFraud`.

## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

Answer here

## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

Answer here

## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

Answer Here

In [5]:
# write out newly transformed dataset to your folder
...