# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [51]:
import pandas as pd
import numpy as np

## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

Answer: There are no missing values in the dataset, as confirmed by checking with df.isnull().sum(). However, we will remove some non-predictive columns, nameOrig, nameDest, and isFlaggedFraud, those data are not features we can use to predict the target variable (isFraud).

In [52]:
transactions = pd.read_csv("../data/bank_transactions.csv")
transactions.isna().sum()

type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [53]:
transactions.drop(columns=['nameOrig', 'nameDest', 'isFlaggedFraud'], inplace=True)

In [54]:
transactions.dropna(inplace=True)
transactions.isna().sum()

type              0
amount            0
oldbalanceOrg     0
newbalanceOrig    0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
dtype: int64

## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

Answer: Different transaction types show significant differences in fraud rates. TRANSFER and CASH_OUT transactions tend to involve larger amounts, and as shown in the figure, they also have much higher fraud rates compared to other types. The average transaction amount for TRANSFER is around 911,827 and for CASH_OUT it's about 175,585. Their fraud probabilities are 0.76% and 0.19%, while other types have a fraud probability of 0%, meaning fraud almost never occurs in those types. This indicates that fraud is more likely to occur in these large-value, TRANSFER and CASH_OUT transactions. 

In [55]:
type_amount_mean = transactions.groupby('type')['amount'].mean().sort_values(ascending=False)
type_amount_mean

type
TRANSFER    911827.155179
CASH_OUT    175584.659320
CASH_IN     168928.914668
PAYMENT      13055.592085
DEBIT         5445.890813
Name: amount, dtype: float64

In [56]:
transactions.groupby('type')['isFraud'].mean().sort_values(ascending=False)

type
TRANSFER    0.007647
CASH_OUT    0.001870
CASH_IN     0.000000
DEBIT       0.000000
PAYMENT     0.000000
Name: isFraud, dtype: float64

## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

Answer here

## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

Answer Here

In [57]:
# write out newly transformed dataset to your folder
transactions.to_csv('../data/filtered_transactions.csv', index=False)