# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# import data 
transactions = pd.read_csv("../data/bank_transactions.csv")

## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

The dataset does not contain any missing values in the numerical columns used for modeling. However, it includes non-predictive columns such as *nameOrig* and *nameDest* which are just **unique account names or identifiers**. These accounts don't have any unique patterns that affected the data or added any insight to our findings and it would be better if they were dropped. 

In [6]:
#Dropping non-predictive columns
transactions_cleaned = transactions.drop(columns=["nameOrig", "nameDest"])

transactions_cleaned.columns



Index(['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest', 'isFraud', 'isFlaggedFraud'],
      dtype='object')

In [7]:
transactions_cleaned.head()

Unnamed: 0,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,PAYMENT,983.09,36730.24,35747.15,0.0,0.0,0,0
1,PAYMENT,55215.25,99414.0,44198.75,0.0,0.0,0,0
2,CASH_IN,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,0
3,TRANSFER,2357394.75,0.0,0.0,4202580.45,6559975.19,0,0
4,CASH_OUT,67990.14,0.0,0.0,625317.04,693307.19,0,0


## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

Transaction types  showed strong differences in both amount and fraud liklihood. Specificallt, **transfer** and **cash-out** transactions are the only types associated with fraud and they involve higher transaction amounts than other types. 

The column *"type"* could be converted into a numeric format column, to use this column for modeling. It can be converting using **one-hot encoding**.

In [11]:
#One-hot Encoding for "Type" Column
transactions_encoded = pd.get_dummies(transactions_cleaned, columns=["type"], drop_first=True)
transactions_encoded.columns

Index(['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest', 'isFraud', 'isFlaggedFraud', 'type_CASH_OUT',
       'type_DEBIT', 'type_PAYMENT', 'type_TRANSFER'],
      dtype='object')

In [12]:
transactions_encoded

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,983.09,36730.24,35747.15,0.00,0.00,0,0,False,False,True,False
1,55215.25,99414.00,44198.75,0.00,0.00,0,0,False,False,True,False
2,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,0,False,False,False,False
3,2357394.75,0.00,0.00,4202580.45,6559975.19,0,0,False,False,False,True
4,67990.14,0.00,0.00,625317.04,693307.19,0,0,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
999995,13606.07,114122.11,100516.04,0.00,0.00,0,0,False,False,True,False
999996,9139.61,0.00,0.00,0.00,0.00,0,0,False,False,True,False
999997,153650.41,50677.00,0.00,0.00,380368.36,0,0,True,False,False,False
999998,163810.52,0.00,0.00,357850.15,521660.67,0,0,True,False,False,False


## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

Fraudulent transactions are rare in this dataset which introduces a class imbalance problem. If a model is trained on this raw data, it could learn to always predict the majority class which in this case is *non-fraudulent transactions* and achieve high accuracy while completely failing to detect actual fraud. 

We can address this issue by using resampling techniques like SMOTE

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split


X = transactions_encoded.drop("isFraud", axis=1)
y = transactions_encoded["isFraud"]


X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)


In [None]:

smote = SMOTE(random_state=42)


X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Train
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_resampled, y_train_resampled)

# Predict
y_pred = clf.predict(X_test)


classification_report(y_test, y_pred)

## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

One useful interaction feature is the difference between old and new balances which reflects the actual money movement. This is particularly valuable for origin accounts since a decrease in balance after a large **transfer** or **cash-out** might indicate potential fraud. We can add the difference to the dataset by creating a new column for it.  

In [None]:
transactions_encoded["balanceChangeOrig"] = transactions_encoded["oldbalanceOrg"] - transactions_encoded["newbalanceOrig"]


In [None]:
# write out newly transformed dataset to your folder

transactions_encoded.to_csv("../data/transactions_transformed.csv", index=False)


resampled_df = pd.concat([
    pd.DataFrame(X_train_resampled, columns=X_train.columns),
    pd.Series(y_train_resampled, name="isFraud")
], axis=1)

resampled_df.to_csv("../data/transactions_resampled.csv", index=False)