# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [31]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder


orig_transactions = pd.read_csv("../data/bank_transactions.csv")
orig_transactions.rename({"oldbalanceOrg": "oldbalanceOrig"})

Unnamed: 0,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,PAYMENT,983.09,C1454812978,36730.24,35747.15,M1491308340,0.00,0.00,0,0
1,PAYMENT,55215.25,C1031766358,99414.00,44198.75,M2102868029,0.00,0.00,0,0
2,CASH_IN,220986.01,C1451868666,7773074.97,7994060.98,C1339195526,924031.48,703045.48,0,0
3,TRANSFER,2357394.75,C458368123,0.00,0.00,C620979654,4202580.45,6559975.19,0,0
4,CASH_OUT,67990.14,C1098978063,0.00,0.00,C142246322,625317.04,693307.19,0,0
...,...,...,...,...,...,...,...,...,...,...
999995,PAYMENT,13606.07,C768838592,114122.11,100516.04,M1593119373,0.00,0.00,0,0
999996,PAYMENT,9139.61,C1912748675,0.00,0.00,M842968564,0.00,0.00,0,0
999997,CASH_OUT,153650.41,C1494179549,50677.00,0.00,C1560012502,0.00,380368.36,0,0
999998,CASH_OUT,163810.52,C116856975,0.00,0.00,C1348490647,357850.15,521660.67,0,0


## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

ANS: The dataset does not contain any missing values, however there are two columns that are "non-predictive": nameOrig and amount. I will be replacing the amount column later with a simpler isLargeAmt: labeling with 1 if the amount is greater than or equal to 100,000 and 0 if it is less than 100,000. 

nameOrig was determined to not have any predictive elements in the EDA either, specifically in using it to track for duplicate transactions involving the same origin account. The result was that at most there are accounts that appear twice, and none of them contained any fraudulent activity, thus it does not appear to play a role in predicting fraud.

In [32]:
transactions = orig_transactions.drop(columns=["nameOrig"])

transactions.head()

Unnamed: 0,type,amount,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,PAYMENT,983.09,36730.24,35747.15,M1491308340,0.0,0.0,0,0
1,PAYMENT,55215.25,99414.0,44198.75,M2102868029,0.0,0.0,0,0
2,CASH_IN,220986.01,7773074.97,7994060.98,C1339195526,924031.48,703045.48,0,0
3,TRANSFER,2357394.75,0.0,0.0,C620979654,4202580.45,6559975.19,0,0
4,CASH_OUT,67990.14,0.0,0.0,C142246322,625317.04,693307.19,0,0


## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

ANS: CASH_OUT and TRANSFER type transactions are more likely to contain fraud. There does not appear to be any evidence or reason for CASH_IN or PAYMENT type transactions to be fraudulent, though. DEBIT types in this dataset contained no fraud but is possible to occur. I will transform the type column into dummy variables that ML models can more easily understand using one-hot encoding.

In [33]:
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)   # I'm not sure what difference a sparse or dense matrix makes... ???

encoded_features = encoder.fit_transform(transactions[["type"]])

feature_names = encoder.get_feature_names_out(["type"])
encoded_transactions = pd.DataFrame(encoded_features, columns=feature_names, index=transactions.index)

transactions = pd.concat([encoded_transactions, transactions.drop(columns=["type"])], axis=1)

transactions.head()

Unnamed: 0,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,amount,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,0.0,0.0,0.0,1.0,0.0,983.09,36730.24,35747.15,M1491308340,0.0,0.0,0,0
1,0.0,0.0,0.0,1.0,0.0,55215.25,99414.0,44198.75,M2102868029,0.0,0.0,0,0
2,1.0,0.0,0.0,0.0,0.0,220986.01,7773074.97,7994060.98,C1339195526,924031.48,703045.48,0,0
3,0.0,0.0,0.0,0.0,1.0,2357394.75,0.0,0.0,C620979654,4202580.45,6559975.19,0,0
4,0.0,1.0,0.0,0.0,0.0,67990.14,0.0,0.0,C142246322,625317.04,693307.19,0,0


## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

ANS: This imbalance in fraud and non-fraud transactions will make it more likely that a machine learning model will simply label all transactions as not fraud, due to it being the overwhelming majority of the dataset. I will use SMOTE in my K-Nearest Neighbors model to artifically increase the number of fraudulent transactions, that way other observations with similar properties will be more likely to be classified as fraudulent. If this does not work I will attempt a different approach.

In [34]:
# SMOTE will be applied in the model_train.ipynb

## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

ANS: High amount can be flagged as a 0 or 1, turning the feature column of amount from a continuous range of floating point numbers into a simple boolean, yes or no column. It would be helpful for training the model to look at the high amount transactions as more suspicious.

In [35]:
def flagIfLarge(x) -> int:
    if x >= 100000:
        return 1
    else:
        return 0


transactions["amount"] = transactions["amount"].apply(flagIfLarge)
transactions.rename(columns={"amount" : "isLargeAmt"}, inplace=True)
transactions.head()

Unnamed: 0,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,isLargeAmt,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,0.0,0.0,0.0,1.0,0.0,0,36730.24,35747.15,M1491308340,0.0,0.0,0,0
1,0.0,0.0,0.0,1.0,0.0,0,99414.0,44198.75,M2102868029,0.0,0.0,0,0
2,1.0,0.0,0.0,0.0,0.0,1,7773074.97,7994060.98,C1339195526,924031.48,703045.48,0,0
3,0.0,0.0,0.0,0.0,1.0,1,0.0,0.0,C620979654,4202580.45,6559975.19,0,0
4,0.0,1.0,0.0,0.0,0.0,0,0.0,0.0,C142246322,625317.04,693307.19,0,0


In [36]:
# write out newly transformed dataset to your folder
transactions.to_csv("../data/transactions.csv", index=False)

## NOTE:
- Is there a way I can incorporate another pattern: transactions with duplicate destinations appear more likely to be fraud.