# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('/Users/talgat/Downloads/detect-fraud/data/sample.csv')


In [4]:
df = df.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis=1)


In [5]:
df = pd.get_dummies(df, columns=['type'])


In [8]:
df.to_csv('/Users/talgat/Downloads/detect-fraud/data/transformed_dataset.csv', index=False)


## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

In this dataset, there aren’t any big missing values that need fixing. But some columns, like nameOrig and nameDest, are just ID numbers that don’t help the model find fraud. Also, the isFlaggedFraud column isn’t useful because it doesn’t match real fraud well and can confuse the model. So, it’s best to remove these columns so the model focuses only on helpful information and can better learn to detect fraud

In [11]:
import pandas as pd

# Load your dataset
df = pd.read_csv('/Users/talgat/Downloads/detect-fraud/data/sample.csv')

# Drop columns that don't help the prediction
df = df.drop(columns=['nameOrig', 'nameDest', 'isFlaggedFraud'])

# Optionally, save the cleaned data for modeling
df.to_csv('/Users/talgat/Downloads/detect-fraud/data/cleaned_transactions.csv', index=False)


## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

Answer here

In [12]:
import pandas as pd

# Load the data
df = pd.read_csv('/Users/talgat/Downloads/detect-fraud/data/sample.csv')

# Drop non-predictive columns (if not already done)
df = df.drop(columns=['nameOrig', 'nameDest', 'isFlaggedFraud'])

# Apply one-hot encoding to the 'type' column
df = pd.get_dummies(df, columns=['type'])

# Save the transformed dataset for modeling
df.to_csv('/Users/talgat/Downloads/detect-fraud/data/cleaned_transactions.csv', index=False)


## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

Answer here

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load cleaned and transformed data
df = pd.read_csv('/Users/talgat/Downloads/detect-fraud/data/transformed_dataset.csv')
X = df.drop('isFraud', axis=1)
y = df['isFraud']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Use class_weight to help model focus on rare frauds
model = RandomForestClassifier(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred, digits=4))


              precision    recall  f1-score   support

           0     0.9988    1.0000    0.9994      2496
           1     1.0000    0.2500    0.4000         4

    accuracy                         0.9988      2500
   macro avg     0.9994    0.6250    0.6997      2500
weighted avg     0.9988    0.9988    0.9984      2500



## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

Answer Here

In [2]:
# write out newly transformed dataset to your folder
...