In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline   # <-- imblearn pipeline (handles samplers)

In [None]:
# load file
fraud = pd.read_csv('/content/drive/MyDrive/fraudTest.csv')

print(fraud.head())
print(fraud.shape)
print(fraud.info())

# Task
To prepare the `fraud` dataset for model training, first inspect the data for missing values and convert the `trans_date_trans_time` column to datetime objects. Then, drop the specified unnecessary columns: 'Unnamed: 0', 'cc_num', 'trans_num', 'first', 'last', 'street', 'city', 'state', 'zip', 'lat', 'long', 'job', 'dob', 'unix_time', 'merch_lat', and 'merch_long'. Afterwards, identify the remaining categorical and numerical features. Apply `OneHotEncoder` to the categorical features and `StandardScaler` to the numerical features using a `ColumnTransformer`. Split the preprocessed data into training and testing sets, and finally, apply SMOTE to the training data to handle class imbalance.

## Inspect Data for Missing Values and Data Types

### Subtask:
Examine the 'fraud' DataFrame for missing values and review the data types of all columns to identify potential issues and determine appropriate preprocessing steps.


**Reasoning**:
To inspect for missing values and review data types, I will call the `.isnull().sum()` method on the `fraud` DataFrame to count missing values per column and then use the `.info()` method to display data types and non-null counts.



In [5]:
print("Missing values per column:")
print(fraud.isnull().sum())

print("\nDataFrame Info (datatypes and non-null counts):")
fraud.info()

Missing values per column:
Unnamed: 0               0
trans_date_trans_time    0
cc_num                   0
merchant                 0
category                 0
amt                      0
first                    0
last                     0
gender                   0
street                   0
city                     0
state                    0
zip                      0
lat                      0
long                     0
city_pop                 0
job                      0
dob                      0
trans_num                0
unix_time                0
merch_lat                0
merch_long               0
is_fraud                 0
dtype: int64

DataFrame Info (datatypes and non-null counts):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555719 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Unnamed: 0             555719 non-null  int64  
 1   trans_date_tran

## Convert 'trans_date_trans_time' to Datetime

### Subtask:
Convert the 'trans_date_trans_time' column to a datetime object to enable time-based feature extraction if needed.


**Reasoning**:
To convert the 'trans_date_trans_time' column to datetime objects, I will use the `pd.to_datetime()` function and then verify the conversion by printing the column's data type.



In [6]:
fraud['trans_date_trans_time'] = pd.to_datetime(fraud['trans_date_trans_time'])
print("Data type of 'trans_date_trans_time' after conversion:")
print(fraud['trans_date_trans_time'].dtype)

Data type of 'trans_date_trans_time' after conversion:
datetime64[ns]


## Drop Unnecessary Columns

### Subtask:
Drop the columns 'Unnamed: 0', 'cc_num', 'trans_num', 'first', 'last', 'street', 'city', 'state', 'zip', 'lat', 'long', 'job', 'dob', 'unix_time', 'merch_lat', and 'merch_long' from the `fraud` DataFrame.

**Reasoning**:
To drop the specified columns, I will create a list of the columns to be dropped and then use the `df.drop()` method with `axis=1` to remove them from the `fraud` DataFrame. After dropping, I will display the first few rows of the DataFrame and its information to confirm the columns have been removed.

**Reasoning**:
Following the defined reasoning, I will now execute the code to drop the specified unnecessary columns from the `fraud` DataFrame. I will then verify the operation by displaying the first few rows and the DataFrame's info.



In [7]:
columns_to_drop = ['Unnamed: 0', 'cc_num', 'trans_num', 'first', 'last', 'street', 'city', 'state', 'zip', 'lat', 'long', 'job', 'dob', 'unix_time', 'merch_lat', 'merch_long']
fraud = fraud.drop(columns=columns_to_drop, axis=1)

print("DataFrame after dropping columns (first 5 rows):")
print(fraud.head())
print("\nDataFrame Info after dropping columns:")
fraud.info()

DataFrame after dropping columns (first 5 rows):
  trans_date_trans_time                              merchant        category  \
0   2020-06-21 12:14:25                 fraud_Kirlin and Sons   personal_care   
1   2020-06-21 12:14:33                  fraud_Sporer-Keebler   personal_care   
2   2020-06-21 12:14:53  fraud_Swaniawski, Nitzsche and Welch  health_fitness   
3   2020-06-21 12:15:15                     fraud_Haley Group        misc_pos   
4   2020-06-21 12:15:17                 fraud_Johnston-Casper          travel   

     amt gender  city_pop  is_fraud  
0   2.86      M    333497         0  
1  29.84      F       302         0  
2  41.28      F     34496         0  
3  60.05      M     54767         0  
4   3.19      M      1126         0  

DataFrame Info after dropping columns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555719 entries, 0 to 555718
Data columns (total 7 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------               

## Identify Categorical and Numerical Features

### Subtask:
Identify the remaining categorical and numerical features in the `fraud` DataFrame after dropping the specified columns.

**Reasoning**:
To identify categorical and numerical features, I will iterate through the DataFrame's columns and classify them based on their data types. Object data types will be considered categorical, while int64 and float64 will be considered numerical, excluding the target variable 'is_fraud'.



In [8]:
categorical_features = fraud.select_dtypes(include=['object']).columns.tolist()
numerical_features = fraud.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Remove 'is_fraud' from numerical features if it's there, as it's the target variable
if 'is_fraud' in numerical_features:
    numerical_features.remove('is_fraud')

# Remove 'trans_date_trans_time' from categorical if it somehow ended up there, or if it's not handled separately for feature engineering
if 'trans_date_trans_time' in categorical_features:
    categorical_features.remove('trans_date_trans_time')

print("Categorical Features:", categorical_features)
print("Numerical Features:", numerical_features)

Categorical Features: ['merchant', 'category', 'gender']
Numerical Features: ['amt', 'city_pop']


## Apply Preprocessing to Features

### Subtask:
Apply `OneHotEncoder` to the categorical features and `StandardScaler` to the numerical features using a `ColumnTransformer`.

## Apply Preprocessing to Features

### Subtask:
Apply `OneHotEncoder` to the categorical features and `StandardScaler` to the numerical features using a `ColumnTransformer`.

**Reasoning**:
I will define a `ColumnTransformer` by specifying the preprocessing steps for both categorical and numerical features. `OneHotEncoder` will be used for categorical features to convert them into a numerical format suitable for machine learning models, and `StandardScaler` will be applied to numerical features to standardize their scales, preventing features with larger values from dominating the model training process.

**Reasoning**:
To apply the defined preprocessing steps, I will create a `ColumnTransformer` with `OneHotEncoder` for the identified categorical features and `StandardScaler` for the numerical features. Then, I will fit and transform the relevant columns of the `fraud` DataFrame and display the shape of the resulting array to confirm the successful application of the transformations.



In [10]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'
)

X = fraud.drop('is_fraud', axis=1)

# Drop 'trans_date_trans_time' from X as it's not meant for direct scaling or one-hot encoding
if 'trans_date_trans_time' in X.columns:
    X = X.drop('trans_date_trans_time', axis=1)

y = fraud['is_fraud']

X_preprocessed = preprocessor.fit_transform(X)

print("Shape of preprocessed data:", X_preprocessed.shape)


Shape of preprocessed data: (555719, 711)


## Split Data and Apply SMOTE

### Subtask:
Split the preprocessed data into training and testing sets, and then apply SMOTE to the training data to handle class imbalance.

**Reasoning**:
To prepare the data for model training, I will first use `train_test_split` to divide the preprocessed features (`X_preprocessed`) and the target variable (`y`) into training and testing sets. This ensures that the model is evaluated on unseen data. Subsequently, I will apply `SMOTE` (Synthetic Minority Over-sampling Technique) to the *training data only* to address the potential class imbalance in the target variable (`is_fraud`), which is crucial for preventing bias towards the majority class during model training.

**Reasoning**:
Following the reasoning, I will now split the preprocessed data into training and testing sets using `train_test_split` and then apply `SMOTE` to the training data to address class imbalance. I will display the shapes of the resulting datasets to confirm the operations.



In [11]:
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42, stratify=y)

print("Shape of X_train before SMOTE:", X_train.shape)
print("Shape of y_train before SMOTE:", y_train.shape)
print("Class distribution of y_train before SMOTE:\n", y_train.value_counts())

sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

print("\nShape of X_train after SMOTE:", X_train_res.shape)
print("Shape of y_train after SMOTE:", y_train_res.shape)
print("Class distribution of y_train after SMOTE:\n", y_train_res.value_counts())

Shape of X_train before SMOTE: (444575, 711)
Shape of y_train before SMOTE: (444575,)
Class distribution of y_train before SMOTE:
 is_fraud
0    442859
1      1716
Name: count, dtype: int64

Shape of X_train after SMOTE: (885718, 711)
Shape of y_train after SMOTE: (885718,)
Class distribution of y_train after SMOTE:
 is_fraud
0    442859
1    442859
Name: count, dtype: int64


## Final Task

### Subtask:
Confirm that all data preprocessing steps are completed and the data is ready for model training, including a summary of the transformations applied.


## Summary:

### Data Analysis Key Findings

*   **Missing Values and Data Types**: The initial inspection revealed no missing values across all 23 columns. The `trans_date_trans_time` column was identified as an `object` type requiring conversion to a datetime format.
*   **Datetime Conversion**: The `trans_date_trans_time` column was successfully converted to the `datetime64[ns]` data type.
*   **Column Removal**: A total of 16 unnecessary columns were successfully dropped, including 'Unnamed: 0', 'cc_num', 'trans_num', 'first', 'last', 'street', 'city', 'state', 'zip', 'lat', 'long', 'job', 'dob', 'unix_time', 'merch_lat', and 'merch_long', resulting in a DataFrame with 7 columns.
*   **Feature Identification**: After dropping columns, the remaining features were classified:
    *   Categorical features: `['merchant', 'category', 'gender']`
    *   Numerical features: `['amt', 'city_pop']`
*   **Preprocessing**:
    *   A `ColumnTransformer` was used to apply `StandardScaler` to numerical features and `OneHotEncoder` to categorical features.
    *   The `trans_date_trans_time` column was explicitly dropped from the feature set `X` before preprocessing to resolve a `ValueError` related to `datetime64[ns]` type not being supported by sparse matrices.
    *   The final preprocessed feature matrix `X_preprocessed` had a shape of (555719, 711).
*   **Data Splitting and Imbalance Handling**:
    *   The preprocessed data was split into training and testing sets with a 80/20 ratio, stratified by the target variable `y`.
    *   Before SMOTE, the training data exhibited significant class imbalance, with 442,859 instances of class 0 and only 1,716 instances of class 1.
    *   SMOTE was applied to the training data, successfully balancing the classes to 442,859 instances for both class 0 and class 1. The resulting `X_train_res` and `y_train_res` had a shape of (885718, 711) and (885718,), respectively.

### Insights or Next Steps

*   The data is now thoroughly preprocessed and ready for model training, with features scaled, categorical variables encoded, and class imbalance addressed in the training set.
*   The next logical step is to train a machine learning model using the `X_train_res` and `y_train_res` datasets, followed by evaluating its performance on the unseen `X_test` and `y_test` datasets.
