**Importing the necessary libraries for data manipulation, preprocessing, model building, and evaluation. Which also includes pandas for data handling, NumPy for numerical operations, and modules from scikit-learn for machine learning tasks. We also import SMOTE from imbalanced-learn to handle class imbalance and additional libraries for model performance metrics and sparse matrix handling.**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, roc_auc_score
import os

In [2]:
file_path = './content/Fraud.csv'
file_size = os.path.getsize(file_path)
file_size_mb = file_size / (1024 * 1024)

print(f"File size: {file_size_mb:.2f} MB")

File size: 470.67 MB


**Loading the dataset into a pandas DataFrame from a CSV file. This is the initial step in data preprocessing where we read the data into a format suitable for analysis and manipulation.**

In [3]:
df = pd.read_csv('./content/Fraud.csv', nrows=1000000)

**Here we define the columns which are numerical and categorical. This distinction is crucial because the preprocessing steps for numerical and categorical data differ significantly.**

In [4]:
numerical_columns = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
categorical_columns = ['type', 'nameOrig', 'nameDest']

**we initialize imputer objects for handling missing values in the dataset. For numerical columns, we use the mean value of the column to fill missing values. For categorical columns, we use the most frequent value.**

In [5]:
num_imputer = SimpleImputer(strategy='mean')
cat_imputer = SimpleImputer(strategy='most_frequent')

**Then we apply the imputers to the numerical and categorical columns to fill all the missing values. This step ensures that our dataset does not have any missing values which could cause errors in subsequent steps.**

In [6]:
df[numerical_columns] = num_imputer.fit_transform(df[numerical_columns])
df[categorical_columns] = cat_imputer.fit_transform(df[categorical_columns])

**Then by initializing a OneHotEncoder to convert categorical variables into a format that can be used by machine learning models. One-hot encoding transforms categorical values into a series of binary columns.**

In [7]:
encoder = OneHotEncoder(sparse_output=True)
encoded_cats = encoder.fit_transform(df[categorical_columns])
encoded_cat_columns = encoder.get_feature_names_out(categorical_columns)

**By converting the encoded categorical data into a DataFrame and concatenate it with the original DataFrame after dropping the original categorical columns. This step integrates the encoded categorical data with our dataset.**

In [8]:
df_encoded = pd.DataFrame.sparse.from_spmatrix(encoded_cats, columns=encoded_cat_columns)
df = df.drop(categorical_columns, axis=1)
df = pd.concat([df, df_encoded], axis=1)

**By separating the features and the target variable from the dataset. Then store the features in X and the target variable (fraud label) in y.**

In [9]:
X = df.drop(['isFraud', 'isFlaggedFraud'], axis=1)
y = df['isFraud']

**For handling the missing values in the target variable by filling them with zeros and converting the target variable to an integer type. This ensures that our target variable is in the correct format for model training.**

In [10]:
y = y.fillna(0).astype('int')

**The indices of fraudulent and non-fraudulent transactions are noted. This allows us to create a balanced subset of the data for initial training.**

In [11]:
fraud_indices = y[y == 1].index
non_fraud_indices = y[y == 0].sample(n=len(fraud_indices), random_state=42).index

**By combining the indices of fraudulent and non-fraudulent transactions to create a balanced subset of our data. This helps in training the model effectively without being biased towards non-fraudulent transactions.**

In [12]:
sample_indices = fraud_indices.union(non_fraud_indices)
X_sample = X.loc[sample_indices]
y_sample = y.loc[sample_indices]

**By applying the SMOTE technique to our balanced subset to further address class imbalance. SMOTE generates synthetic examples of the minority class (fraudulent transactions) to create a more balanced dataset.**

In [13]:
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_sample, y_sample)



**Standardizing the features by removing the mean and scaling to unit variance. This is necessary for many machine learning algorithms to perform optimally.**

In [14]:
scaler = StandardScaler(with_mean=False)
X_scaled = scaler.fit_transform(X_resampled)



**I planned to use RandomForest Classifier for this fraudulent transactions model**

In [15]:
model = RandomForestClassifier()

**By training and evaluating each model using GridSearchCV to find the best hyperparameters. I splited the data into training and testing sets, perform hyperparameter tuning, and evaluate the models using classification metrics and AUC-ROC scores.**

In [17]:
print("Training RandomForest model...")
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_resampled, test_size=0.3, random_state=42)

# Step 6: Hyperparameter Tuning
param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [10, 20, 30]}

grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print(f"Best parameters for RandomForest: {grid_search.best_params_}")
print(classification_report(y_test, y_pred))
auc_roc = roc_auc_score(y_test, y_pred)
print(f"AUC-ROC for RandomForest: {auc_roc}")

result = {
    'best_params': grid_search.best_params_,
    'classification_report': classification_report(y_test, y_pred, output_dict=True),
    'auc_roc': auc_roc
}

Training RandomForest model...
Best parameters for RandomForest: {'max_depth': 30, 'n_estimators': 100}
              precision    recall  f1-score   support

           0       0.96      0.85      0.90       156
           1       0.87      0.97      0.92       165

    accuracy                           0.91       321
   macro avg       0.92      0.91      0.91       321
weighted avg       0.92      0.91      0.91       321

AUC-ROC for RandomForest: 0.9111305361305362
