# Title: Project ML:  Fraud Detection for Online Transactions

##### The project follow following steps 
1) Data Collection  
2) Data Cleaning and Preprocessing
3) Visualize the Data  
4) Split the Data  
5) Label encoding  
6) Model Training  
7) Model Evaluation  
8) Optimize the Model  
9) Web Application Development  
10) Deployment and Group Presentation

## 1) Data Collection

https://www.kaggle.com/competitions/ieee-fraud-detection/data

## 2) Data Cleaning and Preprocessing

#### Step 1: Load required libraries and data

In [1]:
# Import dependencies
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder


In [2]:
# Load the data
file_path1 = Path("Resources/test_identity.csv")
file_path2 = Path("Resources/test_transaction.csv")
file_path3 = Path("Resources/train_identity.csv")
file_path4 = Path("Resources/train_transaction.csv")

In [None]:
# Read the data
test_identity = pd.read_csv(file_path1)
test_transaction = pd.read_csv(file_path2)
train_identity = pd.read_csv(file_path3)
train_transaction = pd.read_csv(file_path4)

### Note: 

**The dataset contains information about the identity and transactions made by the individuals in train and test set

#### Step 2: Exploring the Data

In [None]:
# get shape of the data
print("test_identity Shape: ", test_identity.shape)
print("test_transaction Shape: ", test_transaction.shape)
print("train_identity Shape: ", train_identity.shape)
print("train_transaction Shape: ", train_transaction.shape)

In [None]:
# print first two rows of each dataset
print(test_identity.head())
print(test_transaction.head())
print(train_identity.head())
print(train_transaction.head(2))

In [None]:
# get information about the data
print(test_identity.info())
print(test_transaction.info())
print(train_identity.info())
print(train_transaction.info())

### Note:  

**The train and test datasets have a column `TransactionID`, which can be used as the unique identifier for each transaction.

**The transaction files contain information such as transaction amount, time, and card information, while the identity files contain information such as device type, device info, and several ID columns.**

**The train dataset has a target column called `isFraud`.**

In [None]:
# get descriptive statistics for each dataset
print(test_identity.describe())
print(test_transaction.describe())
print(train_identity.describe())
print(train_transaction.describe())

#### Step 3:  Duplicate  Values:

In [None]:
# Check for Duplicate Values:
test_identity.duplicated().sum()
test_transaction.duplicated().sum()
train_identity.duplicated().sum()
train_transaction.duplicated().sum()

#### Step 4:Merge data

In [None]:
# Merge identity and transaction datasets
train = train_transaction.merge(train_identity, on="TransactionID", how="left")
test = test_transaction.merge(test_identity, on="TransactionID", how="left")

In [None]:
# Check the shape of the merged data
print("train_data Shape: ", train.shape)
print("test_data Shape: ", test.shape)

In [None]:
# Check for missing values in train and test data
missing_train = train.isnull().sum().sort_values(ascending=False)
missing_test = test.isnull().sum().sort_values(ascending=False)

In [None]:
# Display the percentage of missing values in each column
print("Missing values in train (%):")
print((missing_train / len(train)) * 100)
print("\nMissing values in test_data (%):")
print((missing_test / len(test)) * 100)

### Note:
**drop columns with a missing value percentage greater than a certain threshold (let's say 50%)

#### Step 5: Handle Missing Values

In [None]:
# Drop columns with more than 40% missing values
train_data = train.drop(columns=missing_train[missing_train > 0.40 * len(train)].index)
test_data = test.drop(columns=missing_test[missing_test > 0.40 * len(test)].index)

In [None]:
# Impute missing values in the remaining columns with their respective means
train_data.fillna(train_data.mean(), inplace=True)
test_data.fillna(test_data.mean(), inplace=True)

In [None]:
# Replace infinity values with NaN
train_data.replace([np.inf, -np.inf], np.nan, inplace=True)
test_data.replace([np.inf, -np.inf], np.nan, inplace=True)

# Fill NaN values with the mean of each column
train_data.fillna(train_data.mean(), inplace=True)
test_data.fillna(test_data.mean(), inplace=True)

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(train_data.drop(columns=["isFraud"]), train_data["isFraud"], test_size=0.2, random_state=42)

In [None]:
# Label encode categorical variables
le = LabelEncoder()
categorical_cols = ["ProductCD", "card4", "card6", "P_emaildomain", "R_emaildomain"]
for col in categorical_cols:
    X_train[col] = le.fit_transform(X_train[col].astype(str))
    X_test[col] = le.transform(X_test[col].astype(str))

#### Step 6:Save cleaned data to new CSV files

In [None]:
# Save cleaned data to new CSV files
X_train.to_csv("Resources/clean_train_data.csv", index=False)
X_test.to_csv("Resources/clean_test_data.csv", index=False)

In [None]:
train_data.shape

In [None]:
test_data.shape

In [None]:
from sqlalchemy import create_engine
# Create an SQLite database
engine = create_engine('postgresql://postgres:3720@localhost:5432/Fraud-detection')


In [None]:
# # Save cleaned data to the SQLite database
# train_data.to_sql('clean_train_data', engine, index=False, if_exists='replace')
# test_data.to_sql('clean_test_data', engine, index=False, if_exists='replace')