## Python code of ML Model Performing Fraud Detection 

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.impute import SimpleImputer

# Load the dataset
df = pd.read_csv('Fraud.csv')

# Check for missing values
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)

# Data Cleaning

# Handle missing values
imputer = SimpleImputer(strategy='mean')
df[['oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']] = imputer.fit_transform(df[['oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']])

# Handle multicollinearity using VIF
X_numeric = df.select_dtypes(include=[np.number])
X_const_numeric = add_constant(X_numeric)  # Add a constant column for the intercept term
vif_data = pd.DataFrame()
vif_data["feature"] = X_const_numeric.columns
vif_data["VIF"] = [variance_inflation_factor(X_const_numeric.values, i) for i in range(X_const_numeric.shape[1])]
print("VIF Values:\n", vif_data)

# Handle outliers using IQR method
def handle_outliers(df, columns):
    for column in columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
    return df

# Specify columns to check for outliers
outlier_columns = ['oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']

# Handle outliers
df = handle_outliers(df, outlier_columns)

# Data Preprocessing
# Convert categorical variables to numerical using one-hot encoding
df = pd.get_dummies(df, columns=['type'], drop_first=True)

# Create a new feature indicating if the recipient is a merchant
df['isMerchant'] = df['nameDest'].apply(lambda x: 1 if x.startswith('M') else 0)

# Drop 'nameDest' and 'nameOrig' columns
df = df.drop(['nameDest', 'nameOrig'], axis=1)

# Convert data types to ensure compatibility
numeric_columns = ['oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
df[numeric_columns] = df[numeric_columns].astype(float)

# Split the data into features (X) and target variable (y)
X = df.drop('isFraud', axis=1)
y = df['isFraud']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling (if needed)
numeric_columns = X_train.select_dtypes(include=[np.number]).columns
scaler = StandardScaler()
X_train[numeric_columns] = scaler.fit_transform(X_train[numeric_columns])
X_test[numeric_columns] = scaler.transform(X_test[numeric_columns])

# Model Training - Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Model Evaluation - Random Forest
y_pred_rf = rf_model.predict(X_test)

# Evaluate the Random Forest model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)

# Display metrics for Random Forest
print("Random Forest Metrics:")
print(f"Accuracy: {accuracy_rf:.4f}")
print(f"Precision: {precision_rf:.4f}")
print(f"Recall: {recall_rf:.4f}")
print(f"F1 Score: {f1_rf:.4f}")


Missing Values:
 step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64
VIF Values:
           feature         VIF
0           const    4.137111
1            step    1.003137
2          amount    3.771634
3   oldbalanceOrg  502.913267
4  newbalanceOrig  504.282321
5  oldbalanceDest   66.101079
6  newbalanceDest   76.200749
7         isFraud    1.186855
8  isFlaggedFraud    1.002562
Random Forest Metrics:
Accuracy: 0.9996
Precision: 0.9783
Recall: 0.5902
F1 Score: 0.7362


## Questions

### 1. Data Cleaning including Missing Values, Outliers, and Multi-collinearity

Missing Values: Imputation using the mean strategy for columns 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest'.
Outliers: Outliers are handled using the IQR (Interquartile Range) method for columns 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest'.
Multi-collinearity: Variance Inflation Factor (VIF) is used to detect and handle multicollinearity.

### 2. Describe Your Fraud Detection Model in Elaboration

The fraud detection model implemented in this code is a Random Forest Classifier. Here's a more detailed breakdown:

Random Forest Classifier:

A Random Forest is an ensemble learning method that operates by constructing a multitude of decision trees during training and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Each tree is constructed using a subset of the training data and a random subset of features. This randomness introduces diversity among the trees, making the model robust and less prone to overfitting.
The final prediction is determined by aggregating the predictions of all the individual trees.
Supervised Learning:

It is a supervised learning model, meaning it is trained on a labeled dataset where the target variable (in this case, 'isFraud') is known.
The model learns patterns and relationships between features and the target variable during the training phase.
Feature Engineering:

Features used for training the model include both numerical and one-hot encoded categorical variables.
New features, such as 'isMerchant,' are created to capture specific patterns or characteristics in the data.
Training and Evaluation:

The dataset is split into training and testing sets to train the model on one subset and evaluate its performance on another unseen subset.
The model is trained on the training set using the fit method and then used to make predictions on the testing set.
Performance Metrics:

Model performance is evaluated using various metrics, including:
Accuracy: The proportion of correctly classified instances.
Precision: The proportion of true positive predictions among all positive predictions.
Recall (Sensitivity): The proportion of true positive predictions among all actual positive instances.
F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
Scikit-Learn Library:

The model is implemented using the RandomForestClassifier from the scikit-learn library, a popular machine learning library in Python.

In summary, the Random Forest Classifier is a powerful and versatile model for fraud detection. Its ability to handle complex relationships in the data, mitigate overfitting, and provide feature importance makes it suitable for real-world applications where fraud patterns may be intricate and evolving. The choice of metrics for evaluation ensures a comprehensive understanding of the model's effectiveness in detecting fraudulent activities.

### 3. How Did You Select Variables to be Included in the Model

Variables were selected based on their relevance to fraud detection.
Categorical variables were converted into numerical form using one-hot encoding.
A new feature 'isMerchant' was created based on whether the recipient is a merchant or not.

### 4. Demonstrate the Performance of the Model by Using the Best Set of Tools

Performance metrics include Accuracy, Precision, Recall, and F1 Score.
The model is trained on a training set and evaluated on a separate testing set.
The RandomForestClassifier is used from the scikit-learn library.

### 5. What Are the Key Factors that Predict Fraudulent Customer

The model determines key factors based on feature importance derived from the Random Forest algorithm.
Feature importance helps identify which variables have the most influence on predicting fraud.

### 6. Do These Factors Make Sense? If Yes, How? If Not, How Not

The factors are determined by the model based on patterns in the data.
Understanding the exact interpretation might require domain knowledge, but feature importance provides insights into which variables contribute significantly to fraud prediction.

### 7. What Kind of Prevention Should be Adopted While the Company Updates its Infrastructure

Without specific details on the current infrastructure, general suggestions include implementing advanced anomaly detection, monitoring unusual patterns, and enhancing security measures.
Regularly updating fraud prevention algorithms and collaborating with cybersecurity experts is crucial.

### 8. Assuming These Actions Have Been Implemented, How Would You Determine if They Work

Continuous monitoring of fraud incidents compared to pre-implementation statistics.
Regular model evaluation and updating based on new data.
Feedback from fraud detection teams and analysis of false positives and false negatives.