The fraud detection model implemented here is based on a Random Forest classifier, which is an ensemble learning method widely used for classification tasks. Let's break down the key components and characteristics of this fraud detection model:

1. **Data Preprocessing**: The model starts by selecting relevant features from the dataset, including transaction type, transaction amount, balances before and after the transaction for both the origin and destination accounts, and the flag indicating potential fraud. Categorical variables like transaction type are encoded using one-hot encoding to convert them into a numerical format suitable for machine learning algorithms.

2. **Model Selection**: Random Forest is chosen as the classifier for this task due to its effectiveness in handling complex relationships and capturing non-linear patterns in the data. Random Forest is an ensemble learning method that combines multiple decision trees to make predictions. Each tree in the forest is trained on a random subset of the data, and the final prediction is made by aggregating the predictions of all trees (e.g., by taking a majority vote in classification tasks).

3. **Training**: The model is trained using the training data, where it learns to classify transactions as fraudulent or non-fraudulent based on the input features. During training, the Random Forest algorithm builds multiple decision trees, each considering a random subset of features and data samples. This randomness helps to reduce overfitting and improve generalization performance.

4. **Prediction**: After training, the model is used to predict the fraud label (fraudulent or non-fraudulent) for new transactions in the test dataset. The trained Random Forest classifier examines the input features of each transaction and makes a prediction based on the collective decision of all trees in the forest.

5. **Evaluation**: The model's performance is evaluated using metrics such as accuracy, precision, recall, F1-score, and support. Accuracy measures the overall correctness of the predictions, while precision, recall, and F1-score provide insights into the model's ability to correctly classify fraudulent transactions without missing genuine ones or falsely identifying non-fraudulent transactions as fraudulent.

Overall, this fraud detection model leverages the strengths of Random Forest, such as its ability to handle complex data relationships, to effectively identify potentially fraudulent transactions and assist in fraud prevention efforts.

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Load the dataset
data = pd.read_csv('/kaggle/input/data-source-transac/Fraud.csv')

# 1. Missing Values
missing_values = data.isnull().sum()
print("Missing Values:")
print(missing_values)

Missing Values:
step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64


In [2]:
# 2. Outliers
numerical_columns = data.select_dtypes(include=[np.number]).columns
outliers = {}
for col in numerical_columns:
    z_scores = np.abs(stats.zscore(data[col]))
    outliers[col] = np.where(z_scores > 3)[0]

print("\nOutliers:")
print(outliers)

# Handle outliers by removing or transforming them
# For example, you can choose to remove rows containing outliers:
outlier_indices = [index for indices in outliers.values() for index in indices]
data.drop(outlier_indices, inplace=True)


Outliers:
{'step': array([6296000, 6296001, 6296002, ..., 6362617, 6362618, 6362619]), 'amount': array([    359,     375,     376, ..., 6362599, 6362616, 6362617]), 'oldbalanceOrg': array([    662,    1329,    1330, ..., 6362581, 6362582, 6362583]), 'newbalanceOrig': array([    661,     662,    1328, ..., 6362576, 6362578, 6362580]), 'oldbalanceDest': array([    375,     376,     432, ..., 6362134, 6362256, 6362553]), 'newbalanceDest': array([     84,      88,      89, ..., 6362134, 6362256, 6362507]), 'isFraud': array([      2,       3,     251, ..., 6362617, 6362618, 6362619]), 'isFlaggedFraud': array([2736446, 3247297, 3760288, 5563713, 5996407, 5996409, 6168499,
       6205439, 6266413, 6281482, 6281484, 6296014, 6351225, 6362460,
       6362462, 6362584])}


In [3]:
# 3. Multi-collinearity
correlation_matrix = data[numerical_columns].corr()
print("\nCorrelation Matrix:")
print(correlation_matrix)

# Calculate VIF
def calculate_vif(data):
    vif_data = pd.DataFrame()
    vif_data["Feature"] = data.columns
    vif_data["VIF"] = [variance_inflation_factor(data.values, i) for i in range(len(data.columns))]
    return vif_data

vif_scores = calculate_vif(data[numerical_columns])
print("\nVIF Scores:")
print(vif_scores)


Correlation Matrix:
                    step    amount  oldbalanceOrg  newbalanceOrig  \
step            1.000000 -0.012889       0.001992        0.002714   
amount         -0.012889  1.000000       0.018753        0.025331   
oldbalanceOrg   0.001992  0.018753       1.000000        0.997895   
newbalanceOrig  0.002714  0.025331       0.997895        1.000000   
oldbalanceDest  0.003059  0.273661       0.124684        0.128072   
newbalanceDest -0.006528  0.374371       0.080616        0.080097   
isFraud              NaN       NaN            NaN             NaN   
isFlaggedFraud       NaN       NaN            NaN             NaN   

                oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  
step                  0.003059       -0.006528      NaN             NaN  
amount                0.273661        0.374371      NaN             NaN  
oldbalanceOrg         0.124684        0.080616      NaN             NaN  
newbalanceOrig        0.128072        0.080097      NaN      

  return 1 - self.ssr/self.uncentered_tss



VIF Scores:
          Feature         VIF
0            step    1.428556
1          amount    3.782318
2   oldbalanceOrg  402.449844
3  newbalanceOrig  415.412624
4  oldbalanceDest  130.370763
5  newbalanceDest  142.959408
6         isFraud         NaN
7  isFlaggedFraud         NaN


To select features for fraudulent transaction detection, we need to consider their potential relevance and importance in identifying fraudulent behavior. Here's how we can evaluate each feature:

1. **step**: This feature represents time, which could be useful for identifying patterns in fraudulent activity over time. For example, fraud might be more prevalent during certain hours or days.

2. **type**: The type of transaction could be highly relevant. For instance, certain types of transactions, like TRANSFER or CASH_OUT, might be more commonly associated with fraud.

3. **amount**: The transaction amount is often indicative of fraudulent behavior, as fraudulent transactions may involve unusually large or small amounts compared to legitimate ones.

4. **nameOrig** and **nameDest**: These features identify the sender and receiver of the transaction, respectively. Analyzing the behavior of these entities could help detect fraudulent activity, especially if certain accounts are frequently involved in fraudulent transactions.

5. **oldbalanceOrg** and **newbalanceOrig**: Changes in the originating account's balance before and after the transaction could signal fraudulent behavior, such as unauthorized transfers or account takeovers.

6. **oldbalanceDest** and **newbalanceDest**: Similar to the previous pair, changes in the destination account's balance could indicate potential fraudulent activity, such as money laundering or fraudulent fund transfers.

7. **isFlaggedFraud**: This feature flags transactions that exceed a certain threshold amount, potentially indicating high-risk or fraudulent transactions.

Considering the above criteria, the features that are likely to be selected for fraudulent transaction detection are:

- **type**
- **amount**
- **oldbalanceOrg**
- **newbalanceOrig**
- **oldbalanceDest**
- **newbalanceDest**
- **isFlaggedFraud**

These features provide a comprehensive view of transaction behavior, account balances, and transaction types, which are crucial for identifying fraudulent activity.

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

selected_columns = ['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest', 'isFlaggedFraud']
X = data[selected_columns]
y = data['isFraud']

X_encoded = pd.get_dummies(X, columns=['type'])

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

In [5]:
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

In [6]:
y_pred = rf_model.predict(X_test)

In [7]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 1.0

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1204020

    accuracy                           1.00   1204020
   macro avg       1.00      1.00      1.00   1204020
weighted avg       1.00      1.00      1.00   1204020

