## Accredian Screening Round Assignment

### Candidate Name: Anshul Choudhary (choudharyanshul@iitgn.ac.in)


#### README: I have developed a model for predicting fraudulent transactions for a financial compacy using five different Machine Learning Algorithms. The details are in the code snippet given below. For easy code readabilty, I have tried to write step-wise code and include comments as many as possible everywhere. I have first written down the answers of the questions written under "Candidate Expectations" in the given assignment. I'm hopeful going through those answers will definetely make things more simpler. 

## Candidate Expectation

### 1. Data cleaning including missing values, outliers, and multi-collinearity.

**Answer:**

- **Missing Values:** I checked for missing values and handled them by filling with the mean for numeric columns (`amount`, `oldbalanceOrg`, `newbalanceOrig`, `oldbalanceDest`, `newbalanceDest`).
- **Outliers:** Outliers were identified and removed using the Interquartile Range (IQR) method. This involved calculating the lower and upper whiskers and filtering out values outside this range.
- **Multi-Collinearity:** I examined the correlation matrix to identify and handle highly correlated features to reduce redundancy and improve model performance.

### 2. Describe your fraud detection model in elaboration.

**Answer:**

My fraud detection model utilized multiple machine learning algorithms: Logistic Regression, Decision Tree, K-Nearest Neighbors, Support Vector Machine, and a Neural Network. The steps included:

- **Data Preprocessing:** Data was normalized using MinMaxScaler to ensure all features were on a similar scale. The data was split into 65% for training and 35% for testing using `train_test_split`.
- **Class Imbalance Handling:** I used the NearMiss technique to address the imbalance in the dataset.
- **Model Training and Evaluation:** Each model was trained on the processed data, and performance was evaluated using metrics like ROC AUC Score, F1 Score, Confusion Matrix, and Classification Report.

### 3. How did you select variables to be included in the model?

**Answer:**

- **Feature Engineering:** I created new features such as balance differences (`balance_orig_diff` and `balance_dest_diff`).
- **One-Hot Encoding:** The `type` column was converted into multiple binary columns (`type_CASH_OUT`, `type_DEBIT`, `type_PAYMENT`, `type_TRANSFER`).
- **Correlation Analysis:** Features with high correlation were handled to reduce multi-collinearity.

### 4. Demonstrate the performance of the model by using the best set of tools.

**Answer:**

Here are the performance metrics for each model:

#### Accuracy Scores:
| Model                    | Accuracy   |
|--------------------------|------------|
| Logistic Regression      | 0.867652   |
| Decision Tree            | 0.934783   |
| K-Nearest Neighbors      | 0.983826   |
| Support Vector Machine   | 0.968522   |
| Neural Network           | 0.976348   |

#### F1 Scores:
| Model                    | F1 Score   |
|--------------------------|------------|
| Logistic Regression      | 0.869580   |
| Decision Tree            | 0.937965   |
| K-Nearest Neighbors      | 0.983704   |
| Support Vector Machine   | 0.967753   |
| Neural Network           | 0.976307   |

#### Confusion Matrices:

| Model                    | True Positive | True Negative | False Positive | False Negative |
|--------------------------|---------------|---------------|----------------|----------------|
| Logistic Regression      | 2537          | 2452          | 423            | 338            |
| Decision Tree            | 2835          | 2540          | 335            | 40             |
| K-Nearest Neighbors      | 2807          | 2850          | 25             | 68             |
| Support Vector Machine   | 2716          | 2853          | 22             | 159            |
| Neural Network           | 2802          | 2812          | 63             | 73             |





### 5. What are the key factors that predict fraudulent customer?

**Answer:**

Key factors identified were:
- **Transaction Amount:** Higher amounts are more indicative of fraud.
- **Balance Differences:** Significant differences in balances before and after transactions.
- **Transaction Type:** Certain transaction types like `TRANSFER` and `CASH_OUT` are more prone to fraud.

### 6. Do these factors make sense? If yes, How? If not, How not?

**Answer:**

Yes, these factors make sense based on domain knowledge:
- **Transaction Amount:** Large amounts are often targeted in fraud.
- **Balance Differences:** Large discrepancies indicate unusual activities.
- **Transaction Type:** Some types are inherently riskier and more susceptible to fraud.

### 7. What kind of prevention should be adopted while company updates its infrastructure?

**Answer:**

Preventive measures include:
- **Real-time Monitoring:** Implement real-time transaction monitoring systems.
- **Anomaly Detection:** Use advanced anomaly detection algorithms to flag suspicious activities.
- **User Authentication:** Strengthen user authentication processes, such as multi-factor authentication.
- **Regular Audits:** Conduct regular audits and reviews of transaction data to detect and prevent fraud.

### 8. Assuming these actions have been implemented, how would you determine if they work?

**Answer:**

Evaluation of Prevention Measures:
- **Reduction in Fraudulent Transactions:** Track the number of fraudulent transactions over time to see if there is a reduction.
- **Improvement in Detection Metrics:** Monitor metrics such as ROC AUC Score, Precision, Recall, and F1 Score to assess the effectiveness of the fraud detection model.
- **User Feedback:** Collect feedback from users regarding the security and efficiency of the transaction process.


## Code Work Involved

### 1. Data Loading and Initial Exploration

In [21]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, f1_score
from sklearn.preprocessing import MinMaxScaler
from scipy import stats
import imblearn
from imblearn.under_sampling import NearMiss

# Load dataset
file_path = '/Users/anshul/Downloads/Fraud.csv'
df = pd.read_csv(file_path)
df.head()


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


### 2. Data Cleaning and Handling Outliers

#### 2.1 Handle Categorical Features

In [22]:
# Handle categorical features
payment_types = pd.get_dummies(df['type'], prefix='type', drop_first=True)
df = pd.concat([df, payment_types], axis=1)
df.drop('type', axis=1, inplace=True)

# Convert encoded columns to int64
df['type_CASH_OUT'] = df['type_CASH_OUT'].astype(np.int64)
df['type_DEBIT'] = df['type_DEBIT'].astype(np.int64)
df['type_PAYMENT'] = df['type_PAYMENT'].astype(np.int64)
df['type_TRANSFER'] = df['type_TRANSFER'].astype(np.int64)

df.drop(columns=['nameOrig', 'nameDest'], inplace=True)


#### 2.2 Remove Outliers

In [23]:
# Function to remove outliers
def remove_outliers(df, col):
    lower_quantile = df[col].quantile(0.25)
    upper_quantile = df[col].quantile(0.75)
    IQR = upper_quantile - lower_quantile
    lower_whisker = lower_quantile - 1.5 * IQR
    upper_whisker = upper_quantile + 1.5 * IQR
    temp = df.loc[(df[col] > lower_whisker) & (df[col] < upper_whisker)]
    return temp[col]

# Remove outliers for specified columns
df['amount'] = remove_outliers(df, 'amount')
df['oldbalanceOrg'] = remove_outliers(df, 'oldbalanceOrg')
df['newbalanceOrig'] = remove_outliers(df, 'newbalanceOrig')
df['oldbalanceDest'] = remove_outliers(df, 'oldbalanceDest')
df['newbalanceDest'] = remove_outliers(df, 'newbalanceDest')


#### 2.3 Feature Engineering


In [24]:
# Handle missing values by filling with mean for numeric columns
columns_to_fill = ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
for col in columns_to_fill:
    if col in df.columns:
        df[col].fillna(df[col].mean(), inplace=True)
    else:
        print(f"Column '{col}' does not exist in the dataframe")

# Remove outliers for specified columns
columns_to_check = ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
for col in columns_to_check:
    df[col] = remove_outliers(df, col)


#### 2.4 Handling Missing Values and Class Imbalance 

In [25]:
# Identify columns with missing values
missing_columns = df.columns[df.isnull().any()]

# Fill missing values for numeric columns
for col in missing_columns:
    if df[col].dtype == 'float64' or df[col].dtype == 'int64':
        df[col].fillna(df[col].mean(), inplace=True)

# Separate features and target
X = df.drop('isFraud', axis=1)
y = df['isFraud']

# Apply NearMiss for undersampling
nm = NearMiss()
X_nm, y_nm = nm.fit_resample(X, y)


### 3. Model Training and Evaluation

#### 3.1 General Steps

In [27]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_nm, y_nm, test_size=0.35, stratify=y_nm, random_state=2022)

# Normalize data
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)


#### 3.2 Train and Evaluate Multiple Models

#### (a) Logistic Regression

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, f1_score

# Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)

# Evaluation
print("Logistic Regression")
print('ROC AUC Score:', roc_auc_score(y_test, lr_pred))
print('F1 Score:', f1_score(y_test, lr_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, lr_pred))
print('Classification Report:\n', classification_report(y_test, lr_pred))
print('Accuracy Score:', accuracy_score(y_test, lr_pred))


Logistic Regression
ROC AUC Score: 0.8676521739130435
F1 Score: 0.869580119965724
Confusion Matrix:
 [[2452  423]
 [ 338 2537]]
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.85      0.87      2875
           1       0.86      0.88      0.87      2875

    accuracy                           0.87      5750
   macro avg       0.87      0.87      0.87      5750
weighted avg       0.87      0.87      0.87      5750

Accuracy Score: 0.8676521739130435


#### (b) Decision Tree Classifier

In [10]:
from sklearn.tree import DecisionTreeClassifier

# Decision Tree Classifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)
dtree_pred = dtree.predict(X_test)

# Evaluation
print("Decision Tree Classifier")
print("ROC AUC Score:", roc_auc_score(y_test, dtree_pred))
print("F1 Score:", f1_score(y_test, dtree_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, dtree_pred))
print("Classification Report:\n", classification_report(y_test, dtree_pred))
print("Accuracy Score:", accuracy_score(y_test, dtree_pred))


Decision Tree Classifier
ROC AUC Score: 0.9347826086956523
F1 Score: 0.9379652605459057
Confusion Matrix:
 [[2540  335]
 [  40 2835]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.88      0.93      2875
           1       0.89      0.99      0.94      2875

    accuracy                           0.93      5750
   macro avg       0.94      0.93      0.93      5750
weighted avg       0.94      0.93      0.93      5750

Accuracy Score: 0.9347826086956522


#### (c) K-Nearest Neighbors

In [11]:
from sklearn.neighbors import KNeighborsClassifier

# K-Nearest Neighbors
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)

# Evaluation
print("K-Nearest Neighbors")
print("ROC AUC Score:", roc_auc_score(y_test, knn_pred))
print("F1 Score:", f1_score(y_test, knn_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, knn_pred))
print("Classification Report:\n", classification_report(y_test, knn_pred))
print("Accuracy Score:", accuracy_score(y_test, knn_pred))


K-Nearest Neighbors
ROC AUC Score: 0.9838260869565217
F1 Score: 0.9837042228841774
Confusion Matrix:
 [[2850   25]
 [  68 2807]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.99      0.98      2875
           1       0.99      0.98      0.98      2875

    accuracy                           0.98      5750
   macro avg       0.98      0.98      0.98      5750
weighted avg       0.98      0.98      0.98      5750

Accuracy Score: 0.9838260869565217


#### (d) Support Vector Machine

In [12]:
from sklearn.svm import SVC

# Support Vector Machine
svm = SVC()
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)

# Evaluation
print("Support Vector Machine")
print("ROC AUC Score:", roc_auc_score(y_test, svm_pred))
print("F1 Score:", f1_score(y_test, svm_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, svm_pred))
print("Classification Report:\n", classification_report(y_test, svm_pred))
print("Accuracy Score:", accuracy_score(y_test, svm_pred))


Support Vector Machine
ROC AUC Score: 0.9685217391304347
F1 Score: 0.9677534295385711
Confusion Matrix:
 [[2853   22]
 [ 159 2716]]
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.99      0.97      2875
           1       0.99      0.94      0.97      2875

    accuracy                           0.97      5750
   macro avg       0.97      0.97      0.97      5750
weighted avg       0.97      0.97      0.97      5750

Accuracy Score: 0.9685217391304348


#### (e) Nueral Network

In [None]:
pip install tensorflow


In [14]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score

# Define the neural network model
def create_nn_model(input_dim):
    model = Sequential()
    model.add(Dense(64, input_dim=input_dim, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Create the model
input_dim = X_train.shape[1]
nn_model = create_nn_model(input_dim)

# Train the model
history = nn_model.fit(X_train, y_train, epochs=20, batch_size=64, validation_split=0.2, verbose=1)

# Evaluate the model
nn_pred_prob = nn_model.predict(X_test)
nn_pred = (nn_pred_prob >= 0.5).astype(int)

# Evaluation Metrics
print("Neural Network")
print('ROC AUC Score:', roc_auc_score(y_test, nn_pred))
print('F1 Score:', f1_score(y_test, nn_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, nn_pred))
print('Classification Report:\n', classification_report(y_test, nn_pred))
print('Accuracy Score:', accuracy_score(y_test, nn_pred))


Epoch 1/20


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 815us/step - accuracy: 0.6360 - loss: 0.6370 - val_accuracy: 0.8717 - val_loss: 0.3920
Epoch 2/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 426us/step - accuracy: 0.8484 - loss: 0.3977 - val_accuracy: 0.9129 - val_loss: 0.2348
Epoch 3/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 422us/step - accuracy: 0.8984 - loss: 0.2636 - val_accuracy: 0.9265 - val_loss: 0.1657
Epoch 4/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 410us/step - accuracy: 0.9135 - loss: 0.2075 - val_accuracy: 0.9284 - val_loss: 0.1349
Epoch 5/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 418us/step - accuracy: 0.9183 - loss: 0.1677 - val_accuracy: 0.9349 - val_loss: 0.1136
Epoch 6/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 415us/step - accuracy: 0.9315 - loss: 0.1390 - val_accuracy: 0.9574 - val_loss: 0.1016
Epoch 7/20
[1m134/134[0m 

#### Conclusion

In [17]:
# Performance of ML Models
print("Performance of ML Models:")
print('Predictive Accuracy of Logistic Regression:', str(np.round(accuracy_score(y_test, lr_pred) * 100, 2)) + '%')
print('Predictive Accuracy of K Neighbors Classifier:', str(np.round(accuracy_score(y_test, knn_pred) * 100, 2)) + '%')
print('Predictive Accuracy of Support Vector Classifier:', str(np.round(accuracy_score(y_test, svm_pred) * 100, 2)) + '%')
print('Predictive Accuracy of Decision Tree Classifier:', str(np.round(accuracy_score(y_test, dtree_pred) * 100, 2)) + '%')
print('Predictive Accuracy of Nueral Network Model:', str(np.round(accuracy_score(y_test, nn_pred) * 100, 2)) + '%')

Performance of ML Models:
Predictive Accuracy of Logistic Regression: 86.77%
Predictive Accuracy of K Neighbors Classifier: 98.38%
Predictive Accuracy of Support Vector Classifier: 96.85%
Predictive Accuracy of Decision Tree Classifier: 93.48%
Predictive Accuracy of Nueral Network Model: 97.63%
