# Logistic Regression in Financial Data

The overall purpose of this script is to use logistic regression for binary classification to identify fraudulent transactions in the financial dataset.

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

financial = pd.read_csv('financial.csv')

financial.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [20]:
model = LogisticRegression()

In [22]:
X = financial[['amount', 'oldbalanceOrg', 'newbalanceOrig']]  # Features
y = financial['isFraud']

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [25]:
model.fit(X_train, y_train)

In [26]:
predictions = model.predict(X_test)

In [27]:
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270904
           1       0.71      0.98      0.83      1620

    accuracy                           1.00   1272524
   macro avg       0.86      0.99      0.91   1272524
weighted avg       1.00      1.00      1.00   1272524

[[1270270     634]
 [     33    1587]]


## Classification Report

Precision: 100% - This means that every instance predicted as class 0 was actually class 0.
Recall: 100% - This indicates that the model successfully identified all class 0 instances.
F1-Score: 100% - The perfect precision and recall for class 0 give a perfect F1-score, indicating very strong performance for this class.
Class 1 (Positive Class)

Precision: 71% - Of all instances predicted as class 1, 71% were actually class 1. This suggests some false positives.
Recall: 98% - The model successfully identified 98% of all actual class 1 instances, which is excellent.
F1-Score: 83% - This is the harmonic mean of precision and recall and is a measure of the classifier’s accuracy for class 1. A score of 83% is quite good, though not perfect.
Overall Model Performance

Accuracy: 100% - Overall, the model correctly classified 100% of the instances. However, this is likely influenced by the large number of class 0 instances.
Macro Avg: 86% Precision, 99% Recall, 91% F1-Score - These averages are calculated by taking the unweighted mean of the scores for each class. The high scores indicate good overall performance.
Weighted Avg: 100% for all - These averages account for class imbalance by weighting the scores of each class by the number of instances. Given the large number of class 0 instances, these scores are heavily influenced by the model’s performance on class 0.

## Interpretation

The model is excellent at identifying class 0 instances (negatives) with perfect precision and recall. Performance on class 1 (positives) is good but not perfect, especially with regard to precision. The model tends to over-predict class 1, leading to a fair number of false positives. The high recall for class 1 is desirable in scenarios where missing a positive instance is costly (e.g., fraud detection, disease diagnosis). The overall accuracy might be misleading due to the imbalance in the dataset (many more class 0 instances than class 1).