### Here We will be doing EDA and building a baseline Logistic Regression Model

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('creditcard.csv') # Load the dataset into a pandas dataframe : df
df.head() # Display the first 5 rows of the dataframe

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [3]:
df.info() # Display the information about the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [4]:
# checking the number of missing values in each column
df.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [5]:
df['Class'].value_counts() # Display the number of fraud and non-fraud transactions

Class
0    284315
1       492
Name: count, dtype: int64

0 -> Non Fraud Transactions (Normal Transactions)

1 -> Fraud Transactions

**We can clearly see that this data is highly imbalanced containing 284315 records for non-fraud (0) transactions and 492 records for fraud (1) transactions.**

Q. What is Imbalanced Dataset?
- Dataset in which the classes or categories are not represented equally

Q. What problems are caused by highly Imbalanced dataset?
- Biased Model Predictions – The model becomes biased towards the majority class and fails to correctly classify the minority class.
- Poor Recall for Minority Class – The model might achieve high accuracy but perform poorly in detecting the minority class (high false negatives).
- Misleading Performance Metrics – Accuracy can be deceptive; a model predicting only the majority class can still have high accuracy but be useless for real-world applications.

***We will use Under Sampling method to make it balanced. We will take 492 records for fraud (1) transactions and 492 records for non-fraud (0) transactions***

And use this new data frame for model building

In [6]:
normal = df[df['Class'] == 0] # Get all the normal transactions
fraud = df[df['Class'] == 1] # Get all the fraud transactions

print("Shape of dataframe containing normal (non-fraud) transactions : ",normal.shape) # Display the number of normal and fraud transactions
print("Shape of dataframe containing fraud transactions : ",fraud.shape)    # Display the number of normal and fraud transactions

Shape of dataframe containing normal (non-fraud) transactions :  (284315, 31)
Shape of dataframe containing fraud transactions :  (492, 31)


In [7]:
normal.Amount.describe() # Display the statistics of the normal transactions

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

In [8]:
fraud.Amount.describe() # Display the statistics of the fraud transactions

count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

1. Fraudulent Transactions Tend to Have a Higher Average Amount
- Mean for Normal Transactions: $88.29
- Mean for Fraud Transactions: $122.21
- Fraudulent transactions, on average, have a higher amount than normal transactions. This suggests that fraudsters often target higher-value transactions.

2. Fraudulent Transactions Have a Lower 25th and 50th Percentile
- 25th Percentile (Q1 - Lower Quartile)
- Normal: $5.65
- Fraud: $1.00
- 50th Percentile (Median - Q2)
- Normal: $22.00
- Fraud: $9.25
- The median transaction amount for fraud is significantly lower than for normal transactions. This suggests that while some fraudulent transactions involve large amounts, many also involve small transactions, possibly to avoid detection.

3. Fraudulent Transactions Have a Lower Maximum Amount
- Max for Normal Transactions: $25,691.16
- Max for Fraudulent Transactions: $2,125.87
- Fraudulent transactions have a much lower maximum value compared to normal ones. This could mean that fraudsters avoid extremely large transactions that might trigger alerts.

4. Fraudulent Transactions Have a Higher Standard Deviation
- Std for Normal Transactions: $250.10
- Std for Fraud Transactions: $256.68
Fraud transactions show slightly more variation in amounts, but they generally stay within a smaller range.


In [9]:
df.groupby('Class').mean() # Display the mean of all the columns of the dataframe grouped by Class (0 and 1)

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


Now, building a sample & balanced dataset containing equal number of records for fraud and non-fraud transactions using undersampling method

In [10]:
from sklearn.utils import resample

normal_under_sample = resample(normal, replace = False, n_samples = len(fraud), random_state = 27) # Undersample the normal transactions
print("Shape of undersampled normal transactions : ",normal_under_sample.shape) # Display the shape of the undersampled normal transactions

Shape of undersampled normal transactions :  (492, 31)


In [11]:
normal_under_sample.head() # Display the first 5 rows of the undersampled normal transactions

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
162213,114934.0,1.995938,0.041506,-1.626617,0.35973,0.328712,-0.670507,0.070409,-0.077927,0.290942,...,-0.298886,-0.842195,0.34581,0.657588,-0.333355,0.142894,-0.071543,-0.037694,13.48,0
75678,56197.0,0.945842,-0.400033,-0.554717,1.77809,1.832547,4.439277,-0.71645,1.068586,0.011658,...,-0.111339,-0.47234,-0.201955,1.025456,0.688154,0.063892,0.006502,0.038923,127.78,0
279712,169046.0,1.735539,-0.483406,-0.845973,1.402665,0.185065,0.613135,-0.052874,0.017813,0.987641,...,-0.476199,-1.326599,0.262618,-0.01674,-0.22176,-1.111037,0.036587,-0.005262,137.48,0
88272,62068.0,1.253755,-0.657127,0.411747,-1.46809,-0.877031,-0.330262,-0.597605,-0.011242,1.955442,...,-0.055549,-0.00312,-0.182191,-0.550678,0.489173,0.02934,0.037419,0.024202,48.4,0
231933,146985.0,0.242337,0.805488,-0.118981,0.734465,1.330932,-1.0846,1.476898,-0.574761,-0.408175,...,0.126925,0.720698,-0.202297,0.005675,-0.187671,-0.574096,-0.031813,-0.121123,1.0,0


In [12]:
df_new = pd.concat([normal_under_sample, fraud]) # Concatenate the undersampled normal transactions and fraud transactions

df_new.Class.value_counts() # Display the number of fraud and non-fraud transactions in the new dataframe

Class
0    492
1    492
Name: count, dtype: int64

In [13]:
df_new.to_csv('data.csv', index = False) # Save the new BALANCED dataframe to a csv file

# From now on, we will use the new balanced dataframe for training the model

df = pd.read_csv('data.csv') # Load the new balanced dataframe into a pandas dataframe : df

In [14]:
df.Class.value_counts()

Class
0    492
1    492
Name: count, dtype: int64

In [15]:
X = df.drop('Class', axis = 1) # Get all the columns except the Class column
y = df['Class'] # Get the Class column

print("Shape of X : ",X.shape) # Display the shape of X
print("Shape of y : ",y.shape) # Display the shape of y

Shape of X :  (984, 30)
Shape of y :  (984,)


In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=2)

print("Shape of X_train : ",X_train.shape) # Display the shape of X_train
print("Shape of y_train : ",y_train.shape) # Display the shape of y_train
print("Shape of X_test : ",X_test.shape) # Display the shape of X_test
print("Shape of y_test : ",y_test.shape) # Display the shape of y_test

Shape of X_train :  (787, 30)
Shape of y_train :  (787,)
Shape of X_test :  (197, 30)
Shape of y_test :  (197,)


In [17]:
import dagshub
import mlflow

mlflow.set_tracking_uri('https://dagshub.com/therealabhishek/Credit_Card_Fraud_MLOPS.mlflow')
dagshub.init(repo_owner='therealabhishek', repo_name='Credit_Card_Fraud_MLOPS', mlflow=True)

mlflow.set_experiment('Baseline Logistic Regression Model') # Set the experiment name to credit-fraud-detection

2025/06/12 10:34:26 INFO mlflow.tracking.fluent: Experiment with name 'Baseline Logistic Regression Model' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/c099692ff5da4e24ad30606f597bf61b', creation_time=1749704667636, experiment_id='0', last_update_time=1749704667636, lifecycle_stage='active', name='Baseline Logistic Regression Model', tags={}>

In [18]:
import mlflow
import logging
import os
import time
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Configure logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

logging.info("Starting MLflow run...")

with mlflow.start_run():
    start_time = time.time()
    
    try:
        logging.info("Logging preprocessing parameters...")
        mlflow.log_param("test_size", 0.2)

        logging.info("Initializing Logistic Regression model...")
        model = LogisticRegression(max_iter=1000)  # Increase max_iter to prevent non-convergence issues

        logging.info("Fitting the model...")
        model.fit(X_train, y_train)
        logging.info("Model training complete.")

        logging.info("Logging model parameters...")
        mlflow.log_param("model", "Logistic Regression")

        logging.info("Making predictions...")
        y_pred = model.predict(X_test)

        logging.info("Calculating evaluation metrics...")
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)

        logging.info("Logging evaluation metrics...")
        mlflow.log_metric("accuracy", accuracy)
        mlflow.log_metric("precision", precision)
        mlflow.log_metric("recall", recall)
        mlflow.log_metric("f1_score", f1)

        logging.info("Saving and logging the model...")
        mlflow.sklearn.log_model(model, "model")

        # Log execution time
        end_time = time.time()
        logging.info(f"Model training and logging completed in {end_time - start_time:.2f} seconds.")

        # Save and log the notebook
        # notebook_path = "exp1_baseline_model.ipynb"
        # logging.info("Executing Jupyter Notebook. This may take a while...")
        # os.system(f"jupyter nbconvert --to notebook --execute --inplace {notebook_path}")
        # mlflow.log_artifact(notebook_path)

        # logging.info("Notebook execution and logging complete.")

        # Print the results for verification
        logging.info(f"Accuracy: {accuracy}")
        logging.info(f"Precision: {precision}")
        logging.info(f"Recall: {recall}")
        logging.info(f"F1 Score: {f1}")

    except Exception as e:
        logging.error(f"An error occurred: {e}", exc_info=True)

2025-06-12 10:38:38,021 - INFO - Starting MLflow run...
2025-06-12 10:38:39,616 - INFO - Logging preprocessing parameters...
2025-06-12 10:38:40,099 - INFO - Initializing Logistic Regression model...
2025-06-12 10:38:40,100 - INFO - Fitting the model...
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
2025-06-12 10:38:41,730 - INFO - Model training complete.
2025-06-12 10:38:41,732 - INFO - Logging model parameters...
2025-06-12 10:38:42,155 - INFO - Making predictions...
2025-06-12 10:38:42,160 - INFO - Calculating evaluation metrics...
2025-06-12 10:38:42,195 - INFO - Logging evaluation metrics...
2025-06-12 10:3

🏃 View run big-crow-969 at: https://dagshub.com/therealabhishek/Credit_Card_Fraud_MLOPS.mlflow/#/experiments/0/runs/8a72cc2bdf434a0db49fbeb7168b8477
🧪 View experiment at: https://dagshub.com/therealabhishek/Credit_Card_Fraud_MLOPS.mlflow/#/experiments/0
