# Name : Shubham Chavhan 


These two lines import the Pandas library and the glob module.
The glob module is used in the code to read multiple CSV files from a directory that match a certain pattern.

In [1]:
import pandas as pd
import glob

This block of code reads all the CSV files present in the "simulated-data-csv" directory and concatenates them into a single Pandas DataFrame object called "df". It does this by first creating an empty list called "all_data". Then, for each CSV file in the directory, it reads the file using Pandas' read_csv() function and appends the resulting DataFrame to the "all_data" list. Finally, it concatenates all the DataFrames in the "all_data" list into a single DataFrame using Pandas' concat() function.

In [2]:
all_data = []

In [3]:
for i in glob.glob("simulated-data-csv/*.csv"):
    data = pd.read_csv(i)
    all_data.append(data)

In [4]:
df = pd.concat(all_data, ignore_index=True)

This line of code displays the first 5 rows of the DataFrame "df" to verify that the data was loaded correctly.

In [5]:
df.head()

Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO
0,0,2018-04-01 00:00:31,596,3156,57.16,31,0,0,0
1,1,2018-04-01 00:02:10,4961,3412,81.51,130,0,0,0
2,2,2018-04-01 00:07:56,2,1365,146.0,476,0,0,0
3,3,2018-04-01 00:09:29,4128,8737,64.49,569,0,0,0
4,4,2018-04-01 00:10:34,927,9906,50.99,634,0,0,0


This line of code displays information about the DataFrame "df", such as the number of rows and columns, the data types of the columns, and the amount of memory used by the DataFrame.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1754155 entries, 0 to 1754154
Data columns (total 9 columns):
 #   Column             Dtype  
---  ------             -----  
 0   TRANSACTION_ID     int64  
 1   TX_DATETIME        object 
 2   CUSTOMER_ID        int64  
 3   TERMINAL_ID        int64  
 4   TX_AMOUNT          float64
 5   TX_TIME_SECONDS    int64  
 6   TX_TIME_DAYS       int64  
 7   TX_FRAUD           int64  
 8   TX_FRAUD_SCENARIO  int64  
dtypes: float64(1), int64(7), object(1)
memory usage: 120.4+ MB


This line of code checks for missing values in the DataFrame "df" and returns the sum of missing values for each column.

In [7]:
df.isnull().sum()

TRANSACTION_ID       0
TX_DATETIME          0
CUSTOMER_ID          0
TERMINAL_ID          0
TX_AMOUNT            0
TX_TIME_SECONDS      0
TX_TIME_DAYS         0
TX_FRAUD             0
TX_FRAUD_SCENARIO    0
dtype: int64

This line of code drops some columns from the DataFrame "df" that are not needed for the fraud detection model,
axis=1 parameter specifies that the columns should be dropped, and the inplace=True parameter specifies that the changes should be made to the DataFrame "df" itself.

In [8]:
df.drop(["TX_DATETIME", "TRANSACTION_ID", "CUSTOMER_ID", "TERMINAL_ID", "TX_TIME_DAYS"], axis = 1, inplace = True)

In [9]:
df.head()

Unnamed: 0,TX_AMOUNT,TX_TIME_SECONDS,TX_FRAUD,TX_FRAUD_SCENARIO
0,57.16,31,0,0
1,81.51,130,0,0
2,146.0,476,0,0
3,64.49,569,0,0
4,50.99,634,0,0


This block of code prints out the unique values for each column in the DataFrame "df". This is useful to see what the range of values is for each feature.

In [10]:
for i in df:
    print(i,"-",df[i].unique())

TX_AMOUNT - [ 57.16  81.51 146.   ... 569.4  199.19 358.2 ]
TX_TIME_SECONDS - [      31      130      476 ... 15811101 15811192 15811197]
TX_FRAUD - [0 1]
TX_FRAUD_SCENARIO - [0 1 3 2]


These two lines of code split the DataFrame "df" into two parts: the features (stored in "X") and the target variable (stored in "y"). The target variable is the "TX_FRAUD_SCENARIO" column, and it is dropped from the features because it is the variable we want to predict.

In [11]:
X = df.drop("TX_FRAUD_SCENARIO", axis = 1)
y = df.TX_FRAUD_SCENARIO

This line of code splits the features ("X") and target variable ("y") into training and testing sets. The training set contains 80% of the data, and the testing set contains 20% of the data. The random_state=42 parameter ensures that the data is split in a reproducible way.

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

These lines of code import the logistic regression algorithm from scikit-learn and use

In [14]:
from sklearn.linear_model import LogisticRegression

In [15]:
lr = LogisticRegression(max_iter=100000)
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.9918450764043086

In [16]:
from sklearn.naive_bayes import BernoulliNB

In [17]:
berno = BernoulliNB()
berno.fit(X_train, y_train)
berno.score(X_test, y_test)

0.9969073428516865

In [18]:
from sklearn.naive_bayes import MultinomialNB

In [19]:
multi = MultinomialNB()
multi.fit(X_train, y_train)
multi.score(X_test, y_test)

0.802828142324937

In [20]:
from sklearn.model_selection import GridSearchCV

In [21]:
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0],
    'binarize': [0.0, 0.5, 1.0]
}

In [22]:
grid_search = GridSearchCV(berno, param_grid, cv=5)

In [23]:
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=BernoulliNB(),
             param_grid={'alpha': [0.1, 0.5, 1.0, 2.0],
                         'binarize': [0.0, 0.5, 1.0]})

In [24]:
grid_search.best_params_

{'alpha': 0.1, 'binarize': 0.0}

In [25]:
berno1 = BernoulliNB(alpha = grid_search.best_params_["alpha"], binarize = grid_search.best_params_["binarize"])
berno1.fit(X_train, y_train)
berno1.score(X_test, y_test)

0.9969073428516865