# Credit Card Transaction Fraud Detection Dataset EDA

In this notebook, we explore the dataset to determine the important attributes that influence our dependent variable - is_fraud

is_fraud = 1 indicates fraudulent transaction
         = 0 indicates non-fraudulent transaction

We need to generate new features. From intuition, we can generate the following new features:
* We can say that the hour of the day could be an influential factor. There is a higher chance that fraud transactions might be occuring during odd hours. So, we'll extract the hour.
* Now, having extracted the hour, we can further encode it. Say, if the transaction is between 21:00-05:00, there is a higher chance it might be fraudulent.
* We can also say that the frequency of the transaction can be an influential factor. For example, if the number of transactions in last 1/7/30 days suddenly increases, it might indicate a fraudulent transaction

In [None]:
import numpy as np
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('/kaggle/input/fraud-detection/fraudTrain.csv',parse_dates=['trans_date_trans_time',])

In [None]:
df.columns

# Add Features

## Add hour feature

Here, first, we simply extract the hour. Then, we encode transactions done in normal hours 0500-2100 as normal (0) and transactions done in abnormal hours 2100-0500 as abnormal (1)

In [None]:
df['hour'] = df.trans_date_trans_time.dt.hour

In [None]:
df['hourEnc'] = 0
df.loc[df.hour < 5,'hourEnc'] = 1
df.loc[df.hour > 21,'hourEnc'] = 1

## Add frequencies of transactions

Now, we need to generate frequencies of transactions done in last 1/7/30 days. For this, we use pandas rolling function

In [None]:
# Extract frequencies of transactions in last 1/7/30 days
def last1DayTransactionCount(x):
    temp = pd.Series(x.index, index = x.trans_date_trans_time, name='count_1_day').sort_index()
    count_1_day = temp.rolling('1d').count() - 1
    count_1_day.index = temp.values
    x['count_1_day'] = count_1_day.reindex(x.index)
    return x
def last7DaysTransactionCount(x):
    temp = pd.Series(x.index, index = x.trans_date_trans_time, name='count_7_days').sort_index()
    count_7_days = temp.rolling('7d').count() - 1
    count_7_days.index = temp.values
    x['count_7_days'] = count_7_days.reindex(x.index)
    return x
def last30DaysTransactionCount(x):
    temp = pd.Series(x.index, index = x.trans_date_trans_time, name='count_30_days').sort_index()
    count_30_days = temp.rolling('30d').count() - 1
    count_30_days.index = temp.values
    x['count_30_days'] = count_30_days.reindex(x.index)
    return x

In [None]:
df1 = df.groupby('cc_num').apply(last1DayTransactionCount)

In [None]:
df1 = df1.groupby('cc_num').apply(last7DaysTransactionCount)

In [None]:
df1 = df1.groupby('cc_num').apply(last30DaysTransactionCount)

## Add times since last transaction - time_diff

In [None]:
def timeDifference(x):
    x['time_diff'] = x.trans_date_trans_time - x.trans_date_trans_time.shift()
    return x

In [None]:
df1 = df1.groupby('cc_num').apply(timeDifference)

In [None]:
df1['time_diff'] = df1['time_diff'].dt.seconds

# Display correlation heatmaps

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
sns.heatmap(df.corr(),annot=True).set_title('Correlation heatmap without generated features')

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
sns.heatmap(df1.corr(),annot=True).set_title('Correlation heatmap with generated features')

# is_fraud correlation

As you can see from the following correlation series, amount, normal/abnormal hour, count_30_days, count_7_days, time_diff, hour are important contributors

In [None]:
df1.corr()['is_fraud'].abs().sort_values(ascending=False)

Fin.