# Loading data

In [38]:
import pandas as pd
df = pd.read_csv("fraud.csv")

## Cleaning data

These data are mostly clean but we need to add a new field for transaction interarrival time.  Unlike the rest of the work in this notebook, we'll do this for *all* our data (i.e., we'll do this before holding out a test set).

In [65]:
df = df.sort_values(["user_id", "timestamp"]).reset_index()
del df['index']

In [66]:
shifted = df.shift(1)[['user_id', 'timestamp']]

df['prev_user_id'] = shifted['user_id']
df['prev_timestamp'] = shifted['timestamp']
df['interarrival'] = (df['timestamp'] - df['prev_timestamp']).where(df['user_id'] == df['prev_user_id'], np.NaN)

del df['prev_user_id']
del df['prev_timestamp']

In [67]:
df

Unnamed: 0,timestamp,label,user_id,amount,merchant_id,trans_type,foreign,prev_user_id,prev_timestamp,interarrival
0,1581668911,legitimate,0,21.95,18943,chip_and_pin,False,,,
1,1581678337,legitimate,0,13.06,12373,online,False,0.0,1.581669e+09,9426.0
2,1581687106,legitimate,0,85.17,2235,online,False,0.0,1.581678e+09,8769.0
3,1581695596,legitimate,0,12.73,3257,manual,False,0.0,1.581687e+09,8490.0
4,1581704538,legitimate,0,84.85,12362,chip_and_pin,False,0.0,1.581696e+09,8942.0
5,1581713742,legitimate,0,164.92,17530,chip_and_pin,False,0.0,1.581705e+09,9204.0
6,1581721836,legitimate,0,13.97,12373,swipe,False,0.0,1.581714e+09,8094.0
7,1581764483,legitimate,0,16.34,1730,contactless,False,0.0,1.581722e+09,42647.0
8,1581772625,legitimate,0,12.74,384,chip_and_pin,False,0.0,1.581764e+09,8142.0
9,1581782097,legitimate,0,18.40,13413,online,True,0.0,1.581773e+09,9472.0


##  Train/test split

We're using time-series data, so we'll split based on time.

In [53]:
first = df['timestamp'].min()
last = df['timestamp'].max()
cutoff = first + ((last - first) * 0.7)

In [62]:
train = df[df['timestamp'] <= cutoff]
len(train)

16055561

In [63]:
test = df[df['timestamp'] > cutoff]
len(test)

6890570

In [64]:
len(train) / (len(train) + len(test))

0.6997066738614889