### Goals
Classify whether a transaction is fraudulent or not


### Data Description 
| Features  | Description |
| ------------- | ------------- |
| step  | maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation) |
|type |CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER |
| amount | amount of the transaction in local currency  |
| nameOrig  | customer who started the transaction  | 
| oldbalanceOrg  |initial balance before the transaction |
| newbalanceOrig  | new balance after the transaction |
| nameDest  |customer who is the recipient of the transaction |
| oldbalanceDest | initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants)|
| newbalanceDest|new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants)|
| isFraud  | This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system  |
| isFlaggedFraud  | The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction|

In [1]:
#Common necessery libraries
#### Data Processing
import pandas as pd
import numpy as np

#### Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import sidetable as stb
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [2]:
df = pd.read_csv(r'D:\PURWADHIKA\MODUL 03\DATASET\syntetic_fraud.csv')
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [4]:
## Checking for missing value
df.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [5]:
## Checking for balance/ imbalance data of target
((df['isFraud'].value_counts())/len(df))*100

0    99.870918
1     0.129082
Name: isFraud, dtype: float64

### Stratified Sampling

Since the dataset is so large. I used stratified sampling to take samples from both of the class (Fraud and NoFraud).

In [6]:
from sklearn.model_selection import train_test_split
X = df.drop(columns='isFraud')
y = df['isFraud']

X_unselected, X_selected, y_unselected, y_selected = train_test_split(X, y, stratify=y, random_state=42, test_size=49794)

In [7]:
data = X_selected.copy()
data['isFraud'] = y_selected

In [8]:
data.shape

(49794, 11)

In [9]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFlaggedFraud,isFraud
491589,19,CASH_OUT,72872.51,C2043380801,189.0,0.0,C1384249206,676319.22,493973.74,0,0
2593402,207,PAYMENT,3200.11,C426535344,0.0,0.0,M724865366,0.0,0.0,0,0
3595023,262,CASH_OUT,73968.59,C1280271806,202404.0,128435.41,C652886948,5660.77,79629.36,0,0
3256215,251,CASH_IN,1071.73,C1331515829,32106.0,33177.73,C88206578,261916.77,260845.04,0,0
4553717,327,CASH_IN,11680.07,C150656696,6957081.08,6968761.15,C807268977,18094.63,6414.56,0,0


In [10]:
## Checking for balance/ imbalance data of target after stratified sampling
((data['isFraud'].value_counts())/len(data))*100

0    99.87147
1     0.12853
Name: isFraud, dtype: float64

In [11]:
## Exporting DataFrame to csv
# data.to_csv(r'D:\syntetic_fraud_sample.csv', index = False)