# Tabular Synthetic data generation - Credit Fraud transactions

## Credit Fraud - An highly imbalanced dataset
The dataset in this example use case is from Kaggle - ["Credit Card Fraud detection"](https://www.kaggle.com/mlg-ulb/creditcardfraud) dataset, as for demonstration purposes we are only able to use datasets from the public domain.
This dataset includes labeled transactions from European credit car holders, and the data provided is a result from a dimensionality reduction, containing 27 continous features and a time column - the number of secons elapsed between the first and the last transaction of the dataset.

In [34]:
### Import the required packages
import os

from ydata.connectors import GCSConnector, LocalConnector
from ydata.connectors.filetype import FileType
from ydata.utils.formats import read_json

from ydata.metadata import Metadata
from ydata.utils.data_types import DataType

import json
import pickle as pkl
import pandas as pd

try:
    os.mkdir('outputs')
except FileExistsError as e:
    print('Directory already exists')
    
#dataset_path = os.environ['DATASET_PATH']

Directory already exists


In [35]:
#Reading the credentials for Google Cloud Storage
keyfile_dict = read_json('credentials/gcs_credentials.json')

#### Reading the CreditFraud dataset from a remote storage

In [36]:
gcs_connector = GCSConnector(project_id='ydatasynthetic',
                             keyfile_dict=keyfile_dict)

#Get the credit fraud dataset using ydata's GCS connector
data = gcs_connector.read_file('gs://ydata_testdata/tabular/credit_fraud/data.csv', file_type=FileType.CSV)


+---------+----------------+----------------+----------------+
| Package | client         | scheduler      | workers        |
+---------+----------------+----------------+----------------+
| python  | 3.7.11.final.0 | 3.7.10.final.0 | 3.7.10.final.0 |
+---------+----------------+----------------+----------------+


INFO: 2022-02-20 23:22:41,571 [CONNECTOR] - Init data types inference.
INFO: 2022-02-20 23:22:43,520 [CONNECTOR] - Data types infered.


In [37]:
data.head(15)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0
5,2,-0.425966,0.960523,1.141109,-0.168252,0.420987,-0.029728,0.476201,0.260314,-0.568671,...,-0.208254,-0.559825,-0.026398,-0.371427,-0.232794,0.105915,0.253844,0.08108,3.67,0
8,7,-0.894286,0.286157,-0.113192,-0.271526,2.669599,3.721818,0.370145,0.851084,-0.392048,...,-0.073425,-0.268092,-0.204233,1.011592,0.373205,-0.384157,0.011747,0.142404,93.2,0
9,9,-0.338262,1.119593,1.044367,-0.222187,0.499361,-0.246761,0.651583,0.069539,-0.736727,...,-0.246914,-0.633753,-0.120794,-0.38505,-0.069733,0.094199,0.246219,0.083076,3.68,0
11,10,0.384978,0.616109,-0.8743,-0.094019,2.924584,3.317027,0.470455,0.538247,-0.558895,...,0.049924,0.238422,0.00913,0.99671,-0.767315,-0.492208,0.042472,-0.054337,9.99,0
12,10,1.249999,-1.221637,0.38393,-1.234899,-1.485419,-0.75323,-0.689405,-0.227487,-2.094011,...,-0.231809,-0.483285,0.084668,0.392831,0.161135,-0.35499,0.026416,0.042422,121.5,0
13,11,1.069374,0.287722,0.828613,2.71252,-0.178398,0.337544,-0.096717,0.115982,-0.221083,...,-0.036876,0.074412,-0.071407,0.104744,0.548265,0.104094,0.021491,0.021293,27.5,0


### Metadata calculation

In [38]:
meta = Metadata()
print(meta(data))



### Splitting the dataset based on the Class column

In [39]:
#If as a user a pandas interface is more familiar it is possible to convert from Dataset with DASK engine into a pandas dataframe
data = data.to_pandas()

In [43]:
#Creating the fraud and non-fraud datasets
fraud = data[data['Class']==1]
non_fraud = data[data['Class']==0]

## Setting the pipeline step outputs

In [48]:
from ydata.connectors import LocalConnector

#writing both fraud and non_fraud datasets
conn = LocalConnector()
conn.write_file(fraud, path='outputs/fraud.csv', file_type=FileType.CSV)

conn.write_file(non_fraud, path='outputs/non_fraud.csv', file_type=FileType.CSV)

conn.write_file(data, path='outputs/data.csv', file_type=FileType.CSV)