# Transaction Data Simulator
This notebook simulates transaction data for the purposes of demonstrating the use of the [Transaction Data Simulator](). The data is generated using the [Synthetic Financial Datasets For Fraud Detection](https://www.kaggle.com/code/kerneler/starter-synthetic-jane-street-dataset-d7c7c87e-f) dataset from Kaggle and will consist of over 500,000,000 transactions with 30 features. The data will be saved to a CSV file and used in the [Credit Card Fraud Detection System]().



## Transaction features
- TRANSACTION ID: Unique transaction identifier
- TRANSACTION TIME: Time of transaction 
- TRANSACTION AMOUNT: Amount of transaction
- TRANSACTION TYPE: Type of transaction
- TERMINAL ID: Identifier for each unique terminal
- MERCHANT ID: Identifier for each unique merchant
- CUSTOMER ID: Identifier for each unique customer
- TRANSACTION STATUS (approved, declined, etc.): Status of transaction
- TRANSACTION CATEGORY (groceries, gas, etc.): Category of transaction
- TRANSACTION DESCRIPTION: Description of transaction
- TRANSACTION LOCATION (city, state, etc.): Location of transaction
- TRANSACTIOM FRADULENT (yes, no): Whether transaction is fraudulent converted to a binary value (1, 0)


- TRANSACTION DATE (month, day, year): Date of transaction
- TRANSACTION TIME (hour, minute, second): Time of transaction
- TRANSACTION DAY OF WEEK: Day of week of transaction
- TRANSACTION HOUR OF DAY: Hour of day of transaction
- TRANSACTION WEEK OF YEAR: Week of year of transaction
- TRANSACTION MONTH OF YEAR: Month of year of transaction

### CONFIGURATION PARAMETERS
TRANSACTION_ID, TX_DATETIME, CUSTOMER_ID, TERMINAL_ID, TX_AMOUNT, and TX_FRAUD are the only required parameters. The rest are optional and will be generated if not provided. The default values are shown below. 

- TRANSACTION_ID: 1
- TX_DATETIME: 2020-01-01 00:00:00
- CUSTOMER_ID: 1
- TERMINAL_ID: 1
- TX_AMOUNT: 1.00
- TX_FRAUD: 0
- TX_TYPE: ['PURCHASE', 'CASH WITHDRAWAL', 'CASH DEPOSIT', 'TRANSFER', 'BILL PAYMENT', 'REVERSAL']
- TX_STATUS: ['APPROVED', 'DECLINED', 'PENDING', 'REVERSED']
- TX_CATEGORY: ['GROCERIES', 'GAS', 'RESTAURANT', 'ENTERTAINMENT', 'TRAVEL', 'OTHER']
- TX_DESCRIPTION: ['GROCERIES', 'GAS', 'RESTAURANT', 'ENTERTAINMENT', 'TRAVEL', 'OTHER']
- TX_LOCATION: ['CITY', 'STATE', 'COUNTRY']
- MERCHANT_ID: 1
- TX_DATE: 2020-01-01
- TX_TIME: 00:00:00
- TX_DAY_OF_WEEK: ['MONDAY', 'TUESDAY', 'WEDNESDAY', 'THURSDAY', 'FRIDAY', 'SATURDAY', 'SUNDAY']
- TX_HOUR_OF_DAY: 0
- TX_WEEK_OF_YEAR: 1
- TX_MONTH_OF_YEAR: 1

### Transaction Data Simulator
The Transaction Data Simulator is a Python class that generates transaction data. The class is initialized with the following parameters:
  
- transaction_id: Unique transaction identifier
- tx_datetime: Time of transaction
- customer_id: Identifier for each unique customer
- terminal_id: Identifier for each unique terminal
- tx_amount: Amount of transaction
- tx_fraud: Whether transaction is fraudulent converted to a binary value (1, 0)
- tx_type: Type of transaction
- tx_status: Status of transaction
- tx_category: Category of transaction
- tx_description: Description of transaction
- tx_location: Location of transaction
- merchant_id: Identifier for each unique merchant

In [None]:
"""
TRANSACTION_ID, TX_DATETIME, CUSTOMER_ID, TERMINAL_ID, TX_AMOUNT, and TX_FRAUD
"""
# Necessary imports for this notebook
import os

import numpy as np
import pandas as pd

import datetime
import time

import random

# For plotting
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid', {'axes.facecolor': '0.9'})

##### SCRATCH WORK

In [None]:
"""
SCRATCH CODE FOR TESTING
"""

# Create features for transaction data
get_ipython().run_line_magic('matplotlib', 'inline')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import datetime
import time
import os
import sys
import json
import pickle
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve, precision_recall_curve, average_precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score

# Create features for transaction data
def create_features(df):
    df['hour'] = df['transaction_dttm'].dt.hour
    df['day_of_week'] = df['transaction_dttm'].dt.dayofweek
    df['day'] = df['transaction_dttm'].dt.day
    df['month'] = df['transaction_dttm'].dt.month
    df['year'] = df['transaction_dttm'].dt.year
    df['amount'] = df['amount'].abs()
    return df

### Import libraries

### Generate a customer table with the following features:
- CUSTOMER_ID
- CUSTOMER NAME
- X_CUSTOMER_ID, Y_CUSTOMER_ID
- MEAN_AMOUNT, STD_AMOUNT, MIN_AMOUNT, MAX_AMOUNT
- MEAN_NBR_TRX_PER_DAY
- MEAN_NBR_TRX_PER_WEEK
- MEAN_NBR_TRX_PER_MONTH
- MEAN_NBR_TRX_PER_YEAR
- MEAN_NBR_TRX_PER_HOUR

In [None]:
# Generate customer profiles table
def generate_customer_profiles(num_customers):
    customer_profiles = pd.DataFrame()
    customer_profiles['customer_id'] = range(1, num_customers + 1)
    customer_profiles

In [None]:
# Establish features for the datset of synthetic transactions
# TRAN
def create_synthetic_features(df):

In [None]:
def get_list_terminals_within_radius(customer_profile, x_y_terminals, r):
    
    # Use numpy arrays in the following to speed up computations
    
    # Location (x,y) of customer as numpy array
    x_y_customer = customer_profile[['x_customer_id','y_customer_id']].values.astype(float)
    
    # Squared difference in coordinates between customer and terminal locations
    squared_diff_x_y = np.square(x_y_customer - x_y_terminals)
    
    # Sum along rows and compute suared root to get distance
    dist_x_y = np.sqrt(np.sum(squared_diff_x_y, axis=1))
    
    # Get the indices of terminals which are at a distance less than r
    available_terminals = list(np.where(dist_x_y<r)[0])
    
    # Return the list of terminal IDs
    return available_terminals
    

In [None]:
# We first get the geographical locations of all terminals as a numpy array
x_y_terminals = terminal_profiles_table[['x_terminal_id','y_terminal_id']].values.astype(float)
# And get the list of terminals within radius of $50$ for the last customer
get_list_terminals_within_radius(customer_profiles_table.iloc[4], x_y_terminals=x_y_terminals, r=50)

## Transaction Generation Process
The simulation process will consist of 5 main steps:
 1. **Generate a customer profile**: Every customer is different in their spending habits. This will be simulated by defining some properties for each customer. The main properties will be their geographical location, their spending frequency, and their spending amounts. The customer properties will be represented as a table, referred to as the customer profile table.
 2. **Generation of terminal profiles**:  Geographical location  represented as a table, referred to as the terminal profile table.
 3. **Relationship of customers to terminals**: The customer profile table will be joined to the terminal profile table to create a customer terminal relationship table. This table will be used to determine the location of each transaction.
 4. **Generation of transaction data8**: The customer terminal relationship table will be used to generate the transaction data. The transaction data will be generated by randomly selecting a customer and terminal from the customer terminal relationship table. The transaction amount will be randomly selected from a normal distribution with a mean and standard deviation based on the customer profile. The transaction date and time will be randomly selected from a normal distribution with a mean and standard deviation based on the customer profile. The transaction type, status, category, description, and location will be randomly selected from a list of possible values.
 5. **Generation of fraud data**: The transaction data will be used to generate fraud data. The fraud data will be generated by randomly selecting a transaction from the transaction data and changing the transaction status to 'REVERSED' and the transaction fraudulent flag to '1'.

In [None]:
def generate_customer_profiles_table(n_customers, random_state=0):
    
    np.random.seed(random_state)
        
    customer_id_properties=[]
    
    # Generate customer properties from random distributions 
    for customer_id in range(n_customers):
        
        x_customer_id = np.random.uniform(0,100)
        y_customer_id = np.random.uniform(0,100)
        
        mean_amount = np.random.uniform(5,100) # Arbitrary (but sensible) value 
        std_amount = mean_amount/2 # Arbitrary (but sensible) value
        
        mean_nb_tx_per_day = np.random.uniform(0,4) # Arbitrary (but sensible) value 
        
        customer_id_properties.append([customer_id,
                                      x_customer_id, y_customer_id,
                                      mean_amount, std_amount,
                                      mean_nb_tx_per_day])
        
    customer_profiles_table = pd.DataFrame(customer_id_properties, columns=['CUSTOMER_ID',
                                                                      'x_customer_id', 'y_customer_id',
                                                                      'mean_amount', 'std_amount',
                                                                      'mean_nb_tx_per_day'])
    
    return customer_profiles_table