# Generating synthetic payments data

In this notebook, we'll build up a very simple simulator to generate payments data corresponding to legitimate and fraudulent transactions.  (There are many ways you could improve this generator and we'll call some of them out.)  We'll start by building up some functionality to run simulations in general.

## An (extremely) basic discrete-event simulation framework

The next function is all you need to run simple discrete-event simulations.  Here's how to use it:

- you'll define several streams of events, each of which is modeled by a Python generator,
- each event stream generator will `yield` a tuple consisting of *an offset* (the amount of time that has passed since the last event of that type) and *a result* (an arbitrary Python value associated with the event),
- the generator produced by the `simulate` function will yield the next event from all event streams indefinitely.

In [1]:
import heapq

def simulate(event_generators):
    pq = []
    for event in event_generators:
        offset, result = next(event)
        heapq.heappush(pq, (offset, result, event))
    
    while True:
        timestamp, result, event = heapq.heappop(pq)
        offset, next_result = next(event)
        heapq.heappush(pq, (timestamp + offset, next_result, event))
        yield (timestamp, result)

It may be easier to see how this works with an example.  In the next three cells, we 

1. define a generator for event streams, which samples interarrival times from a Poisson distribution and returns a predefined string as the result at each event,
2. set up a simulation with four streams, each of which has a different distribution of interarrival times and value, and
3. take the first twenty events from the simulation

In [2]:
from scipy import stats

def bedrockstream(mu, name):
    while True:
        offset, = stats.poisson.rvs(mu, size=1)
        yield (offset, name)

In [3]:
sim = simulate([bedrockstream(10, "fred"), 
                bedrockstream(12, "betty"), 
                bedrockstream(20, "wilma"), 
                bedrockstream(35, "barney")])

In [4]:
for i in range(20):
    print(next(sim))

(9, 'fred')
(14, 'betty')
(17, 'fred')
(18, 'wilma')
(23, 'betty')
(24, 'fred')
(30, 'wilma')
(33, 'fred')
(34, 'betty')
(37, 'barney')
(44, 'fred')
(47, 'betty')
(49, 'wilma')
(62, 'fred')
(65, 'betty')
(73, 'wilma')
(74, 'fred')
(79, 'barney')
(80, 'betty')
(82, 'fred')


## Modeling transactions

The first problem we have to do is to decide what data we'll generate for each transaction.  Some interesting possibilities include:

- user ID
- merchant ID
- merchant type
- transaction amount (assuming a single currency)
- card entry mode (e.g., contactless, chip and pin, swipe, card manually keyed, or online transaction)
- foreign transaction (whether or not the user's home country matches the country in which the transaction is taking place)

We'll also generate a label for each transaction (`legitimate` or `fraud`).  We'll start with a very basic user event stream generator:  all of the transactions we generate will be legitimate, and we won't do anything particularly interesting with most of the fields.

In [5]:
import numpy as np
MERCHANT_COUNT = 20000

# a small percentage of merchants account for most transactions
COMMON_MERCHANT_COUNT = MERCHANT_COUNT // 7

common_merchants = np.random.choice(MERCHANT_COUNT, 
                                    size=COMMON_MERCHANT_COUNT, 
                                    replace=True)

def basic_user_stream(user_id, mu):
    favorite_merchants = np.random.choice(common_merchants,
                                         size=len(common_merchants) // 5)
    while True:
        amount = 100.00
        entry = "chip_and_pin"
        foreign = False
        
        merchant_id, = np.random.choice(favorite_merchants, size=1)
        offset, = stats.poisson.rvs(mu, size=1)
        result = {
            "user_id": user_id,
            "amount": amount,
            "merchant_id": merchant_id,
            "entry": entry,
            "foreign": foreign
        }
        yield (offset, ("legitimate", result))

In [6]:
sim = simulate([basic_user_stream(1, 700), basic_user_stream(2, 105), basic_user_stream(3, 40)])

In [7]:
for i in range(20):
    print(next(sim))

(44, ('legitimate', {'user_id': 3, 'amount': 100.0, 'merchant_id': 10071, 'entry': 'chip_and_pin', 'foreign': False}))
(80, ('legitimate', {'user_id': 3, 'amount': 100.0, 'merchant_id': 8545, 'entry': 'chip_and_pin', 'foreign': False}))
(101, ('legitimate', {'user_id': 2, 'amount': 100.0, 'merchant_id': 15716, 'entry': 'chip_and_pin', 'foreign': False}))
(117, ('legitimate', {'user_id': 3, 'amount': 100.0, 'merchant_id': 12719, 'entry': 'chip_and_pin', 'foreign': False}))
(157, ('legitimate', {'user_id': 3, 'amount': 100.0, 'merchant_id': 10346, 'entry': 'chip_and_pin', 'foreign': False}))
(198, ('legitimate', {'user_id': 3, 'amount': 100.0, 'merchant_id': 10816, 'entry': 'chip_and_pin', 'foreign': False}))
(216, ('legitimate', {'user_id': 2, 'amount': 100.0, 'merchant_id': 4159, 'entry': 'chip_and_pin', 'foreign': False}))
(245, ('legitimate', {'user_id': 3, 'amount': 100.0, 'merchant_id': 9330, 'entry': 'chip_and_pin', 'foreign': False}))
(280, ('legitimate', {'user_id': 3, 'amount':

## Some quick improvements

1.  Users don't always buy things from a few favorite merchants.  Change `basic_user_stream` so that they occasionally buy from any merchant.
2.  Most people probably buy many inexpensive things and relatively few expensive things.  Use this insight to generate (more) realistic transaction amounts.
3.  Some small percentage of online sales will be foreign transactions.  When a user is traveling abroad, nearly all of his or her transactions will be foreign transactions.  Add some state to `basic_user_stream` to model occasional international travel.

## Simulating fraud

WIP