## Generating Customers

In this notebook we will build up an event generator for making or changing personally identifiable information (PII) for bank customers.

Each event will either be a `new` customer or a `change` event. 

When a `new` event occurs, we will generate a new customer and a set of PII for them including their: 
    - Name
    - Date of Birth 
    - Social Security number
    - Bank account details

When a `change` event occurs, we will select an existing customer and alter one piece of their PII. 

#### Making Repeatable Simulations

We follow the methodology laid out [here](https://chapeau.freevariable.com/2020/02/repeatable-simulation-without-boilerplate.html) to ensure that any simulations can be deterministic, and thus repeatable. This is really desirable as it enables us to replay data on new models to test them.

In [1]:
### Setting up repeatibility
### This is copied directly from Will's notebook
### to do - ? put into a separate file?

def makeprng(func):
    import time
    def call_with_prng(*args, prng=None, seed=None, **kwargs):
        if prng is None:
            if seed is None:
                seed = int(time.time()) & ((1 << 32) - 1)
            prng = np.random.RandomState(seed)
        return func(*args, prng=prng, seed=seed, **kwargs)
    return call_with_prng

#again copied verbatim from Will's notebook. 
#the code makes it really simple to set seeds in the scipy distributions.
def makedist(dist_cls, seed=None, prng=None, **kwargs):
    d = dist_cls(**kwargs)
    d.random_state = (seed and seed) or prng.randint((1 << 32) - 1)
    return d

## Simulating Names

We begin by simulating user names. For this notebook we are using a set of Faroese names, which we scrape from a webpage using `BeautifulSoup`. We follow Faroese naming convention -  female surnames end in _dóttir_ whilst male surnames end in _son_. 

In [2]:
## this should be pushed to a different file, and the data sets we generate saved

from bs4 import BeautifulSoup
import random
import numpy as np


with open("data/faroese-female.htm") as ff:
    soup = BeautifulSoup(ff)

names = soup.find_all('a')

female_names = []
for n in names:
    try:
        female_names.append(n['title'].rsplit(' ')[0])
    except Exception:
        pass
del female_names[0]

with open("data/faroese-male.htm") as fm:
    soup = BeautifulSoup(fm)
    
mnames = soup.find_all('a')
male_names = []
for n in mnames:
    try:
        male_names.append(n['title'].rsplit(' ')[0])
    except Exception:
        pass
del male_names[0]

n_male = len(male_names)
n_female = len(female_names)


In this next cell we set up a generator which simulates names from the set we scraped above.

In [3]:
## this isn't computing the generative form of hte male names. 
@makeprng
def names(males, females, dist=0.5, prng=None, seed=None):
    random.seed(seed)
    
    n_male = len(males)
    n_female = len(females)


    while True:
        rand = random.random()
        ## this is cycling about change events. 
        print(rand)
        surname = males[prng.randint(0, n_male-1)]
        if rand < dist:
            gender = 'male'
            first_name = males[prng.randint(0, n_male-1)]
            surname = surname + 'son'
        else:
            first_name = females[prng.randint(0, n_female-1)]
            surname = surname + 'dóttir'
            gender = 'female'
            
        result = {'name':" ".join([first_name,surname]) }
        yield tuple(r for r in result.values())
        
name = names(males = male_names, females = female_names, dist = 0.5, seed=123) # this is a name generator

In [4]:
for i in range(10):
    print(next(name))

0.052363598850944326
('Herbjartur Kolfinnurson',)
0.08718667752263232
('Hákun Hervarðurson',)
0.4072417636703983
('Sólmundur Benadiktson',)
0.10770023493843905
('Niklái Aleksson',)
0.9011988779516946
('Brá Bergurdóttir',)
0.0381536661023224
('Eyðbjartur Matsson',)
0.5362020400339269
('Berta Snævardóttir',)
0.33219769850967984
('Pál Birgarson',)
0.8520866189293687
('Ata Arnbjørndóttir',)
0.1596623967219699
('Eyðsvein Liasson',)


## Change events

Before we generating more PII, we set up a simple `surname_change` function. At the moment this function simply selects a customer at random and changes their surname. 

In [5]:
@makeprng
def surname_change(old_name, males, prng=None, seed=None):
    ## takes in old name
    ## assumes name is of the form 'first name, surname'
    surname = males[prng.randint(0, n_male-1)]
    
    ## checks to see if male or female. 
    if 'dóttir' in old_name:
        surname = surname + 'dóttir'
        
    else:
        surname = surname + 'son'
    
    new_name = " ".join([old_name.split()[0], surname])
    return new_name

## Customer generator

We now have enough functionality to make a basic customer generator. The generator will first create `N` new users. Once `N` users have been created each subsequent event will:
    - generate another new user with probability `p`
    - change the surname of an existing user with probability `1-p`
    
Events take the form `(timestamp, (event_type, (new_name, old_name)))` where
    - `timestamp` is an arbitary time of the generation or change 
    - `event_type` is either `change` or `new`
    - `new_name` and `old_name` will be the same for `new` events, and will vary if the name has changed. 

In [23]:
from itertools import count
from scipy import stats
numb_users = count()

@makeprng
def name_stream(mu, names_str, N=10, p=0.85, prng=None, seed=None):
    nusers = 0
    dic_names = {}
    poiss = makedist(stats.poisson, prng=prng, mu = mu)
    prng = np.random.RandomState(seed or 0xda7aba5e)
    print(prng)

    while True:
        if nusers < N: 
            nusers = next(numb_users)
            offset, = poiss.rvs(size = 1)
            name = next(names_str)
            y = ('new', name, name)
            dic_names[nusers]= name[0]

        else:
            offset, = poiss.rvs(size = 1)
            rand = prng.random_sample()
            ## this is cycling after change events. 
            print(rand)
            if rand < p:
                nusers = next(numb_users)
                name = next(names_str)
                y = ('new', name, name)
                dic_names[nusers]= name[0]
            else:
                to_edit = prng.randint(0, nusers)
                print("to edit", to_edit, sep = " ")
                new_name = surname_change(old_name = dic_names[to_edit], males = male_names, prng=prng)
                y = ('change', (new_name, dic_names[to_edit]))
                dic_names[to_edit] = new_name

        yield (offset, y)

In [24]:
name = names(males = male_names, females = female_names, dist = 0.5, seed=123) # this is a name generator
sim= name_stream(mu=3, names_str = name, N=10, p=0.85, seed=143)

In [25]:
for i in range(200):
    print(next(sim))

<mtrand.RandomState object at 0x111a401e0>
0.052363598850944326
(3, ('new', ('Herbjartur Kolfinnurson',), ('Herbjartur Kolfinnurson',)))
0.08718667752263232
(5, ('new', ('Hákun Hervarðurson',), ('Hákun Hervarðurson',)))
0.4072417636703983
(4, ('new', ('Sólmundur Benadiktson',), ('Sólmundur Benadiktson',)))
0.10770023493843905
(2, ('new', ('Niklái Aleksson',), ('Niklái Aleksson',)))
0.9011988779516946
(4, ('new', ('Brá Bergurdóttir',), ('Brá Bergurdóttir',)))
0.0381536661023224
(3, ('new', ('Eyðbjartur Matsson',), ('Eyðbjartur Matsson',)))
0.5362020400339269
(0, ('new', ('Berta Snævardóttir',), ('Berta Snævardóttir',)))
0.33219769850967984
(3, ('new', ('Pál Birgarson',), ('Pál Birgarson',)))
0.8520866189293687
(4, ('new', ('Ata Arnbjørndóttir',), ('Ata Arnbjørndóttir',)))
0.1596623967219699
(3, ('new', ('Eyðsvein Liasson',), ('Eyðsvein Liasson',)))
0.3372166571092755
(3, ('new', ('Hørður Bessison',), ('Hørður Bessison',)))
0.7377549616506409
0.3337963946289553
(3, ('new', ('Valdi Hannis