## Simulating Personally Identifiable Information 


In this notebook, we will simulate fake bank customers. We will generate a range of personally identifiable information (PII) for each customer including:
    - First name
    - Surname
    - Gender 
    - Date of Birth
    - Country of residence
    - Social Security number
    - Bank account details
    
We will also note whether the customer is on any watchlists. 

Later in the notebook we will simulate the user's behaviour changing over time. Once a user exists they may change their name or state of residence, or they may be added to a watchlist.  to model users' changing their state of residence or Surname, for example.

We follow the methodology laid out [here](https://chapeau.freevariable.com/2020/02/repeatable-simulation-without-boilerplate.html) to ensure that any simulations can be deterministic, and thus repeatable. This is really desirable as it enables us to replay data on new models over time. 

In [1]:
### Setting up repeatibility
### This is copied directly from Will's notebook
### to do - ? put into a separate file?
def makeprng(func):
    import time
    def call_with_prng(*args, prng=None, seed=None, **kwargs):
        if prng is None:
            if seed is None:
                seed = int(time.time()) & ((1 << 32) - 1)
            prng = np.random.RandomState(seed)
        return func(*args, prng=prng, seed=seed, **kwargs)
    return call_with_prng

## Simulating Names

We begin by simulating user names and genders:

In [2]:
## to do - put this BeautifulSoup code in a separate file
## then just import the lists of names. 

from bs4 import BeautifulSoup
import random
import numpy as np


with open("data/faroese-female.htm") as ff:
    soup = BeautifulSoup(ff)

names = soup.find_all('a')

female_names = []
for n in names:
    try:
        female_names.append(n['title'].rsplit(' ')[0])
    except Exception:
        pass
del female_names[0]

with open("data/faroese-male.htm") as fm:
    soup = BeautifulSoup(fm)
    
mnames = soup.find_all('a')
male_names = []
for n in mnames:
    try:
        male_names.append(n['title'].rsplit(' ')[0])
    except Exception:
        pass
del male_names[0]

n_male = len(male_names)
n_female = len(female_names)

#invoking the decorator
@makeprng
def names(males, females, dist=0.5, prng=None, seed=None):
    random.seed(seed)
    while True:
        rand = random.random()
        surname = males[prng.randint(0, n_male-1)]
        if rand < dist:
            gender = 'male'
            first_name = males[prng.randint(0, n_male-1)]
            surname = surname + 'son'
        else:
            first_name = females[prng.randint(0, n_female-1)]
            surname = surname + 'dóttir'
            gender = 'female'
        result = {'first_name':first_name, 
                  'surname':surname, 
                  ' gender':gender}
        yield tuple(r for r in result.values())

In [3]:
sim = names(male_names, female_names, 0.5, seed = 104)

In [4]:
for i in range(10):
    print(next(sim))

('Sunnfríð', 'Ásfinnurdóttir', 'female')
('Gerda', 'Sjúrðurdóttir', 'female')
('Eir', 'Rasmusdóttir', 'female')
('Sjúrði', 'Bjørgheðinson', 'male')
('Vilmundur', 'Kristjanson', 'male')
('Fípa', 'Ormsteindóttir', 'female')
('Nanný', 'Ovidóttir', 'female')
('Saul', 'Hjørmundurson', 'male')
('Adrian', 'Syftunson', 'male')
('Oyvør', 'Hjørturdóttir', 'female')


## Simulating Dates of Birth

We also want to simulate realistic dates of birth for each customer. We assume that the customers were born uniformly between the 1st of January 1920 and the 1st of January 2005.

We can use the `datetime` functionality built into Numpy to achieve this. We could just write this as a standard function, but for benefits of repeatibility discussed [here](https://chapeau.freevariable.com/2020/02/repeatable-simulation-without-boilerplate.html) we will set up a generator:

In [5]:
@makeprng
def dob(prng=None, seed=None):
    while True:
        date = np.datetime64('1920-01-01') + np.timedelta64(prng.randint(0, 365*85), 'D')
        yield(date)

In [6]:
dates = dob(seed=101)

In [7]:
for i in range(20):
    print(next(dates))

1956-01-03
1969-02-03
1968-08-01
1941-12-02
1935-08-05
1966-07-01
1981-09-02
1953-11-06
1976-01-10
1967-07-10
1980-05-21
1980-08-21
1927-03-08
1978-08-20
1997-10-08
1937-09-08
1955-08-25
1980-10-20
1973-07-27
1984-06-30


## Simulate a Location

We're going to assume all customers are uniformly distributed across the USA, and will assign them a State at random upon generation.

To do this we first need to define a list of States:

In [8]:
us_states = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 
            'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO',
            'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 
            'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 
            'VA', 'WA', 'WV', 'WI', 'WY']

In [9]:
@makeprng
def us_state(states= us_states, prng= None, seed = None):
    while True:
        state = prng.choice(states)
        yield(state)

In [10]:
trying = us_state(seed=293)

In [11]:
for i in range(10):
    print(next(trying))

RI
MS
IN
ID
OR
TX
NJ
NY
NY
PA


We also set up a generator to simulate Social Security Numbers (SSNs)

In [12]:
@makeprng
def ssn(prng= None, seed = None):
    while True:
        ssn = f'{prng.randint(0, (10**10)-1):010}'
        yield ssn

In [13]:
ssns = ssn(seed = 143)

In [14]:
for i in range(10):
    print(next(ssns))

8229052961
6247600733
4568348801
7409279376
8685241312
4924994265
5452556245
3878666205
0023109492
9839588751


In [15]:
def ids(prng= None, seed = None):
    uid = 0
    while True:
        uid = uid+1
        yield uid

In [16]:
u_ids = ids()
name = names(male_names, female_names, 0.5, seed=123) #- this is a name generator
dates = dob(seed=456)
states = us_state(seed=432)
ssns = ssn(seed=987)


def initial_ppi():
        
        while True:
            uid = next(u_ids)
            user_name = next(name)
            dob = next(dates)
            location = next(states)
            ssn = next(ssns)
            bank = "001"
            account = 1223334444
                        
            result = {
                "user_name": user_name,
                "location": location,
                "DOB": dob,
                "SSN": ssn,
                "bank": bank,
                "account": account}
   
            yield ((uid, *result.values()))

In [17]:
testing = initial_ppi()

In [18]:
for i in range(100):
    print(next(testing))

(1, ('Herbjartur', 'Kolfinnurson', 'male'), 'CT', numpy.datetime64('2003-10-31'), '8173579990', '001', 1223334444)
(2, ('Hákun', 'Hervarðurson', 'male'), 'AL', numpy.datetime64('1921-09-05'), '0240364399', '001', 1223334444)
(3, ('Sólmundur', 'Benadiktson', 'male'), 'UT', numpy.datetime64('1963-12-04'), '5244376387', '001', 1223334444)
(4, ('Niklái', 'Aleksson', 'male'), 'OR', numpy.datetime64('1994-07-16'), '7813837217', '001', 1223334444)
(5, ('Brá', 'Bergurdóttir', 'female'), 'UT', numpy.datetime64('1981-10-17'), '0597960811', '001', 1223334444)
(6, ('Eyðbjartur', 'Matsson', 'male'), 'NC', numpy.datetime64('1949-05-23'), '8830221438', '001', 1223334444)
(7, ('Berta', 'Snævardóttir', 'female'), 'PA', numpy.datetime64('1989-10-25'), '9164906441', '001', 1223334444)
(8, ('Pál', 'Birgarson', 'male'), 'CO', numpy.datetime64('1953-02-27'), '4448122658', '001', 1223334444)
(9, ('Ata', 'Arnbjørndóttir', 'female'), 'KS', numpy.datetime64('2001-08-21'), '9150365426', '001', 1223334444)
(10, (

In [19]:
### this would be the structure for simulating ppi for one user at a time. 


name = names(male_names, female_names, 0.5) #- this is a name generator
dates = dob()
states = us_state()
ssns = ssn()


def ppi(user_id):
        user_name = next(name)
        dob = next(dates)
        location = next(states)
        ssn = next(ssns)
        bank = "001"
        account = 1223334444
        
        first_run = True
        
        while True:
            
            if first_run: 
                print("first run")
                result = {
                "user_id":user_id,
                "user_name": user_name,
                "location": location,
                "DOB": dob,
                "SSN": ssn,
                "bank": bank,
                "account": account}
                print(result)
                first_run = False
            else:
                #### this is where we sample an update to the information. 
                ### sample a time (offset) 
                result = {
                    "user_id":user_id,
                    "user_name": next(name),
                    "location": location,
                    "DOB": dob,
                    "SSN": ssn,
                    "bank": bank,
                    "account": account
                }
            yield ((user_id, *result.values()))

In [20]:
trial=ppi(1)

In [21]:
for i in range(10):
    print(next(trial))

first run
{'user_id': 1, 'user_name': ('Natasja', 'Bergleivurdóttir', 'female'), 'location': 'SC', 'DOB': numpy.datetime64('1937-02-07'), 'SSN': '2549020093', 'bank': '001', 'account': 1223334444}
(1, 1, ('Natasja', 'Bergleivurdóttir', 'female'), 'SC', numpy.datetime64('1937-02-07'), '2549020093', '001', 1223334444)
(1, 1, ('Kolgrímur', 'Valdison', 'male'), 'SC', numpy.datetime64('1937-02-07'), '2549020093', '001', 1223334444)
(1, 1, ('Marja', 'Kornusdóttir', 'female'), 'SC', numpy.datetime64('1937-02-07'), '2549020093', '001', 1223334444)
(1, 1, ('Fríðbjørg', 'Brendandóttir', 'female'), 'SC', numpy.datetime64('1937-02-07'), '2549020093', '001', 1223334444)
(1, 1, ('Jensia', 'Jensdóttir', 'female'), 'SC', numpy.datetime64('1937-02-07'), '2549020093', '001', 1223334444)
(1, 1, ('Fríðborg', 'Hjørleivurdóttir', 'female'), 'SC', numpy.datetime64('1937-02-07'), '2549020093', '001', 1223334444)
(1, 1, ('Gudný', 'Eyðtórdóttir', 'female'), 'SC', numpy.datetime64('1937-02-07'), '2549020093', '0