## Dummify Data!!

### Problem:
Developing codes to process confidential and above data is incovenient because the data must reside in a GSIB environment which is ill-equipped for software-dev (no IDE, no plugins, no docker, version control is...).

Sometimes even just masking the identifiable fields is not enough to render data sufficiently de-classified. E.g. income distribution.

### Solution:
An automated process to generate safe fake data given the real table as input. Auto-detection of column types and returns pseudo-realistic verion of the original table.

### Limitations:
The dummified data is meant to test code workability only, not for analysis since all realistic distributions and correlations are destroyed in the dummification process.

## Imports

In [1]:
import pandas as pd

## Read "original"

In [2]:
df_orig = pd.read_csv("data/sample_data.csv")

In [3]:
df_orig.head(8)

Unnamed: 0,name,age,income,gender
0,Aaron Lee,26,7077.895077,M
1,Abdul Halim bin Haron,38,9148.225942,M
2,Abdullah Tarmugi,21,13313.700704,m
3,Adnan Bin Saidi,33,9197.348845,M
4,Adrian Pang,41,7225.228331,m
5,Agu Casmir,22,9977.319644,F
6,Albert Winsemius,68,12649.718205,F
7,Aloysius Pang (Wei Chong),55,8206.125254,f


## Functions

In [4]:
### TODO: What if data imperfect... actually numerical but tainted with string...?

In [5]:
### Categorical! This needs some smarts... each category must be chosen once?
### We ensure each category is represented at least once!
max_cats = 8

def looksCategorical(ser, max_cats=max_cats):
    return len(ser.unique()) <= max_cats

def genCategoricals(ser, gen_size):
    cat_set = ser.unique().tolist()
    gen_remainder = gen_size - len(cat_set)
    
    ### Can make as warning instead...?
    assert gen_remainder > 0
    
    dummy_cats = cat_set + np.random.choice(ser, gen_remainder).tolist()
    np.random.shuffle(dummy_cats)
    
    return dummy_cats

In [6]:
### Uses markov model to generate dummy text
import markovify

def tryUntilCondition(func, cond):
    a = func()
    if cond(a):
        return a
    else:
        return tryUntilCondition(func, cond)

def genDummyStrings(ser, gen_size, state_size=3):
    char_model = markovify.Chain(
        [list(strg) for strg in ser],
        state_size = state_size
    )
    
    min_len = min([len(strg) for strg in ser])
    
    dummy_strs = [
        tryUntilCondition(
            lambda: ''.join(char_model.gen()),
            lambda s: (len(s) > min_len) & (s not in ser.values)
        ) for _ in range(gen_size)]
    
    return dummy_strs

In [7]:
### Use a uniform distribution model to generate floats or integers

### Wanna be smart and detect distribution?
### Detect decimal place.. truncate to same d.p.

import numpy as np

def genNumericals(ser, gen_size):
    dummy_numbers = np.random.uniform(
        ser.quantile(q=0.05), ser.quantile(q=0.95), size=gen_size
    )
    
    return dummy_numbers

In [8]:
### Pepper with nulls?

In [9]:
### Very specific generators e.g. IC, UEN... to conform to some regex?

## Generate dummy

In [10]:
### Params
size_dummy = 50
df_dummy = pd.DataFrame()

In [11]:
for col in df_orig:
    ser = df_orig[col]
    
    ### If categorical
    if looksCategorical(ser):
        df_dummy[col] = genCategoricals(ser, size_dummy)
    
    else:
        ### If string
        if ser.dtypes.name == "object":
            df_dummy[col] = genDummyStrings(ser, size_dummy)

        ### If float64
        if ser.dtypes.name == "float64":
            df_dummy[col] = genNumericals(ser, size_dummy)

        ### If Int64
        if ser.dtypes.name == "int64":
            df_dummy[col] = [int(n) for n in genNumericals(ser, size_dummy)]

In [12]:
df_dummy.head(20)

Unnamed: 0,name,age,income,gender
0,Lim Heng Tee,28,11936.404084,m
1,Vivia Khian,27,8490.249465,M
2,Chia Thian Seng,59,9186.113519,M
3,Davin Pang,47,8335.586088,m
4,Toh Choh,22,8720.765552,m
5,Natashak,52,11431.046348,m
6,Lee Huei Ming,51,8431.490648,F
7,Abdullah Kee,24,9593.843673,M
8,Fandan,54,13042.247801,F
9,Kelvinder Chin,21,8274.527587,M


## Imperfect data

In [17]:
s = pd.Series(list(range(1000)) + ["n.a."])

In [18]:
s

0          0
1          1
2          2
3          3
4          4
        ... 
996      996
997      997
998      998
999      999
1000    n.a.
Length: 1001, dtype: object

In [19]:
### Should be integer..... any better way to check?

0