### Creating a fake case-control data set for neonatal metabolomics studies 
We often work with case control studies where the goal is to identify factors (metabolites) that may be associated/contribute to a certain medical condition. This is done by comparing patient/subjects who have the condition/disease (the "cases") with patients who do not have the condition/disease but are otherwise similar (the "controls"). 

In studies involving dry-blood spots sampled from newborns (as part of the national newborn screening program), controls are selected to have the same (or very close in time) date of birth as the cases. This is because the metabolomic profiles of newborns show a strong variation with season [***reference***?], possibly reflecting how the mother's  (and the rest of us) diet/ lifetyle varies depedning on the time of year it is. 


#### The code
The code below will create a fake data set having the typical structure of a record with sample subjects paired into case and control sample types. We create a pandas dataframe with this data to visualize the resulting table and allow easy export to a csv and/or excel file. 

For all random data we use the [`Faker` library](https://faker.readthedocs.io/en/master/index.html), with the [faker-biology plugin](https://pypi.org/project/faker-biology/) available for Python. 

##### Parameters
We create `n_pairs` of case and control pairs, all having a random date of birth within a defined minimum and maximum age. The maximum age difference between each pair can be defined; below we allow a maximum age difference of seven days, but months and years can also be chosen. 

We also define additional sub groups by assigning a random organ from a list of  `n_organ_groups`; this could perhaps be the tissue from were a certain condition is being monitored in the study. A random barcode is assigned to each subject, which could perhaps be an identifer used to identfy the sample in a biobank say. 






### Setup libs

In [42]:
import pandas as pd
import numpy as np

import faker
from faker_biology.physiology import CellType, Organ, Organelle

from dateutil.relativedelta import relativedelta


# init fake object and load plugins
fake = faker.Faker()

# add organ data plugin 
fake.add_provider(Organ)

# or organelle and celtype plugin
# fake.add_provider(Organelle)
# fake.add_provider(CellType)

### Set parameters

In [43]:
# save
saveop = True

# number of case control pairs in fake study 
n_pairs = 523 

# age of subjects
min_age = 15 # years
max_age = 40 # years

# max age difference between case and control
max_diff_years = 0
max_diff_months = 0
max_diff_days = 7

# number of sub groups; here organs to simulate some attribute of the disease/condition
n_groups = 5
organs = [fake.organ() for _ in range(0,n_groups)]

variables = ["pair_ID", "specimen_ID", "object", "date_of_birth", "year", "barcode", "organ"]

# create dict to hold the fake data 
fake_data = {}
for v in variables:
    fake_data.setdefault(v, [])

In [44]:
organs

['Ganglia', 'Epididymis', 'Pineal gland', 'Lymph node', 'Cochlea']

### Create fake data

In [64]:
spec_id = list(np.arange(1, n_pairs*2+1))


spec_id_pairs = [ [spec_id[i], spec_id[i+1]] for i in range(0, n_pairs*2, 2) ]

spec_id_pairs_permuted = spec_id_pairs
np.random.shuffle(spec_id_pairs_permuted)

spec_id_pairs_permuted

[[635, 636],
 [37, 38],
 [259, 260],
 [221, 222],
 [793, 794],
 [499, 500],
 [965, 966],
 [443, 444],
 [653, 654],
 [921, 922],
 [945, 946],
 [781, 782],
 [1031, 1032],
 [73, 74],
 [1015, 1016],
 [809, 810],
 [655, 656],
 [829, 830],
 [77, 78],
 [689, 690],
 [617, 618],
 [545, 546],
 [501, 502],
 [387, 388],
 [823, 824],
 [157, 158],
 [113, 114],
 [841, 842],
 [935, 936],
 [803, 804],
 [681, 682],
 [855, 856],
 [835, 836],
 [99, 100],
 [329, 330],
 [41, 42],
 [771, 772],
 [97, 98],
 [137, 138],
 [831, 832],
 [27, 28],
 [199, 200],
 [701, 702],
 [959, 960],
 [165, 166],
 [269, 270],
 [557, 558],
 [517, 518],
 [281, 282],
 [643, 644],
 [417, 418],
 [857, 858],
 [509, 510],
 [669, 670],
 [581, 582],
 [495, 496],
 [839, 840],
 [639, 640],
 [909, 910],
 [777, 778],
 [1033, 1034],
 [49, 50],
 [399, 400],
 [861, 862],
 [637, 638],
 [865, 866],
 [225, 226],
 [431, 432],
 [405, 406],
 [79, 80],
 [531, 532],
 [515, 516],
 [541, 542],
 [55, 56],
 [741, 742],
 [159, 160],
 [203, 204],
 [409, 410],

In [65]:

for pair_id in range(0, n_pairs):
    
    # Choose a random sub group
    case_control_organ = np.random.choice(organs)
    
    # --- Controls ---
    
    dob_control = fake.date_of_birth(minimum_age=min_age, maximum_age=max_age)# datetime object
    
    spec_id_control = spec_id_pairs_permuted[pair_id][0]
    
    fake_data["pair_ID"].append(pair_id)
    fake_data["object"].append("Control")
    fake_data["specimen_ID"].append(spec_id_control)
    fake_data["date_of_birth"].append(dob_control)
    fake_data["year"].append(dob_control.year)
    fake_data["barcode"].append(fake.ean8())
    fake_data["organ"].append(case_control_organ)
    
    # --- Cases ---
    
    # let the case specimen be born within relativedelta time from control
    dob_case = fake.date_between_dates(dob_control, 
                            dob_control + relativedelta(years=max_diff_years, 
                                          month=max_diff_months, 
                                          days=max_diff_days)
                            )
    
    # let case and control specimen numbers be contiguous
    spec_id_case = spec_id_pairs_permuted[pair_id][1]
    
    fake_data["pair_ID"].append(pair_id)
    fake_data["object"].append("Case")
    fake_data["specimen_ID"].append(spec_id_case)
    fake_data["date_of_birth"].append(dob_case)
    fake_data["year"].append(dob_case.year)
    fake_data["barcode"].append(fake.ean8())
    fake_data["organ"].append(case_control_organ)

Create dataframe

In [66]:
fake_data_df = pd.DataFrame(fake_data)
fake_data_df

Unnamed: 0,pair_ID,specimen_ID,object,date_of_birth,year,barcode,organ
0,0,418,Control,1990-11-10,1990,97774679,Ganglia
1,0,419,Case,1990-11-11,1990,70231397,Ganglia
2,1,670,Control,2007-05-20,2007,14167485,Pineal gland
3,1,671,Case,2007-05-20,2007,79245111,Pineal gland
4,2,261,Control,1988-05-29,1988,45595707,Ganglia
...,...,...,...,...,...,...,...
2087,520,600,Case,1990-10-25,1990,24104166,Pineal gland
2088,521,911,Control,1995-08-06,1995,00470292,Cochlea
2089,521,912,Case,1995-08-08,1995,75388669,Cochlea
2090,522,971,Control,2003-04-15,2003,18131697,Epididymis


Sort groups by year of birth

In [67]:

fake_data_df = fake_data_df.set_index([fake_data_df["pair_ID"], fake_data_df["specimen_ID"]])
fake_data_df.drop(["specimen_ID", "pair_ID"], inplace=True, axis=1)

fake_data_df
fake_data_df.sort_values(by=["year", "pair_ID"], inplace=True)

fake_data_df.reset_index(inplace=True)
fake_data_df

Unnamed: 0,pair_ID,specimen_ID,object,date_of_birth,year,barcode,organ
0,2,259,Control,1982-10-21,1982,73474296,Ganglia
1,2,260,Case,1982-10-21,1982,88135328,Ganglia
2,43,175,Control,1982-04-29,1982,70297355,Lymph node
3,43,176,Case,1982-05-03,1982,79049610,Lymph node
4,62,330,Control,1982-05-28,1982,34809341,Pineal gland
...,...,...,...,...,...,...,...
2087,428,611,Case,2008-03-02,2008,41413920,Pineal gland
2088,447,754,Control,2008-02-16,2008,36346493,Pineal gland
2089,447,755,Case,2008-02-18,2008,49420029,Pineal gland
2090,453,833,Control,2008-03-04,2008,15612373,Ganglia


In [68]:
if saveop:
    filename = f"fake_case_control_Npairs_{n_pairs}_Ngroups_{n_groups}.xlsx"
    fake_data_df.to_excel("data/" + filename)
    
if saveop:
    filename = f"fake_case_control_Npairs_{n_pairs}_Ngroups_{n_groups}.csv"
    fake_data_df.to_csv("data/" + filename)