## Project: Development of a reduced pediatric injury prediction model
Created by: Thomas Hartka, MD, MS  
Date created: 12/5/20  
  
This notebook combined the data from NASS and CISS. It also adds age groups and fold numbers for cross-validation.   Folds number are separately assigned for NASS and CISS.  This is done so there are an equal number of NASS and CISS cases in each fold, and each database can also be evaluated separately.  

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
import itertools

## Read in NASS and CISS data

In [2]:
nass = pd.read_csv("../Data/NASS/NASSPeds-2000_2015.csv")
ciss = pd.read_csv("../Data/CISS/CISSPeds-2017_2018.csv")

# set year
nass['dataset'] = "NASS"
ciss['dataset'] = "CISS"

## Combine datasets

In [3]:
# combine years
peds = nass.append(ciss).reset_index(drop=True)

In [4]:
# number of cases
print("Total cases: ", len(peds))
print("Total cases (weighted): ", peds.casewgt.sum())

# number of injury cases
print("ISS>=16: ", len(peds[peds.iss>=16]))
print("ISS>=16 (weighted): ", peds[peds.iss>=16].casewgt.sum())

# number of non-injury cases
print("ISS<16: ", len(peds[peds.iss<16]))
print("ISS<16 (weighted): ", peds[peds.iss<16].casewgt.sum())

Total cases:  13560
Total cases (weighted):  6225662.612867111
ISS>=16:  678
ISS>=16 (weighted):  79763.68413127595
ISS<16:  12882
ISS<16 (weighted):  6145898.928735835


In [5]:
6145899/6225663

0.9871878705930597

## Add age groups
Based on development designations by Doud et al.

In [6]:
# age groups
age_labels = ['0_4','5_9','10_14','15_18']

# cut based on groups
peds['age_cat']=pd.cut(peds['age'],bins=[-1,4,9,14,18],labels=age_labels)

# add one-hot encoding of ages
peds = peds.join(pd.get_dummies(peds.age_cat,prefix="age"))

## Make sex binary (male=0, female=1)

In [7]:
peds['sex'] = peds.apply(lambda x: 1 if (x['sex']>=2) else 0, axis=1)

## Make variable for front row (versus all other rows)

In [8]:
peds['front_row'] = peds.apply(lambda x: 0 if (x['seat_row']>=2) else 1, axis=1)

## Add outcome flags 

In [9]:
# AIS 2+ 
peds['mais_head2'] = peds.apply(lambda x: 1 if (x['mais_head']>=2) else 0, axis=1)
peds['mais_thorax2'] = peds.apply(lambda x: 1 if (x['mais_thorax']>=2) else 0, axis=1)
peds['mais_abd2'] = peds.apply(lambda x: 1 if (x['mais_abd']>=2) else 0, axis=1)
peds['mais2'] = peds.apply(lambda x: 1 if ((x['mais_head']>=2)|(x['mais_thorax']>=2)|(x['mais_abd']>=2)) else 0, axis=1)

# AIS 3+ 
peds['mais_head3'] = peds.apply(lambda x: 1 if (x['mais_head']>=3) else 0, axis=1)
peds['mais_thorax3'] = peds.apply(lambda x: 1 if (x['mais_thorax']>=3) else 0, axis=1)
peds['mais_abd3'] = peds.apply(lambda x: 1 if (x['mais_abd']>=3) else 0, axis=1)
peds['mais3'] = peds.apply(lambda x: 1 if ((x['mais_head']>=3)|(x['mais_thorax']>=3)|(x['mais_abd']>=3)) else 0, axis=1)

peds['iss24' ] = peds.apply(lambda x: 1 if (x['iss']>=24) else 0, axis=1)

## Add folds
Add fold for each age group in dataset so there are approximately the same number of cases from each age group in each fold.  This is done separately for each dataset.  Folds are created for 10x and 5x crossvalidation.

In [10]:
num_folds = 10

# initial fold column
folds = pd.Series(-1).repeat(len(peds)).reset_index(drop=True)

# set up k-fold generator
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

# loop through each age group and dataset
for ages in age_labels:
    for db in ['NASS','CISS']:
        # get subset of data to create folds
        dat_subset = peds[(peds['age_cat']==ages) & (peds['dataset']==db)]
    
        # get indices for samples
        sub_idx = peds.index[(peds['age_cat']==ages) & (peds['dataset']==db)]
    
        # get splits for nass
        kf.get_n_splits(dat_subset)
        
        # interate through folds and assign fold number to cases
        for i,row_list in enumerate(kf.split(dat_subset)):
            folds.loc[sub_idx[row_list[1]]] = i
    
peds['fold10x'] = folds
peds['fold5x'] = folds % 5

## Store data

In [11]:
peds.to_csv("../Data/Peds-2010_2018.csv", index=False)