# Milestone: Feature Ready

Notebook ini akan menjelaskan tahapan/proses yang dilakukan dalam melakukan feature selection, feature engineering, dan pemodelan atas dataset IVF. Data visualization untuk feature selection dilakukan pada notebook terpisah. Proses sampai dengan tahapan feature ready.

## Import Library

In [1]:
import pandas as pd
import numpy as np
import joblib

%matplotlib inline
pd.set_option('display.max_rows', 500)

## Data Pre-Processing

### Read Files

In [2]:
df1 = pd.read_excel('../data/ar-2010-2014-xlsb.xlsb', engine='pyxlsb')
df2 = pd.read_excel('../data/ar-2015-2016-xlsb.xlsb', engine='pyxlsb')

### Menggabungkan dataframe df1 dan df2

Dataset di atas terbagi menjadi dua, yaitu df1 (data dari tahun 2010 sampai dengan tahun 2014) dan df2 (data dari tahun 2015 sampai dengan tahun 2016).
Pada tahap ini, kami akan menggabungkan df1 dan df2. Feature pada df1 dan df2 sama (saat dicek apakah kolom pada df1 dan df2, hasil menunjukkan kedua dataframe memiliki kolom yang sama). Berikut ini cara kami menggabungkan kedua dataframe tersebut.

In [4]:
df = pd.concat([df1, df2])
joblib.dump(df, '../output/data_concat.pkl')

['../output/data_concat.pkl']

### Feature Selection dan Engineering

Function untuk melakukan feature selection pada colom yang ditentukan dan function engineering untuk casting type dan pembuatan dummy variables.

In [4]:
def feature_selection(df, column_use):
    
    # hanya menggunakan column yang didefinisikan
    df = df[column_use]
    
    # drop missing observation
    df = df.dropna()
    
    return df

def feature_engineering(df, patient_age_value, to_int_feature, to_dummy_feature):
    
    # ubah patient age treatment menjadi ordinal value
    df['Patient Age at Treatment'] = df['Patient Age at Treatment'].replace(patient_age_value)
    
    # ubah data dengan value '>=5' menjadi 6
    for col in to_int_feature:
        df[col] = df[col].replace({'>=5':6})
    
    # ubah fitur dengan type object menjadi int
    for col in to_int_feature:
        df[col] = pd.to_numeric(df[col])
        
    # ubah categorical menjadi dummy feature
    for col in to_dummy_feature:
        dum = pd.get_dummies(df[col], prefix=col)
        df = pd.concat([df, dum], axis=1)
        df.drop(col, axis=1, inplace=True)
        
    return df

In [23]:
def feature_selection(params):
    df = joblib.load(params['out_path'])
    
    # hanya menggunakan column yang didefinisikan
    df = df[params['col_use']]
    
    # drop missing observation
    df = df.dropna()
    
    return df

def feature_engineering(df, params):
    
    # ubah patient age treatment menjadi ordinal value
    df['Patient Age at Treatment'] = df['Patient Age at Treatment'].replace(params['patient_age_value'])
    
    # ubah data dengan value '>=5' menjadi 6
    for col in params['to_int_feature']:
        df[col] = df[col].replace({'>=5':6})
    
    # ubah fitur dengan type object menjadi int
    for col in params['to_int_feature']:
        df[col] = pd.to_numeric(df[col])
        
    # ubah categorical menjadi dummy feature
    for col in params['to_dummy_feature']:
        dum = pd.get_dummies(df[col], prefix=col)
        df = pd.concat([df, dum], axis=1)
        df.drop(col, axis=1, inplace=True)
    
    joblib.dump(df, '../output/data_fe.pkl')    
    
    return df 

In [24]:
params = {'out_path' : '../output/data_concat.pkl',
          'col_use' : ['Patient Age at Treatment',
                     'Total Number of Previous cycles, Both IVF and DI',
                     'Total Number of Previous treatments, Both IVF and DI at clinic',
                     'Total Number of Previous IVF cycles',
                     'Total Number of Previous DI cycles',
                     'Total number of previous pregnancies, Both IVF and DI',
                     'Total number of IVF pregnancies', 'Total number of DI pregnancies',
                     'Total number of live births - conceived through IVF or DI',
                     'Total number of live births - conceived through IVF',
                     'Total number of live births - conceived through DI',
                     'Type of Infertility - Female Primary',
                     'Type of Infertility - Female Secondary',
                     'Type of Infertility - Male Primary',
                     'Type of Infertility - Male Secondary',
                     'Type of Infertility -Couple Primary',
                     'Type of Infertility -Couple Secondary',
                     'Cause  of Infertility - Tubal disease',
                     'Cause of Infertility - Ovulatory Disorder',
                     'Cause of Infertility - Male Factor',
                     'Cause of Infertility - Patient Unexplained',
                     'Cause of Infertility - Endometriosis',
                     'Stimulation used', 'Type of treatment - IVF or DI',
                     'Specific treatment type', 'Live Birth Occurrence',
                     'Sperm From', 'Number of Live Births',
                     'Number of foetal sacs with fetal pulsation'],
         'patient_age_value' : {'18 - 34':0, '35-37':1, '38-39':2, '40-42':3, '43-44':4, '45-50':5},
          'to_int_feature' : ['Patient Age at Treatment', 
                           'Total Number of Previous cycles, Both IVF and DI',
                           'Total Number of Previous treatments, Both IVF and DI at clinic', 
                           'Total Number of Previous IVF cycles',
                           'Total Number of Previous DI cycles',
                           'Total number of previous pregnancies, Both IVF and DI',
                           'Total number of IVF pregnancies',
                           'Total number of live births - conceived through IVF or DI'],
         'to_dummy_feature' : ['Type of treatment - IVF or DI', 
                             'Specific treatment type',
                             'Sperm From']}

In [25]:
df_fs = feature_selection(params)
feature_engineering(df_fs, params)

Unnamed: 0,Patient Age at Treatment,"Total Number of Previous cycles, Both IVF and DI","Total Number of Previous treatments, Both IVF and DI at clinic",Total Number of Previous IVF cycles,Total Number of Previous DI cycles,"Total number of previous pregnancies, Both IVF and DI",Total number of IVF pregnancies,Total number of DI pregnancies,Total number of live births - conceived through IVF or DI,Total number of live births - conceived through IVF,...,Specific treatment type_IVF / AH,Specific treatment type_IVF / BLASTOCYST,Specific treatment type_IVF:ICSI,Specific treatment type_IVF:IVF,Specific treatment type_IVF:Unknown,Specific treatment type_IVI,Specific treatment type_Unknown,Sperm From_Donor,Sperm From_Partner,Sperm From_not assigned
0,0,1,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
13,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
26,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
28,0,2,2,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
158447,1,2,2,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
158475,2,3,3,0,3,1,0,1,1,0,...,0,0,0,0,0,0,0,1,0,0
158476,0,2,2,0,2,1,0,1,1,0,...,0,0,0,0,0,0,0,1,0,0
158479,0,2,2,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


NameError: name 'df_fs' is not defined

Membersihkan data agar siap diproses dengan melakukan feature selection dan engineering.

In [5]:
col_use = ['Patient Age at Treatment',
       'Total Number of Previous cycles, Both IVF and DI',
       'Total Number of Previous treatments, Both IVF and DI at clinic',
       'Total Number of Previous IVF cycles',
       'Total Number of Previous DI cycles',
       'Total number of previous pregnancies, Both IVF and DI',
       'Total number of IVF pregnancies', 'Total number of DI pregnancies',
       'Total number of live births - conceived through IVF or DI',
       'Total number of live births - conceived through IVF',
       'Total number of live births - conceived through DI',
       'Type of Infertility - Female Primary',
       'Type of Infertility - Female Secondary',
       'Type of Infertility - Male Primary',
       'Type of Infertility - Male Secondary',
       'Type of Infertility -Couple Primary',
       'Type of Infertility -Couple Secondary',
       'Cause  of Infertility - Tubal disease',
       'Cause of Infertility - Ovulatory Disorder',
       'Cause of Infertility - Male Factor',
       'Cause of Infertility - Patient Unexplained',
       'Cause of Infertility - Endometriosis',
       'Stimulation used', 'Type of treatment - IVF or DI',
       'Specific treatment type', 'Live Birth Occurrence',
           'Sperm From', 'Number of Live Births',
       'Number of foetal sacs with fetal pulsation']

df_new = feature_selection(df, col_use)

In [6]:
age_replace = {'18 - 34':0, '35-37':1, '38-39':2, '40-42':3, '43-44':4, '45-50':5}

to_int = ['Patient Age at Treatment', 
          'Total Number of Previous cycles, Both IVF and DI',
          'Total Number of Previous treatments, Both IVF and DI at clinic', 
          'Total Number of Previous IVF cycles',
          'Total Number of Previous DI cycles',
          'Total number of previous pregnancies, Both IVF and DI',
          'Total number of IVF pregnancies',
         'Total number of live births - conceived through IVF or DI']

to_dummy = ['Type of treatment - IVF or DI', 
          'Specific treatment type',
          'Sperm From']

df_new = feature_engineering(df_new, age_replace, to_int, to_dummy)

In [7]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 119210 entries, 0 to 158491
Data columns (total 54 columns):
 #   Column                                                          Non-Null Count   Dtype  
---  ------                                                          --------------   -----  
 0   Patient Age at Treatment                                        119210 non-null  int64  
 1   Total Number of Previous cycles, Both IVF and DI                119210 non-null  int64  
 2   Total Number of Previous treatments, Both IVF and DI at clinic  119210 non-null  int64  
 3   Total Number of Previous IVF cycles                             119210 non-null  int64  
 4   Total Number of Previous DI cycles                              119210 non-null  int64  
 5   Total number of previous pregnancies, Both IVF and DI           119210 non-null  int64  
 6   Total number of IVF pregnancies                                 119210 non-null  int64  
 7   Total number of DI pregnancies        

Split data menjadi train, val, dan test agar siap dimasukan ke model.

In [8]:
from sklearn.model_selection import train_test_split

def split(df, target, rand, testsize):
    y = df[target]
    X = df.loc[:, df.columns != target]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=testsize*2, random_state=rand)
    X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=rand)
    
    return X_train, X_val, X_test, y_train, y_val, y_test

In [9]:
target = 'Live Birth Occurrence'
rand = 42
test_size = 0.2

X_train, X_val, X_test, y_train, y_val, y_test = split(df_new, target, rand, test_size)

Menyimpan data Train, Validation, dan Test dalam bentuk csv ke folder data

In [11]:
joblib.dump(X_train, '../output/X_train.pkl')
joblib.dump(X_train, '../output/X_val.pkl')
joblib.dump(X_train, '../output/X_test.pkl')
joblib.dump(X_train, '../output/y_train.pkl')
joblib.dump(X_train, '../output/y_val.pkl')
joblib.dump(X_train, '../output/y_test.pkl')

['../output/y_test.pkl']