Project ini mengikuti referensi dari **Machine learning predicts live‐birth
occurrence before in‐vitro fertilization treatment**

oleh Ashish Goyal, Maheshwar Kuchana & Kameswari Prasada Rao Ayyagari

link github = https://github.com/zain3ie/pm_dev

In [54]:
import numpy as np
import pandas as pd
import joblib

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics

In [2]:
df1 = pd.read_excel('../data/ar-2010-2014-xlsb.xlsb', engine='pyxlsb')

In [3]:
df2 = pd.read_excel('../data/ar-2015-2016-xlsb.xlsb', engine='pyxlsb')

In [4]:
df = pd.concat([df1.copy(), df2.copy()])

In [5]:
df.head()

Unnamed: 0,Patient Age at Treatment,Date patient started trying to become pregnant OR date of last pregnancy,"Total Number of Previous cycles, Both IVF and DI","Total Number of Previous treatments, Both IVF and DI at clinic",Total Number of Previous IVF cycles,Total Number of Previous DI cycles,"Total number of previous pregnancies, Both IVF and DI",Total number of IVF pregnancies,Total number of DI pregnancies,Total number of live births - conceived through IVF or DI,...,Heart Three Birth Weight,Heart Three Sex,Heart Three Delivery Date,Heart Three Birth Congenital Abnormalities,Heart Four Weeks Gestation,Heart Four Birth Outcome,Heart Four Birth Weight,Heart Four Sex,Heart Four Delivery Date,Heart Four Birth Congenital Abnormalities
0,18 - 34,,1,1,0,1,0,0,0,0,...,,,,,,,,,,
1,35-37,,0,0,0,0,0,0,0,0,...,,,,,,,,,,
2,18 - 34,,0,0,0,0,0,0,0,0,...,,,,,,,,,,
3,38-39,,1,1,0,1,0,0,0,0,...,,,,,,,,,,
4,35-37,,0,0,0,0,0,0,0,0,...,,,,,,,,,,


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 495628 entries, 0 to 158518
Data columns (total 95 columns):
 #   Column                                                                    Non-Null Count   Dtype  
---  ------                                                                    --------------   -----  
 0   Patient Age at Treatment                                                  495628 non-null  object 
 1   Date patient started trying to become pregnant OR date of last pregnancy  12891 non-null   float64
 2   Total Number of Previous cycles, Both IVF and DI                          495628 non-null  object 
 3   Total Number of Previous treatments, Both IVF and DI at clinic            495628 non-null  object 
 4   Total Number of Previous IVF cycles                                       495628 non-null  object 
 5   Total Number of Previous DI cycles                                        495628 non-null  object 
 6   Total number of previous pregnancies, Both IVF and D

## Preprocessing

Feature-nya banyak sekali. Coba mengurangi feature dengan hanya menggunakan 25 feature yang digunakan di referensi. Dimana menurut referensi, feature yang lain tidak berpengaruh signifikan terhadap hasil.

In [7]:
ref_feature = [
    'Patient Age at Treatment',
    'Total Number of Previous IVF cycles',
    'Total number of IVF pregnancies',
    'Total number of live births - conceived through IVF',
    'Type of Infertility - Female Primary',
    'Type of Infertility - Female Secondary',
    'Type of Infertility - Male Primary',
    'Type of Infertility - Male Secondary',
    'Type of Infertility -Couple Primary',
    'Type of Infertility -Couple Secondary',
    'Cause  of Infertility - Tubal disease',
    'Cause of Infertility - Ovulatory Disorder',
    'Cause of Infertility - Male Factor',
    'Cause of Infertility - Patient Unexplained',
    'Cause of Infertility - Endometriosis',
    'Cause of Infertility - Cervical factors',
    'Cause of Infertility - Partner Sperm Concentration',
    'Cause of Infertility -  Partner Sperm Morphology',
    'Causes of Infertility - Partner Sperm Motility',
    'Fresh Cycle',
    'Frozen Cycle',
    'Eggs Thawed',
    'Fresh Eggs Collected',
    'Eggs Mixed With Partner Sperm',
    'Embryos Transfered',
    'Live Birth Occurrence',
    'Number of Live Births'
]

In [8]:
ref_df = df[ref_feature].copy()

In [9]:
ref_df.head()

Unnamed: 0,Patient Age at Treatment,Total Number of Previous IVF cycles,Total number of IVF pregnancies,Total number of live births - conceived through IVF,Type of Infertility - Female Primary,Type of Infertility - Female Secondary,Type of Infertility - Male Primary,Type of Infertility - Male Secondary,Type of Infertility -Couple Primary,Type of Infertility -Couple Secondary,...,Cause of Infertility - Partner Sperm Morphology,Causes of Infertility - Partner Sperm Motility,Fresh Cycle,Frozen Cycle,Eggs Thawed,Fresh Eggs Collected,Eggs Mixed With Partner Sperm,Embryos Transfered,Live Birth Occurrence,Number of Live Births
0,18 - 34,0,0,0,0,0,0,0,0,0,...,0,0,,,,,,,1.0,2
1,35-37,0,0,0,0,0,0,0,0,0,...,0,0,,,,,,,,0
2,18 - 34,0,0,0,0,0,0,0,0,0,...,0,0,,,,,,,1.0,2
3,38-39,0,0,0,0,0,0,0,0,0,...,0,0,,,,,,,,0
4,35-37,0,0,0,0,0,0,0,0,0,...,0,0,,,,,,,,0


In [10]:
ref_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 495628 entries, 0 to 158518
Data columns (total 27 columns):
 #   Column                                               Non-Null Count   Dtype  
---  ------                                               --------------   -----  
 0   Patient Age at Treatment                             495628 non-null  object 
 1   Total Number of Previous IVF cycles                  495628 non-null  object 
 2   Total number of IVF pregnancies                      495628 non-null  object 
 3   Total number of live births - conceived through IVF  495628 non-null  int64  
 4   Type of Infertility - Female Primary                 495628 non-null  int64  
 5   Type of Infertility - Female Secondary               495628 non-null  int64  
 6   Type of Infertility - Male Primary                   495628 non-null  int64  
 7   Type of Infertility - Male Secondary                 495628 non-null  int64  
 8   Type of Infertility -Couple Primary                  4

### Feature : Patient Age at Treatment

Feature **Patient Age at Treatment**, diubah menjadi kategorikal data dengan aturan sbb:

1. 18–34 is converted to 0
2. 35–37 is converted to 1
3. 38–39 is converted to 2
4. 40–42 is converted to 3
5. 43–44 is converted to 4
6. 45–50 is converted to 5

Sedangkan yang isinya 999, kita drop

In [11]:
prep_df = ref_df.copy()

In [12]:
prep_df['Patient Age at Treatment'].unique()

array(['18 - 34', '35-37', '38-39', '45-50', '40-42', '43-44', '999'],
      dtype=object)

In [13]:
prep_df = prep_df[prep_df['Patient Age at Treatment'] != '999']

In [14]:
age_mapping = {
    '18 - 34': 0,
    '35-37': 1,
    '38-39': 2,
    '40-42': 3,
    '43-44': 4,
    '45-50': 5
}

prep_df['Patient Age at Treatment'] = prep_df['Patient Age at Treatment'].map(age_mapping)

In [15]:
prep_df['Patient Age at Treatment'].unique()

array([0, 1, 2, 5, 3, 4])

### Label : Live Birth Occurrence

Yang menjadi target/label pada dataset ini adalah **Live Birth Occurrence**, namun karena labelnya berisi **1** dan **nan**, dengan arti
**1** = program bayi tabungnya berhasil
*nan** = program bayi tabung tidak berhasil

Namun disini *nan* itu sangat rancu sekali, dikhawatirkan ada *nan* yang sesungguhnya (tidak punya label). Jadi perlu di-compare dengan **Number of Live Births**, jika **Number of Live Births** > 0, maka **Live Birth Occurrence** = 1, selain itu 0

In [16]:
prep_df['Live Birth Occurrence'].unique()

array([ 1., nan])

In [17]:
prep_df['Number of Live Births'].unique()

array([2, 0, 1, 3, 4])

In [18]:
# cek apakah ada yang Live Birth Occurrence = 1, tapi Number of Live Births = 0
len(prep_df[
    (prep_df['Live Birth Occurrence'] == 1)
    & (prep_df['Number of Live Births'] == 0)
])

0

In [19]:
# cek apakah ada yang Live Birth Occurrence = nan, tapi Number of Live Births > 0
len(prep_df[
    (prep_df['Live Birth Occurrence'].isnull())
    & (prep_df['Number of Live Births'] > 0)
])

0

Dari pengecekan di atas, dapat disimpulkan bahwa semua data tidak ada yang tidak punya label. Tinggal mengubah label **nan** menjadi **0**

In [20]:
prep_df['Live Birth Occurrence'] = prep_df['Live Birth Occurrence'].fillna(0)

In [21]:
prep_df['Live Birth Occurrence'].unique()

array([1., 0.])

In [22]:
# hapus column Number of Live Births, karena sudah diwakilkan oleh Live Birth Occurrence
prep_df = prep_df.drop('Number of Live Births', axis=1)

### Data tidak lengkap

In [23]:
prep_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 489787 entries, 0 to 158518
Data columns (total 26 columns):
 #   Column                                               Non-Null Count   Dtype  
---  ------                                               --------------   -----  
 0   Patient Age at Treatment                             489787 non-null  int64  
 1   Total Number of Previous IVF cycles                  489787 non-null  object 
 2   Total number of IVF pregnancies                      489787 non-null  object 
 3   Total number of live births - conceived through IVF  489787 non-null  int64  
 4   Type of Infertility - Female Primary                 489787 non-null  int64  
 5   Type of Infertility - Female Secondary               489787 non-null  int64  
 6   Type of Infertility - Male Primary                   489787 non-null  int64  
 7   Type of Infertility - Male Secondary                 489787 non-null  int64  
 8   Type of Infertility -Couple Primary                  4

dari data di atas dapat dilihat ada data yang tidak lengkap dengan jumlah yang sama, yaitu feature:
1. Fresh Cycle
2. Frozen Cycle
3. Eggs Thawed
4. Fresh Eggs Collected
5. Eggs Mixed With Partner Sperm
6. Embryos Transfered

In [24]:
# cek apakah semua data null berada pada index yang sama
(
    list(prep_df[prep_df['Fresh Cycle'].isnull()].index)
    == list(prep_df[prep_df['Frozen Cycle'].isnull()].index)
    == list(prep_df[prep_df['Eggs Thawed'].isnull()].index)
    == list(prep_df[prep_df['Fresh Eggs Collected'].isnull()].index)
    == list(prep_df[prep_df['Eggs Mixed With Partner Sperm'].isnull()].index)
    == list(prep_df[prep_df['Embryos Transfered'].isnull()].index)
)

True

Dari kode di atas, dapat disimpulkan bahwa data tidak lengkap berada pada index yang sama

In [25]:
# total data tidak lengkap
len(prep_df[prep_df['Fresh Cycle'].isnull()])

32270

In [26]:
# persentase data tidak lengkap
len(prep_df[prep_df['Fresh Cycle'].isnull()]) / len(prep_df) * 100

6.58857830036322

karena persentase data yang tidak lengkap cukup kecil yaitu 6%, maka biar mudah data yang tidak dihapus saja :)

In [27]:
prep_df = prep_df[prep_df['Fresh Cycle'].notnull()]

In [28]:
prep_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457517 entries, 8214 to 148105
Data columns (total 26 columns):
 #   Column                                               Non-Null Count   Dtype  
---  ------                                               --------------   -----  
 0   Patient Age at Treatment                             457517 non-null  int64  
 1   Total Number of Previous IVF cycles                  457517 non-null  object 
 2   Total number of IVF pregnancies                      457517 non-null  object 
 3   Total number of live births - conceived through IVF  457517 non-null  int64  
 4   Type of Infertility - Female Primary                 457517 non-null  int64  
 5   Type of Infertility - Female Secondary               457517 non-null  int64  
 6   Type of Infertility - Male Primary                   457517 non-null  int64  
 7   Type of Infertility - Male Secondary                 457517 non-null  int64  
 8   Type of Infertility -Couple Primary                

### Masih ada data object

In [29]:
prep_df['Total Number of Previous IVF cycles'].unique()

array(['3', '2', '0', '1', '4', '>=5', '5'], dtype=object)

In [30]:
prep_df['Total number of IVF pregnancies'].unique()

array([1, 0, 2, 5, 4, 3, '0', '1', '2', '4', '3', '5', '>=5'],
      dtype=object)

In [31]:
prep_df['Fresh Eggs Collected'].unique()

array(['0', '19', '13', '9', '2', '6', '10', '7', '1', '16', '15', '11',
       '8', '26', '14', '22', '3', '5', '21', '18', '4', '28', '20', '12',
       '17', '29', '27', '24', '23', '25', '30', '31', '38', '35', '32',
       '40', '36', '33', '34', '42', '39', '44', '37', '43', '45', '50',
       '47', '46', '> 50', '41', '49', '48'], dtype=object)

In [32]:
prep_df['Eggs Mixed With Partner Sperm'].unique()

array(['0', '18', '13', '9', '2', '6', '10', '16', '15', '14', '8', '4',
       '26', '12', '11', '22', '5', '21', '7', '19', '28', '17', '3',
       '20', '23', '29', '1', '27', '25', '24', '35', '39', '36', '34',
       '30', '42', '37', '33', '32', '38', '31', '50', '47', '46', '41',
       '> 50', '44', '40', '43', '49', '45', '48'], dtype=object)

In [33]:
len(prep_df[prep_df['Total Number of Previous IVF cycles'] == '>=5'])

14067

In [34]:
len(prep_df[prep_df['Total Number of Previous IVF cycles'] == '>=5']) / len(prep_df) * 100

3.074639849448217

In [35]:
len(prep_df[prep_df['Total number of IVF pregnancies'] == '>=5'])

1

In [36]:
len(prep_df[prep_df['Fresh Eggs Collected'] == '> 50'])

89

In [37]:
len(prep_df[prep_df['Eggs Mixed With Partner Sperm'] == '> 50'])

36

karena data feature **Total Number of Previous IVF cycles** yang nilainya **>=5** cukup kecil (hanya 3%). Dan 3 feature lainnya (yang punya data dengan sign '>') sangat kecil, dan bingung mau diapakan, maka lagi-lagi dihapus saja :)

In [38]:
prep_df = prep_df[prep_df['Total Number of Previous IVF cycles'] != '>=5']
prep_df = prep_df[prep_df['Total number of IVF pregnancies'] != '>=5']
prep_df = prep_df[prep_df['Fresh Eggs Collected'] != '> 50']
prep_df = prep_df[prep_df['Eggs Mixed With Partner Sperm'] != '> 50']

In [39]:
# ubah data object menjadi int
prep_df['Total Number of Previous IVF cycles'] = pd.to_numeric(prep_df['Total Number of Previous IVF cycles'])
prep_df['Total number of IVF pregnancies'] = pd.to_numeric(prep_df['Total number of IVF pregnancies'])
prep_df['Fresh Eggs Collected'] = pd.to_numeric(prep_df['Fresh Eggs Collected'])
prep_df['Eggs Mixed With Partner Sperm'] = pd.to_numeric(prep_df['Eggs Mixed With Partner Sperm'])

In [40]:
prep_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 443361 entries, 8214 to 148105
Data columns (total 26 columns):
 #   Column                                               Non-Null Count   Dtype  
---  ------                                               --------------   -----  
 0   Patient Age at Treatment                             443361 non-null  int64  
 1   Total Number of Previous IVF cycles                  443361 non-null  int64  
 2   Total number of IVF pregnancies                      443361 non-null  int64  
 3   Total number of live births - conceived through IVF  443361 non-null  int64  
 4   Type of Infertility - Female Primary                 443361 non-null  int64  
 5   Type of Infertility - Female Secondary               443361 non-null  int64  
 6   Type of Infertility - Male Primary                   443361 non-null  int64  
 7   Type of Infertility - Male Secondary                 443361 non-null  int64  
 8   Type of Infertility -Couple Primary                

## Spliting data

In [41]:
x = prep_df.drop(['Live Birth Occurrence'], axis = 1)
y = prep_df['Live Birth Occurrence'].astype(int)

In [42]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=6)
x_test, x_valid, y_test, y_valid = train_test_split(x_test, y_test, test_size=0.2, random_state=6)

In [43]:
len(x_train)

354688

In [44]:
len(x_test)

70938

In [45]:
len(x_valid)

17735

In [46]:
len(prep_df[prep_df['Live Birth Occurrence'] == 0])

331131

In [47]:
len(prep_df[prep_df['Live Birth Occurrence'] == 1])

112230

In [48]:
len(prep_df[prep_df['Live Birth Occurrence'] == 1]) / len(prep_df)

0.2531345788195173

## Modelling
referensi : https://www.kaggle.com/code/funxexcel/p3-random-forest-tuning-randomizedsearchcv/notebook


In [49]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(10, 80, 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [2, 4]
# Minimum number of samples required to split a node
min_samples_split = [2, 5]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the param grid
param_grid = {'n_estimators': n_estimators,
              'max_features': max_features,
              'max_depth': max_depth,
              'min_samples_split': min_samples_split,
              'min_samples_leaf': min_samples_leaf,
              'bootstrap': bootstrap}

rf_Model = RandomForestClassifier()

rf_RandomGrid = RandomizedSearchCV(estimator=rf_Model, param_distributions=param_grid,
                                   scoring='f1_micro', cv=10, verbose=2, n_jobs=4)

In [50]:
rf_RandomGrid.fit(x_train, y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits


RandomizedSearchCV(cv=10, estimator=RandomForestClassifier(), n_jobs=4,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [2, 4],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2],
                                        'min_samples_split': [2, 5],
                                        'n_estimators': [10, 17, 25, 33, 41, 48,
                                                         56, 64, 72, 80]},
                   scoring='f1_micro', verbose=2)

In [51]:
rf_RandomGrid.best_params_

{'n_estimators': 25,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 4,
 'bootstrap': False}

In [52]:
print (f'Train F1-score : {rf_RandomGrid.score(x_train, y_train):.3f}')
print (f'Test F1-score - : {rf_RandomGrid.score(x_test, y_test):.3f}')
print (f'Validate F1-score - : {rf_RandomGrid.score(x_valid, y_valid):.3f}')

Train F1-score : 0.746
Test F1-score - : 0.749
Validate F1-score - : 0.751


In [78]:
# Dump model
joblib.dump(rf_RandomGrid, '../output/model/train/model.pkl')

['../output/model/train/model.pkl']