# DREAM Preterm Birth Prediction challenge (Boosting)

This is analysis of data competition "DREAM Preterm Birth Prediction challenge". The goal is to predict the gestational age, the months from the last period after pregnancy and the child birth, based on the blood genome data. The main difficulty is to extract important features from the data, which has 30,000 columns.

# Package

In [15]:
import pandas as pd

from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

# Data

In [3]:
alldata = pd.read_csv("alldata.csv")
alldata.head()

Unnamed: 0.1,Unnamed: 0,g_1,g_2,g_3,g_4,g_5,g_6,g_7,g_8,g_9,...,g_32827,g_32828,g_32829,g_32830,SampleID,GA,Batch,Set,Train,Platform
0,1,6.062215,3.796484,5.849338,3.567779,6.166815,4.443027,5.836522,6.330018,4.922339,...,8.972873,10.440245,12.101476,13.695705,Tarca_001_P1A01,11.0,1,PRB_HTA,1,HTA20
1,2,6.125023,3.805305,6.191562,3.452524,5.678373,4.773199,6.143398,5.601745,4.711765,...,9.376194,10.845176,12.370891,13.635522,Tarca_003_P1A03,,1,PRB_HTA,0,HTA20
2,3,5.875502,3.450245,6.550525,3.316134,6.185059,4.393488,5.898364,6.137984,4.628124,...,8.843612,10.493416,12.295786,13.616688,Tarca_004_P1A04,32.6,1,PRB_HTA,1,HTA20
3,4,6.126131,3.628411,6.421877,3.432451,5.633757,4.623783,6.019792,5.787502,4.796283,...,9.191471,10.879879,12.249936,13.524328,Tarca_005_P1A05,30.6,1,PRB_HTA,1,HTA20
4,5,6.146466,3.446812,6.260962,3.477162,5.313198,4.422651,6.407699,5.830437,4.726488,...,9.247768,10.754316,12.245458,13.509353,Tarca_006_P1A06,,1,PRB_HTA,0,HTA20


In [5]:
X_train_df = alldata.loc[alldata['Train'] == 1, 'g_1':'g_32830']
y_train_df = alldata.loc[alldata['Train'] == 1, 'GA']

# Training validation split

In [7]:
SEED = 1
SIZE = 0.3

X_train, X_vali, y_train, y_vali = train_test_split(X_train_df, y_train_df, test_size = SIZE, random_state = SEED)

# AdaBoost

In [12]:
adb = AdaBoostRegressor(n_estimators = 100,
                        random_state = SEED)

In [13]:
adb.fit(X_train, y_train)
y_pred_vali = adb.predict(X_vali)

In [14]:
rmse_vali = MSE(y_vali, y_pred_vali)**(1/2)
print("Validation set RMSE: {:.2f}".format(rmse_vali))

Validation set RMSE: 7.10


# Gradient boosting

In [16]:
gbt = GradientBoostingRegressor(n_estimators = 300,
                                max_depth = 1,
                                random_state = SEED)

In [17]:
gbt.fit(X_train, y_train)
y_pred_vali = gbt.predict(X_vali)

In [18]:
rmse_vali = MSE(y_vali, y_pred_vali)**(1/2)
print("Validation set RMSE: {:.2f}".format(rmse_vali))

Validation set RMSE: 6.64


# Stochastic gradient boosting

In [8]:
sgbt = GradientBoostingRegressor(max_depth = 1,
                                 subsample = 0.8,
                                 max_features = 0.2,
                                 n_estimators = 300,
                                 random_state = SEED)

In [9]:
sgbt.fit(X_train, y_train)
y_pred_vali = sgbt.predict(X_vali)

In [10]:
rmse_vali = MSE(y_vali, y_pred_vali)**(1/2)
print("Validation set RMSE: {:.2f}".format(rmse_vali))

Validation set RMSE: 6.88
