# Kaggle - Porte Seguro Safe Driver Prediction - Perfomance Improvement 3
**Author: Chris Shin**

Ensemble methods are techniques in machine learning where multiple models are trained and combined to improve the overall performance and robustness of the prediction. The idea behind ensemble methods is to create a diverse set of models that are capable of capturing different aspects of the data, and then combine their predictions to produce a more accurate and reliable result.

Ensemble methods can be used in both supervised and unsupervised learning tasks. In supervised learning, ensemble methods are typically used for classification and regression tasks, while in unsupervised learning, they are used for clustering and dimensionality reduction tasks.

There are several reasons why ensemble methods can be effective:

1. Reduction of overfitting: Ensemble methods can help reduce overfitting, which is a common problem in machine learning where a model performs well on the training data but poorly on the test data. By combining multiple models, ensemble methods can reduce the risk of overfitting by averaging out errors and reducing variance.

2. Improved accuracy: Ensemble methods can improve the accuracy of predictions by combining the strengths of multiple models. This can result in more robust and accurate predictions, especially when dealing with complex and noisy datasets.

3. Increased stability: Ensemble methods can be more stable than single models because they are less sensitive to variations in the data. This can be especially important in real-world applications where the data is constantly changing or noisy.

4. Flexibility: Ensemble methods are flexible and can be used with a wide range of machine learning algorithms, including decision trees, neural networks, and support vector machines.

Overall, ensemble methods can be a powerful tool in machine learning and can help improve the accuracy, stability, and robustness of predictions.

Ensemble methods are not always better than single models. It depends on the specific problem and data. In some cases, a well-tuned single model may outperform an ensemble of models. However, ensembles can often provide better results by combining the strengths of multiple models and reducing the weaknesses of any single model. Ensembles can also be more robust and less prone to overfitting than single models. In general, it is a good idea to experiment with both single models and ensembles and choose the approach that gives the best performance for the specific task at hand.

In [1]:
import pandas as pd
pd.options.display.max_columns = None

train = pd.read_csv('./data/train.csv', index_col='id')
test = pd.read_csv('./data/test.csv', index_col='id')
submission = pd.read_csv('./data/sample_submission.csv', index_col='id')

In [2]:
all_data = pd.concat([train, test], ignore_index=True)
all_data = all_data.drop('target', axis=1) 

all_features = all_data.columns

In [3]:
from sklearn.preprocessing import OneHotEncoder

cat_features = [feature for feature in all_features if 'cat' in feature] 

onehot_encoder = OneHotEncoder()
encoded_cat_matrix = onehot_encoder.fit_transform(all_data[cat_features]) 

In [4]:
all_data['num_missing'] = (all_data==-1).sum(axis=1)

In [5]:
remaining_features = [feature for feature in all_features
                      if ('cat' not in feature and 'calc' not in feature)] 
remaining_features.append('num_missing')

In [6]:
ind_features = [feature for feature in all_features if 'ind' in feature]

is_first_feature = True
for ind_feature in ind_features:
    if is_first_feature:
        all_data['mix_ind'] = all_data[ind_feature].astype(str) + '_'
        is_first_feature = False
    else:
        all_data['mix_ind'] += all_data[ind_feature].astype(str) + '_'

In [7]:
cat_count_features = []
for feature in cat_features+['mix_ind']:
    val_counts_dict = all_data[feature].value_counts().to_dict()
    all_data[f'{feature}_count'] = all_data[feature].apply(lambda x: 
                                                           val_counts_dict[x])
    cat_count_features.append(f'{feature}_count')

In [8]:
from scipy import sparse
drop_features = ['ps_ind_14', 'ps_ind_10_bin', 'ps_ind_11_bin', 
                 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_car_14']

all_data_remaining = all_data[remaining_features+cat_count_features].drop(drop_features, axis=1)

all_data_sprs = sparse.hstack([sparse.csr_matrix(all_data_remaining),
                               encoded_cat_matrix],
                              format='csr')

In [9]:
num_train = len(train)

X = all_data_sprs[:num_train]
X_test = all_data_sprs[num_train:]

y = train['target'].values

In [10]:
import numpy as np

def eval_gini(y_true, y_pred):
    assert y_true.shape == y_pred.shape

    n_samples = y_true.shape[0]
    L_mid = np.linspace(1 / n_samples, 1, n_samples)

    pred_order = y_true[y_pred.argsort()]
    L_pred = np.cumsum(pred_order) / np.sum(pred_order) 
    G_pred = np.sum(L_mid - L_pred) 

    true_order = y_true[y_true.argsort()]
    L_true = np.cumsum(true_order) / np.sum(true_order) 
    G_true = np.sum(L_mid - L_true)

    return G_pred / G_true

In [11]:
# Gini for LGB
def gini_lgb(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', eval_gini(labels, preds), True


In [12]:
# Gini for XGB
def gini_xgb(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', eval_gini(labels, preds)

In [13]:
from sklearn.model_selection import StratifiedKFold

folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=1991)

In [14]:
max_params_lgb = {
    'bagging_fraction': 0.6213108174593661,
    'feature_fraction': 0.608712929970154,
    'lambda_l1': 0.7040436794880651,
    'lambda_l2': 0.9832619845547939,
    'min_child_samples': 9,
    'min_child_weight': 36.10036444740457,
    'num_leaves': 40,
    'objective': 'binary',
    'learning_rate': 0.005,
    'bagging_freq': 1,
    'force_row_wise': True,
    'random_state': 1991
}

In [15]:
import lightgbm as lgb

oof_val_preds_lgb = np.zeros(X.shape[0]) 
oof_test_preds_lgb = np.zeros(X_test.shape[0]) 

for idx, (train_idx, valid_idx) in enumerate(folds.split(X, y)):
    print('#'*40, f'Fold {idx+1} / fold {folds.n_splits}', '#'*40)
    
    X_train, y_train = X[train_idx], y[train_idx]
    X_valid, y_valid = X[valid_idx], y[valid_idx]

    dtrain = lgb.Dataset(X_train, y_train)
    dvalid = lgb.Dataset(X_valid, y_valid)
                          
    lgb_model = lgb.train(params=max_params_lgb,
                          train_set=dtrain,
                          num_boost_round=2500,
                          valid_sets=dvalid,
                          feval=gini_lgb,
                          early_stopping_rounds=300,
                          verbose_eval=100)
    
    oof_test_preds_lgb += lgb_model.predict(X_test)/folds.n_splits
    
    oof_val_preds_lgb[valid_idx] += lgb_model.predict(X_valid)
    
    gini_score = eval_gini(y_valid, oof_val_preds_lgb[valid_idx])
    print(f'Fold {idx+1} gini coefficient : {gini_score}\n')

######################################## Fold 1 / fold 5 ########################################




[LightGBM] [Info] Number of positive: 17355, number of negative: 458814
[LightGBM] [Info] Total Bins 1554
[LightGBM] [Info] Number of data points in the train set: 476169, number of used features: 216
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.036447 -> initscore=-3.274764
[LightGBM] [Info] Start training from score -3.274764
Training until validation scores don't improve for 300 rounds
[100]	valid_0's binary_logloss: 0.154239	valid_0's gini: 0.270944
[200]	valid_0's binary_logloss: 0.153176	valid_0's gini: 0.275764
[300]	valid_0's binary_logloss: 0.152584	valid_0's gini: 0.279501
[400]	valid_0's binary_logloss: 0.152222	valid_0's gini: 0.282893
[500]	valid_0's binary_logloss: 0.151986	valid_0's gini: 0.286058
[600]	valid_0's binary_logloss: 0.151824	valid_0's gini: 0.288805
[700]	valid_0's binary_logloss: 0.151712	valid_0's gini: 0.290719
[800]	valid_0's binary_logloss: 0.151622	valid_0's gini: 0.292581
[900]	valid_0's binary_logloss: 0.151552	valid_0's gini: 0.294212
[1000]	va

In [16]:
max_params_xgb = {
    'colsample_bytree': 0.8843124587484356,
    'gamma': 10.452246227672624,
    'max_depth': 7,
    'min_child_weight': 6.494091293383359,
    'reg_alpha': 8.551838810159788,
    'reg_lambda': 1.3814765995549108,
    'scale_pos_weight': 1.423280772455086,
    'subsample': 0.7001630536555632,
    'objective': 'binary:logistic',
    'learning_rate': 0.02,
    'random_state': 1991
}

In [17]:
import xgboost as xgb

oof_val_preds_xgb = np.zeros(X.shape[0]) 
oof_test_preds_xgb = np.zeros(X_test.shape[0]) 

for idx, (train_idx, valid_idx) in enumerate(folds.split(X, y)):
    print('#'*40, f'Fold {idx+1} / fold {folds.n_splits}', '#'*40)
    
    X_train, y_train = X[train_idx], y[train_idx]
    X_valid, y_valid = X[valid_idx], y[valid_idx]

    dtrain = xgb.DMatrix(X_train, y_train)
    dvalid = xgb.DMatrix(X_valid, y_valid)
    dtest = xgb.DMatrix(X_test)

    xgb_model = xgb.train(params=max_params_xgb, 
                          dtrain=dtrain,
                          num_boost_round=2000,
                          evals=[(dvalid, 'valid')],
                          maximize=True,
                          feval=gini_xgb,
                          early_stopping_rounds=200,
                          verbose_eval=100)

    best_iter = xgb_model.best_iteration

    oof_test_preds_xgb += xgb_model.predict(dtest,
                                            iteration_range=(0, best_iter))/folds.n_splits
    
    oof_val_preds_xgb[valid_idx] += xgb_model.predict(dvalid, 
                                                      iteration_range=(0, best_iter))
    
    gini_score = eval_gini(y_valid, oof_val_preds_xgb[valid_idx])
print(f'Fold {idx+1} gini coefficient : {gini_score}\n')

######################################## Fold 1 / fold 5 ########################################




[0]	valid-logloss:0.67665	valid-gini:0.15993
[100]	valid-logloss:0.19089	valid-gini:0.24884
[200]	valid-logloss:0.15780	valid-gini:0.27713
[300]	valid-logloss:0.15458	valid-gini:0.28754
[400]	valid-logloss:0.15405	valid-gini:0.29224
[500]	valid-logloss:0.15390	valid-gini:0.29505
[600]	valid-logloss:0.15385	valid-gini:0.29615
[700]	valid-logloss:0.15380	valid-gini:0.29718
[800]	valid-logloss:0.15379	valid-gini:0.29778
[900]	valid-logloss:0.15376	valid-gini:0.29806
[1000]	valid-logloss:0.15377	valid-gini:0.29791
[1100]	valid-logloss:0.15375	valid-gini:0.29822
[1200]	valid-logloss:0.15375	valid-gini:0.29811
[1300]	valid-logloss:0.15373	valid-gini:0.29846
[1400]	valid-logloss:0.15374	valid-gini:0.29845
[1500]	valid-logloss:0.15372	valid-gini:0.29873
[1600]	valid-logloss:0.15374	valid-gini:0.29847
[1693]	valid-logloss:0.15373	valid-gini:0.29848
######################################## Fold 2 / fold 5 ########################################
[0]	valid-logloss:0.67666	valid-gini:0.12533
[100]

In [18]:
print('LightGBM OOF  OOF validation prediction gini coefficient:', eval_gini(y, oof_val_preds_lgb))

LightGBM OOF  OOF validation prediction gini coefficient: 0.2889651000887542


In [19]:
print('XGBoost OOF  OOF validation prediction gini coefficient:', eval_gini(y, oof_val_preds_xgb))

XGBoost OOF  OOF validation prediction gini coefficient: 0.28863101798154267


In [20]:
oof_test_preds = oof_test_preds_lgb * 0.5 + oof_test_preds_xgb * 0.5

In [21]:
submission['target'] = oof_test_preds
submission.to_csv('submission.csv')