# Model Submission Notebook - Home Credit Default Risk

### initialization - load packages and data
The Data Generation notebook already created the test set of features and performed all the cleansing, data engineering, table merging, etc. All we need to do is load that set into the model, feed it into the pipeline and apply our final model.

In [None]:
import pandas as pd
import joblib
import numpy as np
import gc

pd.options.display.max_columns = None
test = pd.read_pickle('../input/ensemblemodelstuff/test1205.pkl')
print(test.shape)
test.head(5)

### Load Model(s)
The default_processor and default_model files were dumped to joblib in the final step of the model notebook. Since we are attempting an ensemble model, we need to load three different models: logistic regression, light gradient boost and random forest.

In [None]:
preprocessor = joblib.load('../input/ensemblemodelstuff/default_preprocessor_final.joblib')
model1 = joblib.load('../input/ensemblemodelstuff/default_model_final_LR.joblib')                   # this is the logistic regression
model2 = joblib.load('../input/ensemblemodelstuff/default_model_final_LGB.joblib')                  # this is the LightGBM
model3 = joblib.load('../input/ensemblemodelstuff/default_model_final_RF.joblib')                   # this is the random forest

print(type(model1))
print(type(model2))
print(type(model3))

### Preprocessing
We still need to run our data set through the pipeline to perform imputation and scaling on the numeric features, and one-hot encoding on the categorical features.

In [None]:
X_test = preprocessor.transform(test)
print(X_test.shape)

### Test Predictions
For each model, we need to generate predictions on the test data, then we will take the weighted average, using the weights (80% LGBM, 15% LR, 5% RF) that we identified as a candidate ensemble model. This ensemble combination did outscore the pure LGBM model on the validation data, although the difference was in the hundred-thousandths place.

In [None]:
test_pred_1 = model1.predict_proba(X_test)           # predictions for logistic regression
test_pred_2 = model2.predict_proba(X_test)           # predictions for light GBM
test_pred_3 = model3.predict_proba(X_test)           # predictions for random forest

# in the previous notebook, we identified an ensemble of 80% LGBM, 15% LR, 5% RF as a good candidate. So we need the weighted average of the three sets of predictions:
ensemble = (0.15 * test_pred_1) + (0.8 * test_pred_2) + (0.05 * test_pred_3) 

# the predictions that we need are in column 1 (the right hand column) of this array.
print(ensemble.shape)
print(ensemble[:5])

### Submission
Prepare CSV file full of our probabilistic predictions for final Kaggle submission.

In [None]:
submission = pd.read_csv('../input/home-credit-default-risk/sample_submission.csv')
submission.TARGET = ensemble[:,1]                   # replace the default values with our predictions
submission.head(10)
submission.to_csv('default_submission_17.csv', index=False, header = True)