# Combining my submission.csv files for a better score
Over time I have have experimented with various machine learning techniques applied to the [House Prices: Advanced Regression Techniques competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques), each time submitting a `submission.csv` file. Here is a table of my results:

| Technique | Score | Explained variance |
| :--- | --- | --- |
| [neural network](https://www.kaggle.com/carlmcbrideellis/very-simple-neural-network-regression)| 0.23181 | 0.69091 |
| [Gaussian process](https://www.kaggle.com/carlmcbrideellis/gaussian-process-regression-sample-script) | 0.21004 | 0.76409 |
| [Random forest](https://www.kaggle.com/carlmcbrideellis/random-forest-regression-minimalist-script) | 0.17734 | 0.86514 |
| [XGBoost](https://www.kaggle.com/carlmcbrideellis/very-simple-xgboost-regression) | 0.15617 | 0.90148 |
| [CatBoost](https://www.kaggle.com/carlmcbrideellis/catboost-regression-minimalist-script) | 0.15270 | 0.90096 |

I thought it would be fun to find a [linear combination](https://en.wikipedia.org/wiki/Linear_combination) of my `submission.csv` files that gives a better leaderboard score than any of the individual submissions, as well as having a better [explained variance score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html). To do this I make use of the excellent notebook ["Finding Ensemble Weights"](https://www.kaggle.com/hsperr/finding-ensamble-weights) written by [Henning Sperr](https://www.kaggle.com/hsperr).

In [None]:
import pandas as pd
import numpy  as np
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import explained_variance_score

read in my `submission.csv` files

In [None]:
s1 = pd.read_csv("../input/very-simple-neural-network-regression/submission.csv")
s2 = pd.read_csv("../input/gaussian-process-regression-sample-script/submission.csv")
s3 = pd.read_csv("../input/random-forest-regression-minimalist-script/submission.csv")
s4 = pd.read_csv("../input/very-simple-xgboost-regression/submission.csv")
s5 = pd.read_csv("../input/catboost-regression-minimalist-script/submission.csv")


n_submission_files = 5
# also create a placeholder dataFrame
s_final = pd.read_csv("../input/very-simple-xgboost-regression/submission.csv")

we shall also read in the [ground truth (correct) target values](https://www.kaggle.com/carlmcbrideellis/house-prices-advanced-regression-solution-file):

In [None]:
solution   = pd.read_csv('../input/house-prices-advanced-regression-solution-file/submission.csv')
y_true     = solution["SalePrice"]

We now use [scipy.optimize.minimize](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html) to find the lowest score using the evaluation metric of the House Prices competition, which in this case is the root of the [mean squared logarithmic error regression loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html)

In [None]:
from scipy.optimize import minimize

tmp_scores = []
tmp_weights = []
predictions = []
predictions.append( s1["SalePrice"] )
predictions.append( s2["SalePrice"] )
predictions.append( s3["SalePrice"] )
predictions.append( s4["SalePrice"] )
predictions.append( s5["SalePrice"] )

def scoring_function(weights):
    final_prediction = 0
    for weight, prediction in zip(weights, predictions):
            final_prediction += weight*prediction
    return np.sqrt(mean_squared_log_error(y_true, final_prediction))

for i in range(150):
    starting_values = np.random.uniform(size=n_submission_files)
    bounds = [(0,1)]*len(predictions)
    result = minimize(scoring_function, 
                      starting_values, 
                      method='L-BFGS-B', 
                      bounds=bounds, 
                      options={'disp': False, 'maxiter': 10000})
    tmp_scores.append(result['fun'])
    tmp_weights.append(result['x'])

bestWeight = tmp_weights[np.argmin(tmp_scores)]
print('Best weights', bestWeight)

we can see that the best combination is a mix consisting of 2.73% Gaussian process, 7.3% random forest, 17.7% XGBoost and finally 72% CatBoost. Let us now take a look at the results

In [None]:
s_final["SalePrice"] = s1["SalePrice"]*bestWeight[0] + s2["SalePrice"]*bestWeight[1] +  s3["SalePrice"]*bestWeight[2] +  s4["SalePrice"]*bestWeight[3] +  s5["SalePrice"]*bestWeight[4]

print("The new score is %.5f" % np.sqrt( mean_squared_log_error(y_true, s_final["SalePrice"]) ) )
print("The new explained variance is %.5f" % explained_variance_score(y_true, s_final["SalePrice"]) )

# Success!
It looks like we were able to find a judicious combination of weights that does indeed result in a better `submission.csv` than any of the component `submission.csv` files. Let us now submit this new solution to the competition:

In [None]:
s_final.to_csv('submission.csv', index=False)