# Main results

In [1]:
from sklearn.externals import joblib
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook

In [2]:
final_results = joblib.load('../output/modelizationResults.pkl')
final_results['Fine_Tuning'] = final_results['Fine_Tuning'].astype('bool')

These are the populated results of the different algorithms applied to the dataset. As you can see, for each algorithm we provide scores ($R^2$ and MAE) for a prediction with default settings and scores with hyperparameters tuned:

In [3]:
final_results

Unnamed: 0,Algorithm,Fine_Tuning,R2_Train,R2_Test,MAE_Train,MAE_Test
0,Decision trees,False,0.989,0.863,269.0,1278.0
0,Decision trees,True,0.943,0.883,926.0,1263.0
0,Random Forest,False,0.975,0.907,572.0,1132.0
0,Random Forest,True,0.897,0.881,1382.0,1445.0
0,K-nearest neighbours,False,0.9,0.839,1131.0,1445.0
0,K-nearest neighbours,True,0.989,0.86,273.0,1276.0
0,Gradient Boosting,False,0.847,0.836,1519.0,1556.0
0,Gradient Boosting,True,0.931,0.911,1108.0,1172.0


As you can observe, results are not exactly the same amongst different algorithms, so we have to examinate carefully which one could provide better results.

Decission trees, despite of getting good scores with train data is not so good with test data. There is a substantial gap between scores. Same happens with KNN. This is probably a signal of overfitting, so we will discard them as a solution for our issue.

There are two algorithms that provide the smallest gap between scores in train and test (tuned random forest and gradient boosting). We would accept a little higher error in estimation of train data (higher than decision trees or KNN, for instance) but having the best score in test data. This would be the best way to avoid the suspicion of overfitting. These would be the best options to choose from:

In [9]:
final_results.iloc[[3,7]]

Unnamed: 0,Algorithm,Fine_Tuning,R2_Train,R2_Test,MAE_Train,MAE_Test
0,Random Forest,True,0.897,0.881,1382.0,1445.0
0,Gradient Boosting,True,0.931,0.911,1108.0,1172.0


It's hard to tell which one would work better, but another aspect to keep in mind when choosing the best algorithm should be the analysis of the feature importances in the model. If results fit with our knowledge about the structure of the data we will be more confident with the choice. Let's see results for each algorithm:

<img src="../output/featImpRF.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />
     
<img src="../output/featImpGB.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />

With random forest, only powerPS and year concentrate almost all importance in the data explanation. However, gradient boosting gives importance to other features like kilometers and brand & model (actually, the most important feature for the algorithm). It's known that brand and model are important in cars market and they have an remarkable effect on the price. Just as a reminder, we show again the boxplots by brand chart drawn at data visualization section:

<img src="../output/brandsBoxplot.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />

So, in conclusion, gradient boosting will be, finally, our choice for deploying the predictive model as it is the model that best predicts test prices without discarding the importance of some features.

**Some caveats**:  
Although getting very acceptable results, there is some aspect that we have to consider in further improvements of the model.  
We saw that prices in dataset covered a wide range (for the analysis we cut from 200€ to 100K€). It is such a huge spread, so we knew in advance that it would be hard to obtain optimal results.  
Amongst all used algorithms, we observed the same phenomenon: As the cars were more expensive, more error we got in the prediction. That is, model predicted poorly cars above 40K€ and tended to underestimate them.  
Some examples:

<img src="../output/DTTest.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />
<img src="../output/RFTest.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />
<img src="../output/KNNTest.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />
<img src="../output/GBTest.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />

It's true that these cases are only a few (only 0.5% of samples above 40K€), but it would be nice to improve the performance of the prediction in the future. Actually, I think I will postpone it after the summer :-)