## Introduction

The goal is to combine several models' predictions to achieve the highest score with no overfitting. 
For this hyperparameters should be tuned wihout information about LB score. It is better to see what techniques can be applied to achieve maximize the score function. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Reference
* sub0: https://www.kaggle.com/fergusfindley/tpsjan22-eda-baseline-train-submission   
* sub1: https://www.kaggle.com/ambrosm/tpsjan22-06-lightgbm-quickstart                  
* sub2: https://www.kaggle.com/ambrosm/tpsjan22-03-linear-model                   

## LB score
* sub0: 4.61163 
* sub1: 4.38188  
* sub2: 4.33338

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-jan-2022/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-jan-2022/test.csv')
sub = pd.read_csv('../input/tabular-playground-series-jan-2022/sample_submission.csv')

sub0 = pd.read_csv('../input/tpsjan22-eda-baseline-train-submission/submission.csv')
sub1 = pd.read_csv('../input/tpsjan22-06-lightgbm-quickstart/submission_lightgbm_quickstart.csv')
sub2 = pd.read_csv('../input/tpsjan22-03-linear-model/submission_linear_model_rounded.csv')

## Some visualization

In [None]:
pred = np.array([np.array(sub0['num_sold'].values), np.array(sub1['num_sold'].values), np.array(sub2['num_sold'].values)])
pred.T

In [None]:
np.log(pred.T)

In [None]:
# Plot the distribution of the test predictions
plt.figure(figsize=(16,3))
plt.hist(train_df['num_sold'], bins=np.linspace(0, 3000, 201),
         density=True, label='Training')
plt.hist(sub0['num_sold'], bins=np.linspace(0, 3000, 201),
         density=True, rwidth=0.5, label='sub0 predictions')
plt.hist(sub1['num_sold'], bins=np.linspace(0, 3000, 201),
         density=True, rwidth=0.5, label='sub1 predictions')
plt.hist(sub2['num_sold'], bins=np.linspace(0, 3000, 201),
         density=True, rwidth=0.5, label='sub2 predictions')
plt.xlabel('num_sold')
plt.ylabel('Frequency')
plt.legend()
plt.show()

In [None]:
test_df['num_sold0'] = sub0.num_sold
test_df.head()

In [None]:
country='Norway'
store='KaggleMart'
product='Kaggle Hat'
plt.figure(figsize=(20, 6))
train_subset = train_df[(train_df.country == country) & (train_df.store == store) & (train_df['product'] == product)]
plt.scatter(np.arange(len(train_subset)), train_subset.num_sold, label='true', alpha=0.5, color='red', s=3)
plt.legend()
plt.title('Predictions and true num_sold for five years')
plt.show()

In [None]:
plt.figure(figsize=(20, 6))
train_subset = train_df[(train_df.country == country) & (train_df.store == store) & (train_df['product'] == product)]
plt.scatter(np.arange(len(train_subset)),np.log(train_subset.num_sold), label='true', alpha=0.5, color='red', s=3)
plt.legend()
plt.title('Log of True num_sold for five years')
plt.show()

In [None]:
test_df['num_sold0'] = sub0.num_sold
test_df['num_sold1'] = sub1.num_sold
test_df['num_sold2'] = sub2.num_sold

plt.figure(figsize=(20, 6))
train_subset = train_df[(train_df.country == country) & (train_df.store == store) & (train_df['product'] == product)]
sub_subset = test_df[(test_df.country == country) & (test_df.store == store) & (test_df['product'] == product)]
#plt.scatter(np.arange(len(train_subset)), train_subset.num_sold, label='true', alpha=0.5, color='red', s=3)
plt.scatter(np.arange(len(sub_subset)), sub_subset.num_sold0, label='sub0', alpha=0.5, color='orange', s=3)
plt.scatter(np.arange(len(sub_subset)), sub_subset.num_sold1, label='sub1', alpha=0.5, color='green', s=3)
plt.scatter(np.arange(len(sub_subset)), sub_subset.num_sold2, label='sub2', alpha=0.5, color='blue', s=3)
plt.legend()
plt.title('Predicted num_sold for five years')
plt.show()

In [None]:
plt.figure(figsize=(20, 6))
train_subset = train_df[(train_df.country == country) & (train_df.store == store) & (train_df['product'] == product)]
sub_subset = test_df[(test_df.country == country) & (test_df.store == store) & (test_df['product'] == product)]
#plt.scatter(np.arange(len(train_subset)), train_subset.num_sold, label='true', alpha=0.5, color='red', s=3)
plt.scatter(np.arange(len(sub_subset)), np.log(sub_subset.num_sold0), label='sub0', alpha=0.5, color='orange', s=3)
plt.scatter(np.arange(len(sub_subset)), np.log(sub_subset.num_sold1), label='sub1', alpha=0.5, color='green', s=3)
plt.scatter(np.arange(len(sub_subset)), np.log(sub_subset.num_sold2), label='sub2', alpha=0.5, color='blue', s=3)
plt.legend()
plt.title('Log of predicted num_sold for five years')
plt.show()

## Score
Symmetric mean absolute percentage error (SMAPE or sMAPE) is an accuracy measure based on percentage (or relative) errors. It is usually defined as follows:

${\displaystyle {\text{SMAPE}}={\frac {100\%}{n}}\sum _{t=1}^{n}{\frac {\left|F_{t}-A_{t}\right|}{(|A_{t}|+|F_{t}|)/2}}}$

where At is the actual value and Ft is the forecast value.

Ref: https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error

## Discussion
So our goal is to minimize the difference between the actual and predicted values and maximize their average (num_sold > 0, so we ignore absolute value in denominator). 
The goal is that predicted value should be close to actual and at the same time as large as possible. That's the reason rounding and multiplying by factor close to 1 increases LB score.
Additionally, ensembling the logarithm of values may be better because the logarithm decreases the range of possible values of prediction and therefore makes values of different models close to each other. Ensembling them and taking the exponent should arrive to better point than blending values at high distance from each other

## Approach
Here I want to try ensembling public notebooks by:
1. Simply finding the average of all predictions and rounding result
2. Finding the median and rounding, so the result will theoretically satify both requirements of metrics
3. Finding the maximum and rounding, so SMAPE will be decreased because of denominator
4. Finding the mean of logarithm, then taking the exponent and rounding


Finding median and max of logarithm unecessary to check because logarithm is strictly increasing function which makes the mean and median value along 3 options be the same.

In [None]:
mean = np.mean(pred, axis=0)
med = np.median(pred, axis=0)
maxi = np.max(pred, axis=0)

log_mean = np.mean(np.log(pred), axis=0)

In [None]:
sub['num_sold'] = np.round(mean)
sub.to_csv('submission_mean.csv', index=False)
sub.head(5)

In [None]:
sub['num_sold'] = np.round(med)
sub.to_csv('submission_median.csv', index=False)
sub.head(5)

In [None]:
sub['num_sold'] = np.round(maxi)
sub.to_csv('submission_maximum.csv', index=False)
sub.head(5)

In [None]:
sub['num_sold'] = np.round(log_mean)
sub.to_csv('submission_log_mean.csv', index=False)
sub.head(5)

LB scores are:
* mean: 4.25767
* median: 4.21332
* maximum: 4.73279
* logarithmic mean: 4.25514


We can see that median works better than mean and maximum. Also logarithm score is better for mean case and the same fot median and maximum. So earlier assumption was right