#### What are you trying to do in this notebook?
The goal is to combine several models' predictions to achieve the highest score with no overfitting. 
For this hyperparameters should be tuned wihout information.
It is better to see what techniques can be applied to achieve maximize the score function.
#### Why are you trying it?
- To minimize the difference between the actual and predicted values and maximize their average.
- The goal is that predicted value should be close to actual and at the same time as large as possible. 
- Ensembling the logarithm of values may be better because the logarithm decreases the range of possible values of prediction and therefore makes values of different models close to each other.
- Ensembling them and taking the exponent should arrive to better point than blending values at high distance from each other, etc.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-jan-2022/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-jan-2022/test.csv')
sub = pd.read_csv('../input/tabular-playground-series-jan-2022/sample_submission.csv')

In [None]:
sub0 = pd.read_csv('../input/tpsjan22-eda-baseline-train-submission/submission.csv')
sub1 = pd.read_csv('../input/tpsjan22-06-lightgbm-quickstart/submission_lightgbm_quickstart.csv')
sub2 = pd.read_csv('../input/tpsjan22-03-linear-model/submission_linear_model_rounded.csv')

In [None]:
pred = np.array([np.array(sub0['num_sold'].values), np.array(sub1['num_sold'].values), np.array(sub2['num_sold'].values)])
pred.T

In [None]:
np.log(pred.T)

In [None]:
plt.figure(figsize=(16,3))
plt.hist(train_df['num_sold'], bins=np.linspace(0, 3000, 201),
         density=True, label='Training')
plt.hist(sub0['num_sold'], bins=np.linspace(0, 3000, 201),
         density=True, rwidth=0.5, label='sub0 predictions')
plt.hist(sub1['num_sold'], bins=np.linspace(0, 3000, 201),
         density=True, rwidth=0.5, label='sub1 predictions')
plt.hist(sub2['num_sold'], bins=np.linspace(0, 3000, 201),
         density=True, rwidth=0.5, label='sub2 predictions')
plt.xlabel('num_sold')
plt.ylabel('Frequency')
plt.legend()
plt.show()

In [None]:
test_df['num_sold0'] = sub0.num_sold
test_df.head()

In [None]:
country='Norway'
store='KaggleMart'
product='Kaggle Hat'

In [None]:
plt.figure(figsize=(20, 6))
train_subset = train_df[(train_df.country == country) & (train_df.store == store) & (train_df['product'] == product)]
plt.scatter(np.arange(len(train_subset)), train_subset.num_sold, label='true', alpha=0.5, color='red', s=3)
plt.legend()
plt.title('Predictions and true num_sold for five years')
plt.show()

In [None]:
plt.figure(figsize=(20, 6))
train_subset = train_df[(train_df.country == country) & (train_df.store == store) & (train_df['product'] == product)]
plt.scatter(np.arange(len(train_subset)),np.log(train_subset.num_sold), label='true', alpha=0.5, color='red', s=3)
plt.legend()
plt.title('Log of True num_sold for five years')
plt.show()

In [None]:
test_df['num_sold0'] = sub0.num_sold
test_df['num_sold1'] = sub1.num_sold
test_df['num_sold2'] = sub2.num_sold

In [None]:
plt.figure(figsize=(20, 6))
train_subset = train_df[(train_df.country == country) & (train_df.store == store) & (train_df['product'] == product)]
sub_subset = test_df[(test_df.country == country) & (test_df.store == store) & (test_df['product'] == product)]
#plt.scatter(np.arange(len(train_subset)), train_subset.num_sold, label='true', alpha=0.5, color='red', s=3)
plt.scatter(np.arange(len(sub_subset)), sub_subset.num_sold0, label='sub0', alpha=0.5, color='orange', s=3)
plt.scatter(np.arange(len(sub_subset)), sub_subset.num_sold1, label='sub1', alpha=0.5, color='green', s=3)
plt.scatter(np.arange(len(sub_subset)), sub_subset.num_sold2, label='sub2', alpha=0.5, color='blue', s=3)
plt.legend()
plt.title('Predicted num_sold for five years')
plt.show()

In [None]:
plt.figure(figsize=(20, 6))
train_subset = train_df[(train_df.country == country) & (train_df.store == store) & (train_df['product'] == product)]
sub_subset = test_df[(test_df.country == country) & (test_df.store == store) & (test_df['product'] == product)]
#plt.scatter(np.arange(len(train_subset)), train_subset.num_sold, label='true', alpha=0.5, color='red', s=3)
plt.scatter(np.arange(len(sub_subset)), np.log(sub_subset.num_sold0), label='sub0', alpha=0.5, color='orange', s=3)
plt.scatter(np.arange(len(sub_subset)), np.log(sub_subset.num_sold1), label='sub1', alpha=0.5, color='green', s=3)
plt.scatter(np.arange(len(sub_subset)), np.log(sub_subset.num_sold2), label='sub2', alpha=0.5, color='blue', s=3)
plt.legend()
plt.title('Log of predicted num_sold for five years')
plt.show()

In [None]:
mean = np.mean(pred, axis=0)
med = np.median(pred, axis=0)
maxi = np.max(pred, axis=0)

log_mean = np.mean(np.log(pred), axis=0)

In [None]:
sub['num_sold'] = np.round(med)
sub.to_csv('submission_median.csv', index=False)
sub.head(5)

In [None]:
sub['num_sold'] = np.round(mean)
sub.to_csv('submission_mean.csv', index=False)
sub.head(5)

In [None]:
sub['num_sold'] = np.round(maxi)
sub.to_csv('submission_maximum.csv', index=False)
sub.head(5)

In [None]:
sub['num_sold'] = np.round(log_mean)
sub.to_csv('submission_log_mean.csv', index=False)
sub.head(5)

#### Did it work?
- Simply finding the average of all predictions and rounding result.
- Finding the median and rounding, so the result will theoretically satisfy both requirements of metrics.
- Finding the maximum and rounding, so SMAPE will be decreased because of denominator.
- Finding the mean of logarithm, then taking the exponent and rounding.

#### What did you not understand about this process?
Well, everything provides in the competition data page. I've no problem while working on it. If you guys don't understand the thing that I'll do in this notebook then please comment on this notebook.

#### What else do you think you can try as part of this approach?
Finding median and max of logarithm unnecessary to check because logarithm is strictly increasing function. 