In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Version history
* v.1: Started
* v.2: Replaced cosine similarity by MAD. Introduced a projective transformation to model the relationship between quarterly GDP and quarterly sales.
* v.3: Changed optimization algorithm from L-BFGS-B (default) to Nelder-Mead because the former sometimes terminates abnormally. Added assertion to make sure the optimization terminates successfully.

# Introduction

A lot of contestants in this competition use external GDP data to model and forecast the annual sales in 2019. Most noted that nominal GDP in USD is a good fit to the training data. Now your modeling may lead you to a more detailed breakdown, such as quarterly sales or even monthly sales in 2019. It seems that quarterly nominal GDP in USD is not easy to find, especially you stick with reputable sources. In the following, I combine annual GDP data in USD from [World Bank](https://data.worldbank.org) and quarterly data in local currency from [IMF](https://data.imf.org) to produce quarterly GDP data in USD.

# Loading data

I have already extracted the quarterly data from IMF into a CSV file.

In [None]:
gdp_quarterly_lcu = pd.read_csv('../input/gdp-fin-nor-swe-20152019-quarterly-imf/GDP_FIN_NOR_SWE_2015-2019_Quarterly_IMF.csv')
gdp_quarterly_lcu

Then I load the data from World Bank, specifically the nominal GDP data in USD.

In [None]:
df_gdp = pd.read_csv('../input/gdp-fin-nor-swe-20152019-multiple-sources/GDP_FIN_NOR_SWE_2015-2019_Multiple_Sources.csv')
gdp_annual_usd = df_gdp[(df_gdp['Measure']=='Current prices, current exchange rates')&(df_gdp['Data Source']=='World Bank')].copy()
gdp_annual_usd

The only difference between the data in local currency and the data in USD is the exchange rate, which varies from year to year. But we can figure out the quarterly breakdown from the data in local currency and apply it to the data in USD. 

First step is to calculate the relative contribution of each quarter for each year.

In [None]:
gdp_quarterly_lcu['annual']=gdp_quarterly_lcu[['Q1','Q2','Q3','Q4']].to_numpy().sum(axis=1)
for q in range(1,5):
    gdp_quarterly_lcu[F'Q{q}']=gdp_quarterly_lcu[F'Q{q}']/gdp_quarterly_lcu['annual']
gdp_quarterly_lcu[['Year','Country','Q1','Q2','Q3','Q4']]

Next, split the annual USD amount according to the quarterly contributions, and we are done!

In [None]:
gdp_quarterly_usd=gdp_quarterly_lcu[['Year','Country','Q1','Q2','Q3','Q4']].join(gdp_annual_usd.set_index(['Year','Country'])['Value'],on=['Year','Country'])
for q in range(1,5):
    gdp_quarterly_usd[F'Q{q}']=gdp_quarterly_usd[F'Q{q}']*gdp_quarterly_usd['Value']
gdp_quarterly_usd = gdp_quarterly_usd.rename(columns={'Value':'Annual'})
gdp_quarterly_usd

# Validation

Ultimately, we are doing this so that we can forecast 2019 sales for the Kaggle shops using the GDP data. Does the quarterly GDP data we just derived work for that purpose? Let's investigate.

We first load the training data and add two helper columns `Year` and `Quarter`.

In [None]:
train_data = pd.read_csv('../input/tabular-playground-series-jan-2022/train.csv',parse_dates=['date'])
train_data['Year'] = train_data.date.apply(lambda x:x.year)
train_data['Quarter'] = train_data.date.apply(lambda x: (x.month-1)//3+1)
train_data

For each country-store-product-quarter combination, we calculate the correlation between the quarterly sales and the quarterly GDP from 2015 to 2018. We also calculate the annual GDP correlation for comparison.

### QoQ and YoY correlation

In [None]:
corr = []
for country in ['Finland','Norway','Sweden']:
    for store in ['KaggleMart','KaggleRama']:
        for product in ['Kaggle Mug','Kaggle Hat','Kaggle Sticker']:
                row = [country,store,product]
                df = train_data[(train_data['country']==country)&(train_data['store']==store)&(train_data['product']==product)]
                for q in range(1,5):
                    df_q = df[df['Quarter']==q]
                    sales = [df_q[df_q['Year']==yr]['num_sold'].sum() for yr in range(2015,2019)]
                    gdp = gdp_quarterly_usd[(gdp_quarterly_usd['Country']==country)&(gdp_quarterly_usd['Year']<2019)][F'Q{q}'].to_numpy()
                    row.append(pd.Series(sales).corr(pd.Series(gdp)))

                sales = [df[df['Year']==yr]['num_sold'].sum() for yr in range(2015,2019)]
                gdp = gdp_quarterly_usd[(gdp_quarterly_usd['Country']==country)&(gdp_quarterly_usd['Year']<2019)]['Annual'].to_numpy()
                row.append(pd.Series(sales).corr(pd.Series(gdp)))
                corr.append(row)
pd.DataFrame(corr,columns=['country','store','product','Q1','Q2','Q3','Q4','Annual'])

The year-on-year (linear) correlation is very high, as most contestants already know. The quarter-on-quarter correlation is not as high, especially for Norway. It seems that using the quarterly GDP data to project quarterly sales is not such a good idea.

What about predicting the quarterly breakdown within a given year? For a given year, the quarterly sales form a distribution. The quarterly GDP data form another distribution. For simplicity, we use the mean absolute deviation (MAD) to compare them, which is valid as long as we normalize the quarterly GDP data and quarterly sales data so that within each year they sum to 1.

### Predicting quarterly contributions

In [None]:
l_1 = []
for country in ['Finland','Norway','Sweden']:
    for store in ['KaggleMart','KaggleRama']:
        for product in ['Kaggle Mug','Kaggle Hat','Kaggle Sticker']:
            row = [country,store,product]
            for yr in range(2015,2019):
                df = train_data[(train_data['country']==country)&(train_data['store']==store)&
                                (train_data['product']==product)&(train_data['Year']==yr)]
                sales = np.array([df[df['Quarter']==q]['num_sold'].sum() for q in range(1,5)])
                gdp = gdp_quarterly_usd[(gdp_quarterly_usd['Country']==country)&(gdp_quarterly_usd['Year']==yr)][['Q1','Q2','Q3','Q4']].to_numpy().flatten()
                row.append(np.abs(sales/np.sum(sales)-gdp/np.sum(gdp)).mean())
            l_1.append(row)
                           
pd.DataFrame(l_1,columns=['country','store','product','2015','2016','2017','2018'])

So the MAD's are kind of small, but not good enough. Can we do better? Technically, we have the quarterly GDP's expressed as proportions \\((g_0:g_1:g_2:g_3)\\) in projective space \\(\mathbb{P}^3\\) where \\(g_0+g_1+g_2+g_3=1\\), and we want to find a model, i.e., a projective transformation, that would give us the quarterly sales proportions \\((q_0:q_1:q_2:q_3)\\) where \\(q_0+q_1+q_2+q_3=1\\). We are going to try the simplest projective transformation
$$
(g_1:g_2:g_3:g_4)\mapsto(\alpha_0g_0:\alpha_1g_1:\alpha_2g_2:\alpha_3g_3)
$$
where, without loss of generality, we can assume \\(\alpha_0=1\\). This innocent looking transformation is more sophisticated than it looks, because it predicts the quarterly sales by a rational transformation:
$$
q_i=\frac{\alpha_ig_i}{\alpha_0g_0+\alpha_1g_1+\alpha_2g_2+\alpha_3g_3}
$$
Finding the parameters \\(\alpha_1\\), \\(\alpha_2\\) and \\(\alpha_3\\) requires nonlinear optimization with the \\(\ell^1\\) objective, and this is done for every country-store-product combination with 4 data points from 2015 to 2018.

In [None]:
def obj_fn(x,gdp_quarterly_array,quarterly_sales_array):
    alpha = np.array([1,x[0],x[1],x[2]]).reshape((1,4))
    y=alpha*gdp_quarterly_array
    y=y/np.sum(y,axis=1,keepdims=True)
    return np.abs(y-quarterly_sales_array).mean()

In [None]:
from scipy.optimize import minimize

l_1 = []
sales_est_cache = {}
for country in ['Finland','Norway','Sweden']:
    for store in ['KaggleMart','KaggleRama']:
        for product in ['Kaggle Mug','Kaggle Hat','Kaggle Sticker']:
            row = [country,store,product]
            df = train_data[(train_data['country']==country)&(train_data['store']==store)&
                                (train_data['product']==product)]
            quarterly_sales_array = np.array([df[(df['Year']==yr)&(df['Quarter']==q)]['num_sold'].sum() for yr in range(2015,2019) for q in range(1,5)])
            quarterly_sales_array = quarterly_sales_array.reshape((-1,4))
            quarterly_sales_array = quarterly_sales_array/np.sum(quarterly_sales_array,axis=1,keepdims=True)
            gdp_quarterly_array = gdp_quarterly_usd[gdp_quarterly_usd.Country==country][['Q1','Q2','Q3','Q4']].to_numpy().astype(np.float)[:-1,:]
            gdp_quarterly_array = gdp_quarterly_array/np.sum(gdp_quarterly_array,axis=1,keepdims=True)
            result = minimize(lambda x: obj_fn(x,gdp_quarterly_array,quarterly_sales_array), (1,1,1), bounds=[(0,None)]*3,
                             method='Nelder-Mead')
            assert result.success
            x1,x2,x3 = result.x
            alpha = np.array([1,x1,x2,x3]).reshape((1,4))
            sales_est = alpha*gdp_quarterly_array
            sales_est=sales_est/np.sum(sales_est,axis=1,keepdims=True)
            for i in range(4):
                row.append(np.abs(sales_est[i,:]-quarterly_sales_array[i,:]).mean())
            l_1.append(row)
            sales_est_cache[(country,store,product)] = [quarterly_sales_array,gdp_quarterly_array,sales_est]
    
                           
pd.DataFrame(l_1,columns=['country','store','product','2015','2016','2017','2018'])

MAD's are reduced, in some cases by an order of magnitude. Let's check the extreme results, starting with the worst result, which is Norway-KaggleRama-Mug in 2016 with a MAD of 0.009285. Let's visualize the distributions.

In [None]:
quarterly_sales_array,gdp_quarterly_array,sales_est=sales_est_cache[('Norway','KaggleRama','Kaggle Mug')]
np.abs(sales_est[1,:]-quarterly_sales_array[1,:]).mean()
from matplotlib import pyplot as plt
fig = plt.figure(figsize=(8, 8))
plt.bar(np.arange(4),quarterly_sales_array[1,:],width=0.25)
plt.bar(np.arange(4)+0.25,sales_est[1,:],width=0.25)
plt.bar(np.arange(4)+0.5,gdp_quarterly_array[1,:],width=0.25)
plt.xticks(np.arange(4)+0.25,['Q1','Q2','Q3','Q4'])
plt.legend(['Actual Sales','Predicted Sales','GDP'])
plt.title('2016 Quarterly Distributions (Nor-Rama-Mug)')
plt.show()

Now visualize the best result, which is Sweden-KaggleRama-Hat in 2018 with a MAD of 0.000234.

In [None]:
quarterly_sales_array,gdp_quarterly_array,sales_est=sales_est_cache[('Sweden','KaggleRama','Kaggle Hat')]
np.abs(sales_est[1,:]-quarterly_sales_array[1,:]).mean()
from matplotlib import pyplot as plt
fig = plt.figure(figsize=(8, 8))
plt.bar(np.arange(4),quarterly_sales_array[1,:],width=0.25)
plt.bar(np.arange(4)+0.25,sales_est[1,:],width=0.25)
plt.bar(np.arange(4)+0.5,gdp_quarterly_array[1,:],width=0.25)
plt.xticks(np.arange(4)+0.25,['Q1','Q2','Q3','Q4'])
plt.legend(['Actual Sales','Predicted Sales','GDP'])
plt.title('2018 Quarterly Distributions (Swe-Rama-Hat)')
plt.show()

If we had used the quarterly GDP directly for prediction, we would have predicted a best Q4, whereas the actual Q4 was mediocre.

# Conclusion

It seems that a plausible strategy is to forecast the 2019 annual sales using the annual GDP data, and then predict the quarterly breakdown using the quarterly data. A simple projective transformation can be used to model the relationship between quarterly GDP distributions and quarterly sales distributions. Since we are using only the relative contributions of the quarters, we could have just used the quarterly GDP data in local currency.