Hi, This notebook describes a simple **Hybrid Model**, containing two steps:

1) **Linear Regression** that uses the **GDP** feature as the input and **'num_sold'** as the target.

2) **XGB Regressor** that uses country, store, product and date features as the inputs and the **'Residual'** from the first model as the target.

(There is not much feature engineering beyond the basics, and no parameter tuning)

[Here](https://www.kaggle.com/ryanholbrook/hybrid-models) is a good resource on 'Hybrid Models'.

#### **Table of Contents**

#### 1. Setup
* Importing Libraries
* Reading the Data

#### 2. GDP
* Reading the GDP data
* Comparison of GDP and Yearly Average for each country
* Creating GDP Feature.

#### 3. Linear Regression

#### 4. Basic Feature Engineering

#### 5. XGB Regressor

#### 6. Submission File


GDP data from: https://databank.worldbank.org/reports.aspx?source=2&series=NY.GDP.MKTP.CD&country=WLD#

## 1. Setup

#### Importing Libraries

In [None]:
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn.linear_model import LinearRegression    #Linear Regression for the first step of the Hybrid Model
from xgboost import XGBRegressor  #XGB Regressor will be used for the final(second) step of the Hybrid Model

#### Reading the Data

In [None]:
data_1 = pd.read_csv('../input/tabular-playground-series-jan-2022/train.csv') #train data
data_2 = pd.read_csv('../input/tabular-playground-series-jan-2022/test.csv') #test data
sub = pd.read_csv('../input/tabular-playground-series-jan-2022/sample_submission.csv') #sample submission

for data in [data_1, data_2]:
    data['date'] = pd.to_datetime(data.date)  #dtype converted for date

for data in [data_1, data_2]:
    data['Year'] = data['date'].dt.year   #Year Feature
    
for data in [data_1, data_2]:
    display(data.head())  #displays the first five rows of each of the train and test data

## 2. GDP

#### Reading the GDP data


Notice how the GDP is increasing year by year for Finland and Sweden during for 2015-2018,
but **Norway has a drop** at **2016**.

In [None]:
GDP = pd.read_csv('../input/gdp-data-3-countries/8a04c824-8256-4330-aa88-1fd167a692ee_Data.csv', index_col = 'Country Name')

GDP = GDP.iloc[1:4,-6:-1].T

GDP = GDP.astype(int)

GDP

#### Comparison of GDP and Yearly Average for each country

Take a look at the **average of num_sold** over the years (2015-2018) for **each country**.

In [None]:
sns.set_theme(style="darkgrid")
sns.catplot(x="Year", y="num_sold", kind="point", data=data_1, col = 'country', color = 'green')

We can see that here the increasing / decreasing pattern of the Sales data is revealed in the GDP
(**drop of Norway at 2016**).

We might go ahead and make the following regression line:

$$ Sales = a*GDP + b $$

However, if you look at the GDP values, in descending order, it's Sweden, Norway, and Finland.

In the Sales data, it's Norway, Sweden, and Finland. So we could think, perhaps the **growth rate** is applied, and we could factor in the relative Sales of the countries.

Now this is the GDP on relative scale, for each country, based on it's **2015 GDP (as 1)**.

In [None]:
for Country in ['Finland','Norway','Sweden']:
    GDP[Country] = GDP[Country] / GDP[Country][0]
    
GDP['Year'] = np.arange(2015,2020)
GDP = GDP.set_index('Year')

GDP

For each country, the **'base num_sold' (num_sold averager for 2015)** will be multiplied, 

and the feature is engineered below.

In [None]:
def GDP_value(row):
    mean_num_sold_2015 = data_1.loc[data_1['Year'] == 2015].groupby('country')['num_sold'].mean()[row['country']]
    GDP_relative = GDP.loc[row['Year'],row['country']]
    row['GDP_value'] = mean_num_sold_2015 * GDP_relative
    return row

data_1 = data_1.apply(GDP_value, axis='columns')
data_2 = data_2.apply(GDP_value, axis='columns')

display(data_1.head())
display(data_2.head())

In [None]:
#sns.set_theme(style="darkgrid")

#sns.relplot(x = 'date', y = 'num_sold', ci=None, kind = 'line', data = data_1, col = 'country', hue = 'Year')

## 3. Linear Regression

In [None]:
X_1 = data_1[['GDP_value']]
Y_1 = data_1['num_sold']

m_1 = LinearRegression()

m_1.fit(X_1,Y_1)

## 4. Basic Feature Engineering

For the Date Features, you can type '.dt' in the [pandas](https://pandas.pydata.org/docs/search.html?q=.dt) documentation search to learn about them.

The ideas for making these features here would be that these **features**, like store, country, weekend/weekday, month, etc.

are information in **numbers** for the XGB Regressor to work with.

In [None]:
for data in [data_1, data_2]:
    for country in data.country.unique():
        data[country] = 1 * (data['country'] == country) #Country One-Hot Encoding

# For One-Hot Encodings,
# This may not be a usual/efficient way of One-Hot Encoding, but this is how I usually have been doing it.
# The idea is that each country column gives the 'boolean value', which is True or False values, but since that is not a number, when we multiply it by 1, True becomes 1 and False becomes 0.
# So for the 'Finland' column, value of 1 would mean that it is a sale data of Finland, and value of 0 would mean that is a sale data from Norway or Sweden.        
# If you have a suggestion for me, feel free to leave it in the comments!

    for product in data['product'].unique():
        data[product] = 1 * (data['product'] == product) #Proudct One-Hot Encoding
    
    for store in data.store.unique():
        data[store] = 1 * (data['store'] == store) #Store One-Hot Encoding
        
        
    data['Day'] = data['date'].dt.day #Day: 1,2,3,...26,27,28/29/31 (depending on each month)
    
    data['Month'] = data['date'].dt.month #Month: 1,2,3,...,11,12
    
    for i in range(1,13):
        data['Month_'+str(i)] = 1 * (data['Month'] == i) #One-Hot Encoding of Month
        
    data['Day_of_Week'] = data['date'].dt.day_of_week #This is the 0-Monday, 1-Tuesday, ..., 6-Sunday
    
    data['Weekend'] =  1 * (data['Day_of_Week'] >= 5) #5 and 6 are weekend days. This gives the weekend feature.
    
    for i in range(0,7):
        data['Day_'+str(i)] = 1 * (data['Day_of_Week'] == i) #One-Hot Encoding of Day_of_Week
    
    data['Day_of_Year'] = data['date'].dt.day_of_year #Day of Year: 1-365, 1-366(Leap Year)
    
data_1 = data_1.drop(['Month','Day_of_Week'], axis = 1) #Month and Day of Week are One-Hot Encoded, so these ones are dropped.
data_2 = data_2.drop(['Month','Day_of_Week'], axis = 1)

display(data_1.head())
display(data_2.tail())

## 5. XGB Regressor

In [None]:
data_1['Residual'] = data_1['num_sold'] - m_1.predict(X_1)

X_2 = data_1.drop(['row_id','date','country','store','product','num_sold','Year','Residual','GDP_value'], axis = 1)
Y_2 = data_1['Residual']

m_2 = XGBRegressor()

m_2.fit(X_2, Y_2)

## 6. Submission File

In [None]:
X_3 = data_2[['GDP_value']]

X_4 = data_2.drop(['row_id','date','country','store','product','Year','GDP_value'], axis = 1)

submission = m_1.predict(X_3) + m_2.predict(X_4)

In [None]:
sub['num_sold'] = submission

display(sub.head())

sub.to_csv('sub.csv',index = False)

Let me know if you have corrections/suggestions to give!