### 1. Task description:

Provide product sales predictions in order to help to plan stock level.

### Datasets:

**sales.csv**- Data on product-level weekly sales:

- week_starting_date - first day of the week date in format YYYYMMDD
- product_id - unique id for product
- sales - weekly sales in pieces

**categories.csv**- Data on which categories products are assigned to:
- product_id - unique id for product
- category_id - unique id for category

**traffic.csv**:
- week_starting_date - first day of the week date in format YYYYMMDD
- product_id - unique id for product
- traffic - weekly product displays on website'''

### 2. Import Libraries

In [403]:
%matplotlib inline
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pylab as plt 
import statsmodels.api as sm
import datetime
import itertools
import seaborn as sns
from sklearn.dummy import DummyRegressor
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf 
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from matplotlib.pylab import rcParams
from os import listdir
from os.path import isfile, join
from prophet import Prophet
import warnings
warnings.simplefilter("ignore")


from sklearn.metrics import mean_squared_log_error
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor

rcParams['figure.figsize'] = 10, 6
pd.set_option('display.max_rows', 500)

### 3. Data Collection and preparing the dataset

In [404]:
# loading the data
mypath = 'data/'
df_categories = pd.read_csv(join(mypath, 'categories.csv'), sep = ';')
df_sales = pd.read_csv(join(mypath, 'sales.csv'), sep = ';', names=["week_when_sold", "product_id", "sales"], parse_dates=['week_when_sold'], header = 0) 
df_traffic = pd.read_csv(join(mypath, 'traffic.csv'), sep = ';', names=["week_when_displayed_on_website", "product_id", "traffic"], parse_dates=['week_when_displayed_on_website'], header = 0)

In [405]:
# getting some information about the data 'categories.csv'
df_categories.info()
print('\n')
df_categories.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3135 entries, 0 to 3134
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   product_id   3135 non-null   int64
 1   category_id  3135 non-null   int64
dtypes: int64(2)
memory usage: 49.1 KB




Unnamed: 0,product_id,category_id
0,1990,0
1,2361,1
2,1085,2
3,3091,3
4,955,4


In [406]:
# getting some information about the data 'sales.csv'
df_sales.info()
print('\n')
df_sales.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105781 entries, 0 to 105780
Data columns (total 3 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   week_when_sold  105781 non-null  datetime64[ns]
 1   product_id      105781 non-null  int64         
 2   sales           105781 non-null  int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 2.4 MB




Unnamed: 0,week_when_sold,product_id,sales
0,2019-12-09,1990,1
1,2020-11-23,1990,1
2,2020-12-07,1990,1
3,2019-12-02,1990,1
4,2020-11-09,1990,2


In [407]:
# getting some information about the data 'traffic.csv'
df_traffic.info()
print('\n')
df_traffic.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176324 entries, 0 to 176323
Data columns (total 3 columns):
 #   Column                          Non-Null Count   Dtype         
---  ------                          --------------   -----         
 0   week_when_displayed_on_website  176324 non-null  datetime64[ns]
 1   product_id                      176324 non-null  int64         
 2   traffic                         176324 non-null  int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 4.0 MB




Unnamed: 0,week_when_displayed_on_website,product_id,traffic
0,2019-01-07,1990,1
1,2019-01-07,2361,7
2,2019-01-07,1085,1
3,2019-01-07,3091,4
4,2019-01-07,955,12


In [408]:
# adding traffic data to the dataset (the same week when sold)
df_data = pd.merge(df_sales, df_traffic[['week_when_displayed_on_website','product_id','traffic']], left_on=['week_when_sold', 'product_id'], right_on=['week_when_displayed_on_website', 'product_id'], how='outer')
df_data.rename(columns = {'traffic':'traffic_when_sold_week'}, inplace = True)

In [409]:
df_data.week_when_sold.fillna(0, inplace=True)
df_data.loc[df_data["week_when_sold"] == 0, "week_when_sold"] = df_data["week_when_displayed_on_website"]
df_data.week_when_sold = df_data.week_when_sold.astype('datetime64')
df_data.sales.fillna(0, inplace=True)
df_data.traffic_when_sold_week.fillna(0, inplace=True)
df_data.drop(['week_when_displayed_on_website'], axis=1, inplace=True)
df_data.head()

Unnamed: 0,week_when_sold,product_id,sales,traffic_when_sold_week
0,2019-12-09,1990,1.0,1.0
1,2020-11-23,1990,1.0,0.0
2,2020-12-07,1990,1.0,0.0
3,2019-12-02,1990,1.0,1.0
4,2020-11-09,1990,2.0,1.0


## 4. 'A single model to forecast multiple time series at the same time' approach
- more data -> better predictions
- some products don't have much data

### preparing the data

In [410]:
print(min(df_data.week_when_sold))
print(max(df_data.week_when_sold))

2019-01-07 00:00:00
2020-12-28 00:00:00


In [411]:
# adding week number to the dataset
times = pd.date_range('2019-01-07', periods=105, freq='7D')
_time_df = pd.DataFrame(list(zip(times, np.arange(len(times)))), columns = ['date','week_nbr'])
_time_df.head()

Unnamed: 0,date,week_nbr
0,2019-01-07,0
1,2019-01-14,1
2,2019-01-21,2
3,2019-01-28,3
4,2019-02-04,4


In [412]:
df_data = pd.merge(df_data, _time_df, left_on='week_when_sold', right_on='date', how='inner')
df_data.drop(['date'], axis=1, inplace=True)
df_data.head()

Unnamed: 0,week_when_sold,product_id,sales,traffic_when_sold_week,week_nbr
0,2019-12-09,1990,1.0,1.0,48
1,2019-12-09,2361,4.0,11.0,48
2,2019-12-09,3091,4.0,5.0,48
3,2019-12-09,1603,16.0,6.0,48
4,2019-12-09,1824,1.0,11.0,48


### feature engineering
#### making some lags' features

In [413]:
df_data = df_data.sort_values(by="week_when_sold")
df_data.head()

Unnamed: 0,week_when_sold,product_id,sales,traffic_when_sold_week,week_nbr
118006,2019-01-07,2127,0.0,9.0,0
117910,2019-01-07,333,0.0,4.0,0
117909,2019-01-07,2492,0.0,1.0,0
117908,2019-01-07,1622,0.0,1.0,0
117907,2019-01-07,1942,0.0,1.0,0


In [414]:
df_data['traffic_previous_week'] = df_data.groupby(['product_id'])['traffic_when_sold_week'].shift(1)
df_data['traffic_previous_2x_week'] = df_data.groupby(['product_id'])['traffic_when_sold_week'].shift(2)
df_data['traffic_previous_3x_week'] = df_data.groupby(['product_id'])['traffic_when_sold_week'].shift(3)

df_data['diff_traffic_previous_week'] = df_data.groupby(['product_id'])['traffic_when_sold_week'].diff(1)
df_data['diff_traffic_previous_2x_week'] = df_data.groupby(['product_id'])['traffic_when_sold_week'].diff(2)
df_data['diff_traffic_previous_3x_week'] = df_data.groupby(['product_id'])['traffic_when_sold_week'].diff(3)

df_data.head()

Unnamed: 0,week_when_sold,product_id,sales,traffic_when_sold_week,week_nbr,traffic_previous_week,traffic_previous_2x_week,traffic_previous_3x_week,diff_traffic_previous_week,diff_traffic_previous_2x_week,diff_traffic_previous_3x_week
118006,2019-01-07,2127,0.0,9.0,0,,,,,,
117910,2019-01-07,333,0.0,4.0,0,,,,,,
117909,2019-01-07,2492,0.0,1.0,0,,,,,,
117908,2019-01-07,1622,0.0,1.0,0,,,,,,
117907,2019-01-07,1942,0.0,1.0,0,,,,,,


In [415]:
df_data['sales_previous_week'] = df_data.groupby(['product_id'])['sales'].shift(1)
df_data['sales_previous_2x_week'] = df_data.groupby(['product_id'])['sales'].shift(2)
df_data['sales_previous_3x_week'] = df_data.groupby(['product_id'])['sales'].shift(3)

df_data['diff_sales_previous_week'] = df_data.groupby(['product_id'])['sales'].diff(1)
df_data['diff_sales_previous_2x_week'] = df_data.groupby(['product_id'])['sales'].diff(2)
df_data['diff_sales_previous_3x_week'] = df_data.groupby(['product_id'])['sales'].diff(3)

df_data = df_data.dropna()
df_data.head()

Unnamed: 0,week_when_sold,product_id,sales,traffic_when_sold_week,week_nbr,traffic_previous_week,traffic_previous_2x_week,traffic_previous_3x_week,diff_traffic_previous_week,diff_traffic_previous_2x_week,diff_traffic_previous_3x_week,sales_previous_week,sales_previous_2x_week,sales_previous_3x_week,diff_sales_previous_week,diff_sales_previous_2x_week,diff_sales_previous_3x_week
116604,2019-01-28,2019,0.0,1.0,3,2.0,1.0,1.0,-1.0,0.0,0.0,23.0,7.0,7.0,-23.0,-7.0,-7.0
116616,2019-01-28,545,0.0,1.0,3,6.0,1.0,16.0,-5.0,0.0,-15.0,0.0,0.0,9.0,0.0,0.0,-9.0
116615,2019-01-28,594,0.0,1.0,3,2.0,5.0,9.0,-1.0,-4.0,-8.0,15.0,0.0,0.0,-15.0,0.0,0.0
116613,2019-01-28,1169,0.0,1.0,3,2.0,2.0,2.0,-1.0,-1.0,-1.0,1.0,0.0,1.0,-1.0,0.0,-1.0
116612,2019-01-28,1417,0.0,1.0,3,16.0,1.0,1.0,-15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [416]:
df_data = df_data.drop(['week_when_sold'], axis=1)
df_data.head()

Unnamed: 0,product_id,sales,traffic_when_sold_week,week_nbr,traffic_previous_week,traffic_previous_2x_week,traffic_previous_3x_week,diff_traffic_previous_week,diff_traffic_previous_2x_week,diff_traffic_previous_3x_week,sales_previous_week,sales_previous_2x_week,sales_previous_3x_week,diff_sales_previous_week,diff_sales_previous_2x_week,diff_sales_previous_3x_week
116604,2019,0.0,1.0,3,2.0,1.0,1.0,-1.0,0.0,0.0,23.0,7.0,7.0,-23.0,-7.0,-7.0
116616,545,0.0,1.0,3,6.0,1.0,16.0,-5.0,0.0,-15.0,0.0,0.0,9.0,0.0,0.0,-9.0
116615,594,0.0,1.0,3,2.0,5.0,9.0,-1.0,-4.0,-8.0,15.0,0.0,0.0,-15.0,0.0,0.0
116613,1169,0.0,1.0,3,2.0,2.0,2.0,-1.0,-1.0,-1.0,1.0,0.0,1.0,-1.0,0.0,-1.0
116612,1417,0.0,1.0,3,16.0,1.0,1.0,-15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### evaluation metric and the baseline
'As of now planning is done on the basis of last week sales - weekly sales are assumed to stay on same level next week' - aaand it's gonna be our baseline

In [446]:
# evaluation metric
def rmse(y, y_hat):
    return np.sqrt(np.mean(np.square(y - y_hat)))

def wmape(y, y_hat):
    return np.sum(np.abs(y - y_hat)) / np.sum(np.abs(y))    

In [451]:
mean_error_rmse = []
mean_error_wmape = []

for week in range(80,104):
    train = df_data[df_data['week_nbr'] < week]
    val = df_data[df_data['week_nbr'] == week]
    
    p = val['sales_previous_week'].values
    print()
    
    error1 = rmse(val['sales'].values, p)
    print('Week %d - Error RMSE %.5f' % (week, error1))
    error2 = wmape(val['sales'].values, p)
    print('Week %d - Error WMAPE %.5f' % (week, error2))
    mean_error_rmse.append(error1)
    mean_error_wmape.append(error2)
print()
print('Mean Error = %.5f' % np.mean(mean_error_rmse))
print('Mean Error = %.5f' % np.mean(mean_error_wmape))


Week 80 - Error RMSE 38.36134
Week 80 - Error WMAPE 0.68853

Week 81 - Error RMSE 28.04150
Week 81 - Error WMAPE 0.68259

Week 82 - Error RMSE 27.98423
Week 82 - Error WMAPE 0.54217

Week 83 - Error RMSE 37.23291
Week 83 - Error WMAPE 0.77593

Week 84 - Error RMSE 49.91075
Week 84 - Error WMAPE 0.62346

Week 85 - Error RMSE 56.13004
Week 85 - Error WMAPE 0.70886

Week 86 - Error RMSE 104.18121
Week 86 - Error WMAPE 0.66075

Week 87 - Error RMSE 74.02139
Week 87 - Error WMAPE 0.77325

Week 88 - Error RMSE 39.62706
Week 88 - Error WMAPE 0.61742

Week 89 - Error RMSE 59.91589
Week 89 - Error WMAPE 0.82219

Week 90 - Error RMSE 27.04518
Week 90 - Error WMAPE 0.59653

Week 91 - Error RMSE 32.15868
Week 91 - Error WMAPE 0.59218

Week 92 - Error RMSE 27.78812
Week 92 - Error WMAPE 0.60547

Week 93 - Error RMSE 28.30867
Week 93 - Error WMAPE 0.50565

Week 94 - Error RMSE 55.27341
Week 94 - Error WMAPE 0.83204

Week 95 - Error RMSE 33.43598
Week 95 - Error WMAPE 0.57746

Week 96 - Error RMSE 6

### Model creation
Random Forest

In [452]:
mean_error_rmse = []
mean_error_wmape = []

for week in range(80,104):
    train = df_data[df_data['week_nbr'] < week]
    val = df_data[df_data['week_nbr'] == week]
    
    xtr, xts = train.drop(['sales'], axis=1), val.drop(['sales'], axis=1)
    ytr, yts = train['sales'].values, val['sales'].values
    
    mdl = RandomForestRegressor(n_estimators=10, n_jobs=-1, random_state=0)
    mdl.fit(xtr, ytr)
    
    p = mdl.predict(xts)
    print()
    
    error1 = rmse(yts, p)
    print('Week %d - Error RMSE %.5f' % (week, error1))
    error2 = wmape(yts, p)
    print('Week %d - Error WMAPE %.5f' % (week, error2))
    mean_error_rmse.append(error1)
    mean_error_wmape.append(error2)
print()
print('Mean Error = %.5f' % np.mean(mean_error_rmse))
print('Mean Error = %.5f' % np.mean(mean_error_wmape))


Week 80 - Error RMSE 3.57073
Week 80 - Error WMAPE 0.02359

Week 81 - Error RMSE 3.05149
Week 81 - Error WMAPE 0.01854

Week 82 - Error RMSE 5.37059
Week 82 - Error WMAPE 0.02503

Week 83 - Error RMSE 4.12400
Week 83 - Error WMAPE 0.02970

Week 84 - Error RMSE 4.91512
Week 84 - Error WMAPE 0.02569

Week 85 - Error RMSE 9.99071
Week 85 - Error WMAPE 0.04051

Week 86 - Error RMSE 9.99255
Week 86 - Error WMAPE 0.03461

Week 87 - Error RMSE 8.17534
Week 87 - Error WMAPE 0.03984

Week 88 - Error RMSE 12.07857
Week 88 - Error WMAPE 0.04934

Week 89 - Error RMSE 12.48382
Week 89 - Error WMAPE 0.05980

Week 90 - Error RMSE 5.47322
Week 90 - Error WMAPE 0.02529

Week 91 - Error RMSE 3.47866
Week 91 - Error WMAPE 0.01866

Week 92 - Error RMSE 2.40844
Week 92 - Error WMAPE 0.01714

Week 93 - Error RMSE 3.10730
Week 93 - Error WMAPE 0.01889

Week 94 - Error RMSE 6.49332
Week 94 - Error WMAPE 0.03216

Week 95 - Error RMSE 3.46540
Week 95 - Error WMAPE 0.01758

Week 96 - Error RMSE 6.70444
Week 96 