# KPMG COVID-19 Short-Term Model Pipeline

Since the COVID-19 outbreak was first diagnosed, it has spread to over 190 countries and all U.S. states. The pandemic is having a noticeable impact on global economic growth. Business owners need to understand how to adjust their financial budget based COVID-19.

Using predictive data analytics, KPMG’s Lighthouse has developed serval methodologies to identify the economic impact of COVID-19.
Leveraging mobility data and COVID-19 data to spot patterns and perform sensitivity analysis enable us to predict what will happen in the following fall and winter.

The short-term model is a signal-based time series model to generate one or two months COVID-19 new cases forecast


## Project layout

    Short-Term Model Pipeline - data_cleaning.ipynb
    Short-Term Model Pipeline - model_selection.ipynb
    data 
        df_train.csv - output from Short-Term Model Pipeline - data_cleaning.ipynb
        Global_Mobility_Report.csv # googel mobility data https://www.google.com/covid19/mobility/
        lockdown_aggregated_country.csv   # KPMG country level lockdown index
        lockdown_aggregated_state.csv # KPMG state level lockdown index
        time_series_covid19_confirmed_global.csv # Coivd-19 new cases https://github.com/CSSEGISandData
        time_series_covid19_confirmed_US.csv  # Coivd-19 US new cases https://github.com/CSSEGISandData
        time_series_covid19_deaths_global.csv # Coivd-19 death https://github.com/CSSEGISandData
        time_series_covid19_deaths_US.csv # Coivd-19 US death https://github.com/CSSEGISandData

## Prerequisites

Python Package:
```
pandas, numpy, datetime, fbprophet, pystan, pydlm, statsmodels, pmdarima, tbats, plotly
```

KPMG Forecast enabler:
```
Pysigeval: http://usmdckdap10412.nix.us.kworld.kpmg.com:3003/
Pytselect: http://usmdckdap10412.nix.us.kworld.kpmg.com:3004/
Pytsplot: https://git.us.kworld.kpmg.com/projects/INTELLIGE1/repos/pytsplot/browse
```

In [33]:
target = 'US' # model tagret varibale 
index_col = 'date'
forecast_period  = 28 # the forecast lenght 
window_start = '2020-04-24'# the start data of rolling window cross validation 
window_end = '2020-05-22'# the end data of rolling window cross validation 
prediction_date = '2020-05-27'
window_num = 2 # the number of rolling windows 

In [2]:
# Load packages 
import pandas as pd
import numpy as np
import math
from functools import reduce
from functools import partial
from pysigeval import transfuncs
from pytsplot import general_plot
from pysigeval import corfuncs
from pytsplot import signalselect_plot
from pytselect import modelselect
from pysigeval import selectfuncs
from pysigeval import rankfuncs
# load packages and read input data
from pytselect import signalselect
from pytselect import modelpredict
# read data
from pytselect import ts_helper

df = pd.read_csv('data/df_train.csv')

ModuleNotFoundError: No module named 'pysigeval'

In [58]:
df

Unnamed: 0,date,Afghanistan,Albania,Algeria,Andorra,Angola,Antigua_and_Barbuda,Argentina,Armenia,Australia_Australian_Capital_Territory,...,US_Diamond_Princess,US_Grand_Princess,US_death,US_lockdown,US_retail_and_recreation,US_workplaces,US_transit_stations,US_residential,US_grocery,US_parks
0,2020-01-22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,,,,,,
1,2020-01-23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,,,,,,,
2,2020-01-24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,,,,,,,
3,2020-01-25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,,,,,,,
4,2020-01-26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
122,2020-05-23,782.0,8.0,195.0,0.0,1.0,0.0,704.0,374.0,0.0,...,0.0,0.0,1108.0,,,,,,,
123,2020-05-24,584.0,9.0,193.0,0.0,8.0,0.0,723.0,359.0,0.0,...,0.0,0.0,633.0,,,,,,,
124,2020-05-25,591.0,6.0,197.0,1.0,1.0,0.0,552.0,452.0,0.0,...,0.0,0.0,500.0,,,,,,,
125,2020-05-26,658.0,25.0,194.0,0.0,0.0,0.0,600.0,289.0,0.0,...,0.0,0.0,693.0,,,,,,,


#### 1. Generate Signal Transformations
run [tr_pipeline()](http://usmdckdap10412.nix.us.kworld.kpmg.com:3003/data_transformations/#tr_pipeline) to get the lags (28, 35, 42, 49) of signals, and test different transformations

In [4]:
df_trans = transfuncs.tr_pipeline(df=df,y = target, index_col='date',
                                  lags = [28, 35, 42, 49], fill_value=0, 
                                  transform=['log', 'normalise'], freq = 'd')
df_trans  = df_trans.iloc[35:]


divide by zero encountered in log10


invalid value encountered in log10



##### Tips
* The tr_pipeline function can handle missing values. As shown in the example above, the missing values won't be transformed.
* To keep non-lagged signals, users could add `0` into `lags` argument, like `lags = [0, 3, 6, 9]` 
* When `'ma'` is included in `transform`, user can specify the order of moving average by pass `order` parameter. The default value of `order` is 3.
* When set `'diff'` is included in `transform`, user can specify the category of diff function by pass `change` parameter. The default value of `change` is `difference`.


#### 2. Correlation Analysis
run [`cor_analysis()`](http://usmdckdap10412.nix.us.kworld.kpmg.com:3003/correlation_analysis/#cor_analysis) to select the best transformation type and best lag of each signal based on their correlation value.

In [5]:
# run cor_analysis to get the best correlation for each signal 
cor_df = corfuncs.cor_analysis(df = df_trans, y = target, lag = True, 
                      transform = True, filter_ = True)
cor_df.head(30)

Unnamed: 0,signal,transform,lag,correlation,abs_correlation,method
0,China_Sichuan,none,35,-0.834913,0.834913,pearson
1,China_Jiangsu,norm,35,-0.816953,0.816953,pearson
2,China_Hunan,none,42,-0.798599,0.798599,pearson
3,China_Chongqing,none,42,-0.78799,0.78799,pearson
4,China_Shaanx,none,42,-0.78751,0.78751,pearson
5,China_Anhu,none,35,-0.786466,0.786466,pearson
6,China_Guangdong,none,42,-0.785086,0.785086,pearson
7,China_Fujian,norm,42,-0.778857,0.778857,pearson
8,China_Guangx,none,35,-0.774642,0.774642,pearson
9,China_Henan,none,35,-0.773657,0.773657,pearson


##### Tips
* The `cor_analysis()` function can handle missing values. 
* Please set lag, transform and filter_ to be `True` after running [tr_pipeline()](../data_transformations/#tr_pipeline)

#### 3. Dimension Reduction Using cor_select()
use the outputs of [`tr_pipeline()`](http://usmdckdap10412.nix.us.kworld.kpmg.com:3003/data_transformations/#tr_pipeline) and [`cor_analysis()`](http://usmdckdap10412.nix.us.kworld.kpmg.com:3003/correlation_analysis/) to select the best transformation type and best lag of each signal (dimension reduction function).


In [6]:
# run cor_select 
df_short = corfuncs.cor_select(df_trans, cor_df, index_col = index_col, y = target)
df_short.tail(30)

Unnamed: 0,date,US,China_Sichuan_none_lag35,China_Jiangsu_norm_lag35,China_Hunan_none_lag42,China_Chongqing_none_lag42,China_Shaanx_none_lag42,China_Anhu_none_lag35,China_Guangdong_none_lag42,China_Fujian_norm_lag42,...,Tanzania_none_lag35,France_Saint_Pierre_and_Miquelon_norm_lag28,Western_Sahara_none_lag28,Yemen_none_lag35,Sao_Tome_and_Principe_norm_lag49,Canada_Diamond_Princess_none_lag28,Comoros_none_lag28,Tajikistan_none_lag28,Lesotho_none_lag28,US_American_Samoa_none_lag28
146,2020-06-16,,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,9.0,0.02649,0.0,0.0,207.0,0.0,0.0
147,2020-06-17,,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,5.0,0.0,0.0,23.0,204.0,0.0,0.0
148,2020-06-18,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,15.0,0.039735,0.0,0.0,210.0,0.0,0.0
149,2020-06-19,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,21.0,0.013245,0.0,44.0,201.0,1.0,0.0
150,2020-06-20,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,16.0,0.0,0.0,0.0,187.0,0.0,0.0
151,2020-06-21,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,6.0,0.0,0.0,9.0,191.0,0.0,0.0
152,2020-06-22,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,0.046358,0.0,0.0,171.0,0.0,0.0
153,2020-06-23,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,37.0,1.0,0.0,0.0,166.0,0.0,0.0
154,2020-06-24,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,17.0,0.0,0.0,0.0,158.0,0.0,0.0
155,2020-06-25,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,,13.0,0.086093,,,,,


#### 4. Dimension Reduction Using signal_rankings 
Run all signal selection methods and return the final ranking. The reture can be the correlation values and scaled importance values of the signal selection methods, or it can be a ranking of each signal in each method

In [7]:
# set h2o cluster (if h2o cluster has been set up, do not need to run this)
selectfuncs.init()
final_rank = rankfuncs.signal_rankings(df = df_short, y = target ,
                                       method_ = ['pearson', 'spearman','elastic', 'rf', 'gbm'])

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_152-release"; OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12); OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)
  Starting server from /Users/shuotian/Desktop/test_vm/test_vm/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/yt/dfq2zglx071d_njymtm7vr_jv9ntwb/T/tmpp5da0rs9
  JVM stdout: /var/folders/yt/dfq2zglx071d_njymtm7vr_jv9ntwb/T/tmpp5da0rs9/h2o_shuotian_started_from_python.out
  JVM stderr: /var/folders/yt/dfq2zglx071d_njymtm7vr_jv9ntwb/T/tmpp5da0rs9/h2o_shuotian_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.3
H2O cluster version age:,3 months and 22 days !!!
H2O cluster name:,H2O_from_python_shuotian_xwm7vc
H2O cluster total nodes:,1
H2O cluster free memory:,3.556 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [8]:
final_rank

Unnamed: 0,signal,pearson_corr,spearman_corr,elastic_net,random_forest,gbm,mean
0,China_Sichuan_none_lag35,0.834913,0.687310,0.0,0.017662,0.000714,0.308120
1,China_Jiangsu_norm_lag35,0.816953,0.705814,0.0,0.044486,0.000100,0.313471
2,China_Hunan_none_lag42,0.798599,0.641138,0.0,0.000086,0.000084,0.287981
3,China_Chongqing_none_lag42,0.787990,0.676005,0.0,0.000343,0.000273,0.292922
4,China_Shaanx_none_lag42,0.787510,0.701164,0.0,0.004055,0.000523,0.298650
...,...,...,...,...,...,...,...
260,Sao_Tome_and_Principe_norm_lag49,0.031898,0.088578,0.0,0.000004,0.000000,0.024096
261,Canada_Diamond_Princess_none_lag28,,,0.0,0.000000,0.000000,0.000000
262,Comoros_none_lag28,,,0.0,0.000000,0.000000,0.000000
263,Tajikistan_none_lag28,,,0.0,0.000000,0.000000,0.000000


#### Remove Chinese signals 

In [9]:
final_rank = final_rank[~final_rank["signal"].str.contains('China')].head(90)
final_rank = final_rank.reset_index(drop=True)


In [10]:
final_rank

Unnamed: 0,signal,pearson_corr,spearman_corr,elastic_net,random_forest,gbm,mean
0,Italy_norm_lag28,0.773459,0.798017,0.579869,0.266613,3.116431e-01,0.545920
1,Iran_norm_lag28,0.742008,0.726064,0.290166,0.031476,1.000000e+00,0.557943
2,Norway_none_lag28,0.697761,0.821169,0.000000,0.111567,4.975587e-03,0.327094
3,Slovenia_none_lag28,0.682040,0.744800,0.000000,0.129879,8.133541e-04,0.311506
4,Malaysia_none_lag28,0.620524,0.640850,0.000000,0.000945,0.000000e+00,0.252464
...,...,...,...,...,...,...,...
85,Ethiopia_none_lag28,0.350872,0.295317,0.245109,0.000241,9.880783e-07,0.178308
86,South_Africa_none_lag28,0.347295,0.476851,0.000000,0.000334,3.003113e-04,0.164956
87,Dominican_Republic_norm_lag28,0.346249,0.352599,0.000000,0.000282,0.000000e+00,0.139826
88,Barbados_none_lag28,0.344452,0.403227,0.000000,0.000036,1.833044e-04,0.149580


In [11]:
final_rank.head(28)

Unnamed: 0,signal,pearson_corr,spearman_corr,elastic_net,random_forest,gbm,mean
0,Italy_norm_lag28,0.773459,0.798017,0.579869,0.266613,0.311643,0.54592
1,Iran_norm_lag28,0.742008,0.726064,0.290166,0.031476,1.0,0.557943
2,Norway_none_lag28,0.697761,0.821169,0.0,0.111567,0.004976,0.327094
3,Slovenia_none_lag28,0.68204,0.7448,0.0,0.129879,0.000813,0.311506
4,Malaysia_none_lag28,0.620524,0.64085,0.0,0.000945,0.0,0.252464
5,Switzerlan_norm_lag28,0.614075,0.749517,0.0,0.079729,0.000317,0.288728
6,Germany_norm_lag28,0.612809,0.729044,0.0,0.038332,0.020649,0.280167
7,Denmark_norm_lag28,0.596609,0.632486,0.0,0.00065,0.002376,0.246424
8,Costa_Rica_norm_lag28,0.595165,0.682061,0.0,0.000155,0.003392,0.256155
9,Icelan_none_lag28,0.578187,0.820139,0.0,1.0,0.247029,0.529071


#### Shutdown H2O Cluster



In [12]:
selectfuncs.shutdown()

H2O session _sid_b2fc closed.


In [13]:
# get the final df output from pysigeval dimension reduction
col_list = [index_col, target] + list(final_rank.signal)
row_index = max(df_short[col_list].drop(index_col, 1).apply(pd.Series.last_valid_index))+1
df_short = df_short[col_list].iloc[:row_index, :]

In [14]:
df_short.head(30)

Unnamed: 0,date,US,Italy_norm_lag28,Iran_norm_lag28,Norway_none_lag28,Slovenia_none_lag28,Malaysia_none_lag28,Switzerlan_norm_lag28,Germany_norm_lag28,Denmark_norm_lag28,...,Cuba_none_lag28,Canada_Quebec_none_lag28,France_Guadeloupe_none_lag28,Kenya_norm_lag28,Australia_Tasmania_norm_lag28,Ethiopia_none_lag28,South_Africa_none_lag28,Dominican_Republic_norm_lag28,Barbados_none_lag28,Canada_New_Brunswick_none_lag28
35,2020-02-26,6.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
36,2020-02-27,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37,2020-02-28,2.0,0.000305,0.0,0.0,0.0,0.0,0.0,0.000144,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38,2020-02-29,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000433,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39,2020-03-01,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000288,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
40,2020-03-02,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000288,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41,2020-03-03,20.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
42,2020-03-04,31.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
43,2020-03-05,70.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
44,2020-03-06,48.0,0.000153,0.0,0.0,0.0,0.0,0.0,0.000144,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
# fill missing missing values 
df_short = transfuncs.tr_treat_na(df_short, 
                       y = target,
                       index_col = 'date',
                       method='both')
df_short = df_short.reset_index(drop=True)

In [16]:
df_short

Unnamed: 0,date,US,Italy_norm_lag28,Iran_norm_lag28,Norway_none_lag28,Slovenia_none_lag28,Malaysia_none_lag28,Switzerlan_norm_lag28,Germany_norm_lag28,Denmark_norm_lag28,...,Cuba_none_lag28,Canada_Quebec_none_lag28,France_Guadeloupe_none_lag28,Kenya_norm_lag28,Australia_Tasmania_norm_lag28,Ethiopia_none_lag28,South_Africa_none_lag28,Dominican_Republic_norm_lag28,Barbados_none_lag28,Canada_New_Brunswick_none_lag28
0,2020-02-26,6.0,0.000000,0.000000,0.0,0.0,3.0,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
1,2020-02-27,1.0,0.000000,0.000000,0.0,0.0,1.0,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
2,2020-02-28,2.0,0.000305,0.000000,0.0,0.0,0.0,0.000000,0.000144,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
3,2020-02-29,8.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000433,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
4,2020-03-01,6.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000288,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136,2020-07-11,,0.089065,0.652856,18.0,2.0,15.0,0.011355,0.046733,0.133333,...,11.0,541.0,0.0,1.0,0.0,30.0,1673.0,0.907115,0.0,1.0
137,2020-07-12,,0.089065,0.652856,18.0,2.0,15.0,0.011355,0.046733,0.133333,...,11.0,541.0,0.0,1.0,0.0,30.0,1673.0,0.907115,0.0,1.0
138,2020-07-13,,0.089065,0.652856,18.0,2.0,15.0,0.011355,0.046733,0.133333,...,11.0,541.0,0.0,1.0,0.0,30.0,1673.0,0.907115,0.0,1.0
139,2020-07-14,,0.089065,0.652856,18.0,2.0,15.0,0.011355,0.046733,0.133333,...,11.0,541.0,0.0,1.0,0.0,30.0,1673.0,0.907115,0.0,1.0


#### 3. Model Selection 

Model selection function, run [`ts_model_selection()`](http://usmdckdap10412.nix.us.kworld.kpmg.com:3004/ModelSelection/#ts_model_selection) to select best model type and tune model parameters. Given a rectangle of data containing target variable, date index and time series signals to fit a range of forecasting models using cross validation and select the best model types.

In [17]:
%%time
# run ts_model_selection with model = ['bsts', 'ets', 'fbp'], 
# and turn on parameter_tune
# freq = m, window = 5, skip = 2, parallel = True, measure = 'mape'
out_df, out_error = modelselect.ts_model_selection(df_short, start_date = window_start, 
                                                   end_date = window_end, y = target,
                                                   model = ['bsts', 'ets', 'fbp'],
                                                   window = 5, skip = 6, periods = forecast_period,
                                                   freq = 'd',
                                                   seasonal_periods = 7,
                                                   daily_cv_schema = 'normal',
                                                   parameter_tune = True)


CPU times: user 42 ms, sys: 53 ms, total: 95 ms
Wall time: 21.2 s


In [18]:
out_df

Unnamed: 0,Date,Prediction_Date,Window_Number,Actual,Fitted,Prediction,Model
0,2020-02-26,2020-04-23,1,6.0,5.968185,,bsts
1,2020-02-27,2020-04-23,1,1.0,1.000000,,bsts
2,2020-02-28,2020-04-23,1,2.0,2.000000,,bsts
3,2020-02-29,2020-04-23,1,8.0,8.000000,,bsts
4,2020-03-01,2020-04-23,1,6.0,5.984451,,bsts
...,...,...,...,...,...,...,...
1297,2020-05-18,2020-05-21,5,21551.0,22706.690746,,fbp
1298,2020-05-19,2020-05-21,5,20260.0,23485.628076,,fbp
1299,2020-05-20,2020-05-21,5,23285.0,24569.657142,,fbp
1300,2020-05-21,2020-05-21,5,25294.0,26771.352061,,fbp


In [19]:
out_error

Unnamed: 0,model,parameter,mape
0,ets,"{'trend_': 'multiplicative', 'seasonal': 'mult...",10.27417
1,bsts,"{'trend_': 'None', 'autoReg_': 3}",18.700122
2,fbp,{'seasonal': 'additive'},80.671375


#### 4. Signal Selection 

Run forward signal selection on best performed model, `bsts`. The signal list will the the output from `pysigeval` dimension reduction.

In [20]:
# get signal list
signal_list = list(df_short.columns[2:])

# fill missing values
df_short = transfuncs.tr_treat_na(df_short, 
                       y = target,
                       index_col = index_col,
                       method='both')


In [22]:
%%time
# run ts_signal_selection and set threshold to 0.1% (to select more signals)
# model = bsts_reg  
# the target value is the second column in df
# freq = m, window = 5, skip = 2, parallel = True, measure = 'mape', forecast_signals = True
out_df, out_error = signalselect.ts_signal_selection(df_short, signal_list, start_date = window_start, 
                                                     end_date = window_end, threshold=0.1,
                                                     window = 5, skip = 6, periods = forecast_period,
                                                     freq = 'd',
                                                     seasonal_periods = 7,
                                                     daily_cv_schema = 'normal',
                                                     autoReg_ = 3,
                                                     forecast_signals=False, model = 'bsts_reg', parallel=True)

CPU times: user 3.81 s, sys: 819 ms, total: 4.63 s
Wall time: 3min 31s


In [23]:
# show the seleceted signals
out_error[out_error['selected']==1]

Unnamed: 0,signal,train_mape,test_mape,selected,round
0,,2.206758,29.157054,1,0
9,Taiwan*_none_lag28,1.885511,25.192861,1,1
91,Lithuania_norm_lag28,1.475989,19.857306,1,2
183,Spain_none_lag28,1.242274,19.668583,1,3
271,Finlan_none_lag28,1.021288,16.486474,1,4
393,Croatia_none_lag28,0.942615,15.263074,1,5
441,"Korea,_South_none_lag35",0.732113,15.122227,1,6
547,Romania_none_lag28,0.646147,11.683692,1,7
651,United_Kingdom_Gibraltar_norm_lag28,0.594237,10.88468,1,8


In [31]:
out_df[out_df['Window_Number']==2].tail(20)

Unnamed: 0,Date,Prediction_Date,Window_Number,Actual,Fitted,Prediction
153,2020-05-03,2020-04-30,2,25501.0,,17961.174746
154,2020-05-04,2020-04-30,2,22335.0,,20890.672234
155,2020-05-05,2020-04-30,2,23976.0,,17380.830803
156,2020-05-06,2020-04-30,2,24980.0,,19753.541632
157,2020-05-07,2020-04-30,2,27692.0,,22455.089238
158,2020-05-08,2020-04-30,2,26906.0,,25572.038125
159,2020-05-09,2020-04-30,2,25621.0,,26794.369623
160,2020-05-10,2020-04-30,2,19710.0,,19760.002457
161,2020-05-11,2020-04-30,2,18621.0,,22075.907338
162,2020-05-12,2020-04-30,2,21495.0,,22423.238413


In [32]:
# get selected signals
selected_signal = list(out_error[out_error['selected']==1].signal)[1:]
selected_signal

['Taiwan*_none_lag28',
 'Lithuania_norm_lag28',
 'Spain_none_lag28',
 'Finlan_none_lag28',
 'Croatia_none_lag28',
 'Korea,_South_none_lag35',
 'Romania_none_lag28',
 'United_Kingdom_Gibraltar_norm_lag28']

In [53]:
# get out of sample error by period_out
error_report = modelselect.ts_cv_error(out_df, summary_type = 'period_out', error_type = 'test', 
                                       measure = ['mape', 'mae','mse','rmse'])
error_report['target'] = target
error_report

Unnamed: 0_level_0,mape,mae,mse,rmse,target
period_out,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,13.195504,3938.963257,17563540.0,4190.888008,US
2,10.078027,2886.196844,12628170.0,3553.61405,US
3,16.205759,3835.964734,20682000.0,4547.746442,US
4,10.676167,2115.581869,8204069.0,2864.274685,US
5,12.557741,2900.715315,13641270.0,3693.409582,US
6,12.451417,2904.609088,11699350.0,3420.431638,US
7,8.739831,2421.988663,9185231.0,3030.714529,US
8,12.405034,3487.002956,15097800.0,3885.588234,US
9,8.061271,2286.322201,10894400.0,3300.66694,US
10,17.943331,4343.324719,41706630.0,6458.06709,US


####  9.Model Prediction
Fit the selected forecasting model and make new predictions. 

In [50]:
# forecast signals to 2019-12-01
df_forecast = ts_helper.get_signal_forecasts(df = df_short, date = '2020-06-22')

# run bsts_reg model with all default paramter  
prediction_df = modelpredict.ts_predict(df_forecast, model = 'bsts_reg',
                                        date =  prediction_date, signal_list = selected_signal, freq = 'd',
                                         periods = forecast_period)
prediction_df

Unnamed: 0,Date,Actual,Fitted,Prediction
0,2020-02-26,6.0,5.985877,
1,2020-02-27,1.0,1.007370,
2,2020-02-28,2.0,2.000659,
3,2020-02-29,8.0,7.991456,
4,2020-03-01,6.0,6.002692,
...,...,...,...,...
136,2020-07-11,,,
137,2020-07-12,,,
138,2020-07-13,,,
139,2020-07-14,,,


In [51]:
prediction_df['Target']  = target

In [52]:
prediction_df.tail(40)

Unnamed: 0,Date,Actual,Fitted,Prediction,Target
101,2020-06-06,,,20576.82073,US
102,2020-06-07,,,21560.644898,US
103,2020-06-08,,,20206.775837,US
104,2020-06-09,,,18551.980253,US
105,2020-06-10,,,17019.409923,US
106,2020-06-11,,,20047.159661,US
107,2020-06-12,,,20051.776692,US
108,2020-06-13,,,19147.163886,US
109,2020-06-14,,,19835.051632,US
110,2020-06-15,,,18899.068975,US


##### Tips
* Please use `get_signal_forecasts()` to create a set of simple forecasts for signals before running `ts_predict()`

In [None]:
out_error

In [54]:
out_df['Target']= target

In [55]:
out_df.to_csv('cross_validation_result.csv')

In [56]:
error_report.to_csv('error_report.csv')

In [57]:
prediction_df.to_csv('prediction_df.csv')