# The goal: Decide how the data for the CNN Time Series Model should look like.

#### Details: success for this means that we have a good understanding on what the dataset we will use for our cnn model will like and why.
#### for most of the ideas in this test I follow the tutorial here: https://machinelearningmastery.com/how-to-develop-convolutional-neural-network-models-for-time-series-forecasting/

In [1]:
# import libraries
import pandas as pd
import os

In [2]:
# import dataset - use datetime column as index
df = pd.read_csv("dlts_dummy.csv", header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime'], encoding = "ISO-8859-1")

In [3]:
df.head()

Unnamed: 0_level_0,sales_d,beechams,bei.erkÃ.ltung,benylin,bronchitis,cold,cold.and.flu,colds,dizzy,dobendan,...,wind_b,sleet_b,region_key_DEU01,region_key_DEU02,region_key_DEU03,region_key_DEU04,region_key_DEU05,region_key_DEU06,region_key_DEU07,region_key_DEU08
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-10-07,1511,0,0,0,167,71,0,0,0,20,...,0,0,1,0,0,0,0,0,0,0
2013-10-07,685,0,0,0,167,71,0,0,0,20,...,0,0,1,0,0,0,0,0,0,0
2013-10-07,1546,0,0,0,0,93,0,0,0,13,...,0,0,0,1,0,0,0,0,0,0
2013-10-07,1355,0,0,0,0,93,0,0,0,13,...,0,0,0,1,0,0,0,0,0,0
2013-10-07,3213,0,0,0,0,93,0,0,0,13,...,0,0,0,1,0,0,0,0,0,0


## resample data to weekly

In [4]:
weekly_groups = df.resample('W-MON') 
weekly_data = weekly_groups.sum() 

In [5]:
weekly_data.head()

Unnamed: 0_level_0,sales_d,beechams,bei.erkÃ.ltung,benylin,bronchitis,cold,cold.and.flu,colds,dizzy,dobendan,...,wind_b,sleet_b,region_key_DEU01,region_key_DEU02,region_key_DEU03,region_key_DEU04,region_key_DEU05,region_key_DEU06,region_key_DEU07,region_key_DEU08
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-10-07,48642,0,0,0,334,1222,0,0,0,236,...,2,0,2,4,3,4,1,3,3,3
2013-10-14,43651,0,0,0,518,1220,0,0,184,70,...,1,0,2,2,2,4,2,2,2,3
2013-10-21,47238,0,0,0,707,957,0,0,300,113,...,8,0,2,3,3,4,2,3,3,3
2013-10-28,53223,0,0,0,472,1146,0,33,0,94,...,4,0,2,4,3,3,1,2,2,5
2013-11-04,55440,0,0,0,1058,1096,0,9,388,147,...,8,0,2,3,3,4,3,4,4,3


## Thoughts after resampling the data on a weekly level
#### @problem: when we resample the data weekly the regions are grouped as well. This might create a problem later on
#### @thinking_out_loud_1: should I run the model with the data this way and see if it makes sense?
#### @thinking_out_loud_2: should I resample non encoded regions data to weekly?


In [6]:
# import dateset without one-hot-encoded regions
df2 = pd.read_csv("model_dummy.csv", header=0, infer_datetime_format=True, parse_dates=['date'], index_col=['date'], encoding = "ISO-8859-1")

In [7]:
df2.head()

Unnamed: 0_level_0,region_key,sales_d,beechams,bei.erkÃ.ltung,benylin,bronchitis,cold,cold.and.flu,colds,dizzy,...,cloudy_b,partly_cloudy_day_b,partly_cloudy_night_b,snow_b,fog_b,rain_b,wind_b,sleet_b,mean_min,mean_max
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-10-07,DEU01,1511,0,0,0,167,71,0,0,0,...,0,0,0,0,1,0,0,0,10.65,15.49
2013-10-07,DEU01,685,0,0,0,167,71,0,0,0,...,0,0,0,0,0,0,0,0,5.321667,13.805
2013-10-07,DEU02,1546,0,0,0,0,93,0,0,0,...,0,0,1,0,0,0,0,0,4.33,17.27
2013-10-07,DEU02,1355,0,0,0,0,93,0,0,0,...,0,1,0,0,0,0,0,0,8.1575,16.24
2013-10-07,DEU02,3213,0,0,0,0,93,0,0,0,...,0,0,0,0,1,0,0,0,10.05,15.22


## Region_Key appears as a variable in df2
#### let's resample weekly and see if it'll be any different from when we resampled df

In [8]:
# resample df2
weekly_groups2 = df2.resample('W-MON') 
weekly_data2 = weekly_groups2.sum()
weekly_data2.head()

Unnamed: 0_level_0,sales_d,beechams,bei.erkÃ.ltung,benylin,bronchitis,cold,cold.and.flu,colds,dizzy,dobendan,...,cloudy_b,partly_cloudy_day_b,partly_cloudy_night_b,snow_b,fog_b,rain_b,wind_b,sleet_b,mean_min,mean_max
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-10-07,48642,0,0,0,334,1222,0,0,0,236,...,1,5,3,0,5,2,2,0,139.322167,324.453429
2013-10-14,43651,0,0,0,518,1220,0,0,184,70,...,1,5,1,0,5,3,1,0,133.9995,248.5475
2013-10-21,47238,0,0,0,707,957,0,0,300,113,...,1,5,1,0,3,2,8,0,161.6365,320.423
2013-10-28,53223,0,0,0,472,1146,0,33,0,94,...,2,5,2,0,3,2,4,0,232.497881,390.522762
2013-11-04,55440,0,0,0,1058,1096,0,9,388,147,...,3,5,3,0,1,2,8,0,163.936667,346.221667


## Resampling df2 is challenging because the region_keys disappear
#### The resampling won't work for strings
#### It might work if we try to group it by region_keys - let's play with the idea

In [9]:
mean_agg = (df2.groupby(['region_key',pd.Grouper(freq='W-MON')]).mean())

mean_agg.unstack('region_key')

Unnamed: 0_level_0,sales_d,sales_d,sales_d,sales_d,sales_d,sales_d,sales_d,sales_d,beechams,beechams,...,mean_min,mean_min,mean_max,mean_max,mean_max,mean_max,mean_max,mean_max,mean_max,mean_max
region_key,DEU01,DEU02,DEU03,DEU04,DEU05,DEU06,DEU07,DEU08,DEU01,DEU02,...,DEU07,DEU08,DEU01,DEU02,DEU03,DEU04,DEU05,DEU06,DEU07,DEU08
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2013-10-07,1098.0,2364.750000,861.000000,3492.000000,1652.0,1428.333333,3261.000000,1572.000000,0.0,0.0,...,4.082667,3.634000,14.647500,16.112500,14.243333,14.311250,12.851429,12.853333,12.853333,13.587333
2013-10-14,2045.0,3618.000000,1590.000000,2253.000000,2698.0,2048.000000,2399.000000,1947.666667,0.0,0.0,...,8.138000,7.310556,14.195000,13.476500,11.915833,12.026875,11.590000,14.490500,14.490500,13.374444
2013-10-21,1458.0,1696.333333,1830.333333,1635.250000,3376.0,2191.333333,2550.666667,2074.333333,0.0,0.0,...,6.117333,4.748333,13.182500,14.045833,13.640000,14.470833,14.730833,13.521333,13.521333,14.175833
2013-10-28,2226.5,2661.000000,2026.000000,1945.666667,3497.0,1676.500000,3890.500000,2316.000000,0.0,0.0,...,7.339167,10.330000,16.833333,18.614167,17.365000,16.789333,18.371429,16.782500,16.782500,18.887000
2013-11-04,1430.0,3290.000000,2517.000000,2598.500000,1735.0,1775.500000,2069.250000,1393.666667,0.0,0.0,...,6.194375,5.905833,12.412917,12.648889,12.796111,13.602917,12.072222,13.951250,13.951250,14.274167
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-08-19,2675.0,2346.000000,2683.500000,2473.000000,2420.0,1736.333333,1390.666667,2783.000000,0.0,0.0,...,15.110000,14.537500,20.484167,21.894500,23.556250,24.614167,24.267500,25.464667,25.464667,24.355000
2019-08-26,2276.0,1148.000000,1405.000000,2253.666667,1999.5,2156.666667,1801.666667,1699.333333,0.0,0.0,...,13.751667,13.715556,24.003000,26.012857,27.478000,26.590000,21.648000,26.608333,26.608333,27.117778
2019-09-02,2718.0,3069.500000,2034.000000,1424.333333,2718.0,1661.666667,2979.333333,1681.666667,0.0,0.0,...,19.542667,17.578667,28.544583,28.832917,30.654167,30.182222,27.561111,31.590667,31.590667,32.263333
2019-09-09,1276.0,1914.500000,2592.333333,2227.000000,2622.0,1029.000000,1885.000000,2454.500000,0.0,0.0,...,10.636667,11.729000,17.965833,19.261250,20.673333,21.271667,19.075500,21.190833,21.190833,21.583500


In [10]:
mean_agg.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,sales_d,beechams,bei.erkÃ.ltung,benylin,bronchitis,cold,cold.and.flu,colds,dizzy,dobendan,...,cloudy_b,partly_cloudy_day_b,partly_cloudy_night_b,snow_b,fog_b,rain_b,wind_b,sleet_b,mean_min,mean_max
region_key,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
DEU01,2013-10-07,1098.0,0.0,0.0,0.0,167.0,71.0,0.0,0.0,0.0,20.0,...,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,7.985833,14.6475
DEU01,2013-10-14,2045.0,0.0,0.0,0.0,218.0,253.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,7.54,14.195
DEU01,2013-10-21,1458.0,0.0,0.0,0.0,283.0,132.0,0.0,0.0,0.0,18.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,7.915,13.1825
DEU01,2013-10-28,2226.5,0.0,0.0,0.0,236.0,194.0,0.0,0.0,0.0,18.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,11.561667,16.833333
DEU01,2013-11-04,1430.0,0.0,0.0,0.0,83.0,176.0,0.0,0.0,0.0,19.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,7.86875,12.412917


## In short, grouping by region works in theory but:
#### 1) It looks impractical for what we want to do
#### 2) We can always test a model to see if it'll make any sense

## Why we won't use resampling for our problem?
### To answer this question we need to understand what resampling does first
#### Resampling involves changing the frequency of your time series observations. There are two ways one can change time series observations by:
##### 1) Upsampling: Where you increase the frequency of the samples, such as from hours to minutes
##### 2) Downsampling: Where you decrease the frequency of the samples, such as from days to weeks.

<strong> In both cases we need to be careful of our approach, since resampling our data means that in esense we are creating new data. 
</strong>
<p>
At the time of writing this, my current manager, one of the people who teach me a lot about how to be a good analyst, has a very simple rule "try to not invent data, when you invent data you always risk unreliable analysis, no matter how robust the technique you use".
</p>
<strong>
Examples of what we need to be careful with:
</strong>
<p>
When we use upsampling we need to be careful in determining how the fine-grained observations are calculated using interpolation, <i> constructing new data points within the range of a discrete set of known data points</i>
</p>
<p>
When we use downsampling we need to be careful in the summary statistics used to calculate the aggregated values. <i> In the example above I used mean, maybe a different summmary statistics technique could be more suitable to our data but we won't explore it in this exersise</i>
</p>

<strong>
Jason Brownlee PhD, in his blog "Machine Learning Mastery", mentions that there perhaps two main reasons why you may be interested to resample your time series data:
</strong>
<li>
<ol>
<strong> Problem Framing:</strong> Resampling may be required if your data is available at the same frequency that you want to make predictions.
</ol>
<ol>
<strong> Feature Engineering: </strong>  Resampling can also be used to provide additional structure or insight into the learning problem for supervised learning models.
</ol>
</li>

### I've already made sure that the data is broken down on a weekly basis and that the weather variables and region_keys are one_hot_encoded, therefore, we should be able to use the data to run the model without resampling them.

#### In the next post we will run the cnn time series model without resampling but with the one_hot_encoded data