# Time Series - Maker workshop

## Quick round table

Presentation & expectations?

## Definition

- **Time series data is data that is collected at different points in time.** This is opposed to cross-sectional data which observes individuals, companies, etc. at a single point in time.


- If you previously followed the *Maker workshop dedicated to Machine Learning*, you've already worked with cross-sectional data, but not time series.


- Time series can be found in a wide variety of domains: in economics, social sciences, medicine, but also ( and obviously) in physical sciences and engineering. As a result, **we deal with them a lot at Total!**

## Outline

1. Today's challenge
2. Today's Data Science environment checklist
3. Exploring the data 
    - Types, indexes and unique values
    - Distributions
    - Correlations
4. Dealing with missing values
5. Resampling techniques
6. Time series visualization
7. Anomalies detection techniques
8. Forecasting
8. Open discussion / work session

## Today's Challenge

**Predict the air temperature in 2017 based on weather data from 2009 to 2016.**

- Features available:
    - Air temperature
    - Atmospheric pressure
    - Humidity
    - Wind direction
    - Etc.

## Today's Data Science environment checklist

- A Jupyter notebook
- The data folder (the one that we sent)
- The following libraries installed:

In [None]:
! make -f ../setup/Makefile

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from fbprophet import Prophet

# Optional
%config InlineBackend.figure_format = 'retina'

## Loading the data

In [None]:
clean_data = pd.read_csv('../data/jena_climate_2009_2016_part_3.csv')
clean_data.set_index('Date Time', inplace=True)
clean_data.index = pd.to_datetime(clean_data.index)

## Anomalies detection

### Definition

- An anomaly is an outlier data point, which does not follow the collective common pattern of the majority of the data points and hence can be easily separated or distinguished from the rest of the data.

- In our case, we can try to identify abnormal temperatures over the period.

In [None]:
clean_data.head()

### Fix threshold

In [None]:
TAG_NAME = 'T (degC)'

plt.figure(figsize=(20, 7))

clean_data[TAG_NAME].plot()
plt.title(TAG_NAME)

plt.show()

_What could be a relevant threshold to apply to this specific sensor ?_

Now, consider that we apply the following thresholds (upper/lower) for the specified sensors.

In [None]:
SENSORS_THRESHOLDS = {TAG_NAME:[-15, 34]}

Let's backtest our **fix threshold** strategy:

In [None]:
backtesting_df = clean_data.copy()

for col in SENSORS_THRESHOLDS.keys():
    upper_alert = (backtesting_df[col] > SENSORS_THRESHOLDS[col][1])
    lower_alert = (backtesting_df[col] < SENSORS_THRESHOLDS[col][0])
    
    backtesting_df[f'is_alert_{col}'] = (upper_alert | lower_alert).astype(int)

In [None]:
plt.figure(figsize=(20, 7))

backtesting_df[f'is_alert_{TAG_NAME}'].plot(color='red')
plt.title(f'Fix threshold - Alerting state on sensor {TAG_NAME}')

plt.show()

### Statistical profiling

- Creating a statistical profile of the data can be the fastest and the most useful approach, and it still offers a **clear and explainable outcome**.

- In the case of statistical profiling, **we use the mean, median, standard deviations and/or quantiles to come up with upper and lower bounds** to detect anomalies.

In [None]:
plt.figure(figsize=(20, 5))

sns.boxplot(clean_data[TAG_NAME])

plt.show()

Now, consider that we use the 1st and 99th quantiles for the specified sensors.

In [None]:
QUANTILE_PARAM = 0.99

upper_quantile = clean_data[TAG_NAME]."CODE HERE"(QUANTILE_PARAM)
lower_quantile = clean_data[TAG_NAME]."CODE HERE"(1-QUANTILE_PARAM)

Let's backtest our **statistical profiling** strategy:

In [None]:
backtesting_df = clean_data.copy()

for col in SENSORS_THRESHOLDS.keys():
    upper_alert = (backtesting_df[col] > upper_quantile)
    lower_alert = (backtesting_df[col] < lower_quantile)
    
    backtesting_df[f'is_alert_{col}'] = (upper_alert | lower_alert).astype(int)

In [None]:
plt.figure(figsize=(20, 7))

backtesting_df[f'is_alert_{TAG_NAME}'].plot(color='red')
plt.title(f'Statistical profiling - Alerting state on sensor {TAG_NAME}')

plt.show()

## Forecasting

### A word of caution

One needs to be careful when predicting the future:

- _"Stocks have reached what looks like a permanently high plateau."_ - Irving Fischer, Professor of Economics, Yale University, 1929
    - True or False?

- _"Computers in the future weigh no more than 1.5 tons."_ - Popular Mechanics, forecasting the relentless march of science, 1949
    - True or False?

### Introduction to Prophet

- Open-sourced by Facebook's core data science team a few years ago, Prophet is based on time series decomposition but has the ability to model different seasonalities as well as the effect of holidays and special events.

- On [Prophet Github page](https://github.com/facebook/prophet), we find the following description:

_"Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well."_

- In this section, we'll try to assess how Prophet performs to predict the future value of the temperature (the T (degC) sensor).

The input to Prophet is always a DataFrame with 2 columns: `ds` and `y`:
- The `ds` (datestamp) column should be of a format expected by pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp. 
- The `y` column must be numeric, and represents the measurement we wish to forecast.

In [None]:
TAG_NAME = 'T (degC)'

prophet_df = clean_data."CODE HERE"('1d')."CODE HERE"()
prophet_df = prophet_df[[TAG_NAME]].reset_index()
prophet_df = prophet_df.rename(columns={'Date Time':'ds', TAG_NAME:'y'})

prophet_df.head()

Prophet follows the sklearn model API. We create an instance of the `Prophet `class and then call its `fit` and `predict` methods.

In [None]:
model = Prophet()
model.fit(prophet_df)

Now that we have a model, we can make predictions on a DataFrame with a column `ds` containing the dates for which a prediction is to be made. 

You can get a suitable DataFrame that extends into the future a specified number of days using the helper method `Prophet.make_future_dataframe` (by default, it will also include the dates from the history).

In [None]:
future = model.make_future_dataframe(periods=365)
future.tail()

Now, we can apply the `predict` method to this DataFrame: it will assign each row a predicted value which it names `yhat`. If you pass in historical dates, it will provide an in-sample fit.

In [None]:
forecast = model."CODE HERE"(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

We can plot the forecast by calling the `Prophet.plot` method and passing in our forecast DataFrame.

In [None]:
fig1 = model."CODE HERE"(forecast)

If you want to see the forecast components, you can use the `Prophet.plot_components` method. 

By default you’ll see the trend, yearly seasonality, and weekly seasonality of the time series. If you include holidays, you’ll see those here, too.

In [None]:
fig2 = model."CODE HERE"(forecast)

## Thank you!
### Any feedback? Return on time invested?