# Time Series - Maker workshop

## Quick round table

Presentation & expectations?

## Definition

- **Time series data is data that is collected at different points in time.** This is opposed to cross-sectional data which observes individuals, companies, etc. at a single point in time.


- If you previously followed the *Maker workshop dedicated to Machine Learning*, you've already worked with cross-sectional data, but not time series.


- Time series can be found in a wide variety of domains: in economics, social sciences, medicine, but also ( and obviously) in physical sciences and engineering. As a result, **we deal with them a lot at Total!**

## Outline

1. Today's challenge
2. Today's Data Science environment checklist
3. Exploring the data 
    - Types, indexes and unique values
    - Distributions
    - Correlations
4. Dealing with missing values
5. Resampling techniques
6. Time series visualization
7. Anomalies detection techniques
8. Forecasting
8. Open discussion / work session

## Today's Challenge

**Predict the air temperature in 2017 based on weather data from 2009 to 2016.**

- Features available:
    - Air temperature
    - Atmospheric pressure
    - Humidity
    - Wind direction
    - Etc.

## Today's Data Science environment checklist

- A Jupyter notebook
- The data folder (the one that we sent)
- The following libraries installed:

In [None]:
! make -f ../setup/Makefile

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from fbprophet import Prophet

# Optional
%config InlineBackend.figure_format = 'retina'

## Loading the data

In [None]:
raw_data = pd.read_csv('../data/jena_climate_2009_2016_part_2.csv', sep=',')
raw_data.head()

## Dealing with missing values

### Identification of missing values

_Let's identify missing values in our dataset._

Tip: Knowing that our variables are floats, the missing values will appear as `NaN` (= Not A Number).

In [None]:
raw_data."CODE HERE"()."CODE HERE"()

In [None]:
raw_data[raw_data.isna().any(axis=1)]

## Resampling techniques

### Assess the time series delta

In [None]:
sorted_data = raw_data.copy()
sorted_data['Date Time'] = pd.to_datetime(sorted_data['Date Time'], format='%d.%m.%Y %H:%M:%S')
sorted_data =  sorted_data.sort_values('Date Time')

dt_index_delta = sorted_data['Date Time']."CODE HERE"
dt_index_delta.value_counts()

Lucky for us, we have is a nice `resample()` method for pandas dataframes that have a DatetimeIndex.

### Create a DatetimeIndex

In [None]:
reindexed_data = sorted_data.copy()
reindexed_data."CODE HERE"('Date Time', inplace=True)

In order to better illustrate the concept of resampling, let's create a fake sinusoidal time series.

In [None]:
def create_fake_ts(length=10000):
    raw_values = np.sin(np.linspace(1, length, length)*2*np.pi/(24*60))
    date = pd.date_range(start='2017-01-01', periods=len(raw_values), freq='min')
    
    sampled = [i for i in range(len(raw_values)//2, len(raw_values), 8*60)]
    sampled_date = date[sampled]

    raw_less_values = raw_values[sampled]

    raw_df = pd.DataFrame(raw_values[:sampled[0]])
    raw_df.index = date[:sampled[0]]

    raw_less_values_df = pd.DataFrame(raw_less_values)
    raw_less_values_df.index = sampled_date

    raw_df = raw_df.append(raw_less_values_df)
    
    raw_df.index.name = 'date'
    raw_df.columns = ['values']

    return raw_df

In [None]:
ts_to_resample = create_fake_ts(length=10000)

We can then apply the same analysis of the time differences as in the previous section.

In [None]:
ts_to_resample.reset_index(inplace=True)

ts_to_resample['delta'] = ts_to_resample['date'].diff()

ts_to_resample.set_index('date', inplace=True)

ts_to_resample.head()

In [None]:
plt.figure(figsize=(15, 5))

plt.scatter(x=ts_to_resample.index, y=ts_to_resample['values'])

plt.title('Time series to resample')

plt.show()

In [None]:
plt.figure(figsize=(15, 5))

(ts_to_resample['delta'].dt.total_seconds() / (3600 * 24)).plot()

plt.title('Differences between timestamps')

plt.show()

In this example data, starting on 2017-01-04, we no longer receive 1 data point every minute but instead 1 data point every 8h.

In our case, we have 2 different ways to go to obtain a time series with equally-spaced data points:
1. Remove some data points from the first part of the time series in order to get 8h-spaced data points similar to the second part of the time series: this technique is known as **undersampling**.
2. Add some data points in the second part of the time series in order to get 1min-spaced data points similar to the first part of the time series: this technique is known as **oversampling**.

### Oversampling

#### Forward-fill method

Even though it depends on the problem at hand, a way to go can be to use the frequency the most represented in order not to add/remove too many data points.

When oversampling your time series, i.e. creating new data points, you'll need to make a decision regarding which values to assign to these new points.

When resampling time series, a common risk is to introduce **data leakage** by adding data from the future to the past, i.e. data that would not have been available at the time.

For example, if you decide to fill a missing data point at time t with the next available values (this is known as **backward filling**), how would you have been able to do that at time t, knowing that these future values were not available at the time?

You should always ask yourself this question when manipulating time series, especially when adding data points and creating new features.

![](../setup/images/ffill.png)

As presented on this diagram, you should always use **forward filling** as a resampling method when oversampling. Backward filling would bring in unavailable values from the future and introduce a data leak.

In [None]:
ts_to_resample_min = ts_to_resample['values']."CODE HERE"('T')."CODE HERE"()

plt.figure(figsize=(15, 5))

plt.plot(ts_to_resample_min, color='orange')
plt.scatter(x=ts_to_resample.index, y=ts_to_resample['values'])

plt.title('Resampling with the forward fill method')

plt.show()

#### Other methods - Linear

In [None]:
ts_to_resample_min = ts_to_resample['values']."CODE HERE"('T').interpolate(method="CODE HERE")

plt.figure(figsize=(15, 5))

plt.plot(ts_to_resample_min, color='orange')
plt.scatter(x=ts_to_resample.index, y=ts_to_resample['values'])

plt.title('Resampling with the linear method')

plt.show()

#### Other methods - Nearest

In [None]:
ts_to_resample_min = ts_to_resample['values'].resample('T').interpolate(method='nearest')

plt.figure(figsize=(15, 5))

plt.plot(ts_to_resample_min, color='orange')
plt.scatter(x=ts_to_resample.index, y=ts_to_resample['values'])

plt.title('Resampling with the "nearest" method')

plt.show()

As explained earlier, if you look closely at the different graphs, you'll realize that **the forward fill method is the only method presented which doesn't introduce any data leak.**

### Mix of undersampling and oversampling

In [None]:
ts_to_resample_min = ts_to_resample['values'].resample('4H').ffill()

plt.figure(figsize=(15, 5))

plt.plot(ts_to_resample_min, color='orange')
plt.scatter(x=ts_to_resample.index, y=ts_to_resample['values'])

plt.show()

Now, we know everything we need to resample our data on a 10-min basis.

In [None]:
clean_data = reindexed_data.resample('10min')."CODE HERE"()

## Time series visualization

### Global view & analysis

In [None]:
plt.figure(figsize=(20, 7))

for col in clean_data.columns:
    plt.plot(clean_data[col], label=col)

plt.legend(loc='upper right')
plt.title('Sensor values')

plt.show()

### Sensor-level view & analysis

In [None]:
for col in clean_data.columns:
    
    plt.figure(figsize=(20, 7))
    plt.plot(clean_data[col])
    plt.title(col)
    
    plt.show()

In [None]:
clean_data.to_csv('../data/jena_climate_2009_2016_part_3.csv')

## See you on Part 3 ;)