# Time Series - Maker workshop

## Quick round table

Presentation & expectations?

## Definition

- **Time series data is data that is collected at different points in time.** This is opposed to cross-sectional data which observes individuals, companies, etc. at a single point in time.


- If you previously followed the *Maker workshop dedicated to Machine Learning*, you've already worked with cross-sectional data, but not time series.


- Time series can be found in a wide variety of domains: in economics, social sciences, medicine, but also ( and obviously) in physical sciences and engineering. As a result, **we deal with them a lot at Total!**

## Outline

1. Today's challenge
2. Today's Data Science environment checklist
3. Exploring the data 
    - Types, indexes and unique values
    - Distributions
    - Correlations
4. Dealing with missing values
5. Resampling techniques
6. Time series visualization
7. Anomalies detection techniques
8. Forecasting
8. Open discussion / work session

## Today's Challenge

**Predict the air temperature in 2017 based on weather data from 2009 to 2016.**

- Features available:
    - Air temperature
    - Atmospheric pressure
    - Humidity
    - Wind direction
    - Etc.

## Today's Data Science environment checklist

- A Jupyter notebook
- The data folder (the one that we sent)
- The following libraries installed:

In [None]:
! make -f ../setup/Makefile

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from fbprophet import Prophet

# Optional
%config InlineBackend.figure_format = 'retina'

In [2]:
# Uncomment this if you don't have the data
# !wget https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip
# !unzip jena_climate_2009_2016.csv.zip

# Used for data preparation

#raw_data_bis = pd.read_csv('../data/jena_climate_2009_2016.csv')
#raw_data_bis['open_st'] = 1.0
#
#df = raw_data_bis[['VPmax (mbar)', 'VPact (mbar)']].copy()
#import random
#ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
#for row, col in random.sample(ix, int(round(.01*len(ix)))):
#    df.iat[row, col] = np.nan
#
#raw_data_bis[['VPmax (mbar)', 'VPact (mbar)']] = df.copy()
#
#raw_data_bis.to_csv('../data/jena_climate_2009_2016.csv', index=False)

## Exploring the data

### Reading the raw data

- `head -n 10` is a useful shell command to give a look at a file's header (the first 10 lines in this case)
- In a Jupyter notebook, we can use the symbol `!` to run shell commands

_What are the useful details that you can see thanks to this command?_

In [3]:
!head -n 10 ../data/jena_climate_2009_2016.csv

Date Time,p (mbar),T (degC),Tpot (K),Tdew (degC),rh (%),VPmax (mbar),VPact (mbar),VPdef (mbar),sh (g/kg),H2OC (mmol/mol),rho (g/m**3),wv (m/s),max. wv (m/s),wd (deg),open_st
01.01.2009 00:10:00,996.52,-8.02,265.4,-8.9,93.3,3.33,3.11,0.22,1.94,3.12,1307.75,1.03,1.75,152.3,1.0
01.01.2009 00:20:00,996.57,-8.41,265.01,-9.28,93.4,3.23,3.02,0.21,1.89,3.03,1309.8,0.72,1.5,136.1,1.0
01.01.2009 00:30:00,996.53,-8.51,264.91,-9.31,93.9,3.21,3.01,0.2,1.88,3.02,1310.24,0.19,0.63,171.6,1.0
01.01.2009 00:40:00,996.51,-8.31,265.12,-9.07,94.2,3.26,3.07,0.19,1.92,3.08,1309.19,0.34,0.5,198.0,1.0
01.01.2009 00:50:00,996.51,-8.27,265.15,-9.04,94.1,3.27,3.08,0.19,1.92,3.09,1309.0,0.32,0.63,214.3,1.0
01.01.2009 01:00:00,996.5,-8.05,265.38,-8.78,94.4,3.33,3.14,0.19,1.96,3.15,1307.86,0.21,0.63,192.7,1.0
01.01.2009 01:10:00,996.5,-7.62,265.81,-8.3,94.8,3.44,3.26,0.18,2.04,3.27,1305.68,0.18,0.63,166.5,1.0
01.01.2009 01:20:00,996.5,-7.62,265.81,-8.36,94.4,3.44,3.25,0.19,2.03,3.26,1305.69,0.19,0.5,118.6,1.

Now that we have a better idea of the file's format, we can implement our reading function:

In [None]:
raw_data = pd.read_csv('../data/jena_climate_2009_2016.csv', sep="CODE HERE")
raw_data.head()

### Data types

- Checking for data types is useful to make sure that types were properly inferred when reading the raw CSV file
- If you've already explored the data, you can specify the undetected types in the `pandas.read_csv` function
- Tip: Casting to smaller float types can help you tremendly reduce the size of a dataset

_Comment on the following dtypes. Do you think the proper types were inferred?_

In [None]:
raw_data.dtypes

### Indexing

- When dealing with time series, we'll see that it can be useful to make the most out of pandas' `DatetimeIndex`, i.e. to set a `Datetime` column as index of the dataframe.

_Let's verify if the Datetime type was correctly inferred from the CSV file._

In [None]:
type(raw_data['Date Time'][0])

### Checking duplicated rows

- Before to continue the data manipulation, we should check for potential duplicated rows in the data that we want to get rid of.

_What is the percentage of duplicated rows among the complete dataset?_

In [None]:
percentage = raw_data[raw_data.duplicated(subset=["CODE HERE"])].shape[0] / raw_data.shape[0] * 100
print(f'Among the complete data, {round(percentage * 100, 2)}% are duplicated rows.')

In [None]:
raw_data.drop_duplicates(subset='Date Time', inplace=True)

### Unique values

- Checking for unique values will give you information on your variables' granularity:
    - A small number of unique values can indicate the presence of a category
    - A single unique value may indicate that a variable is never changing, even out of your sample
   
_Do you notice any of these two cases in your dataset?_

In [None]:
for col in raw_data.columns: 
    print(col, ' '*(20-len(col))+'----->', len(raw_data[col].unique()))

### Distributions

- With the `seaborn` library, we can easily plot the distributions and the relationship between each pair of sensors

_Let's give a look at the following graph: from your functional knowledge of the sensors, can you identify normal or abnormal patterns?_

In [None]:
SELECTED_COLUMNS = ['p (mbar)', 'T (degC)', 'H2OC (mmol/mol)', 'sh (g/kg)', 'wd (deg)']

In [None]:
sns."CODE HERE"(raw_data[SELECTED_COLUMNS])

plt.show()

### Correlations

- Correlation analysis is a statistical method used to **evaluate the strength of relationship between two quantitative variables**. 


- A high correlation means that two or more variables have a strong relationship with each other.
- A weak correlation means that the variables are hardly related.

_Let's continue our analysis by plotting the correlation matrix. Do you notice anything?_

In [None]:
def print_correlation(df):
    corr = df.corr()
    
    plt.figure(figsize=(8, 8))
    
    ax = sns.heatmap(corr, vmin=-1, vmax=1, center=0,
                     cmap=sns.diverging_palette(20, 220, n=200),
                     square=True, annot=True)
    
    ax.set_xticklabels(ax.get_xticklabels(),
                       rotation=45,
                       horizontalalignment='right')
    
    plt.show()

In [None]:
print_correlation(raw_data[SELECTED_COLUMNS])

Here, we use our own custom function to get a stylized correlation matrix. However, you could simply use the pandas method _your_dataframe_name.corr()_.

In [None]:
raw_data.to_csv('../data/jena_climate_2009_2016_part_2.csv', index=False)

## See you on Part 2 ;)