In this kernel I wanted to practice with time series with classic statistical tools in order to check if it could help for anomaly detection.
At first I've just studied if an ARIMA model variant could predict some irregular univariate time series and I noticed a few limits. Please feel free if you notice some irregularities in the methods I've employed

At first, let's load all our needed libraries

In [None]:
import os
import pandas as pd
import numpy as np
from matplotlib import rcParams
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX, SARIMAXResults
from statsmodels.tsa.stattools import adfuller
from sklearn.metrics import mean_squared_error
os.getcwd()

In [None]:
os.chdir('..')
os.listdir()

In [None]:
os.chdir("input")
os.listdir()

In [None]:
#path = r"C:\Users\[..]\Desktop\pump-sensor-data"
#df = pd.read_csv(os.path.abspath(path + r"/sensor.csv"),index_col = "timestamp",parse_dates=["timestamp"])
df = pd.read_csv('sensor.csv',index_col = "timestamp",parse_dates=["timestamp"])
df.drop("Unnamed: 0",axis=1,inplace=True)
df.head()

In [None]:
df['machine_status'].unique()

In [None]:
df.info()

Let's drop "sensor_15" as it won't bring anything to our study

In [None]:
df.drop('sensor_15',axis=1,inplace=True)
df["machine_status"]=df.machine_status.astype("category")

We'll check our categorical variable is now considered as a "category" type variable

In [None]:
df.machine_status.dtype

In [None]:
df.describe()

In [None]:
df["sensor_00"].plot(figsize=(10,6))
plt.xticks(color="white")
plt.yticks(color="white")
plt.title("Capteur 0 série temporelle", color="white")
plt.xlabel("Timestamp", color="white")

In [None]:
df[(df.index.month>=4) & (df.index.month<=6)]["sensor_00"].plot(figsize=(10,7))
plt.xticks(color="white")
plt.yticks(color="white")
plt.xlabel("Timestamp", color="white")
plt.title("Extract of the time series between April and June for sensor 00",color = "white")

In [None]:
df["machine_status"].cat.categories

In [None]:
df["machine_status_code"]=df["machine_status"].cat.codes

In [None]:
df["machine_status"].cat.codes.unique()

In [None]:
df["machine_status"].cat.codes.plot(figsize=(10,7))

In [None]:
df.loc[df["machine_status"]=="BROKEN",["machine_status","machine_status_code"]]

In [None]:
df.loc[df["machine_status"]=="RECOVERING",["machine_status","machine_status_code"]].head()

In [None]:
df.loc[df["machine_status"]=="NORMAL",["machine_status","machine_status_code"]].head()

We'll now have an overview of all sensors to check if there is some interesting pattern to look at

In [None]:
df[(df.index.month>=4) & (df.index.month<=5)].plot(figsize=(15,120), subplots=True)

I've also added the "machine_status_code" in the plot in order to look if there was some obvious patterns that could be landmarked in a certain period
we can see some sensors can be grouped for a further study in a multivariate time series study, for this notebook we'll just focus on a univariate time series study with classical statistical tool ARIMA

Let's grab an subset of our sensor data so my computer doesn't cry when I want to train my model on it, I took 2 month of data between April and end of May

In [None]:
#On sélectionne une partie de notre dataset pour entrainer notre modèle supervisé
df_train = df[(df.index.month>=4) & (df.index.month<=5)]

Let's do some imputation for missing values on "sensor 00" time series

In [None]:
df_train["sensor_00"].fillna(method='bfill',inplace=True)

In [None]:
df_train["sensor_00"].isna().sum()

Let's check if there is some pattern we can see with the machine status

In [None]:
df_train[["sensor_00","machine_status_code"]].plot(figsize=(10,6),subplots=True)
plt.title("Capteur 00 et status de fonctionnement de la machine", color="white")
plt.xticks(color="white",rotation=0)
plt.yticks(color="white")
plt.xlabel("Timestamp", color="white")

It seems like this time series is pretty much correlated with the broken state of the machine and may be a good indicator for the broken state of the system, we'll check it for another notebook. For now the only interest is to manipulate and forecast to check robustness of classical methods

We'll check for stationarity with AD-Fuller test and check if the p-value is <= 0.5

In [None]:
#On check la stationarité de la série temporelle avec le test ADF
adfuller(df_train["sensor_00"],maxlag=50)

Seems good at first! But let's check it visually, I'll differenciate our time series and then look further

In [None]:
df_train["sensor_00"].shift(1).head()

In [None]:
sensor00_acf_plot = plot_acf((df_train["sensor_00"].shift(1)-df_train["sensor_00"]).dropna(), lags=50, title="ACF Sensor 00")

Not really convincing, we'll also do the partial autocorrelation plot for our sensor 00 for the exercice

In [None]:
sensor00_pacf_plot = plot_pacf(df_train["sensor_00"], lags=50, title="PACF Sensor 00")

In [None]:
type(df_train.index)

In [None]:
rcParams['figure.figsize'] = 11, 9
decomposed_sensor00 = sm.tsa.seasonal_decompose(df_train["sensor_00"], freq=360)
figure = decomposed_sensor00.plot()

We'll need to remove seasonality, for this we'll test which are the best values for our model

In [None]:
#resDiff = sm.tsa.arma_order_select_ic(df_train["sensor_00"], max_ar=10, max_ma=10, ic='aic', trend='c')
#print('ARMA(p,q) = ',resDiff["aic_min_order"],' is the best!')

Took forever to run on my computer... But I'll save it for this kernel and the best result was (9,9) for AR and MA values and we had some intuition of this result through the autocorrelation plot

We'll use a value of 1 for differentiation d parameter. I've followed the following guidelines [here](http://people.duke.edu/~rnau/arimrule.htm) which mentions about overdifferencing if out lag value in ACF plot goes >= -0.5.
After 1 differentiation, ACF for lag 1 was already equal to -0.2 and then dived into negatives beyond the threshold of the guidelines so we'll keep our d=1
Which gives us an ARIMA(9,1,9)

In [None]:
model = SARIMAX(df_train["sensor_00"], order=(9,1,9))
results = model.fit()
results.summary()

In [None]:
results_plot = results.plot_diagnostics(figsize=(15,12))

Next, we'll plot our results to check if we have :
- our KDE distribution that follows a N(0,1) distribution (not really the case here...)
- qq-plot of our residuals follows the linear trend of the sample taken from a standart normal distribution 
- our correlogram reflects stationarity (seems good!)

In [None]:
df_train.index[0], df_train.index[-1]

In [None]:
tr_start, tr_end = '2018-04-01 00:00:00','2018-05-31 23:59:00'
tr_pred = '2018-06-10 00:00:00'
steps_to_predict = 5

In [None]:
forecast = results.forecast(steps_to_predict)

In [None]:
df_train["prediction"] = results.predict(start=70640,end=87840, dynamic=True)
df_train[["sensor_00","prediction"]].plot(figsize=(12,8))

We see how much ARIMAX is limited for time series for important rebounds which aren't very regular

In [None]:
df_train["prediction"] = results.predict(start=73640,end=87840, dynamic=True)
df_train[["sensor_00","prediction"]].plot(figsize=(12,8))

As we notice, linear prediction is dependent on the chosen forecasting starting time => very little reliability if we want to do anomaly detection on a single time series.

In conclusion : 
- ARIMA seems good for regular patterns which are noticeable on the time series itself but not on random time series just like we did here
- Our prediction depends on the forecasting time and even though we did prediction on a portion of data that we trained on, it's supposed to overfit and our prediction should match out data but it's not even the case, mistake in the process or misfit of the model for our data ?
- ARIMA is such a pain to make it work and a lot of hypotheses must be met in order to validate the model (normality condition on the data distribution => transformations)
- It would be interesting to compare approaches with recent machine learning techniques in order to compare a linear approach to a non linear one

In a next notebook I'd like to do the same study on another time series of the same dataset with a NN and a RNN to compare forecasting performances

As I'm a beginner, many flaws may happen in the process I've offered in this notebook, I'd be glad to have feedbacks for any who'd be taking time to point out unclear points I'm grateful in advance to you!