# Time series forecast using ARIMA model

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/time-series-datasets/Electric_Production.csv')
df.head()

In [None]:
df.dtypes

In [None]:
df['DATE'] = pd.to_datetime(df['DATE'])
df.head()

In [None]:
df.index= df['DATE']
df = df.drop(columns=['DATE'],axis=1)
df.head()

## Stationary test using Dickey-fuller method

In [None]:
from statsmodels.tsa.stattools import adfuller
adf = adfuller(df.IPG2211A2N)
print('p-value:',adf[1])

The p-value is greater than 0.05. hence, the data set is non-stationary and thus, differencing has to be done.

## Intergration - Differencing (d parameter)

In [None]:
from matplotlib import pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
plt.figure(figsize=(10,5))
ax = plt.subplot(1,2,2)
plt.plot(df.IPG2211A2N)
plot_acf(df.IPG2211A2N);

Without performig any intergration, the lags in the autocorrelation graph is high and there is a considerable amount of non-stationarity.

## 1st order differencing

In [None]:
plt.plot(df.IPG2211A2N.diff().dropna())
plot_acf(df.IPG2211A2N.diff().dropna());

For 1st order differencing, the auto correlation graph seems to have a better stationarity with regular intervals of up and down variation.

## 2nd order differencing

In [None]:
plt.plot(df.IPG2211A2N.diff().diff())
plot_acf(df.IPG2211A2N.diff().diff());

2nd order differencing nullifies the autocorrelation factor. hence, d parameter can be set to 1 for this prediction.

## Auto Regression - p parameter

In [None]:
from statsmodels.graphics.tsaplots import plot_pacf
plt.plot(df.IPG2211A2N)
plot_pacf(df.IPG2211A2N);

In [None]:
plt.plot(df.IPG2211A2N.diff().dropna())
plot_pacf(df.IPG2211A2N.diff().dropna());

There are 14 lags in the auto correlation graph. But if larger lags alone are considered, then 1st and 3rd lag are bigger than the other lags. hence, p = 2 is considered.

## Moving average - q parameter

In [None]:
plt.plot(df.IPG2211A2N.diff().dropna())
plot_acf(df.IPG2211A2N.diff().dropna());

Around 24 error lags are present in autocorrelation graph. Considering the highest value of the lag, the 1st lag has bigger value than other lags. Hence, q = 1

## ARIMA Model

p=2 value was not compatible with the arima model. hence, p=1 is considered.

In [None]:
from statsmodels.tsa.arima_model import ARIMA
arima = ARIMA(df.IPG2211A2N,order=(1,1,1))
model = arima.fit()
print(model.summary())

Based on the ARIMA report, the p-values of MA and AR models are way less than 0.05. Hence, the prediction could be efficient.

## Checking accuracy of the model

In [None]:
plt.figure(figsize=(10,10))
model.plot_predict();

From the plot, the model and the actual values are matching exactly with no variations.

# Model Development 

In [None]:
train_set = df[0:365]
test_set = df[365:]

In [None]:
arima = ARIMA(train_set,order=(1,1,1))
model = arima.fit()
print(model.summary())

In [None]:
fcast,se,confidencebands = model.forecast(32,alpha=0.01)

In [None]:
pred_set = pd.DataFrame(data=fcast,columns=['Value']) 
pred_set.index = test_set.index
pred_set.tail()

In [None]:
plt.figure(figsize=(20,10))
plt.plot(train_set,label='Training examples')
plt.plot(test_set,label='Original Values')
plt.plot(pred_set,label='Predicted forecast')
plt.legend()
plt.xlabel('Year')
plt.ylabel('Electricity production')


The forecast has given predictions equivalent to the actual results. hence, the production for the upcoming years can be predicted.

In [None]:
model.plot_predict(1,500);

Based on the plot, there is a steady increase in the amount of electric production in the upcoming years.