# Overview

In this lab you’ll use the SARIMA model that you built in the previous lab to predict and forecast values in the time series (we’ll explain the difference between predictions and forecasts shortly).

You’ll also use various diagnostic information available to verify the goodness-of-fit of the SARIMA model.

# Roadmap
There are 3 exercises in this lab, of which the last exercise is "if time permits". Here is a brief summary of the tasks you will perform in each exercise; more detailed instructions follow later:
1.	Plotting diagnostic information about how well the model fits the data
2.	Using the model to predict values
3.	(If time permits) Using the model to forecast future values


# Global Settings

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from statsmodels.tsa.statespace.sarimax import SARIMAX

input_file = './Data/CO2.csv'

# Exercise 1:  Plotting diagnostic info about how well the model fits the data

The MLEResults object returned by the fit() function has a very handy method named plot_diagnostics(), which plots various graphs that enable you to determine how well the model fits the data. Call the plot_diagnostics() method as follows:

    results.plot_diagnostics()
    plt.show()


In [None]:
# PLACE YOUR SOLUTION HERE



Here’s an explanation of the 4 graphs on the previous page:
-	The upper-left graph shows the standardized residual values. The term “residual” means the difference between the actual value (in the time series data) and the value predicted by the model. There is always a difference between actual and predicted values – this is unavoidable, even if your model is great! What you’re looking for is the absence of any systematic pattern (e.g. residual values that increase over time or show a cyclical pattern). 
In our case, there is no pattern to the residual data; the “errors” seem evenly distributed. Sometimes the residual value is positive, sometimes it’s negative. There isn’t any systemic discrepancy between actual and predicted values, which suggests the model is good!
-	The upper-right graph shows 3 plots:
>	N(0,1) plot, which is a perfect normal distribution (i.e. a mean of 0 and a standard deviation of 1). 
>
>	KDE (Kernel Density Estimation) plot, which shows the density of standardized residuals. The KDE plot is very close to the N(0,1) plot, which means the residuals exhibit a near-normal distribution with no particular bias and with a small range of residual errors, which suggests the model is good!
>
>	A histogram that shows the same info as the KDE plot, but as a histogram rather than as a curve.
-	The lower-left graph shows a Normal Q-Q (Quantile-Quantile) plot. The red line shows what the residuals should look like if they were perfectly normally distributed. The blue dots show the residuals yielded by our model. As you can see, the blue dots are very close to the red line, which means the residuals are normally distributed, which suggests the model is good!
_	The lower-right graph shows a Correlogram (autocorrelation) plot for the residuals. The plot indicates there is no correlation between residual values, i.e. our model doesn’t yield any systemic relationship between residuals, which suggests the model is good!

These graphs are incredibly important. It’s vital that you know whether your model is a good or bad fit for the data, so that you can have confidence in using the model to predict and forecast future values.
 


# Exercise 2:  Using the model to predict values


In the previous exercise you verified that the model is a good fit for the data, so it’s safe to use the model to predict/forecast new values. The MLEResults object has two methods for this purpose:
-	get_prediction() predicts values within the date/time range of the time series. You can then compare these predicted values with the actual values in the time series, to see if the model has done a good job of modelling reality. You’ll see how to do all this in this exercise.
-	get_forecast() forecasts values outside the date/time range of the time series. You’ll see how to that in next exercise.
Let’s see how to use get_prediction() to predict values within the date/time range of the time series. Full documentation about the get_prediction() function is available here:

https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.mlemodel.MLEResults.get_prediction.html

Let’s say you want to predict CO2 levels for 1997 onwards. You can achieve this as follows:
    prediction = results.get_prediction(start='1997-01-01')

The function returns a prediction object. The prediction object has a conf_int() method that gives you the predicted lower and upper CO2 value for each period, to 95% confidence. This info is expressed as a DataFrame with 3 columns: the date/time index; the lower predicted value; and the upper predicted value. Try out the following code to see what it looks like:

    prediction_ci = prediction.conf_int()
    print('\nRange of predicted values\n', prediction_ci)
    print('\nLower predicted values\n',    prediction_ci.iloc[:, 0])
    print('\nUpper predicted values\n',    prediction_ci.iloc[:, 1])

The prediction object also has a predicted_mean property that gives you the mean predicted CO2 value for each period. Try out the following code:

    prediction_mean = prediction.predicted_mean         
    print('\nMean predicted values\n', prediction_mean)
 
As well as seeing predicted values printed on the console, it’s also very useful to visualize the predicted values on a plot. You’ll do that now…
First, add the following code to create a plot for the actual values, e.g. for 1990 onwards:

    ts1990onwards = ts['1990' : ]
    axes = ts1990onwards.plot(label='Actual values')
    axes.set_xlabel('Date')
    axes.set_ylabel('CO2')

Note that the plot() function returns a MatPlotLib Axes object, which represents the axes upon which the current plot (i.e. the actual values) is drawn. You can use this Axes object to superimpose additional plots; for example, the following code superimposes the predicted mean values on the same axes (the predicted values will be drawn in red here, with an opacity of 75%):

    prediction_mean.plot(ax=axes,
                         label='Mean predicted values', 
                         color='red',
                         alpha=0.75)
                         
It’s also informative to plot the range of values between the lower and upper values in the 95% confidence range. You can use the fill_between() function to do this, as shown below. The 1st parameter specifies x values, and the 2nd and 3rd parameters specify the lower and upper y values to fill-in (in fairly transparent black, in this example):

    axes.fill_between(prediction_ci.index,
                      prediction_ci.iloc[:, 0],
                      prediction_ci.iloc[:, 1],
                      color='black', 
                      alpha=0.25)
All that remains is to show these plots, along with a legend:
    plt.legend()
    plt.show()


In [None]:
# PLACE YOUR SOLUTION HERE


# Exercise 3 (If time permits):  Using the model to forecast future values

In the previous exercise you used the get_prediction() function to predict values within the date/time range of the time series. 

In this exercise you’ll use the get_forecast() function to forecast values outside the date/time range of the time series, i.e. to forecast future values. get_forecast() is very similar to get_prediction() – it obtains lower and upper forecast values with 95% confidence, and also a predicted mean value. Full details about the function are available here:
https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.mlemodel.MLEResults.get_forecast.html

Add code to your script to forecast CO2 values for 200 months beyond the end of the current time series. When you’re done, plot the information as follows. Note that the confidence range diverges as time goes on, which is to be expected – the further into the future you forecast, the less certain you can be about the forecast:


In [None]:
# PLACE YOUR SOLUTION HERE
