# Sunspots forecasting with SARIMA

## Source of the data 
* The data comes from http://www.sidc.be/
* A full description can be found here http://www.sidc.be/silso/infosndtot

**The data gather the daily total sunspot number from 1818.**

In [None]:
# Import librairies
%matplotlib inline 
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = [16, 9]
from statsmodels.tsa import stattools
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from timeit import default_timer as timers

## Description of the data 
Columns are :
* Year
* Month
* Day
* Fraction of the year
* Total sunspot number
* Standard deviation
* Number of observations
* Indicator : observation has been done or not.

**(-1) stands for NAN**

Daily total sunspot number derived by the formula: R= Ns + 10 * Ng, with Ns the number of spots and Ng the number of groups counted over the entire solar disk.

In [None]:
# load the data
path = "./kaggle/input/daily-sun-spot-data-1818-to-2019/"
filename = os.path.join(path,"sunspot_data.csv")
df = pd.read_csv('/kaggle/input/daily-sun-spot-data-1818-to-2019/sunspot_data.csv', delimiter=',', na_values=['-1'])
df.dataframeName = 'sunspot_data.csv'
del(df['Unnamed: 0'])
df.columns = ['year', 'month', 'day', 'fraction','sunspots', 'sdt', 'obs','indicator']
df.head(-5)

# Add the column time 
df['time']=df[['year', 'month', 'day']].apply(lambda s: pd.datetime(*s),axis = 1)
# time column is the index of the dataframe
df.index = df['time']
# replace the Nan by linear interpolation 
df['sunspots'].interpolate(method='linear', inplace=True)

## Time serie creation
**Data are resample monthly and quarter**

In [None]:
ts = pd.Series(data=df.sunspots, index=df.index)
#ts = ts['1900-01-01':]
ts_month = ts.resample('MS').mean()
ts_quarter = ts.resample('Q').mean()
ts_quarter.plot()
plt.show()

# SARIMA - Seasonal Autoregressive Integrated Moving Average
SARIMA is a seasonal ARIMA

In order to configure a SARIMA(p,d,q)(P,D,Q)S, two kind of hyperparameters have to be set 

#### Trend parameters
* p: The number of lag observations included in the model, also called the lag order.
* d: The number of times that the raw observations are differenced, also called the degree of differencing.(The purpose of differencing it to make the time series stationary.)
* q: The size of the moving average window, also called the order of moving average.
(The purpose of differencing it to make the time series stationary.)

#### Seasonal parameters
* P: Seasonal autoregressive order.
* D: Seasonal order of differencing.
* Q: Seasonal moving average order.
* S: The number of time steps for a single seasonal period.


## Rules
The selection of these parameters comes from the observation of the ACF/PACF plots.

*(personal point of view : not really straightforward to pick the good parameters)*

##### How to find the order of differencing (d) ?
> Apply differentiation until the stationarity test is good.
##### How to find the order of the AR term (p) ?
> With the Partial Autocorrelation (PACF) plot : p = first lag where the value is above the significance level.
##### How to find the order of the MA term (q) ?
> With the Partial Autocorrelation (ACF) plot : q = first lag where the value is above the significance level.
##### How to find the seasonality (S) ?
> Directly from the observed data or domain knowledge OR with the Partial Autocorrelation (ACF) plot : S is around the high lag
##### How to find the order of seasonal differencing (D) ?
> D=1 if the series has a stable seasonal pattern over time, else D=0 otherwise , AND d+D≤2
##### How to find the order of the seasonal AR term (P) ?
> P≥1 if the ACF is positive at lag S, else P=0 AND P+Q≤2
##### How to find the order of the seasonal MA term (Q) ?
> Q≥1 if the ACF is negative at lag S, else Q=0, AND P+Q≤2


## Apply these rules on sunspots data
First plot PACF and ACF

In [None]:
plot_pacf(ts_quarter,lags=100,title='Sunspots')
plt.show()

In [None]:
plot_acf(ts_quarter,lags=100,title='Sunspots')
plt.show()

### Find the order of differencing (d) ?
The test of stationarity is significative with non differencing, so d=0

In [None]:
from statsmodels.tsa.stattools import adfuller
def printADFTest(serie):
    result = adfuller(serie, autolag='AIC')
    print("ADF Statistic %F" % (result[0]))
    print(f'p-value: {result[1]}')
    for key, value in result[4].items():
        print('Critial Values:')
        print(f'   {key}, {value}')
    print('\n')

#d = 0
printADFTest(ts_quarter)
#d = 1 
#printADFTest(ts_quarter.diff(1).dropna())

### Find the seasonality (S) ?
https://en.wikipedia.org/wiki/Solar_cycle
> The solar cycle or solar magnetic activity cycle is a nearly periodic 11-year change in the Sun's activity. 

From the ACF plot, the first high lag is around 43, so let's take **S=43**

*Remark : we can observe on the global plot of sunspots (ts_quarter.plot()) that the cycles are not regular, between [10.5 , 11.5] years*


### Find the order of seasonal differencing (D) ?
**D=1**, the series has a stable seasonal pattern over time.

### Find the order of the AR term (p) ?
PACF plot : p = first lag where the value is above the significance level. **p=3**

### Find the order of the MA term (q) ?
ACF plot : q = first lag where the value is above the significance level. **q=10**
##### How to find the order of the seasonal AR term (P) ?
ACF plot : **P=1**, ACF is positive at lag 43 AND P+Q≤2
##### How to find the order of the seasonal MA term (Q) ?
ACF plot : **Q=0** ACF is negative at lag 43 AND P+Q≤2

## Build the model SARIMA(3,0,10)(1,1,0,43)

In [None]:
model = sm.tsa.statespace.SARIMAX(ts_quarter, trend='n', order=(3,0,10), seasonal_order=(1,1,0,43))
results = model.fit()
print(results.summary())


## Plot the forecast

In [None]:
forecast = results.predict(start = ts_quarter.index[-2], end= ts_quarter.index[-2] + pd.DateOffset(months=240), dynamic= True) 
ts_quarter.plot()
forecast.plot()
plt.show()

# Go further ... 
## determine the parameters (p,d,q)(P,D,Q,S) with a grid search
We need to install the package pmdarima

In [None]:
!pip install pmdarima

In [None]:
import pmdarima as pm
grid_model = pm.auto_arima(ts_quarter, start_p=1, start_q=1,
                         test='adf',
                         max_p=4, max_q=4, m=43,
                         start_P=0, seasonal=True,
                         d=0, D=1, trace=True,
                         error_action='ignore',  
                         suppress_warnings=True, 
                         stepwise=True)
print(grid_model.summary())

### plot the forecast

In [None]:
period = 60
fitted, confint = grid_model.predict(n_periods=period, return_conf_int=True)
index_of_fc = pd.date_range(ts_quarter.index[-1], periods = period, freq='Q')

# make series for plotting purpose
fitted_series = pd.Series(fitted, index=index_of_fc)
lower_series = pd.Series(confint[:, 0], index=index_of_fc)
upper_series = pd.Series(confint[:, 1], index=index_of_fc)

# Plot
plt.plot(ts_quarter)
plt.plot(fitted_series, color='darkgreen')
plt.fill_between(lower_series.index, 
                 lower_series, 
                 upper_series, 
                 color='k', alpha=.15)

plt.title("SARIMA - Forecast Sunspots")
plt.show()