# Time Series Analysis of Stock Prices Using Auto Arima
This Notebook analyses the Closing Stock Prices of Yes Bank Limited right from the time it got listed (July 2005) until very recent of this month (November 2020). Using the monthly prices of stocks, Time Series Analysis is used to predict the future prices based on trend and seasonal components. 
Auto ARIMA is demonstrated in Python and its working is explained in detail.

### Importing Necessary Libraries 

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline

# Load specific forecasting tools
from statsmodels.tsa.statespace.sarimax import SARIMAX, SARIMAXResults
from statsmodels.tsa.arima_model import ARMA,ARMAResults,ARIMA,ARIMAResults

from statsmodels.graphics.tsaplots import plot_acf,plot_pacf # for determining (p,q) orders
from statsmodels.tsa.seasonal import seasonal_decompose      # for ETS Plots
                          # for determining ARIMA orders

# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")


### Installing the pmdarima package in order to use autoarima

In [None]:
!pip install pmdarima

In [None]:
from pmdarima import auto_arima

### Reading Data
The Data has Opening, Highest, Lowest and the Closing Prices of the Stock in every month. For this analysis, only Closing Stock Prices have been considered. 

In [None]:
data = pd.read_csv("../input/yes-bank-stock-prices/YesBank_StockPrices.csv", usecols=["Date", "Close"])

In [None]:
data.head()

In [None]:
data.info()

### Preparing the Time Series and Basic EDA
To be able to apply ARIMA to a data, the date column needs to be converted into a date time object and then made the index of the dataframe. This is achieved by using strptime of the datetime library. The Given Date format MMM-YY is converted to proper date of YYYY-MM-DD, that Date is set as index and frequency of the Date is set to 'MS' which is monthly

In [None]:
from datetime import datetime
data['Date'] = data['Date'].apply(lambda x: datetime.strptime(x, '%b-%y'))

In [None]:
data.head()

In [None]:
data.dtypes

In [None]:
ts = data.set_index('Date')

In [None]:
ts.index.freq = 'MS'

In [None]:
ts.head()

Once the Time Series Data is prepared, Data is plotted to see if there is a recurring pattern. While no pattern is observed directly, on further decomposing the series, it can be observed that there is a distinct seasonality and trend hidden along with the variations. This explains that ARIMA might be a good way for predictions.

Another thing that should be kept in mind is that up until 2018, the stock prices more or less, kept increasing but there was a sudden dip after that. This can be attributed to the Yes bank fraud case against Rana Kapoor. Read more about that here: https://economictimes.indiatimes.com/topic/yes-bank-scam

In [None]:
#plotting the data

ax = ts['Close'].plot(figsize=(12,6))
ax.autoscale(axis='x',tight=True)


In [None]:
#TS Decomposition
result = seasonal_decompose(ts['Close'], model='add')
result.plot();

### Data Preparation for (S)ARIMA
Now that the dataframe is ready, we divide it into train and test for modeling and testing. For this example, The last two years, Jan 2019- Nov 2020, are taken as test, Rest everything is train

In [None]:
len(ts)

In [None]:
# Set two years for testing
train = ts.iloc[:162]
test = ts.iloc[162:]

In [None]:
test

### Applying Auto ARIMA
Just to give a brief about how Auto ARIMA works. Auto ARIMA is like a grid search for time series models, it tries ARIMA, SARIMA, SARIMAX, all ARIMA related models depending on the parameters that are supplied to it.  The auto_arima function seeks to identify the most optimal parameters for an ARIMA model, and returns a fitted ARIMA model. This function is based on the commonly-used R function, forecast::auto.arima 

The auto_arima function works by conducting differencing tests (i.e., Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller or Phillips–Perron) to determine the order of differencing, d, and then fitting models within ranges of defined start_p, max_p, start_q, max_q ranges. If the seasonal optional is enabled, auto_arima also seeks to identify the optimal P and Q hyper- parameters after conducting the Canova-Hansen to determine the optimal order of seasonal differencing, D.

Here's the link to its documentation and User Guide, if you want to know about it in detail: https://alkaline-ml.com/pmdarima/0.9.0/modules/generated/pyramid.arima.auto_arima.html

The main idea is that you don't really need to worry about differencing orders and keep trying different orders or look at ACF charts to come to the correct fitted parameters, Auto ARIMA would do that for you automatically.

Here, the parameters which are supplied are:
m= 12 indicatinng monthly range of Date

seasonal True, which we saw from the decomposed chart

and max iterations is set to 200 so that it analyses as many possible combinations of parameters before sticking to a local minima. Usually 200 works, However, higher the better, though that may take longer time. 

Basic steps to use Auto ARIMA include:
1. Using the Auto_Arima funtion on the series to obtain Model Parameters  (p,d,q) (P, D, Q, m)
2. Using the parameters obtained, running a model through statsmodels ARIMA/SARIMA on your training set
3. Obtaining predicted values on the test set based on the model run in Step 2
4. Comparing and Plotting predictions to expected values
5. Evaluating the model through MSE OR MAE



In [None]:
#Applying the Auto ARIMA Function
auto_arima(ts['Close'],m=12,seasonal = True,maxiter=200).summary()

In [None]:
#statsmodel function implementation
model = SARIMAX(train['Close'],order=(1,1,1))
results = model.fit(maxiter=200)
results.summary()

In [None]:
# Obtain predicted values
start=len(train)
end=len(train)+len(test)-1
predictions = results.predict(start=start, end=end, dynamic=False, typ='levels', full_results = True).rename('SARIMA Predictions')

In [None]:
# Compare predictions to expected values
for i in range(len(predictions)):
    print(f"predicted={predictions[i]:<11.10}, expected={test['Close'][i]}")

In [None]:
#Plotting Predictions and Original/Expected Values
ax = test['Close'].plot(legend=True,figsize=(6,6))
predictions.plot(legend=True)
ax.autoscale(axis='x',tight=True)

In [None]:
#Calcalutaing MSE
from sklearn.metrics import mean_squared_error

error = mean_squared_error(test['Close'], predictions)
print(f'SARIMA MSE Error: {error:11.10}')

In [None]:
#Calculating RMSE
from statsmodels.tools.eval_measures import rmse

error = rmse(test['Close'], predictions)
print(f'SARIMA RMSE Error: {error:11.10}')

In [None]:
#Making Future Predictions of next year that is 2021
model = SARIMAX(ts['Close'],order=(1,1,1))
results = model.fit(maxiter=200)
fcast = results.predict(len(ts),len(ts)+11,typ='levels').rename('SARIMA Forecast')

In [None]:
#Plotting Future Predictions with Old values
ax = ts['Close'].plot(legend=True,figsize=(12,6))
fcast.plot(legend=True)
ax.autoscale(axis='x',tight=True)

### Model Insights and Discussion
As can be seen through the expected and predicted values, initial few are correct but ARIMA is unable to predict the Dip that happened because of the fraud. The MSE was extremely high and not acceptable for prediction models. However, This is fair as it is quite unlikely to be able to predict such a huge dip without having any other parameters in the model such as market reputation. With the trend that the model saw, the model obviously followed that trend pattern and showed prices above or around the last highest price that was observed. Seasonality was kept in place, however that cannot really make any model estimate a Dip without additional information.

To check if Auto ARIMA works well if the data isnt influenced by Frauds or such sudden dips, data from 2018-2020 was neglected for trial. data for 2017 was then considered to be test and the remaining data before that was considered to be train, the model and the results that were obtained as follows:

### Model for 2017 as Test set

In [None]:
ts2017 = ts.iloc[:150]
train2017 = ts.iloc[:138]
test2017 = ts.iloc[138:150]

In [None]:
#plotting the data

ax = ts2017['Close'].plot(figsize=(12,6))
ax.autoscale(axis='x',tight=True)

In [None]:
test2017

In [None]:
#Applying the Auto ARIMA Function
auto_arima(ts2017['Close'],m=12,seasonal = True,maxiter=200).summary()

In [None]:
#statsmodel function implementation
model2017 = SARIMAX(train2017['Close'],order=(0,1,0))
results2017 = model2017.fit(maxiter=200)
results2017.summary()

In [None]:
# Obtain predicted values
start2017=len(train2017)
end2017=len(train2017)+len(test2017)-1
predictions2017 = results2017.predict(start=start2017, end=end2017, dynamic=False, typ='levels', full_results = True).rename('SARIMA Predictions 2017')

In [None]:
# Compare predictions to expected values
for i in range(len(predictions2017)):
    print(f"predicted={predictions2017[i]:<11.10}, expected={test2017['Close'][i]}")

In [None]:
#Plotting Predictions and Original/Expected Values
ax = test2017['Close'].plot(legend=True,figsize=(6,6))
predictions2017.plot(legend=True)
ax.autoscale(axis='x',tight=True)

In [None]:
from sklearn.metrics import mean_squared_error

error = mean_squared_error(test2017['Close'], predictions2017)
print(f'SARIMA MSE Error: {error:11.10}')

As can be seen by this Model, the 2017 model predictions are still very poor. Though MSE is much better than the one with all data, it is evident that even with infraudulent data, something is amiss. 

Upon searching online, I was introduced with the concept of **Stationarity**.
To be able to apply Models like ARIMA, a time series should always be stationary. 
To know what Staionary time series means, read here: https://towardsdatascience.com/stationarity-in-time-series-analysis-90c94f27322

Dickey Fuller test is used to check stationarity of the data which was done next. 

### Checking Stationarity
if test statistic < critical value in the Dickey Fuller Test, series is stationary, otherwise it is not
In this case, series came out to be non stationary

In [None]:
from statsmodels.tsa.stattools import adfuller
def adf_test(timeseries):
    #Perform Dickey-Fuller test:
    print ('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
       dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)


#apply adf test on the series
adf_test(ts2017['Close'])

#if test statistic < critical value, series is stationary
#series not stationary

### Detrending for Improved Model
In order to solve the problem of stationarity, after further searching stuff online
The method of de trending was identified appropriate.
To know more about detrending, read here: https://machinelearningmastery.com/time-series-trends-in-python/

The concept is basically removing the trend component from the series. Forecasting the trend component separately, seasonality + random component separately and then adding the two together to get the final results. 

The Trend componenet is removed in the following cell and the remainder series is converted into a dataframe called detrended:

In [None]:
# Using statmodels: Subtracting the Trend Component.
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
res = seasonal_decompose(ts2017['Close'], extrapolate_trend='freq')
detrended = ts2017.Close.values - res.trend
plt.plot(detrended)
plt.title('Stock Prices detrended by subtracting the trend component', fontsize=16)

In [None]:
detrended

In [None]:
detrended =  pd.DataFrame(detrended)

Now checking the detrended component in Dickey Fuller Test.
Critical Value > Test Statistic 
Hence, series is stationary now. ARIMA can now be applied easily. 

In [None]:
adf_test(detrended.trend)

Preparing the Data to apply ARIMA to the detrended portion

In [None]:
len(detrended)

In [None]:
# Set one year for testing
strain = detrended.iloc[:138]
stest = detrended.iloc[138:150]

In [None]:
stest

In [None]:
#SARIMA
auto_arima(detrended.trend,m=12,seasonal = True,maxiter=200).summary()

In [None]:
smodel = SARIMAX(detrended['trend'],order=(4,0,2))
sresults = smodel.fit(maxiter=200)
sresults.summary()

In [None]:
# Obtain predicted values
sstart=len(strain)
send=len(strain)+len(stest)-1
spredictions = sresults.predict(start=sstart, end=send, dynamic=False, typ='levels', full_results = True).rename('SARIMA Predictions Detrended')

In [None]:
# Compare predictions to expected values
for i in range(len(spredictions)):
    print(f"predicted={spredictions[i]:<11.10}, expected={stest['trend'][i]}")

In [None]:
ax = stest['trend'].plot(legend=True,figsize=(6,6))
spredictions.plot(legend=True)
ax.autoscale(axis='x',tight=True)

In [None]:
from sklearn.metrics import mean_squared_error

error = mean_squared_error(stest['trend'], spredictions)
print(f'SARIMA MSE Error: {error:11.10}')

Up until now, the detrended componeent is predicted through the ARIMA model and the MSE is SIGNIFICANTLY less. This definitely shows that detrending worked.

However, the process is only half done.
Now, the subtracted trend component will be predicted separately.
For that the trend component that was subytacyed is used, converted to a dataframe called TRENDY and ARIMA is applied with seasonality = False

In [None]:
res.trend

In [None]:
plt.plot(res.trend)

In [None]:
trendy = pd.DataFrame(res.trend)

In [None]:
#SARIMA
auto_arima(trendy.trend,m=12,seasonal = False, maxiter=200).summary()

In [None]:
# Set one year for testing
traint = trendy.iloc[:138]
testt = trendy.iloc[138:150]

In [None]:
modelt = SARIMAX(trendy['trend'],order=(0,2,0))
resultst = modelt.fit(maxiter=200)
resultst.summary()

In [None]:
# Obtain predicted values
start=len(traint)
end=len(traint)+len(testt)-1
predictionst = resultst.predict(start=start, end=end, dynamic=False, typ='levels', full_results = True).rename('SARIMA Predictions')

In [None]:
# Compare predictions to expected values
for i in range(len(predictionst)):
    print(f"predicted={predictionst[i]:<11.10}, expected={testt['trend'][i]}")

In [None]:
ax = testt['trend'].plot(legend=True,figsize=(6,6))
predictionst.plot(legend=True)
ax.autoscale(axis='x',tight=True)

In [None]:
from sklearn.metrics import mean_squared_error

error = mean_squared_error(testt['trend'], predictionst)
print(f'SARIMA MSE Error: {error:11.10}')

In [None]:
 finalpreds = (predictionst + spredictions)

In [None]:
test2017

In [None]:
ax = test2017['Close'].plot(legend=True,figsize=(6,6))
finalpreds.plot(legend=True)
ax.autoscale(axis='x',tight=True)

In [None]:
from sklearn.metrics import mean_squared_error

error = mean_squared_error(test2017['Close'], finalpreds)
print(f'SARIMA MSE Error: {error:11.10}')

This model is MUCH MUCH better than the original 2017 model without detrending.

To understand why this works better, you can have a look at this research paper: https://research.cs.aalto.fi/aml/Publications/Publication173.pdf
But a whole lot of it has improved because of the series being staionary. 

### Questions I am still looking answers to for the Next Version
1. How can these predictions be further improved?
2. Why did the Auto ARIMA function not give any seasonal components (P,Q,D,m) despite the stock prices having seasonality?
3. Is there any better way like detrending to improve the predictions?
4. Does Auto ARIMA actually help in the process or not?
5. Which other algorithms apart from ARIMA can be used for time series modeling?
6. Can additional features such as market reputation be added and will multivariate time series forecasting be possible?

Suggestions and Collaborations are welcome!