# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Additional Notebook (ungraded): Stock Price prediction using ARIMA

## Learning Objectives

At the end of the experiment you will be able to :

- Predict stock prices using ARIMA


## Dataset description

In this Dataset, we see stocks and ETFs (Exchange trade funding) data for USA. The data is last updated on 11-10-2017 and is in txt format.
The columns of the data are:
1. Date
2. Open
3. High
4. Low
5. Close
6. volume
7. OpenInt


[Dataset link]( https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs)


The files present are named according to Stock listing


A ticker symbol or stock symbol is an abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market. A stock symbol may consist of letters, numbers or a combination of both. "Ticker symbol" refers to the symbols that were printed on the ticker tape of a ticker tape machine.

To understand better go through : https://www.nasdaq.com/market-activity/stocks/screener

###  Domain Information
Stock price forecasting is one of the most challenging tasks in financial decision-making because stock prices are inherently noisy and non-stationary and have been observed to have a random-walk characteristic. Accurate stock price predictions can yield significant profits and therefore econometric and statistical approaches including linear/non-linear methods such as autoregressive (AR) models, moving averages (MA), autoregressive moving averages (ARIMA) and artificial neural networks have been in use.

## ARIMA

Autoregressive Integrated Moving Average Model (ARIMA)
is a generalized model of Autoregressive Moving Average (ARMA) that combines
Autoregressive (AR) process and Moving Average (MA) processes and builds a composite model of the time series.
As the acronym indicates, ARIMA (p, d, q) captures the key elements of the model:
- AR: Autoregression. A regression model that uses the dependencies between an observation and a number of
lagged observations (p).
- I: Integrated. To make the time series stationary by measuring the differences of observations at different time
(d).
- MA: Moving Average. An approach that takes into accounts the dependency between observations and the
residual error terms when a moving average model is used to the lagged observations (q).


In [None]:
# @title Download the Dataset
! wget https://cdn.iisc.talentsprint.com/CDS/Datasets/msft.us.txt
print("The datset was downloaded")


### Importing required packages

In [None]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import lag_plot
from statsmodels.tsa.arima.model import ARIMA
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller

### Microsoft Stock Market Analysis using ARIMA


Read the Microsoft stock prices data file

In this notebook, we will just examine the “Close” stock prices feature. This same analysis can be repeated for most of the other features.

In [None]:
# You can change the file to predict stocks for different companies
# Open, High, Low, Close values are in dollars.
# Volume refers to the number of contracts traded in a given period,
# Open interest denotes the number of contracts that are open or active.

df = pd.read_csv("/content/msft.us.txt").fillna(0)
df.head()

In [None]:
print(df.shape)
print(df.columns)

In [None]:
# Plot the closing values for Microsoft
plt.figure(figsize=(17,8))
plt.plot(df['Close'])
plt.title('Microsoft Closing Values')
plt.xlabel('Dates')
plt.ylabel('Close')
plt.xticks(np.arange(0,7982, 1300), df['Date'][0:7982:1300])
plt.legend()

In [None]:
# Return the cumulative sum of the dataframe
df = df.drop(['Volume'],axis=1)
dr = df.cumsum()
print(dr[:3])
dr.plot()
plt.title('Microsoft Cumulative Returns')

#### Test the stationarity using Dickey-Fuller test

In [None]:
# Perform Dickey-Fuller test:
print('Results of Dickey-Fuller Test:')
dftest = adfuller(df.Close, autolag='AIC', maxlag = 20 )
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
pvalue = dftest[1]
if pvalue < 0.01:
    print('p-value = %.4f. The series is likely stationary.' % pvalue)
else:
    print('p-value = %.4f. The series is likely non-stationary.' % pvalue)

print(dfoutput)

In [None]:
# Apply differencing
diff = df.Close.diff(1).fillna(0)

In [None]:
# Perform Dickey-Fuller test:
print('Results of Dickey-Fuller Test:')
dftest = adfuller(diff, autolag='AIC', maxlag = 20 )
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
pvalue = dftest[1]
if pvalue < 0.01:
    print('p-value = %.4f. The series is likely stationary.' % pvalue)
else:
    print('p-value = %.4f. The series is likely non-stationary.' % pvalue)

print(dfoutput)

Before starting working on Time Series prediction, we will analyse the autocorrelation plot of the "Close" feature with respect to a fixed lag of 1.

In [None]:
# Find the autocorrelation using lag plot.
# A lag plot checks whether a data set or time series is random or not.
# Non-random data exhibits an identifiable structure in the lag plot (eg. linear structure).
plt.figure(figsize=(17,7))
lag_plot(df['Close'], lag=1)
plt.title('Microsoft Autocorrelation plot')

The above graph exhibits a linear pattern indicating that the data are non-random and suggests that an autoregressive model will be appropriate for this data.

In [None]:
# Find out the last date given in Microsoft data
df['Date'][::-1]

### Split the data

Divide the data into train, test split in 80:20 ratio and plot the series.

In [None]:
train_data, test_data = df[0:int(len(df)*0.8)], df[int(len(df)*0.8):] # Train and test split
plt.figure(figsize=(17,7))
plt.title('Microsoft Prices')
plt.xlabel('Dates')
plt.ylabel('Prices')
plt.plot(train_data['Close'], 'blue', label='Training Data') #Plot train data in blue color
plt.plot(test_data['Close'], 'red', label='Testing Data')  # Plot test data in red color
plt.xticks(np.arange(0,7982, 1300), df['Date'][0:7982:1300])
plt.legend()
plt.show()

#### Define a function for calculating the loss (Mean absolute percentage error)

In [None]:
# Function to Calculate the mean absolute percentage error
# The mean absolute percentage error (MAPE) is a statistical measure of how accurate a forecast system is.
# It measures this accuracy as a percentage, and can be calculated as the average absolute percent error for each time
# period minus actual values divided by actual values
def  Mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [None]:
df.plot()

#### ACF and PACF Plot

In [None]:
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(diff, lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(diff, lags=40, ax=ax2)

###ARIMA model

In [None]:
# ARIMA model
train_ar = train_data['Close'].values
test_ar = test_data['Close'].values

history = [x for x in train_ar]

predictions = list()
for t in range(len(test_ar)):
    model = ARIMA(history, order=(0,1,0))
    model_fit = model.fit()
    # regarding the fit of the regression model.
    output = model_fit.forecast() # one-step forecast
    yhat = output[0]
    predictions.append(yhat)
    obs = test_ar[t]
    history.append(obs)
    print('predicted=%f, expected=%f' % (yhat, obs))

In [None]:
error = Mean_absolute_percentage_error(test_ar, predictions)
print('Mean absolute percentage error: %.3f' % error)

### Here we plot our predictions

In [None]:
# Plot the predictions
plt.figure(figsize=(17,7))
plt.plot(train_data['Close'], 'green', color='blue', label='Training Data')
plt.plot(test_data.index, predictions, color='green',marker='o', linestyle='dashed', label='Predicted Price')
plt.plot(test_data.index, test_data['Close'], color='red', label='Actual Price')
plt.title('Microsoft Prices Prediction')
plt.xlabel('Dates')
plt.ylabel('Prices')
plt.xticks(np.arange(0,7982, 1300), df['Date'][0:7982:1300])
plt.legend()

### Lets Compare Predicted and Actual visually

In [None]:
plt.figure(figsize=(27,16))
#ax3 = plt.subplot(222)
plt.plot(test_data.index, predictions, color='green',marker='o', linestyle='dashed', label='Predicted Price')
plt.plot(test_data.index, test_data['Close'], color='red',marker='o', label='Actual Price')
plt.xlabel('Dates')
plt.ylabel('Prices')
plt.xticks(ticks=np.arange(6386,7982, step=300), labels=df['Date'][6386:7982:300])
plt.legend()
plt.show()

Also read: [Smooth Exponential Smoothing method applied to Microsoft stocks data](http://rstudio-pubs-static.s3.amazonaws.com/399202_e78dfd98a7434405893996f2e7cf4b37.html)