[![General Assembly Logo](https://camo.githubusercontent.com/1a91b05b8f4d44b5bbfb83abac2b0996d8e26c92/687474703a2f2f692e696d6775722e636f6d2f6b6538555354712e706e67)](https://generalassemb.ly/education/web-development-immersive)
![Misk Logo](https://i.ibb.co/KmXhJbm/Webp-net-resizeimage-1.png)

# Project_4 ___ Predicting the Microsoft Stock Market Using Time-Series Models; namely, Arima, Recurrent Neural Network (RNN) and FaceBook Prophet Models



**Team Members:** Ibrahim Rizqallah Alzahrani - Abdulaziz Alsulami

---

# Problem Statement

#### We aim to examine the best prediction model that we would be to use in predicting future stockmmarket (In our task, predicting the Microsoft stock market for ahead three months).

In [None]:
pip install pmdarima

In [None]:
# importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
np.set_printoptions(precision=4)
sns.set(font_scale=1.5)
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# this will filter out a lot of future warnings from statsmodels
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Loading autocorrelation ACF,PACF,plots, and seasonal decompose

from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from pmdarima import auto_arima


In [None]:
# reading data as dataframe
df = pd.read_csv('../input/microsoft-stock-market-2001-2021/MSFT_Stock.csv', index_col=0)

In [None]:
# displaying 1st five rows
df.head()

In [None]:
# displaying number of rows and columns
df.shape

In [None]:
# displaying names, count of rows, number of null values and data types of features
df.info()

In [None]:
# displaying statistics information
df.describe()

We can see the first four features have nearly same value of mean, std, min, percentil range except max value.

In [None]:
# resorting data frame as ascending according to date
# displaying 1st five rows
df.sort_index(inplace=True)
df.head()

In [None]:
# displaying last five rows
df.tail()

In [None]:
# plotting and displaying corrlation of features with values
plt.figure(figsize=(12,6))
sns.heatmap(
    df.corr(), 
    annot=True,
    )
plt.title('Features Correlation');

# Each feature has a perfect positive correlation with other features excpet volume has a low negative corrlation with other features

** bold textMinimum and Maximum value of all features**

In [None]:
# creating for loop to print the minimum stock price for each feature and number of stocks with date
for col in df.columns:
    min_val = df[col].min()
    print(df[df[col] == df[col].min()][col])
    print()

# In Mach 2009 all stock prices dropped at the low level except stock volume were at the low level in July 2020.

In [None]:
# creating for loop to print the minimum stock price for each feature and number of stocks with date
for col in df.columns:
    max_val = df[col].max()
    print(df[df[col] == df[col].max()][col])
    print()

# In January 2021 all stock prices rocket jump at the high level except sotck volume were at the high level in April 2006.

### EDA

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(12, 12), constrained_layout=True) 
ax = ax.ravel()

colors = ['red', 'blue', 'green', 'yellow']
cols = [col for col in df.columns if col != 'volume']

for i in range(len(cols)):
    df.iloc[:,i].plot(ax=ax[i], color=colors[i], )
    ax[i].set(ylabel=cols[i])
    ax[i].set_xticklabels(ax[i].get_xticklabels(),rotation=20)
    ax[i].set_title(f'{cols[i]} size from 2002 unitl 2021');

# We can see after 2016, there was a rocket jump in stock prices for all features. Befor 2016 most stock prices were less than 50, in 2009 stock prices were at the low level.

In [None]:
plt.figure(figsize=(12,6))
df.iloc[:,-1].plot()
plt.title('stock volume per from 2002 until 2021');

# We can see stocks volume were less than half 300 Million even 2006, then it will jump at the high level to reach around 600 Million then dropped and still at same level less than 300 Milliion 

### Model
**Autoregressive Integrated Moving Average aka ARIMA**

ARIMA, or Autoregressive Independent Moving Average is actually a combination of 3 models:
* <strong>AR(p)</strong> Autoregression.
* <strong>I(d)</strong> Integration - uses differencing of observations (subtracting an observation from an observation at the previous time step) in order to make the time series stationary
* <strong>MA(q)</strong> Moving Average.


In [None]:
# converting index (date) to date time type
df.index = pd.to_datetime(df.index)
df.head()

In [None]:
# resampling to 4 Quarter (season)
df.resample('Q').mean().head()

In [None]:
new_df = df.resample('Q').mean()[['close']]

new_df.head()

### 1. Visually examine the close rate

In [None]:
new_df['close'].plot(figsize=(12, 5))
plt.show()

### 2. Do Time Series Decomposition to check for Seasonality

In [None]:
result = seasonal_decompose(new_df['close'],freq=30)
result.plot();

#### Test for Stationarity

#### if not, determine the d value (differencing)

#### 1. Check if stationary

In [None]:
from statsmodels.tsa.stattools import adfuller

def adf_test(series,title=''):
    """
    Pass in a time series and an optional title, returns an ADF report
    """
    print(f'Augmented Dickey-Fuller Test: {title}')
    result = adfuller(series.dropna(),autolag='AIC') # .dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    out = pd.Series(result[0:4],index=labels)

    for key,val in result[4].items():
        out[f'critical value ({key})']=val
        
    print(out.to_string())          # .to_string() removes the line "dtype: float64"
    
    if result[1] <= 0.05:
        print("Strong evidence against the null hypothesis")
        print("Reject the null hypothesis")
        print("Data has no unit root and is stationary")
    else:
        print("Weak evidence against the null hypothesis")
        print("Fail to reject the null hypothesis")
        print("Data has a unit root and is non-stationary")

In [None]:
def autocorr_plots(y, lags=None):
    fig, ax = plt.subplots(ncols=2, figsize=(12, 4), sharey=True)
    plot_acf(y, lags=lags, ax=ax[0])
    plot_pacf(y, lags=lags, ax=ax[1])
    return fig, ax

In [None]:
adf_test(new_df['close'])

####  our data is not stationary

#### 2. Do differencing until we make our data stationary

[Why would we difference?](https://otexts.com/fpp2/stationarity.html) Well, there is one assumption that is **required** for nearly every time series model: **stationarity**.
- If our time series is stationary, then we do not need to difference and let $d=0$.
- If our time series is not stationary, then we difference either once ($d=1$) or twice ($d=2$). Differenced data often is stationary, so we difference our data, then model that!

In [None]:
# d = 1
adf_test(new_df['close'].diff().dropna())

In [None]:
# d = 2
adf_test(new_df['close'].diff().diff().dropna())

In [None]:
# d = 3
adf_test(new_df['close'].diff().diff().diff().dropna())

After three times of differenced our data get a stationary

In [None]:
fig,ax=plt.subplots(ncols=2,figsize=(16,8))

new_df['close'].plot(lw=2.5, ax=ax[0])
new_df['close'].diff().diff().diff().dropna().plot(lw=2.5, ax=ax[1]);

### 3. Create ACF and PACF plots
#### Determine the p and q values (Manually)

#### 1. Create ACF and PACF plots on our differenced data

In [None]:
autocorr_plots(new_df['close'].diff().diff().diff().dropna(), lags=20);

### 4. Using Auto-ARIMA to determine (p,d,q)

In [None]:
auto_fit = auto_arima(new_df['close'], start_p=0, start_q=0,
                          max_p=2, max_q=2, 
                          m=1,                     # m is used for seasonality, m=1 means no seasonality (cover this later)
                          seasonal=False,          # We do not want seasonality here
                          d=None,  # The order of first-differencing. If None (by default), automatically be selected
                          trace=True,
                          error_action='ignore',   # we don't want to know if an order does not work
                          suppress_warnings=True,  # we don't want convergence warnings
                          stepwise=True)           # set to stepwise

auto_fit.summary()

**What was the values of (p,d,q) suggested by auto-arima?**

As you can see the output of the summery, is suggests ARIMA(2,1,1) as the best based on AIC/BIC lowest values

### 5. Fit ARIMA model
#### 1. Train/Test Split

In [None]:
# See what are the ranges of our data
new_df.index.max(), new_df.index.min()

In [None]:
df_train = new_df.loc[:'2019']
df_test = new_df.loc['2019':]

In [None]:
df_train.shape

In [None]:
df_test.shape

In [None]:
# Plot the train and test sets on the axis ax
fig, ax = plt.subplots(figsize=(12,6))
df_train.plot(ax=ax)
df_test.plot(ax=ax)
ax.legend(labels=['Train Data','Test Data']);


#### 2. Fitting ARIMA models

In [None]:
from statsmodels.tsa.arima_model import ARIMA

model = ARIMA(df_train,order=(2,1,1))
res = model.fit()
res.summary()



In [None]:
# plot our fitted values for train data

df_train.diff().diff().diff().dropna().plot(legend = True,figsize=(12,8))
res.fittedvalues.rename("Train Fitted Values").plot(legend = True)
plt.show()

#### 3. Predict values on the test dataset¶

In [None]:
# plot our prediction for test data


start = len(df_train) 
end = len(df_train) + len(df_test) -1
  
# Predictions for the test set 

# Notice below typ='level' , it will predict the levels of the original variables (undifferenced)
predictions = res.predict(start, end,typ ='levels',dynamic=False) # .rename("Test predicted")  

In [None]:
predictions

In [None]:
# Compare predictions to expected values
for i in range(len(predictions)):
    print(f"predicted={predictions[i]:<4.4}, expected={df_test['close'][i]}")

In [None]:
# plot predictions and actual test values 
title = 'Predicted test vs. Real test Unemployment Rate'
ax=predictions.plot(legend = True,figsize=(12,8),title=title) 
df_test.plot(legend = True,ax=ax);
ax.legend(labels=['Test predicted','Test Data']);

#### 4. Evaluate the Model

In [None]:
from statsmodels.tools.eval_measures import mse
from statsmodels.tools.eval_measures import rmse

error1 = mse(df_test['close'], predictions)
print(f'ARIMA(2,1,1) MSE Error: {error1:11.10}')

error2 = rmse(df_test['close'], predictions)
print(f'ARIMA(2,1,1) RMSE Error: {error2:11.10}')

In [None]:
new_df['close'].mean()

### 6. Forcast the Future

In [None]:
# Do a forcast for 1 Quarter (3 months) (2021Q1 to 2021Q2)
fcast = res.predict(start=len(df_test),end=len(df_test)+1,typ='levels',dynamic=False).rename('ARIMA(2,1,1) Forecast')


In [None]:
# starting and end points for forecasting for our previous model

start_1 = 3
end_1 = len(df_train)+len(df_test)+1
fig, ax = plt.subplots(figsize=(12,8)) 
res.plot_predict(start=start_1, end=end_1, ax =ax,)
plt.legend(loc=2);

# Recurrent Neural Network (LSTM Model)

### It is worthy noted that LTSM model is sensitive to the scale of data; therefore, we will apply MinMax scaler

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from keras.models import Sequential
from keras.layers import Dense,  LSTM
from keras import metrics
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [None]:
lstm= df.copy()
lstm.head()

In [None]:
lstm.reset_index(inplace=True)
lstm.rename(columns={'index': 'date'}, inplace=True)
lstm.head()

In [None]:
#The value that we want to predict is the 4th column (Close). Defining Y as this column
L = len(lstm)
Y = lstm.iloc[:,4]
Y= np.array(Y)
Y= Y.reshape(-1,1)
plt.plot(Y)
plt.title('Distribution Closing Price for MS-Stock Market')
plt.show(block= False)


In [None]:
#We will use shifted versions of the Y column this is to say use 3 delays of Y as inputs to predict the output of our data.
X1= Y[0:L-3,:]
X2=Y[1:L-2,:]
X3=Y[2:L-1,:]
Y = Y[3:L,:]
X= np.concatenate([X1,X2,X3],axis=1)
print(f'X shape is {X.shape}')
print(f'Y shape is {Y.shape}')

In [None]:
from sklearn.preprocessing import MinMaxScaler

#standardising our data
scaler = MinMaxScaler()
scaler.fit(X)
X = scaler.transform(X)
scaler1 = MinMaxScaler()
scaler1.fit(Y)
Y = scaler1.transform(Y)
X= np.reshape(X, (X.shape[0],1,X.shape[1]))


#### Splitting the dataset into train and test sets


In [None]:
#Let’s now define training and test sets for our model
#4907 => 4997-90 => 1Q

X_train = X[:4907,:,:]
X_test = X[4907:,:,:]
Y_train = Y[:4907,:]
Y_test = Y[4907:,:]


In [None]:
len(X_train), len(X_test)

In [None]:
# 10 units is chosen, hyperbolic tangent for the activation. 
# input shape of (1,3) because we have 3 delays (the X1, X2 and X3 that we defined before). 
# hard sigmoid is used for reccurent activation.

model = Sequential()
model.add(LSTM(10,activation = 'tanh',input_shape = (1,3),recurrent_activation= 'hard_sigmoid'))

In [None]:
# Instatiating an output layer to the model. We use only one output for our model.

model.add(Dense(1))

### Our neural network is created, it is now to be compiled. 

In [None]:
model.compile(loss= 'mean_squared_error',optimizer = 'rmsprop', metrics=[metrics.mse])
model.fit(X_train,Y_train,epochs=10,verbose=2)
Predict = model.predict(X_test)

In [None]:
plt.figure(figsize=(15,10))
plt.plot(Y_test,label = 'Test')
plt.plot(Predict, label = 'Prediction')
plt.legend(loc='best')
plt.show()

In [None]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(Y_test, Predict)
print('MSE Performance Metrics scores of RNN_LSTM model for Closing Price of MS Stock Market is:    ', mse)
rmse = np.sqrt(mse)
print('RMSE Performance Metrics scores of RNN_LSTM model for Closing Price of MS Stock Market is:    ',rmse)

# Scale back to the original scale

In [None]:
Y_train = scaler1.inverse_transform(Y_train)
Y_train = pd.DataFrame(Y_train)
Y_train.index = pd.to_datetime(df.iloc[3:4910,0])
Y_test = scaler1.inverse_transform(Y_test)
Y_test = pd.DataFrame(Y_test)
Y_test.index = pd.to_datetime(df.iloc[4910:,0])
Predict = model.predict(X_test)
Predict = scaler1.inverse_transform(Predict)
Predict = pd.DataFrame(Predict)
Predict.index=pd.to_datetime(df.iloc[4910:,0])
plt.figure(figsize=(15,10))
plt.plot(Y_test)
plt.plot(Predict)
plt.show()

# Evaluating using Facebook Prophet model

In [None]:
pip install fbprophet

In [None]:
pip install pystan

In [None]:
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation
from fbprophet.diagnostics import performance_metrics
from fbprophet.plot import plot_cross_validation_metric

In [None]:
df = pd.read_csv('../input/microsoft-stock-market-2001-2021/MSFT_Stock.csv', index_col=0)

In [None]:
df.reset_index(inplace=True)
df.rename(columns={'index': 'date'}, inplace=True)

df.head()

In [None]:
from pandas import DataFrame
prophet= pd.DataFrame()

prophet['date']= df.date
prophet['close']= df.close

prophet.head()

In [None]:
prophet.columns=['ds','y']

prophet.head()

In [None]:
prophet.dropna(inplace=True)


In [None]:
prophet['ds']=pd.to_datetime(prophet.ds)
prophet.plot(x='ds',y='y')


## Splitting the data into train and test to start running our model.


In [None]:
prophet_train=prophet[:4820]
prophet_test=prophet[4820:]

# Forecasting Closing Price of MS Stock Marketwith Prophet (Base model)
Generating a Quarter ahead forecast of MS Stock Marketwith using Prophet.

In [None]:
m=Prophet(seasonality_mode='multiplicative')
m.fit(prophet_train)
future=m.make_future_dataframe(periods=3,freq='MS')
forecast=m.predict(prophet_test)
forecast.head()

**yhat:** yhat is a notation traditionally used to represent the predicted values of a value y)

**yhat_lower:** the lower bound of our forecasts

**yhat_upper:** the upper bound of our forecasts

In [None]:
m.plot(forecast)


In [None]:
fig1 = m.plot_components(forecast)
plt.show()


In [None]:
from fbprophet.plot import plot_plotly, plot_components_plotly

plot_plotly(m, forecast)


In [None]:
plot_components_plotly(m, forecast)

In [None]:
m.plot(forecast)
ax=forecast.plot(x='ds',y='yhat',legend=True,label='predictions',figsize=(12,8))
prophet_test.plot(x='ds',y='y',legend=True,label='True Test Data',ax=ax,xlim=('2001-03-16', '2021-01-29'))

## Validating the Model

In [None]:
# Initial training period.
initial= 2*365
initial= str(initial)+' days'

#Period length that we perform the cross validation for.
period= 2*90
period=str(period)+' days'

#Horizon of prediction essentially for each fold.
horizon = 90
horizon=str(horizon)+' days'
prophet_cv=cross_validation(m,initial=initial,period=period,
horizon=horizon)

# Performance Metrics of fb_cv
performance_metrics(prophet_cv)


In [None]:
plot_cross_validation_metric(prophet_cv,'rmse');


In [None]:
# changing trend points
from fbprophet.plot import add_changepoints_to_plot

fig=m.plot(forecast)
a=add_changepoints_to_plot(fig.gca(),m,forecast)

In [None]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(prophet_cv.y, prophet_cv.yhat)
print('MSE Performance Metrics scores of FaceBook Prophet model for Closing Price of MS Stock Market is:    ', mse)
rmse = np.sqrt(mse)
print('RMSE Performance Metrics scores of FaceBook Prophet model for Closing Price of MS Stock Market is:    ',rmse)

# As it was demonistrated that RNN Model has the best performance amongst the others; thus, the forecasting step is going to be based on RNN as follows:

In [None]:
df = pd.read_csv('../input/microsoft-stock-market-2001-2021/MSFT_Stock.csv', index_col=0)

df.head()

In [None]:
from keras.preprocessing.sequence import TimeseriesGenerator
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

train_RNN = pd.DataFrame(df.iloc[:,3])


scaler.fit(train_RNN)
scaled_train_RNN = scaler.transform(train_RNN)

n_input = 4907
n_features = 1
generator = TimeseriesGenerator(scaled_train_RNN, scaled_train_RNN, length=n_input, batch_size=1)

# define model
model_RNN = Sequential()
model_RNN.add(LSTM(10, activation='relu', input_shape=(n_input, n_features)))
model_RNN.add(Dense(1))
model_RNN.compile(optimizer='adam', loss='mse')
# fit model
model_RNN.fit_generator(generator,epochs=10)

In [None]:
forecast_RNN = []

eval_RNN = scaled_train_RNN[-n_input:]
current_RNN = eval_RNN.reshape((1, n_input, n_features))

for i in range(90):
    current_pred_RNN = model_RNN.predict(current_RNN)[0]
    forecast_RNN.append(current_pred_RNN) 
    current_RNN = np.append(current_RNN[:,1:,:],[[current_pred_RNN]],axis=1)

forecast_RNN= scaler.inverse_transform(forecast_RNN)

In [None]:
forecast_RNN = pd.DataFrame({'Forecast':forecast_RNN.flatten()})
forecast_RNN.index = np.arange('2021-01-30',90,dtype='datetime64[D]')
forecast_RNN.head()

In [None]:
fig = plt.figure(dpi=120,figsize = (14,6))
ax = plt.axes()
ax.set(xlabel = 'Date',ylabel = 'Price',title = 'Forecast : (30-01-2021) to (30-03-2021)')
forecast_RNN.plot(label = 'Forecast',ax=ax,color='red',lw=2);

**The RMSE scores of each of the appiled models as follow:**

- ARIMA(2,1,1): RMSE Score 25.83997931
- FaceBook Prophet: RMSE Score 7.461
- Recurrent Neural Network (LSTM Model): RMSE Score 0.025

#### In conclusion, RNN model predicts better than the other models; and, the closing stock price of Microsoft Market is expected to experience dramatic drops in the coming three months.