# PHASE 4 PROJECT:
# DEVELOPMENT AND EVALUATION OF A MACHINE LEARNING-BASED DJIA PREDICTION SYSTEM.

| Collaborators        | 
| -------------------- |
| Ian Vaati            |
| Sylvia Murithi       |
| Bushra Mohammed      |

**Project Submission Date:** December 11th, 2023.


# BUSINESS UNDERSTANDING.

## Executive Summary

The stock market is a complex and dynamic system that plays a crucial role in the global economy. Predicting stock market movements can provide valuable insights for investors and financial institutions, enabling them to make informed decisions about investment strategies and risk management. This project aims to develop a machine learning model capable of predicting the short-term movement of the Dow Jones Industrial Average (DJIA), a prominent stock market index, using historical data.



## Business Problem

Accurately forecasting stock market movements is a challenging task due to the inherent volatility and unpredictability of financial markets. Existing forecasting methods often rely on simple technical indicators or subjective analysis, which may not capture the full complexity of market dynamics. A more sophisticated approach is needed to provide accurate and consistent predictions that can inform investment decisions.

## Business Objectives

* **Forecasting Accuracy:** Develop accurate and reliable models to predict future trends in the Dow Jones Industrial Average, allowing stakeholders to anticipate market movements with greater precision.

* **Risk Mitigation:** Provide insights into potential market risks and opportunities, enabling proactive risk management strategies for investors and financial institutions.

* **Decision Support:** Equip decision-makers with actionable information derived from predictive models, empowering them to make informed investment decisions and optimize portfolio management.

* **Market Intelligence:** Enhance market intelligence by identifying patterns, trends, and key indicators that contribute to a deeper understanding of market dynamics.

### Business Benefits

Accurate stock market predictions can provide several benefits to investors and financial institutions:

* **Improved Investment Decisions:** By understanding the direction of market movements, investors can make more informed decisions about buying, selling, or holding stocks.


* **Risk Management:** Accurate predictions can help investors identify potential risks and take appropriate measures to mitigate them.

* **Enhanced Financial Planning:** Financial institutions can use stock market predictions to develop more effective investment strategies and risk management plans.

### Business Stakeholders
The target stakeholders for this project includes:

* **Individual Investors:** Individuals seeking to make informed investment decisions based on market predictions.

* **Financial Institutions:** Banks and financial organizations aiming to enhance their risk management strategies.

* **Financial Analysts:** Analysts who need to forecast market movements for research and reporting purposes.

* **Portfolio Managers:** Individuals responsible for managing investment portfolios, seeking tools to optimize performance.

## Key Deliverables
The project will focus on the following:
* **Predictive Models:** Develop and deploy machine learning models capable of forecasting future values of the Dow Jones Industrial Average.

* **Visualizations and Reports:** Provide visually appealing and informative representations of market trends, predictions, and relevant indicators to aid decision-makers.

* **Documentation:** Create comprehensive documentation detailing the project's methodologies, data sources, model selection, and performance evaluation.

* **Training and Support:** Offer training sessions and ongoing support to stakeholders on interpreting and utilizing the predictive models effectively.

## Success Criteria
The project will be considered successful if the following criteria are met:

* The developed machine learning model accurately predicts the direction of DJIA movement (up, down, or sideways) for the specified time horizon.

* The model consistently outperforms benchmark models, such as a simple moving average.

* The web application or API provides a user-friendly interface for obtaining DJIA predictions.

By achieving these success criteria, the project will demonstrate the potential of machine learning to provide valuable insights into stock market movements, empowering investors and financial institutions to make informed decisions.

# Methodology

## Research Questions

1. Can historical data from Dow Joes Industrial Average (DJIA) be used to predict future market trends accurately?

2. What key features or indicators contribte significantly to predicting stock market movements?

3. Which machine learning algorithms are best suited for predicting DJIA movement?

## Hypothesis

1. Historical trends and patterns in Dow Jones data can be leveraged to forecast future market trends with reasonable accuracy.

2. Technical indicators like moving averages, relative strength index (RSI), and MACD will be crucial features for predicting stock market trends.

3. Time series forecasting models like ARIMA, LSTM, and Prophet will outperform basic regression models in predicting stock market movements.

## Research Design
* **Analytical Approach:** Utilize an analytical approach by applying various machine learning models to historical Dow Jones data to forecast future trends.

* **Time Series Analysis:** Focus on time series analysis methodologies to capture the sequential nature of stock market data.

## Data Description
### Data Source:
The data for this project was collected from the [investing.com website](https://www.investing.com/indices/us-30-historical-data)
, which provides historical data for various financial indices, including the Dow Jones Industrial Average (DJIA). The data covers a period from 01/03/2000 to 11/24/2023 and includes the following attributes:

### Data Variables:
1. **Date:** Represents the date for which the data is recorded.

2. **Close:** Represents the closing price of the DJIA for the specified date.

3. **Open:** Represents the opening price of the DJIA for the specified date.

4. **High:** Represents the highest price reached by the DJIA during the specified day.

5. **Low:** Represents the lowest price reached by the DJIA during the specified day.

6. **Volume:** Represents the total trading volume of the DJIA for the specified day.

7. **Change %:** Represents 

# Data Analysis

### Load the Data

In [None]:
# Import relevant libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
# Ignore all warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.cluster import KMeans  
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from pmdarima import auto_arima
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.layers import Input
from sklearn.metrics import mean_squared_error 
# from fbprophet import Prophet

In [None]:
# Load and display the dataset
df = pd.read_csv('Dow Jones Industrial Average Historical Data.csv')
df

In [None]:
# Creating a copy of the DataFrame 
df = df.copy() 

# Displaying the first few rows of the copied DataFrame 
df.head()

### Inspect the Data

In [None]:
# check the data types  
df.info()

In [None]:
# convert the data types to the desired datatypes
# 1. Convert the date column to datetime datatype
df['Date'] = pd.to_datetime(df['Date'])

# 2. convert the Price, Open, High, Low Columns to float format
df['Price'] = df['Price'].str.replace(',', '').astype(float)
df['Open'] = df['Open'].str.replace(',', '').astype(float)
df['High'] = df['High'].str.replace(',', '').astype(float)
df['Low'] = df['Low'].str.replace(',', '').astype(float)

# 3. Convert the Vol. column to Float format after converting 'M' (millions) to numeric values.
# Remove 'M' and convert to numeric
df['Vol.'] = df['Vol.'].str.replace('M', '').astype(float) * 1_000_000  # Multiply by 1 million
# Convert to integer (if no fractional values) or float
df['Vol.'] = df['Vol.'].astype(float)  

#4. Convert the % column to float formart after removing the percentage symbol
df['Change %'] = df['Change %'].str.rstrip('%').astype(float)

In [None]:
# Check for missing values
print(df.isnull().sum())

In [None]:
# summary statistics
df.describe()

In [None]:
# checking for outliers
# Visualize box plots for all columns
plt.figure(figsize=(15, 8))
sns.boxplot(data=df)
plt.title('Box Plots for All Columns')
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
# Identify and display potential outliers using Tukey's method
def identify_outliers_tukey(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = (data < lower_bound) | (data > upper_bound)
    return outliers

# Create a DataFrame to store outliers
outliers_df = pd.DataFrame()

# Identify outliers for each column
for column in df.columns:
    outliers_df[column] = identify_outliers_tukey(df[column])

# Display rows with outliers
outliers_rows = df[outliers_df.any(axis=1)]
outliers_rows

The outliers in the dataset are left as they are since it may carry valuable information. Extreme stock price movements can be driven by significant events, news, or market conditions. Removing outliers might result in the loss of important information.

### Visualize Time Series Data

In [None]:
# Plotting the closing prices over time
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Price'])
plt.title('Dow Jones Industrial Average Closing Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.grid(True)
plt.show()

### Explore the Distributions and Trends

In [None]:
# Distribution of Closing Prices
plt.figure(figsize=(8, 6))
sns.histplot(df['Price'], bins=30, kde=True)
plt.title('Distribution of Closing Prices')
plt.xlabel('Closing Price')
plt.ylabel('Frequency')
plt.show()

# Trends in Daily Percentage Change
plt.figure(figsize=(10, 6))
sns.lineplot(x=df['Date'], y=df['Change %'])
plt.title('Daily Percentage Change in Dow Jones Industrial Average')
plt.xlabel('Date')
plt.ylabel('Percentage Change')
plt.xticks(rotation=45)
plt.show()


### Correlation Analysis
This analysis aids in understanding relationships, identifying multicollinearity, and guiding feature selection for machine learning models. The heatmap provides a visually intuitive representation, facilitating quick interpretation and communication of complex relationships among financial features.

In [None]:
# Correlation matrix
correlation_matrix = df.corr()

# Heatmap to visualize correlations
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()


# Time-Based Feature Creation and Technical Indicators

Additional time-based features are created, such as day of the week, month, and year. The moving averages (MA_50 and MA_200) and daily returns are calculated.
Rows with missing values introduced by rolling means are dropped.


In [None]:
# Create additional time-based features
df['day_of_week'] = df['Date'].dt.dayofweek
df['month'] = df['Date'].dt.month
df['year'] = df['Date'].dt.year

# Calculate moving averages
df['MA_10'] = df['Price'].rolling(window=10).mean()
df['MA_50'] = df['Price'].rolling(window=50).mean()
df['MA_200'] = df['Price'].rolling(window=200).mean()

# Calculate daily returns
df['daily_return'] = df['Price'].pct_change()

# Drop rows with missing values introduced by rolling means
df.dropna(inplace=True)

In [None]:
# Plotting the original price and moving averages
plt.figure(figsize=(14, 8))
plt.plot(df['Date'], df['Price'], label='Original Price')
plt.plot(df['Date'], df['MA_10'], label='MA_10')
plt.plot(df['Date'], df['MA_50'], label='MA_50')
plt.plot(df['Date'], df['MA_200'], label='MA_200')

plt.title('Dow Jones Industrial Average with Moving Averages')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.show()

From the graph we can see that the best values to measure the moving average is the 10 days and the 50 days because we still capture trends in the data.

# Relative Strength Index
The Relative Strength Index (RSI) is a tool in finance that helps traders understand if a market has moved too much in one direction. It shows if prices have gone up a lot (overbought) or gone down a lot (oversold). 

When RSI is high, it might mean prices are too high,suggesting a possible drop. On the other hand, if RSI is low, it might suggest prices are too low, it indicates a possible increase. 

RSI gives a number between 0 and 100, with over 70 indicating overbought and under 30 indicating oversold. Traders use RSI to find potential points where prices could change direction. It's a helpful tool to understand market conditions and make smarter trading choices.

In [None]:
# Calculate daily price changes
delta = df['Price']

# Identify gains (positive changes) and set losses to 0
gain = delta.where(delta > 0, 0)

# Identify losses (negative changes) and set gains to 0
loss = -delta.where(delta < 0, 0)

# Calculate the average gain over a 14-day window
avg_gain = gain.rolling(window=14).mean()

# Calculate the average loss over a 14-day window
avg_loss = loss.rolling(window=14).mean()

# Calculate the Relative Strength (RS)
rs = avg_gain / avg_loss

# Calculate the Relative Strength Index (RSI)
df['RSI'] = 100 - (100 / (1 + rs))


# Create lag features
df['Price_Lag_1'] = df['Price'].shift(1)
df['Price_Lag_5'] = df['Price'].shift(5)

# Drop rows with NaN values resulting from lag features
df = df.dropna()

# View the data
df

## Stationarity of the dataset
Checking for stationarity is important because it ensures that the statistical properties of the time series, such as mean and variance, remain constant over time, providing a stable foundation for modeling and forecasting.

Hypothesis:

Null Hypothesis (H0): Presence of a unit root thus exhibits a trend or seasonality

Alternative Hypothesis (H1): The time series data is stationary.

In [None]:
from statsmodels.tsa.stattools import adfuller
#Test for stationarity
def test_stationarity(timeseries):
    #Determing rolling statistics
    rolmean = timeseries.rolling(12).mean()
    rolstd = timeseries.rolling(12).std()
    #Plot rolling statistics:
    plt.plot(timeseries, color='blue',label='Original')
    plt.plot(rolmean, color='red', label='Rolling Mean')
    plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean and Standard Deviation')
    plt.show(block=False)
    print("Results of dickey fuller test")
    adft = adfuller(timeseries,autolag='AIC')
    output = pd.Series(adft[0:4],index=['Test Statistics','p-value','No. of lags used','Number of observations used'])
    for key,values in adft[4].items():
        output['critical value (%s)'%key] =  values
    print(output)
test_stationarity(df['Price'])

From the result, the Augmented Dickey-Fuller (ADF) test statistic is -2.351880, and the corresponding p-value is 0.155781.
In this case, the p-value is greater than 0.05, we, therefore, fail to reject the null hypothesis, this suggests that there is evidence that the series has a unit root and exhibits a trend or seasonality.

Since the data is non-stationary, we will apply transformations to make it stationary.
Common techniques include differencing or taking the logarithm. We will use the differencing method and drop the null values

In [None]:
# Compute the first-order difference
df['Price_diff'] = df['Price'].diff().dropna()

In [None]:
def adf_test(timeseries):
    print("Differenced Series:")
    print(timeseries.head())
    
    result = adfuller(timeseries.dropna(), autolag='AIC')  # Drop NaN values before the test
    print('ADF Statistic:', result[0])
    print('p-value:', result[1])
    print('Critical Values:', result[4])

    if result[1] <= 0.05:
        print('Reject the null hypothesis and conclude that the time series is likely stationary.')
    else:
        print('Fail to reject the null hypothesis and conclude that the time series is likely non-stationary.')
# ADF test on differenced data
adf_test(df['Price_diff'])

From the result, the p-value is below the 0.05 significance level. This provides strong evidence against the null hypothesis.
We therefore conclude that the first difference time series is stationary

In [None]:
# Visualize the differenced data
plt.figure(figsize=(16,8))
plt.plot(df['Date'],df['Price_diff'])
plt.title("DJIA First Difference")
plt.xlabel('Date', fontsize=18)
plt.ylabel('Close Price', fontsize=18)
plt.show()

# Splitting the data into train-test

In [None]:
#Split the data into train & test data
train_data = df[:int(len(df) * 0.8)]
test_data = df[int(len(df) * 0.8):]

# Plotting
plt.figure(figsize=(20, 10))
plt.title('DJIA Prices')
plt.xlabel('Dates')
plt.ylabel('Prices')
plt.plot(df['Price_diff'], label='Training Data')
plt.plot(test_data['Price_diff'], 'green', label='Testing Data')
plt.legend()
plt.show()

## The Autoregressive Integrated Moving Average (ARIMA)

In essence, ARIMA is designed to capture and model the temporal dependencies, trends, and fluctuations present in time series data. 
The Autoregressive Integrated Moving Average (ARIMA) model is a powerful time series forecasting method that combines autoregression (AR), differencing (I), and moving average (MA) components. The model is denoted as ARIMA(p, d, q), where p represents the lag order or the number of lag observations included in the model, d is the degree of differencing indicating how many times the raw observations undergo differencing to achieve stationarity, and q is the order of the moving average, which signifies the size of the moving average window.

In essence, ARIMA is designed to capture and model the temporal dependencies, trends, and fluctuations present in time series data. 

In [None]:
from pmdarima import auto_arima

# Fit the auto_arima model
model_autoARIMA = auto_arima(train_data['Price'], start_p=0, start_q=0,
                              test='adf',
                              max_p=3, max_q=3,
                              m=1,
                              d=None,
                              seasonal=False,
                              start_P=0,
                              D=0,
                              trace=True,
                              error_action='ignore',
                              suppress_warnings=True,
                              stepwise=True)

print(model_autoARIMA.summary())
model_autoARIMA.plot_diagnostics(figsize=(15,8))
plt.show()

From the results obtained from Auto Arima, it is clear that best model is determined to be ARIMA(2,1,2) which means p =2,d=1,q=2. This means: p (Order of Autoregression): 2, d (Order of Differencing): 1, q (Order of Moving Average): 2

In [None]:
# # Based on the plots:
# Top left : The residual errors appear fluctuate around the mean of zero.

# Top Right : The density plot on the top right indicates a normal distribution with a mean of zero.

# Bottom Left : The data is normally distributed

# Bottom Right : Based on the Correlogram (ACF plot), the residual errors are not autocorrelatel.

In [None]:
# ARIMA MODEL
model_arima = ARIMA(df['Price_diff'], order=(2, 1, 2))
fitted_arima = model_arima.fit()
print(fitted_arima.summary())

## Model Forecast

In [None]:
# Forecast future values
fc = fitted_arima.forecast(steps=len(test_data))

# Evaluate the model
mse_arima = mean_squared_error(test_data['Price_diff'], fc)
rmse_arima = np.sqrt(mse_arima)
mae_arima = np.mean(np.abs(test_data['Price_diff'] - fc))

# Print the evaluation metrics
print(f'Mean Squared Error (MSE)(ARIMA): {mse_arima}')
print(f'Root Mean Squared Error (RMSE)(ARIMA): {rmse_arima}')
print(f'Mean Absolute Error (MAE)(ARIMA): {mae_arima}')


# Clustering Analysis using KMeans
Features (MA_10, MA_50, MA_200, daily_return) are standardized using StandardScaler.
KMeans clustering with three clusters is applied to the scaled features.
A new 'cluster' column is added to the dataframe to represent the assigned cluster labels.


In [None]:
# Standardize features for clustering
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[['MA_10', 'MA_50', 'MA_200', 'daily_return']])

# Apply KMeans clustering on scaled features
kmeans = KMeans(n_clusters=3)
df['cluster'] = kmeans.fit_predict(scaled_features)

# Machine Learning Model (Gradient Boosting) for Classification

Features (including cluster labels) and the target variable (binary classification) are defined.
Data is split into training and testing sets.
Grid search is performed to find the best hyperparameters for the Gradient Boosting Classifier.
The model is trained, and predictions are made on the test set.
Model performance metrics such as accuracy and classification report are displayed.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

# Define features and target variable
X = df[['MA_50', 'MA_200', 'daily_return', 'day_of_week', 'month', 'year', 'cluster']]
y = np.where(df['Price'].shift(-1) > df['Price'], 1, 0)  # 1 if price increases, else 0

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Grid search for hyperparameter tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5]
}

gb_classifier = GradientBoostingClassifier(random_state=42)
grid_search = GridSearchCV(estimator=gb_classifier, param_grid=param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Evaluate the model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate accuracy and other metrics
accuracy_gb = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Display the best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

# Display model evaluation metrics
print(f"Accuracy: {accuracy_gb}")
print("Classification Report:\n", classification_rep)

The model achieves an overall accuracy of 53.31%, which means it correctly predicts the direction of the Dow Jones Industrial Average (DJIA) movement (whether an increase or a decrease) about 53.31% of the time.

Precision indicates how well the model performs when it predicts an increase or decrease. The precision for predicting a decrease (Class 0) is 54%, and for predicting an increase (Class 1) is 52%.

Recall measures how well the model captures the actual increases or decreases. The recall for predicting a decrease (Class 0) is 74%, and for predicting an increase (Class 1) is 31%.

The F1-Score provides a balance between precision and recall. The F1-Score for predicting a decrease (Class 0) is 62%, and for predicting an increase (Class 1) is 39%.

In [None]:
# Calculate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

print(conf_matrix)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", linewidths=.5)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

In [None]:
# True Positive (TP): 184 instances were correctly predicted as positive.
# False Positive (FP): 158 instances were incorrectly predicted as positive.
# True Negative (TN): 448 instances were correctly predicted as negative.
# False Negative (FN): 371 instances were incorrectly predicted as negative.

## Feature Importance Analysis and Visualization
Feature importance is extracted from the trained Gradient Boosting Classifier.
A bar plot is created to visualize the importance of each feature in the prediction.


In [None]:
# Feature importance analysis
feature_importance = best_model.feature_importances_
features = X.columns
feature_importance_dict = dict(zip(features, feature_importance))

# Plotting feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x=list(feature_importance_dict.values()), y=list(feature_importance_dict.keys()))
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.ylabel('Features')
plt.show()


## Time Series Forecasting using Facebook Prophet

The dataset is split into training and testing sets for the Facebook Prophet time series forecasting model.
The Prophet model is trained on historical data and used to forecast future prices.
Mean Squared Error (MSE) is calculated for evaluation.
The forecast is visualized using the Prophet plotting functions.

In [None]:
# from fbprophet import Prophet
# import matplotlib.pyplot as plt
# from sklearn.metrics import mean_squared_error

# # Assuming you already have the necessary libraries and DataFrame (df)

# # Train-test split
# train_size = int(len(df) * 0.8)
# prophet_data = df[['Date', 'Price']].rename(columns={'Date': 'ds', 'Price': 'y'})
# train_prophet = prophet_data.iloc[:train_size]
# test_prophet = prophet_data.iloc[train_size:]

# # Fit Prophet model with daily seasonality
# prophet_model = Prophet(daily_seasonality=True)
# prophet_model.fit(train_prophet)

# # Make future dataframe for predictions
# future = prophet_model.make_future_dataframe(periods=len(test_prophet), freq='D')

# # Forecast with Prophet
# prophet_forecast = prophet_model.predict(future)

# # Evaluate Prophet performance
# mse_prophet = mean_squared_error(test_prophet['y'], prophet_forecast['yhat'].tail(len(test_prophet)))
# print(f'Mean Squared Error (Prophet): {mse_prophet}')

# # Visualize Prophet predictions
# fig = prophet_model.plot(prophet_forecast)

# # Include labels for forecasted side
# plt.title('Prophet Predictions for DJIA')
# plt.xlabel('Date')
# plt.ylabel('Closing Price')

# # Highlight the forecasted side
# plt.axvline(x=test_prophet['ds'].iloc[0], color='red', linestyle='--', label='Train-Test Split')
# plt.fill_between(test_prophet['ds'].values, test_prophet['y'].values, color='gray', alpha=0.3, label='Test Data')
# plt.legend()

# plt.show()

In [None]:
# from sklearn.metrics import mean_absolute_error, mean_squared_error
# # Evaluate Prophet performance
# mse_prophet = mean_squared_error(test_prophet['y'], prophet_forecast['yhat'].tail(len(test_prophet)))
# mae_prophet = mean_absolute_error(test_prophet['y'], prophet_forecast['yhat'].tail(len(test_prophet)))
# rmse_prophet = np.sqrt(mse_prophet)

# print(f'Mean Squared Error (Prophet): {mse_prophet}')
# print(f'Mean Absolute Error (Prophet): {mae_prophet}')
# print(f'Root Mean Squared Error (Prophet): {rmse_prophet}')

## LSTM Time Series Prediction using Keras

The dataset is prepared and normalized for input to the LSTM (Long Short-Term Memory) neural network.
Sequences are created using a function (create_sequences).
The LSTM model is defined, compiled, and trained on the training data.
Predictions are made on the test set and then inverse transformed to the original scale.
Mean Squared Error (MSE) is calculated, and predictions are visualized alongside actual prices.

In [None]:
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import LSTM, Dense

In [None]:
# Ensure the data is sorted by date
df.sort_values(by='Date', inplace=True)

# Extract the 'Price' column as the target variable
data = df.reset_index()['Price']

# Normalize the data
scaler = MinMaxScaler(feature_range=(0, 1))
data = scaler.fit_transform(np.array(data).reshape(-1, 1))

# Split the data into training and testing sets
train_size = int(len(data) * 0.65)
test_size = len(data) - train_size
train_data, test_data = data[0:train_size, :], data[train_size:len(data), :1]

# Function to create a dataset with look back
def create_dataset(dataset, time_step=1):
    X, Y = [], []
    for i in range(len(dataset)-time_step-1):
        a = dataset[i:(i+time_step), 0]
        X.append(a)
        Y.append(dataset[i + time_step, 0])
    return np.array(X), np.array(Y)

# Reshape the data into X=t, t+1, t+2, t+3 and Y=t+4
time_step = 200  
X_train, y_train = create_dataset(train_data, time_step)
X_test, y_test = create_dataset(test_data, time_step)

# Reshape input to be [samples, time steps, features] for LSTM
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)

# Input layer
input_layer = Input(shape=(X_train.shape[1], X_train.shape[2]))

In [None]:
# LSTM layer
lstm_layer = LSTM(units=50, return_sequences=True)(input_layer)
lstm_layer = LSTM(units=50, return_sequences=True)(lstm_layer)
lstm_layer = LSTM(units=50)(lstm_layer)

# Dense layer
dense_layer = Dense(units=1)(lstm_layer)

# Create the model
model = Model(inputs=input_layer, outputs=dense_layer)
model.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
# Define hyperparameters
epochs = 20
batch_size = 16

# Train the model
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1)

# Predictions
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)

# Inverse transform the predictions
train_predict = scaler.inverse_transform(train_predict)
test_predict = scaler.inverse_transform(test_predict)

In [None]:
# Calculate RMSE
train_rmse = np.sqrt(mean_squared_error(y_train, train_predict))
print(f'Training RMSE: {train_rmse}')

test_rmse = np.sqrt(mean_squared_error(y_test, test_predict))
print(f'Testing RMSE: {test_rmse}')

In [None]:
import matplotlib.pyplot as plt

# Plotting training set
plt.figure(figsize=(15, 6))
plt.plot(scaler.inverse_transform(y_train.reshape(-1, 1)), label='Actual Train Data')
plt.plot(train_predict, label='Predicted Train Data', color='red')
plt.title('Stock Price Prediction - Training Set')
plt.xlabel('Time Steps')
plt.ylabel('Stock Price')
plt.legend()
plt.show()

# Plotting testing set
plt.figure(figsize=(15, 6))
plt.plot(scaler.inverse_transform(y_test.reshape(-1, 1)), label='Actual Test Data')
plt.plot(test_predict, label='Predicted Test Data', color='red')
plt.title('Stock Price Prediction - Testing Set')
plt.xlabel('Time Steps')
plt.ylabel('Stock Price')
plt.legend()
plt.show()

## Evaluation of  Performance of Models

In [None]:
# Comparison of Model Performance

# Create a DataFrame to store model metrics
model_comparison = pd.DataFrame(columns=['Model', 'MSE', 'MAE', 'RMSE'])

# ARIMA Model Metrics
model_comparison = model_comparison.append({
    'Model': 'ARIMA',
    'MSE': mse_arima,
    'MAE': mae_arima,
    'RMSE': rmse_arima
}, ignore_index=True)

# Prophet Model Metrics
model_comparison = model_comparison.append({
    'Model': 'Prophet',
    'MSE': mse_prophet,
    'MAE': mae_prophet,
    'RMSE': rmse_prophet
}, ignore_index=True)

# LSTM Model Metrics
model_comparison = model_comparison.append({
    'Model': 'LSTM',
    'MSE': test_rmse,  # Assuming you use test RMSE for LSTM
    'MAE': mean_absolute_error(y_test, test_predict),
    'RMSE': test_rmse
}, ignore_index=True)


# Display the model comparison table
print(model_comparison)

# Visualize the comparison
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='RMSE', data=model_comparison, palette='viridis')
plt.title('Model Comparison - RMSE')
plt.xlabel('Model')
plt.ylabel('Root Mean Squared Error (RMSE)')
plt.show()


The Prophet model has significantly higher errors (RMSE, MAE, MAPE) compared to the other models, suggesting that it may not be performing well on your data. The ARIMA and LSTM models seem to have lower errors compared to the prophet model

## Summary of Model Training

### Gradient Boosting Classifier

- **Best Hyperparameters:** {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
- **Accuracy:** 0.5443583118001722

Classification Report:
               precision    recall  f1-score   support

           0       0.55      0.74      0.63       606
           1       0.54      0.33      0.41       555
#### Classification Report:
    accuracy                           0.54      1161
   macro avg       0.54      0.54      0.52      1161
weighted avg       0.54      0.54      0.52      1161

### ARIMA Model

-  **Root Mean Squared Error (RMSE):** 14688.52


### LSTM Model

- **Epochs:** 20
- **Root Mean Squared Error (RMSE):** 26464.03644681247

### Prophet Model

- **Root Mean Squared Error (RMSE):** 29738.195598 

The Gradient Boosting Classifier achieved an accuracy of 0.5444, the LSTM model was trained for 20 epochs with a final RMSE of 26464, and the Prophet model had a RMSE of 29738. Further evaluation and comparison with other models can provide insights into their performance.


Considering the RMSE, lower values indicate better predictive performance. In this case, the LSTM Model has the lowest MSE among the presented models. Therefore, based on this metric, the LSTM Model appears to be the most suitable choice for predicting the short-term movement of the Dow Jones Industrial Average.

## Investor Guidance

Investors are encouraged to leverage the predictions from the LSTM Model as a valuable tool in their decision-making processes. Despite the inherent uncertainty in financial markets, the consistently low Root Mean Squared Error (RMSE) of the LSTM Model Model indicates its relative accuracy in forecasting DJIA movements.

It is advisable to integrate these predictions with a comprehensive approach that includes both fundamental and technical analyses. The combination of machine learning predictions with traditional methods enhances the depth of insight, aiding investors in making well-informed decisions.

Given the dynamic nature of financial markets, investors should maintain a proactive stance. Continuous monitoring and adaptable strategies, responsive to real-time market conditions, are crucial for effective decision-making.

Incorporating machine learning predictions into investment strategies provides a catalyst for a more nuanced understanding of potential market trends. This empowers investors to optimize their overall portfolio management, potentially leading to more informed and strategic investment outcomes.