## Prediction of Stock Prices Using Deep Learning

Add indicators like sell and buy on chart

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
import warnings
import numpy as np
from numpy import array
from importlib import reload # to reload modules if we made changes to them without restarting kernel
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier # for features importance

In [None]:
warnings.filterwarnings('ignore')
plt.rcParams['figure.dpi'] = 227

In [None]:
import statsmodels.api as sm
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf
from sklearn.metrics import mean_squared_error, confusion_matrix, f1_score, accuracy_score
from pandas.plotting import autocorrelation_plot

import functions
import plotting

In [None]:
import tensorflow.keras as keras
from tensorflow.python.keras.optimizer_v2 import rmsprop
from functools import partial
from tensorflow.keras import optimizers
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Flatten, TimeDistributed, LSTM, Dense, Bidirectional, Dropout, ConvLSTM2D, Conv1D, GlobalMaxPooling1D, MaxPooling1D, Convolution1D, BatchNormalization, LeakyReLU
from bayes_opt import BayesianOptimization

from tensorflow.keras.utils import plot_model

Tell the `numpy` library to use the number 66 as its random seed. This means that every time the program is run, it will generate the same sequence of random numbers.

In [None]:
np.random.seed(66)

Useful for machine learning because it allows to get consistent results each time notebook is executed/run. This is important because it allows to compare different models and algorithms fairly, and to make sure that the results are reliable.

### Loading Data
Reading stock datas:

In [None]:
files = os.listdir('data/stocks')
stocks = {}
for file in files:
    if file.split('.')[1] == 'csv':
        name = file.split('.')[0]
        stocks[name] = pd.read_csv('data/stocks/'+file, index_col='Date')
        stocks[name].index = pd.to_datetime(stocks[name].index)

### Baseline Model

The baseline prediction model is simply a model that predicts that the stock price will go up or down with a 50% chance. 

The accuracy of the baseline prediction model is calculated using the accuracy_score() function. This function takes in two arrays of predictions and labels and returns the percentage of predictions that were correct.

In [None]:
def baseline_model(stock):
    baseline_predictions = np.random.randint(0, 2, len(stock))
    accuracy = accuracy_score(functions.binary(stock), baseline_predictions)
    return accuracy

In [None]:
baseline_accuracy = baseline_model(stocks['tsla'].Return)
print('Baseline model accuracy: {:.1f}%'.format(baseline_accuracy * 100))

### Accuracy Distribution

Visualize the accuracy of the baseline model. The histogram shows how many predictions were made for each accuracy level. The vertical line shows the average accuracy of the model.

In [None]:
base_preds = []
for i in range(1000):
    base_preds.append(baseline_model(stocks['tsla'].Return))
    
plt.figure(figsize=(16,6))
plt.style.use('seaborn-whitegrid')
plt.hist(base_preds, bins=50, facecolor='#4ac2fb')
plt.title('Baseline Model Accuracy', fontSize=15)
plt.axvline(np.array(base_preds).mean(), c='k', ls='--', lw=2)
plt.show()

### ARIMA

In [None]:
print('Tesla historical data contains {} entries'.format(stocks['tsla'].shape[0]))
stocks['tsla'][['Return']].head()

### Autocorrelation

Plot the autocorrelation function (ACF) of the returns of `Tesla` stock.

The ACF measures how correlated a stock’s returns are with its past returns at different time lags. The plot shows how this correlation changes over time, for up to 299 days.

The ACF can be used to analyze historical stock returns to identify patterns in the stock’s price movements.

For example, a positive ACF at a lag of 1 day suggests that the stock is more likely to go up if it has gone up in the previous day. This information could be used to develop a trading strategy.

In [None]:
plt.rcParams['figure.figsize'] = (16, 3)
plot_acf(stocks['tsla'].Return, lags=range(300))
plt.show()

In [None]:
orders = [(0,0,0),(1,0,0),(0,1,0),(0,0,1),(1,1,0)]

train = list(stocks['tsla']['Return'][1000:1900].values)
test = list(stocks['tsla']['Return'][1900:2300].values)

all_predictions = {}

for order in orders:
    try:
        history = train.copy()
        order_predictions = []
        
        for i in range(len(test)):
            
            model = ARIMA(history, order=order) # defining ARIMA model
            model_fit = model.fit(disp=0) # fitting model
            y_hat = model_fit.forecast() # predicting 'return'
            order_predictions.append(y_hat[0][0]) # first element ([0][0]) is a prediction
            history.append(test[i]) # simply adding following day 'return' value to the model    
            print('Prediction: {} of {}'.format(i+1,len(test)), end='\r')
        
        accuracy = accuracy_score( 
            functions.binary(test), 
            functions.binary(order_predictions) 
        )        
        print('                             ', end='\r')
        print('{} - {:.1f}% accuracy'.format(order, round(accuracy, 3)*100), end='\n')
        all_predictions[order] = order_predictions
    
    except:
        print(order, '<== Wrong Order', end='\n')
        pass

### Review Predictions

Plots a graph of the actual and predicted stock prices for a given period of time.

In [None]:
fig = plt.figure(figsize=(16,4))
plt.plot(test, label='Test', color='#4ac2fb')
plt.plot(all_predictions[(0,1,0)], label='Predictions', color='#ff4e97')
plt.legend(frameon=True, loc=1, ncol=1, fontsize=10, borderpad=.6)
plt.title('Arima Predictions', fontSize=15)
plt.xlabel('Days', fontSize=13)
plt.ylabel('Returns', fontSize=13)

In [None]:
plt.annotate('',
             xy=(15, 0.05), 
             xytext=(150, .2), 
             fontsize=10, 
             arrowprops={'width':0.4,'headwidth':7,'color':'#333333'}
            )
ax = fig.add_subplot(1, 1, 1)
rect = patches.Rectangle((0,-.05), 30, .1, ls='--', lw=2, facecolor='y', edgecolor='k', alpha=.5)
ax.add_patch(rect)

In [None]:
plt.axes([.25, 1, .2, .5])
plt.plot(test[:30], color='#4ac2fb')
plt.plot(all_predictions[(0,1,0)][:30], color='#ff4e97')
plt.tick_params(axis='both', labelbottom=False, labelleft=False)
plt.title('Lag')
plt.show()

### Histogram

This code creates a histogram plot that compares the distribution of the actual and predicted stock returns for Tesla stock. The data used for the histogram is a subset of the Tesla stock returns from index 1900 to 2300.

The actual stock returns are plotted in blue, while the predicted stock returns are plotted in pink with some transparency. A vertical dashed line at 0 is also plotted.

In [None]:
plt.figure(figsize=(16,5))
plt.hist(stocks['tsla'][1900:2300].reset_index().Return, bins=20, label='True', facecolor='#4ac2fb')
plt.hist(all_predictions[(0,1,0)], bins=20, label='Predicted', facecolor='#ff4e97', alpha=.7)
plt.axvline(0, c='k', ls='--')
plt.title('ARIMA True vs Predicted Values Distribution', fontSize=15)
plt.legend(frameon=True, loc=1, ncol=1, fontsize=10, borderpad=.6)
plt.show()

- If the distribution of the predicted values is very similar to the distribution of the actual values, then this suggests that the model is doing a good job of predicting the stock market.

- If the distribution of the predicted values is significantly different from the distribution of the actual values, then this suggests that the model is not doing a good job of predicting the stock market.

### Sentiment Analysis

Sentiment analysis is a technique for extracting the sentiment of a piece of text, such as whether it is positive, negative, or neutral.

In [None]:
tesla_headlines = pd.read_csv('data/tesla_headlines.csv', index_col='Date')

tesla = stocks['tsla'].join(tesla_headlines.groupby('Date').mean().Sentiment)

Combining the stock data with the sentiment data can provide valuable insights into how news headlines may be impacting Tesla’s stock price.

In [None]:
tesla.fillna(0, inplace=True)

Machine learning algorithms can be trained on the combined stock and sentiment data to learn to predict future stock prices.

In [None]:
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(16,6))
plt.plot(tesla.loc['2019-01-10':'2019-09-05'].Sentiment.shift(1), c='#3588cf', label='News Sentiment')
plt.plot(tesla.loc['2019-01-10':'2019-09-05'].Return, c='#ff4e97', label='Return')
plt.legend(frameon=True, fancybox=True, framealpha=.9, loc=1)
plt.title('Tesla News Sentiment and Daily Return', fontSize=15)
plt.show()

In [None]:
pd.DataFrame({
    'Sentiment': tesla.loc['2019-01-10':'2019-09-05'].Sentiment.shift(1), 
    'Return': tesla.loc['2019-01-10':'2019-09-05'].Return}).corr()

The correlation coefficient is a measure of the strength and direction of the relationship between two variables.

A correlation coefficient of 1 indicates a perfect positive correlation, while a correlation coefficient of -1 indicates a perfect negative correlation.

A correlation coefficient of 0 indicates no correlation.

The correlation coefficient between the Sentiment and Return columns will indicate the strength and direction of the relationship between news sentiment and stock returns for Tesla stock.

A positive correlation coefficient would suggest that positive news sentiment tends to be associated with positive stock returns, while a negative correlation coefficient would suggest that negative news sentiment tends to be associated with negative stock returns.

### Feature Selection With XGBoost

Scaling is a common technique used in machine learning to prepare data for training and prediction. Scaling helps to ensure that all of the features in the data have a similar scale, which can improve the performance of machine learning algorithms.

In [None]:
scaled_tsla = functions.scale(stocks['tsla'], scale=(0,1))

In [None]:
X = scaled_tsla[:-1]
y = stocks['tsla'].Return.shift(-1)[:-1]

In [None]:
xgb = XGBClassifier()
xgb.fit(X[1500:], y[1500:])

Create an instance of the `XGBClassifier` class, which is a machine learning algorithm used for classification.

The fit() method is then called on the `XGBClassifier` instance, passing in a subset of the input data X and corresponding labels y. This fit() method trains the classifier on the provided data, allowing it to learn patterns and make predictions.

The `XGBClassifier` algorithm is a type of gradient boosting algorithm, which is a machine learning technique that combines the predictions of multiple weak learners to produce a strong learner.

Gradient boosting algorithms are known for their ability to handle complex data and achieve high accuracy on a variety of machine learning tasks, including classification and regression.

The fit() method is a critical step in the machine learning process. It is during this step that the classifier learns the relationship between the input data X and the output labels y. Once the classifier is trained, it can be used to make predictions on new data.

Th`e XGBClassifi`er algorithm is a powerful tool for machine learning. It can be used to solve a wide variety of classification problems, including spam filtering, fraud detection, and medical diagnosis.

### Deep Neural Networks

In [None]:
n_steps = 21
scaled_tsla = functions.scale(stocks['tsla'], scale=(0,1))


X_train, \
y_train, \
X_test, \
y_test = functions.split_sequences(
                        
    scaled_tsla.to_numpy()[:-1], 
    stocks['tsla'].Return.shift(-1).to_numpy()[:-1], 
    n_steps, 
    split=True, 
    ratio=0.8
)

Prepare the data for stock prediction using machine learning by scaling the input data and splitting it into training and testing sets.

The training set will be used to train the machine learning model, and the testing set will be used to evaluate the performance of the model.

### LSTM Network

Implementing a stock prediction model using machine learning using `Keras`.

In [None]:
keras.backend.clear_session()

n_steps = X_train.shape[1]
n_features = X_train.shape[2]


model = Sequential()
model.add(LSTM(100, activation='relu', return_sequences=True, input_shape=(n_steps, n_features)))
model.add(LSTM(50, activation='relu', return_sequences=False))
model.add(Dense(10))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse', metrics=['mae'])

A Sequential model is a linear stack of layers. The model architecture consists of two LSTM (Long Short-Term Memory) layers.

LSTM layers are a type of recurrent neural network (RNN) that are well-suited for time series forecasting tasks. LSTM layers are able to learn long-term dependencies in the data, which is important for stock prediction.

The first LSTM layer in the model has 100 units and uses the ReLU activation function. It returns sequences of outputs. The input shape is determined by the number of time steps and features.

The second LSTM layer in the model has 50 units and also uses the ReLU activation function. It does not return sequences of outputs.

After the LSTM layers, there are two Dense layers. Dense layers are a type of fully connected neural network layer. The first Dense layer in the model has 10 units. The second Dense layer has 1 unit.

The second Dense layer outputs the prediction for the next day’s stock price.

The model is compiled using the `Adam Optimizer`, `mean squared error (mse)` as the loss function, and `mean absolute error (mae)` as the metric.

The Adam optimizer is a popular optimizer for training machine learning models. It is known for its ability to converge quickly to good solutions.

`Mean squared error (mse)` is a common loss function for regression tasks. It measures the average squared difference between the predicted values and the actual values.

`Mean absolute error (mae)` is another common loss function for regression tasks. It measures the average absolute difference between the predicted values and the actual values.

In [None]:
## Generate a summary of a machine learning model
model.summary()

In [None]:
model.fit(X_train, y_train, epochs=100, verbose=0, validation_data=[X_test, y_test], use_multiprocessing=True)

In [None]:
plt.figure(figsize=(16,4))
plt.plot(model.history.history['loss'], label='Loss')
plt.plot(model.history.history['val_loss'], label='Val Loss')
plt.legend(loc=1)
plt.title('LSTM - Training Process')
plt.show()

Epochs - An epoch is a single iteration over the entire training dataset. During each epoch, the model is trained on all of the training data.

Validao_n- : Validation is the process of evaluating the performance of a machine learning model on a dataset that it has not been trained on. This helps to ensure that the model is not overfitting to the training daa__

Multipri_  - ng: Multiprocessing is a technique that allows multiple processes to run simultaneously. This can improve the performance of machine learning algorithms, especially when training on large daa
t_
__
: - oss: The loss value is a measure of the difference between the predicted and actual values during training. The goal of training a machine learning model is to minimize the lsv
a__lue.

ioatL_o: -n loss: The validation loss value is a measure of the difference between the predicted and actual values during validation.

In [None]:
pred, y_true, y_pred = functions.evaluation(
                    X_test, y_test, model, random=False, n_preds=50, 
                    show_graph=True)