# Stock Price Prediction for Five Commodity-Based Companies  

The investment fund management team would like to gain the ability to analyze time series data for stock price forecasting. As a first use case, they would like to start predicting the value of stocks of commodity-producing companies, based on historical data on some specific stocks.

A script executed below will analyze the stock price history of the following commodity-based companies:

- SuperPower Batteries (SUBAT): a company that produces clean energy by harvesting the enthusiasm emitted from educational gameplay; 

- Jack & Jill (JAJIL): this company is among the largest suppliers of bulk hill and island building materials;  

- Voyager (VGER6): the largest Western manufacturer of the refined metals used in the construction of flying game drones;  

- Sabre Feeds (SABRE): this company is one of the largest producers of grain-based animal feedstocks in the Americas;  

- CloudAir (CLAIR): this company is considered the largest producer of rarified gasses in the world;  

The data will be divided into 7 features for each day: lowest, highest, open, closed and adjusted close price, as well as volume and ticker.

In [None]:
#ensure we have the latest pip
%pip install --upgrade pip

In [None]:
# ensure our application has all of the libraries and versions it requires to run
%pip install -U sagemaker
%pip install botocore
%pip install --upgrade awscli
%pip install tensorflow
%pip install s3fs
%pip install matplotlib
%pip install plotly
%pip install nbformat

In [None]:
# load needed packages and utilities
import numpy as np
import pandas as pd
import tensorflow as tf
import datetime
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import os
import json
import nbformat
import sys 
import _strptime
import _datetime

#import specific packages
from datetime import date
from plotly.subplots import make_subplots
from tensorflow import keras
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error

%matplotlib inline


In [None]:
# save the data bucket name here
lab_data_bucket_name = "<define the data bucket name here>"

In [None]:
# listing the companies and gathering the data
stock_list = ["SUBAT.CQ", "JAJIL.CQ", "VGER6.CQ", "SABRE.CQ", "CLAIR.CQ"]
stock_data_url = f"s3://{lab_data_bucket_name}/finance/stock/stock.parquet"

# Listing the companies and gathering the data

In [None]:
#read stocks into data frame
df = pd.read_parquet(stock_data_url)
df.head()

For the sake of convenience in the later steps, let's scale the adjclose for JAJIL now. This will result in better efficiency in the models as well as allow us to compare the prices in a relative way, which makes the performance easier to visualize.

In [None]:
scaler = MinMaxScaler()
index = df[df.symbol == "JAJIL.CQ"].index
df.loc[index, "adjclose"] = scaler.fit_transform(df.loc[index].adjclose.values.reshape(-1, 1))
df.loc[index]

In [None]:
df.shape

In [None]:
#import again here to make it available within the scope of this code block
from datetime import datetime

# Spliting the data into train and test
def split_data(df, company_list, prediction_leght, startdate = '2018-01-02'):
    """
        Receive a dataframe with one company or more, as well as a company list and split the data into train and test 
        by the date given as input for each company.
        
        Inputs:
        - df: a dataframe containing at least timestamps and the target columns
        - company_list: a list of company present in the df. They will be splited and formated
        - prediction length: the number of timestamps that should be separeted as test data
        - start_date: is the start of our dataset. Default is the startdate for BOVV11
        
        Returns:
        2 dictionaries containing the train and test datasets for each company. The datasets contain just
        the date column as well as the adjclose (target) column.
    """
    startdate = datetime.strptime(startdate, '%Y-%m-%d').date()
    
    train = {}
    test = {} 

    for company in company_list:
        train[company] = df[(df.symbol == company) & (df.date > startdate)][:-prediction_length][["date", "adjclose"]]
        test[company] = df[(df.symbol == company) & (df.date > startdate)][-prediction_leght:][["date", "adjclose"]]

    return train, test

In [None]:
# Defining the timespan to make it efficient and easier for the future
timespan = 90
prediction_length = timespan

# Spliting the data
train, test = split_data(df, stock_list, prediction_length)

In [None]:
train

In [None]:
test

### Upload to S3

In order to train a model in SageMaker, we need to first upload the data to an S3 bucket.

In [None]:
# Saving the train and test data on data folder
for stock in stock_list:
    train[stock].to_csv("./data/train_{}.csv".format(stock[:4].lower()), index = False)
    test[stock].to_csv("./data/test_{}.csv".format(stock[:4].lower()), index = False)

In [None]:
# Importing general AWS session configuration
import boto3
import sagemaker
from sagemaker import get_execution_role

session = sagemaker.Session(default_bucket=lab_data_bucket_name)
role = get_execution_role()

bucket = session.default_bucket()


In [None]:
# Creating specific configuration
prefix = "stock-price-forecast-project"
data_dir = "./data"
paths = {}

# Addressing the data on the disc
train_key = os.path.join(data_dir, "train_{}.csv".format("jaji"))
test_key = os.path.join(data_dir, "test_{}.csv".format("jaji"))

# Path where the files will be saved
train_prefix = "{}/{}".format(prefix, "train_{}".format("jaji"))
test_prefix = "{}/{}".format(prefix, "test_{}".format("jaji"))

# Uploading to S3
paths["train"] = session.upload_data(train_key, bucket = bucket, key_prefix = train_prefix)
paths["test"] = session.upload_data(test_key, bucket = bucket, key_prefix = test_prefix)

## Model Building

Now, we will build two models and compare them to predict stock prices.  

The timespan that we are interested in is 3 months, so for each model we are going to compare RMSE and MAE. We will also visualize the quality of the predictions by using a line graph with the prediction and real values for the last 90 days.

## Random Cut Forest Regressor - Baseline

To start our model development task, it is a standard practice to have a baseline model so we can use it to compare future models, so we can see if we are making progress in refining our models.   

For this task we will create three types of basic models:
- Differentiation of the next row
- Lag from the original target
- Moving Average

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Let's concatanete the train and test dataframes to do just one feature engineering process
df_rf = pd.concat([train["JAJIL.CQ"], test["JAJIL.CQ"]])
df_rf.index = df_rf.date
df_rf = df_rf.drop("date", axis = 1)
df_rf.head()

Now let's develop the features of these models...

In [None]:
# Applying the diff to the data
df_rf["adj_close_diff"] = df_rf.diff()

# Creating 10 lags for starting
for i in range (5, 0, -1):
    df_rf['t-' + str(i)] = df_rf.adjclose.shift(i)

# Moving Avg of 2 weeks
df_rf["rolling"] = df_rf.adjclose.rolling(window = 14).mean()
    
df_rf.dropna(inplace = True)
df_rf.head()

Now that we have a dataframe to work with, we can feed it into a basic random forest regressor. But first, let's split the data into train and test versions.

In [None]:
X_train_rf = df_rf.iloc[:-timespan].drop("adjclose", axis = 1)
y_train_rf = df_rf.iloc[:-timespan].adjclose

X_test_rf = df_rf.iloc[-timespan:].drop("adjclose", axis = 1)
y_test_rf = df_rf.iloc[-timespan:].adjclose

In [None]:
# Checking
X_train_rf.shape, X_test_rf.shape, y_train_rf.shape, y_test_rf.shape

In [None]:
# Instanciating a Regressor and Training
regressor_rf = RandomForestRegressor(n_estimators = 1000)

# Training
regressor_rf.fit(X_train_rf, y_train_rf)

In [None]:
prediction_rf = regressor_rf.predict(X_test_rf)

Now that we have the predictions, let's transform it back to its un-scaled form.

In [None]:
prediction_rf = scaler.inverse_transform(prediction_rf.reshape(1, -1))
y_test_rf = scaler.inverse_transform(y_test_rf.values.reshape(1, -1))

In [None]:
# RMSE
rf_RMSE = np.sqrt(mean_squared_error(y_test_rf, prediction_rf))
rf_RMSE

In [None]:
# MAE
rf_MAE = mean_absolute_error(y_test_rf, prediction_rf)
rf_MAE

Now, let's visualize this behaviour in a line graph.

In [None]:
trace1 = go.Scatter(x = X_test_rf.index, y = y_test_rf[0],
                   mode = 'lines',
                   name = 'Real Price')

trace2 = go.Scatter(x = X_test_rf.index, y = prediction_rf[0],
                    mode = "lines",
                    name = "Predicted Price")

layout = go.Layout(title = "Real Price vs Predicted Price using Random Forest Regressor",
                   width = 1000, height = 600)

fig = go.Figure(data = [trace1, trace2], layout = layout)
fig.show()

## LSTM Model with TensorFlow


In [None]:
from sagemaker.tensorflow import TensorFlow

In [None]:
from sagemaker.predictor import Predictor

In [None]:
# Setting up the output path
output_path = "s3://{}/{}/output".format(bucket, prefix)

# Setting the instance type, batch size, and epoch size variables
TF_FRAMEWORK_VERSION = '2.11.0'
instancetype = "ml.m5.xlarge" 
batchsize = 32 
epochsize = 25 

regressor_tf = TensorFlow(
    entry_point='train.py',
    role=role,
    framework_version=TF_FRAMEWORK_VERSION,
    model_dir = False,
    py_version='py39',
    instance_type=instancetype,
    instance_count=1,
    output_path=output_path,
    hyperparameters={
        'batch-size':batchsize,
        'epochs':epochsize})

In [None]:
regressor_tf.fit(paths["train"])


In [None]:
predictor_tf = regressor_tf.deploy(initial_instance_count=1, instance_type=instancetype)

In [None]:
predictor_tf = Predictor(
    endpoint_name="<Enter your endpoint name here>",
    sagemaker_session=sagemaker.Session(),
    serializer=sagemaker.serializers.JSONSerializer()
)

In order to make predictions, we need to first prepare the data with its lags. 

In [None]:
train_input_tf = train["JAJIL.CQ"].adjclose.values.reshape(-1,1)
test_input_tf = test["JAJIL.CQ"].adjclose.values.reshape(-1,1)

In [None]:
df_volume = np.vstack((train_input_tf, test_input_tf))
window = 30 # number of lags

inputs = df_volume[df_volume.shape[0] - test_input_tf.shape[0] - window:]
inputs = inputs.reshape(-1, 1)

prediciton_lengh = df_volume.shape[0] - train_input_tf.shape[0] + window

X_test = []

for i in range(window, prediciton_lengh):
    X_test_reshaped = np.reshape(inputs[i-window:i], (window, 1))
    X_test.append(X_test_reshaped)

X_test = np.stack(X_test)


In [None]:
predictor_tf.predict(X_test)

In [None]:
# Making predictions
try:
    # new predictor
    predictions_byte = predictor_tf.predict(X_test)
    prediction_tf = np.array(predictions_byte["predictions"])
    print('1')
except: 
    # reusing predictor
    predictions_json = json.loads(predictions_byte)
    prediction_tf = np.array(predictions_json["predictions"])
    print('2')

In [None]:
# Scaling the predictions back
prediction_tf = scaler.inverse_transform(prediction_tf).flatten()

In [None]:
# RMSE
tf_RMSE = np.sqrt(mean_squared_error(y_test_rf[0], prediction_tf))
tf_RMSE

In [None]:
# MAE
tf_MAE = mean_absolute_error(y_test_rf[0], prediction_tf)
tf_MAE

In [None]:
trace1 = go.Scatter(x = X_test_rf.index, y = y_test_rf[0],
                   mode = 'lines',
                   name = 'Real Price')

trace2 = go.Scatter(x = X_test_rf.index, y = prediction_tf,
                   mode = "lines",
                   name = "Predicted Price")

layout = go.Layout(title = "Real Price vs Predicted Price using LSTM with Tensor Flow",
                   width = 1000, height = 600)

fig = go.Figure(data = [trace1, trace2], layout = layout)
fig.show()

With the result for the two models as well its scores, we can then move forward and explore the results.

## Results

Now that we have our models trained and evaluated on a test set, we can compare the metrics and use visualization to get insights on how good they are and how close they were to the forecast. 

Let's begin by comparing the metrics RMSE and MAE for each model.

In [None]:
tf_MAE

In [None]:
import pandas as pd
metrics = {"RMSE": [rf_RMSE, tf_RMSE], "MAE": [rf_MAE, tf_MAE]}
pd.DataFrame(metrics, index = ["Random Forest", "LTSM"] )

In [None]:
trace1 = go.Scatter(x = X_test_rf.index, y = y_test_rf[0],
                   mode = 'lines',
                   name = 'Real Price')

trace2 = go.Scatter(x = X_test_rf.index, y = prediction_rf[0],
                   mode = "lines",
                   name = "Predicted Price RF")


trace3 = go.Scatter(x = X_test_rf.index, y = prediction_tf,
                   mode = "lines",
                   name = "Predicted Price LSTM")

layout = go.Layout(title = "Comparing the Results of Random Forest and LSTM models with the Real Price",
                   width = 1000, height = 600)

fig = go.Figure(data = [trace1, trace2, trace3], layout = layout)
fig.show()

Looking at the graph above, we can see that both models did a very good job in this forecast task.  

## Storing Results in S3, to be consumed by business users

In [None]:
result_set_df = X_test_rf
result_set_df

real_price_array = y_test_rf[0]
real_price_array
real_price_df = pd.DataFrame(real_price_array, columns = ['real_price'])

predicted_price_array = prediction_tf
predicted_price_array
predicted_price_df = pd.DataFrame(predicted_price_array, columns = ['predicted_price'])
predicted_price_df

result_set_df['real_price'] = real_price_df.values
result_set_df['predicted_price'] = predicted_price_df.values
result_set_df['ticker'] = 'JAJIL.CQ'

In [None]:
result_set_df.to_parquet(f's3://{lab_data_bucket_name}/finance/predictions/predictions.parquet')

### References

[1] https://medium.datadriveninvestor.com/using-aws-sagemaker-to-stock-price-forecast-of-brazilian-commodity-based-companies-f937572b7654
[2] https://en.wikipedia.org/wiki/Fundamental_analysis#The_two_analytical_models  
[3] https://en.wikipedia.org/wiki/Technical_analysis  
[4] https://www.thebalance.com/brazil-and-commodities-808912  
[5] https://www.nasdaq.com/articles/3-reasons-why-commodities-etfs-may-rally-in-2021-2021-01-15  
[6] https://www.reuters.com/article/column-russell-commodities-yearahead-idUSL1N2IQ0A2    
[7] https://plusmining.com/en/commodities-rally-is-projected-to-2021-the-coronavirus-would-mark-a-milestone-in-the-cycle-potentially-leaving-years-of-weak-prices-behind/  
[8] https://www.fxempire.com/forecasts/article/speculators-bet-on-a-continued-commodity-rally-in-2021-690009  
[9] https://www.kaggle.com/miracl16/tesla-stock-price-prediction-lstm-vs-gru     
[10] https://www.kaggle.com/fatmakursun/tesla-stock-price-prediction  
[11] https://www.kaggle.com/akanksha496/stock-price-prediction-lstm   
[12] https://www.kaggle.com/raoulma/ny-stock-price-prediction-rnn-lstm-gru  
[13] https://www.kaggle.com/biphili/time-series-data-analysis-stock-price-code-12#5.Forecasting-Stock-Price   
[14] https://towardsdatascience.com/python-for-finance-stock-portfolio-analyses-6da4c3e61054
