# Predicting Stock Prices from Expected Earnings
## 1. Overview
This project uses a Recurrent Neural Network (RNN) through TensorFlow to predict stock prices the day after an earnings report. It takes as input the daily stock prices for the (approximately) 3-month period leading up to the earnings report, as well as the expected Earnings Per Share (EPS) prior to the report. Its output is the expected % difference between the next day's closing price and the previous day's high price.
## 2. Collecting and Cleaning Data
Data will be taken from the last 8 years' historical stock prices and earnings reports. It is housed in stocks_latest, which you should download on your own at <a href="https://www.kaggle.com/tsaustin/us-historical-stock-prices-with-earnings-data">this link</a>.

First, let's import the packages we'll be using.

In [1]:
import os
import datetime

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf

Next, we'll extract our data as pandas dataframes.

In [31]:
stocks_data_path = "./datasets/stock_prices_latest.csv"
earnings_data_path = "./datasets/earnings_latest.csv"

stocks_df = pd.read_csv(stocks_data_path)
earnings_df = pd.read_csv(earnings_data_path)

Let's see what data our stocks file contains:

In [34]:
stocks_df.info()
stocks_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23528435 entries, 0 to 23528434
Data columns (total 9 columns):
 #   Column             Dtype  
---  ------             -----  
 0   symbol             object 
 1   date               object 
 2   open               float64
 3   high               float64
 4   low                float64
 5   close              float64
 6   close_adjusted     float64
 7   volume             int64  
 8   split_coefficient  float64
dtypes: float64(6), int64(1), object(2)
memory usage: 1.6+ GB


Unnamed: 0,symbol,date,open,high,low,close,close_adjusted,volume,split_coefficient
0,MSFT,2016-05-16,50.8,51.96,50.75,51.83,49.7013,20032017,1.0
1,MSFT,2002-01-16,68.85,69.84,67.85,67.87,22.5902,30977700,1.0
2,MSFT,2001-09-18,53.41,55.0,53.17,54.32,18.0802,41591300,1.0
3,MSFT,2007-10-26,36.01,36.03,34.56,35.03,27.2232,288121200,1.0
4,MSFT,2014-06-27,41.61,42.29,41.51,42.25,38.6773,74640000,1.0


We'll do the following to clean up this dataset:
1. Remove split_coefficient and close, opting for close_adjusted instead
2. Rename close_adjusted to close
3. Convert our dates to datetime objects in pandas
4. Sort first by symbol, then by date

In [35]:
stocks_df.drop(['split_coefficient','close'], axis=1, errors='ignore', inplace=True)
stocks_df.rename(columns={'close_adjusted':'close'}, inplace=True)
stocks_df['date'] = pd.to_datetime(stocks_df.date)
stocks_df.sort_values(by=['symbol','date'], inplace=True)
stocks_df.head()

Unnamed: 0,symbol,date,open,high,low,close,volume
19762470,A,1999-11-18,45.5,50.0,40.0,29.6303,44739900
19762410,A,1999-11-19,42.94,43.0,39.81,27.1926,10897100
19762440,A,1999-11-22,41.31,44.0,40.06,29.6303,4705200
19762399,A,1999-11-23,42.5,43.63,40.25,27.105,4274400
19762394,A,1999-11-24,40.13,41.94,40.0,27.6505,3464400


We'll be feeding our NN the stock prices for each 90-day period leading up to an earnings report. First, we'll group the stock data by symbol.

In [36]:
stock_df_sep = dict(tuple(stocks_df.groupby('symbol')))
symbol1 = stock_df_sep['ZM']
print(symbol1)

         symbol       date      open     high      low   close    volume
20646711     ZM 2019-04-18   65.0000   66.000   60.321   62.00  25764659
20646712     ZM 2019-04-22   61.0000   68.900   59.940   65.70   9949738
20646710     ZM 2019-04-23   66.8700   74.169   65.550   69.00   6786513
20646715     ZM 2019-04-24   71.4000   71.500   63.160   63.20   4973529
20646709     ZM 2019-04-25   64.7400   66.850   62.600   65.00   3863275
...         ...        ...       ...      ...      ...     ...       ...
23487984     ZM 2020-11-02  462.2900  477.000  440.000  453.00   8051893
23487983     ZM 2020-11-03  456.3284  461.880  445.010  451.51   6496995
23487980     ZM 2020-11-04  471.9200  484.940  455.400  483.70   8048450
23487985     ZM 2020-11-05  497.9900  499.350  480.761  496.73   6379939
23487982     ZM 2020-11-06  494.6600  505.880  482.730  500.11   4248801

[394 rows x 7 columns]


Now we'll need to use our earnings data, and particularly the dates of earnings periods for each company, to further separate our data into 90-day chunks. Let's explore our earnings data:

In [39]:
earnings_df.info()
earnings_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160660 entries, 0 to 160659
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   symbol        160660 non-null  object 
 1   date          160660 non-null  object 
 2   qtr           159656 non-null  object 
 3   eps_est       103413 non-null  float64
 4   eps           128273 non-null  float64
 5   release_time  105929 non-null  object 
dtypes: float64(2), object(4)
memory usage: 7.4+ MB


Unnamed: 0,symbol,date,qtr,eps_est,eps,release_time
0,A,2009-05-14,04/2009,,,post
1,A,2009-08-17,07/2009,,,post
2,A,2009-11-13,10/2009,,,pre
3,A,2010-02-12,01/2010,,,pre
4,A,2010-05-17,04/2010,,,post


If a stock doesn't report earnings for a period, we can remove that period.

In [42]:
earnings_df.dropna(inplace=True)
earnings_df.head()

Unnamed: 0,symbol,date,qtr,eps_est,eps,release_time
14,A,2012-11-19,10/2012,0.8,0.84,post
15,A,2013-02-14,01/2013,0.66,0.63,post
16,A,2013-05-14,04/2013,0.67,0.77,post
17,A,2013-08-14,07/2013,0.62,0.68,post
18,A,2013-11-14,10/2013,0.76,0.81,post
