# Predicting Stock Prices from Expected Earnings
## 1. Overview
This project uses a Recurrent Neural Network (RNN) through TensorFlow to predict stock prices the day after an earnings report. It takes as input the daily stock prices for the (approximately) 3-month period leading up to the earnings report, as well as the expected Earnings Per Share (EPS) prior to the report. Its output is the expected % difference between the next day's closing price and the previous day's high price.
## 2. Collecting and Cleaning Data
Data will be taken from the last 8 years' historical stock prices and earnings reports. It is housed in stocks_latest, which you should download on your own at <a href="https://www.kaggle.com/tsaustin/us-historical-stock-prices-with-earnings-data">this link</a>.

First, let's import the packages we'll be using.

In [5]:
import os
import datetime

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf

Next, we'll extract our data as pandas dataframes.

In [21]:
stocks_data_path = "./datasets/stock_prices_latest.csv"
earnings_data_path = "./datasets/earnings_latest.csv"

stocks_df = pd.read_csv(stocks_data_path)
earnings_df = pd.read_csv(earnings_data_path)

Let's see what data our stocks file contains:

In [7]:
stocks_df.info()
stocks_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23528435 entries, 0 to 23528434
Data columns (total 9 columns):
 #   Column             Dtype  
---  ------             -----  
 0   symbol             object 
 1   date               object 
 2   open               float64
 3   high               float64
 4   low                float64
 5   close              float64
 6   close_adjusted     float64
 7   volume             int64  
 8   split_coefficient  float64
dtypes: float64(6), int64(1), object(2)
memory usage: 1.6+ GB


Unnamed: 0,symbol,date,open,high,low,close,close_adjusted,volume,split_coefficient
0,MSFT,2016-05-16,50.8,51.96,50.75,51.83,49.7013,20032017,1.0
1,MSFT,2002-01-16,68.85,69.84,67.85,67.87,22.5902,30977700,1.0
2,MSFT,2001-09-18,53.41,55.0,53.17,54.32,18.0802,41591300,1.0
3,MSFT,2007-10-26,36.01,36.03,34.56,35.03,27.2232,288121200,1.0
4,MSFT,2014-06-27,41.61,42.29,41.51,42.25,38.6773,74640000,1.0


We'll do the following to clean up this dataset:
1. Remove split_coefficient and close, opting for close_adjusted instead
2. Rename close_adjusted to close
3. Convert our dates to datetime objects in pandas
4. Sort first by symbol, then by date

In [8]:
stocks_df.drop(['split_coefficient','close'], axis=1, errors='ignore', inplace=True)
stocks_df.rename(columns={'close_adjusted':'close'}, inplace=True)
stocks_df['date'] = pd.to_datetime(stocks_df.date)
stocks_df.sort_values(by=['symbol','date'], inplace=True)
stocks_df.head()

Unnamed: 0,symbol,date,open,high,low,close,volume
19762470,A,1999-11-18,45.5,50.0,40.0,29.6303,44739900
19762410,A,1999-11-19,42.94,43.0,39.81,27.1926,10897100
19762440,A,1999-11-22,41.31,44.0,40.06,29.6303,4705200
19762399,A,1999-11-23,42.5,43.63,40.25,27.105,4274400
19762394,A,1999-11-24,40.13,41.94,40.0,27.6505,3464400


We'll be feeding our NN the stock prices for each 90-day period leading up to an earnings report. First, we'll group the stock data by symbol and save as numpy arrays for faster processing.

In [52]:
stocks_sep = dict(tuple(stocks_df.groupby('symbol')))
for symbol in stocks_sep:
    stocks_sep[symbol] = stocks_sep[symbol].to_numpy()
symbol1 = stocks_sep['ZM']
print(symbol1)

[['ZM' '2019-04-25' 64.74 ... 65.0 3863275 1.0]
 ['ZM' '2019-04-23' 66.87 ... 69.0 6786513 1.0]
 ['ZM' '2019-04-18' 65.0 ... 62.0 25764659 1.0]
 ...
 ['ZM' '2020-10-26' 520.4885 ... 517.79 10080795 1.0]
 ['ZM' '2020-10-27' 525.22 ... 538.99 7227645 1.0]
 ['ZM' '2020-10-28' 549.5 ... 516.01 8810070 1.0]]


Now we'll need to use our earnings data, and particularly the dates of earnings periods for each company, to further separate our data into 90-day chunks. Let's clean and explore our earnings data:

In [23]:
earnings_df.dropna(inplace=True)
earnings_df['date'] = pd.to_datetime(earnings_df.date)
earnings_df.sort_values(by=['symbol', 'date'], inplace=True)
earnings_df.info()
earnings_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77282 entries, 14 to 160659
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   symbol        77282 non-null  object        
 1   date          77282 non-null  datetime64[ns]
 2   qtr           77282 non-null  object        
 3   eps_est       77282 non-null  float64       
 4   eps           77282 non-null  float64       
 5   release_time  77282 non-null  object        
dtypes: datetime64[ns](1), float64(2), object(3)
memory usage: 4.1+ MB


Unnamed: 0,symbol,date,qtr,eps_est,eps,release_time
14,A,2012-11-19,10/2012,0.8,0.84,post
15,A,2013-02-14,01/2013,0.66,0.63,post
16,A,2013-05-14,04/2013,0.67,0.77,post
17,A,2013-08-14,07/2013,0.62,0.68,post
18,A,2013-11-14,10/2013,0.76,0.81,post


In [83]:
earnings_sep = dict(tuple(earnings_df.groupby('symbol')))
delete_dict = []
for symbol in earnings_sep:
    earnings_sep[symbol] = earnings_sep[symbol].to_numpy()
    try:
        stocks_sep[symbol]
    except:
        delete_dict.append(symbol)
for symbol in delete_dict:
    del earnings_sep[symbol]
symbol2 = earnings_sep['ZM']
print(symbol2)
print(len(earnings_sep))

[['ZM' Timestamp('2019-09-05 00:00:00') '07/2019' 0.013 0.08 'post']
 ['ZM' Timestamp('2019-12-05 00:00:00') 'Q3' 0.028 0.09 'post']
 ['ZM' Timestamp('2020-03-04 00:00:00') 'Q4' 0.071 0.15 'post']]
4328


In [None]:
# TODO: add earnings period start and end date columns to earnings_df before separating

In [78]:
stocks_cnt = 0
stocks_total_cnt = len(earnings_sep)
for symbol in earnings_sep:
    stocks_row_index = 0
    stocks_cnt += 1
    if stocks_cnt % 100 == 0:
        print("stocks processed: {:d}/{:d}".format(stocks_cnt, stocks_total_cnt))
    for earnings_date_row in earnings_sep[symbol]:
        # slice stocks_sep from earnings date start to earnings date end and add to input
        # calculate percent drop and add to output

stocks processed: 100/4328
stocks processed: 200/4328
stocks processed: 300/4328
stocks processed: 400/4328
stocks processed: 500/4328
stocks processed: 600/4328
stocks processed: 700/4328
stocks processed: 800/4328
stocks processed: 900/4328
stocks processed: 1000/4328
stocks processed: 1100/4328
stocks processed: 1200/4328
stocks processed: 1300/4328
stocks processed: 1400/4328
stocks processed: 1500/4328
stocks processed: 1600/4328
stocks processed: 1700/4328
stocks processed: 1800/4328
stocks processed: 1900/4328
stocks processed: 2000/4328
stocks processed: 2100/4328
stocks processed: 2200/4328
stocks processed: 2300/4328
stocks processed: 2400/4328
stocks processed: 2500/4328
stocks processed: 2600/4328
stocks processed: 2700/4328
stocks processed: 2800/4328
stocks processed: 2900/4328
stocks processed: 3000/4328
stocks processed: 3100/4328
stocks processed: 3200/4328
stocks processed: 3300/4328
stocks processed: 3400/4328
stocks processed: 3500/4328
stocks processed: 3600/4328
s