## An Alternate Method for Data Acquisition

I noticed some issues with the previous dataset. Notably, the volume column contained many zero values. With enough incorrect values from the source data, errors can propogate to future pipeline steps. To correct this, I will instead use the Alpaca API which allows for historical data acquisition.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import json

To acquire the data, an API key from [Alpaca](https://alpaca.markets/) is required. Alpaca provides an easy to use API for stock trading, market data acquisition, and backtesting. Some features require an authorized account, which is for now available only to U.S. citizens. To utilize the API  you must first generate two keys, a key ID and a secret key. This can be done on the Alpaca website.

In [2]:
import alpaca_trade_api as tradeapi

key_id = None
secret_key = None
with open('../files/private/alpakey') as key_file:
    keys = key_file.readlines()
    key_id = keys[0].strip()
    secret_key = keys[1].strip()
    
api_url = "https://paper-api.alpaca.markets"

alpaca = tradeapi.REST(key_id, secret_key, api_url, api_version='v2')

'sp_500.json' contains a list of the stocks traded on the S&P 500, also known as the S&P 500 constituents. Some ETFs (Exchange Traded Funds) track multiple stocks and can be used to approximate the market as a whole. The SPY ETF tracks the S&P 500, a collection of stocks listed on the US markets. By combining stock data with overall market data, better predictions can be made that take into account market ups and downs. Other ETFs may work just as well, such as DIA (Dow Jones Industrial Average) or VUG (Vanguard). In my tests, they all performed similarly.

In [3]:
with open('../files/public/sp_500.json', 'r') as top_symbols:
    symbols = json.load(top_symbols)
    
symbols.append('SPY')

The API lets you specify the timeperiod between each data point, the number of data points, and the starting or ending point. The results can be returned as a DataFrame for convenience. To train the model, data from as far back as 2007 will be acquired.

In [4]:
top_stocks = []
for symbol in tqdm(symbols):
    try:
        stock = alpaca.polygon.historic_agg_v2(symbol, 1, 'day',
                                               _from='2007-01-01', to='2020-07-01').df
        stock['symbol'] = symbol
        top_stocks.append(stock)
    except Exception as e:
        print(e)
        continue
        

# After the stocks have been retrieved, I concatenate them into a single DataFrame.        
top_stocks = pd.concat(top_stocks)

HBox(children=(FloatProgress(value=0.0, max=506.0), HTML(value='')))




In [5]:
top_stocks.reset_index(inplace=True)
top_stocks.columns = ['date', 'open', 'high', 'low', 'close', 'volume', 'symbol']

For backtesting the model, unseen data will need to be acquired. However, due to fluctuations in the market, not every year makes for balanced data. To avoid this, I will backtest using three sets of data. 

##### Backtest Set 1: 2008
    
The market crash of 2008 was one of the worst on record. I'll choose this year to test the model's performance in the face of a recession.
    
##### Backtest Set 2: 2011

2011 was neither a good year or a bad year for the market. There was little difference between stock prices in January and December. I'll choose this year to test the model's performance in flat years.

##### Backtest Set 3: 2013

By 2013, the market had rebounded from the 2008 recession. Throughout the year, stock growth never stopped. I'll choose this year to test the model's performance in a successful market.

These years will be removed from the training set when the time comes.

In [6]:
top_stocks.set_index('date', inplace=True)

After the stocks have been retrieved, I concatenate them into a single DataFrame.

The stock DataFrames are merged with the market DataFrame by date.

In [7]:
top_stocks.drop_duplicates(inplace=True)

The finalized data is saved for future use. This step may be skipped in future iterations once you are satisfied with the size and quality of the dataset.

In [8]:
top_stocks.to_hdf('../data/raw/market_stocks.h5', key='top_stocks')