# Polygon data download
With the Alpaca notebooks I tried to convert tick quote data to quotes and then merge them with bars. I have decided to not use quote data anymore. It is simply too expensive, cumbersome or simply impossible. I would have to download ALL tick data in order to get quotebars. Instead, I will just trade using 1-minute OHLC. This can be downloaded in a reasonable timeframe from Polygon, which also includes delisted stocks. When backtesting, I will assume a spread of 1 tick or a few cents. This is okay, because I simply will not trade systems with an average profit less than 0.3% for liquid stocks or less than 0.1% for liquid futures/CFDs. I also will avoid illiquid stocks. I will not do HFT. My holding period should be at least 10 minutes. If I do want to venture in sub 1-minute, I will just use QuantConnect instead.

My data pipeline will look like this:
1. Download all raw 1-minute unadjusted data for all stocks for all dates (including delisted tickers) and store them in the <code>../data/polygon/raw/m1/\<stock\>.csv</code>. Raw price data contains <code>["t", "open", "high", "low", "close", "vwap", "trades", "volume", "otc"]</code>. The <code>t</code> column is the Unix timestamp. The <code>trades</code> column is the amount of trades in the minute and <code>otc</code> is a flag whether the asset is traded OTC. I will not trade OTCs yet. Higher resolutions are not downloaded, but aggregated later. Raw data is never deleted or modified.
2. Get adjustment factors and store to <code>../data/polygon/raw/adjustments/\<stock\>.csv</code>.
3. Download fundamental data to <code>../data/polygon/raw/fundamentals/\<stock\>.csv</code>. This contains P/E ratios etc.
4. Download descriptive data to <code>../data/polygon/raw/descriptive.csv</code>. This is a single file containing all stocks. The difference between descriptive and fundamental data is that this never changes. These contain e.g. the start/end date of listing, the sector, the country etc. (The start/end date is the only thing that changes.)

Step 1 to 4 is repeated if I have to update data. Then new data is simply appended to the end of the csv files. I will not update frequently.

After downloading, the raw data has to be processed to the <code>../data/polygon/processed/</code> folder. For now, I will not process the fundamental/descriptive data. Processed data is always adjusted. When data is downloaded for the first time or updated, the *entire* data set needs be processed (for the updated tickers). This is because if new adjustments come in due to dividend or split data, the entire history needs to be readjusted. Processed data will contain the columns <code>["open", "high", "low", "close", "close_original", "volume", "vwap", "tradeable"]</code>. All prices are adjusted, except <code>close_original</code>. The reason we need the original close price is for price filters. Processed data will not contain OTC stocks for now. 

The steps when processing data:
1. Remove OTC traded stocks.
2. Convert the timestamps to (naive) ET time.
3. For illiquid stocks or extended hours not every minute a trade takes place. Also, I want my processed data to have no gaps, which makes pairs trading much easier. Every stock that is traded on a day should have the same amount of minute bars. To do this we need to reindex to the opening hours (4:00 to 20:00, where regular hours are from 9:30 to 16:00) and forward fill empty values. However, some days have early closes and there are holidays. So we also need to get a list of dates the stock market is open. These will be stored in <code>../data/other/market_hours.csv</code>. If data is forward filled, the backtester should not trade these 'stale' prices. Either there have been no trades in the minute or the stock is halted. That is what the <code>tradeable</code> flag is for. It can happen that a stock is halted for multiple days (example: TOP). In this case, it will be treated just as empty prices which are forward filled.
4. Adjust the data using dividend and split data. Round to 4 digits after the comma.
5. Aggregate 1-minute bars to higher timeframes. Daily data contains only regular trading hours.

*Note: a data point at 15:59 with OHLC means that the open was at 15:59:00 and close at 16:00:00. So daily data does not contain the 16:00:00 minute bar. My polygon key is in <code>../data/polygon/secret.txt</code>, this file has to be created manually.*