# Fundamentals
The fundamentals that I want are:
* Easy to get:
    * Company headquarters location
    * Sector
    * Outstanding shares (for market cap, want weekly)
    * Earnings yield (inverse of P/E)
    * Cash & equivalents (for dilution prediction)
    * Cash burn (for dilution prediction, but must be estimated)

* Hard to get:
    * Free float (outstanding - insider - institutional)

* Very hard to get:
    * Short interest

A list of what I have found:
* [Polygon](https://polygon.io/docs/stocks/get_vx_reference_financials): only has EPS, sector and outstanding shares
* [Finnhub](https://finnhub.io/docs/api): has all easy to get data, but not for delisted stocks
* [FinancialModelingPrep](https://site.financialmodelingprep.com/): has all easy to get. Free float can be calculated as they have insider/institutional holdings and outstanding shares.
* [Valueinvesting.io](https://valueinvesting.io/short-interest-api): only has short-interest but nothing else, Either 100 or 400/month.
* Compustat: I have access through my university, but not sure if short-interest is available
* [Sharadar](https://data.nasdaq.com/search?filters=%5B%22Equities%22%2C%22Fundamentals%22%5D): Listed on the Nasdaq Data Link data market
* [Tiingo](https://www.tiingo.com/documentation/)
* [IEXCloud](https://iexcloud.io/)
* [EODHD](https://eodhd.com/pricing?utm_source=google_search&utm_medium=cpc&utm_campaign=Brand_Europe&roistat_referrer&roistat_pos&roistat=google1_g_124657464434_526148784701_eodhd&gad=1)

Since Sharadar has almost everything I want, including data for ADRs, I settle with them. QuantRocket also uses them. I thought they are extremely expensive so I didn't look further. Turns out I am wrong, but it's still stupid you have to create an account just to see the pricing. The bundle for all data is 69/month. And they have a bulk data download endpoint. Although they do not provide float, it can be calculated by subtracting the institutional and insider ownership from the outstanding shares. While this is not 100% correct (e.g. you also have to take into account options, Rule 144 etc.) it is good enough for my purposes.

Almost no one has short interest. I will likely settle with Compustat and Valueinvesting.io for that. However this is not high on my priorities list.

Other alternative data that I want are the historical borrowing rates on IBKR. These can be bought on [iborrowdesk.com](https://www.patreon.com/iborrowdesk) for almost free.


In [2]:
from datetime import datetime, date, time, timedelta
from utils import first_trading_day_after, last_trading_day_before, get_tickers_v3
import pandas as pd
import numpy as np

START_DATE = first_trading_day_after(datetime(2021, 1, 1).date())
END_DATE = last_trading_day_before(datetime(2023, 8, 18).date())
print(START_DATE)
print(END_DATE)

2021-01-04
2023-08-18


Polygon actually does not have all tickers. They probably don't have ADRs because their filings are a little bit different.

In [2]:
from polygon.rest import RESTClient
DATA_PATH = "../../../data/polygon/"

with open(DATA_PATH + "secret.txt") as f:
    KEY = next(f).strip()

client = RESTClient(api_key=KEY)

data = pd.DataFrame(client.vx.list_stock_financials(ticker = "TOP", filing_date_gte=START_DATE, filing_date_lte=END_DATE) )
print(data)
data = client.get_ticker_details(ticker="TOP", date=END_DATE)
print(data)

Empty DataFrame
Columns: []
Index: []
TickerDetails(active=True, address=None, branding=Branding(icon_url=None, logo_url=None, accent_color=None, light_color=None, dark_color=None), cik='0001848275', composite_figi=None, currency_name='usd', currency_symbol=None, base_currency_name=None, base_currency_symbol=None, delisted_utc=None, description='TOP Financial Group Ltd is an online brokerage firm in Hong Kong specializing in the trading of local and overseas equities, futures, and options products. The company operates in only one segment which is the futures brokerage service.', ticker_root='TOP', ticker_suffix=None, homepage_url='https://www.zyfgl.com', list_date='2022-06-01', locale='us', market='stocks', market_cap=219845423.24, name='TOP Financial Group Limited Ordinary Shares', phone_number=None, primary_exchange='XNAS', share_class_figi=None, share_class_shares_outstanding=35010000, sic_code=None, sic_description=None, ticker='TOP', total_employees=11, type='CS', weighted_shares

# Sharadar Core US Fundamentals
They offer several subscriptions. They have data for fundamentals (59/month), insiders (25/month), institutional ownership (25/month) and EOD prices (29/month). The bundle that contains these all only costs 69/month. So I might as well pay the 10/month more.

Using this API is a little bit different than others. This is because Sharadar sells their data through the Nasdaq Data Link (NDL), so they must use their API. This API must be usable for all vendors, so the API endpoints are very generalized. They have 4 different APIs. The one we need is the 'REST API for tables data' for data product code SF1. 

Every data set that uses the tables API has different sets of tables. E.g. for Sharadar the tables are SF1, DAILY, TICKERS, INDICATORS, ACTIONS, SP500 and EVENTS. All data is stored in either one of these tables. Then in the API call filters (e.g. <code>ticker='AAPL'</code>) are used to narrow down the selection. The return value is XML, JSON or CSV. The filters available depend on the data set. For SF1 these are ticker, calendardate, lastupdated, dimension and datekey. You can also append the filter with [filter operators](https://docs.data.nasdaq.com/docs/parameters-1), such as <code>.gt</code> to return only the rows with values greater than the specified value. We can also pass several parameters to each call, such as <code>qopts.columns</code> to select which column of the table we want. There are also options to bulk download. 

The 'dimension' must be a 3-letter combination of (AR, MR) and (Y, T, Q). AR is As-Reported, MR is Most-Recent, Y is for annual data while T is for trailing-12-month and Q is for quarterly. Q is not available for ADRs. The AR is the one that excludes restatements. Sometimes when companies file annual reports, they do not file a quarterly one because that information is already contained in the annual. If a restatement occurs in ART within the 12-month period, that datapoint is restated.

Which one do we need? The best point-in-time information is ART. We definitely need As-Reported. Also a trailing-12-month figure makes more sense. If we did not use 12-month trailing we would use either quarterly (which has no information about the previous 3 months) or annual (which has no information of the most recent months unless we query by January). ART *does* include restatements, but only if it was available within the 12-month period. We want restatements, but not those that we would not have access to at that point in time. So ART is the best choice.

So the goal is: build a fundamental database for as many tickers as possible in our <code>tickers_v3</code> ticker list (excluding ETF and ETNs). This database has a quarterly frequency which contains the information of the last 12 months. Except for ADRs because they do not file annually. We do not need all ratio's, just a few. Some data is non-changing, such as the country or sector, which will be added to our ticker list to create <code>tickers_v4</code>. Some data updates more frequently than once per quarter. These are the ratios that depend on price such as P/B ratio. For these we only store the non-price (B) part. Then at backtesting the true value is calculated using the *forward adjusted* price from every data point. For ADRs some values are in USD, other not. Yes this is annoying.

We will add to <code>tickers_v4</code>:
* Sector (not the SIC sector)
* Industry
* Country

We will add to <code>raw/fundamentals/AAA.csv</code>:
* date: datekey: Date of availability (date)
* weighted_shares_outstanding: shareswa: Weighted average shares (int) (to get market cap)

* cash: cashnequsd : Cash and Equivalents (USD)
* change in cash: cashnequsd / previous cashnequ
* current_ratio: currentratio : Short term assets/liabilities (ratio). Higher is better.

* div_ps: dps: Dividends per share (USD/share)
* earnings_ps: epsusd : to calculate P/E (USD)
* book_ps: bvps/fxusd: book value per share (may not be USD) converted to USD
* sales_ps : sps: sales per share (USD)
* net_margin: netmargin: Net margin (ratio)

Other data we want:
* S&P500 constituents in <code>raw/SP500.csv</code>. They do not have other indices. But for the other ones we can just construct them ourselves. Some are also mechanical unlike the S&P500.

This list is quite short. But it is enough to work with. We can easily add more if we want. We also have to take into account splits... The earnings per share gets halved with a 1-to-2 split. But some ratios stay the same such as the current ratio.

Also we will save the entire database just in case we need it later.

To not have to deal with naked API requests, we use the their official [SDK](https://github.com/Nasdaq/data-link-python/#local-api-key-environment-variable). Read their [quickstart guide](https://github.com/Nasdaq/data-link-python/blob/main/FOR_ANALYSTS.md) and [detailed guide](https://github.com/Nasdaq/data-link-python/blob/main/FOR_DEVELOPERS.md). It is really confusing what the difference between <code>get</code>, <code>get_table</code>, <code>DataSet</code>, <code>DataTable</code> and <code>DataBase</code> is. I will just follow the [examples](https://data.nasdaq.com/databases/SF1/documentation) and use <code>get_table</code>.

In [4]:
import nasdaqdatalink

DATA_PATH = "../../../data/sharadar/"

nasdaqdatalink.read_key(filename=DATA_PATH + "secret.txt")

## 2.1 Getting all tickers
First get the list of tickers. This is obviously different from the Polygon one. The Polygon one is the most 'point-in-time' one: it does include recycled tickers. Sharadar does not. But even if it was we would expect some small differences between data vendors.

In [28]:
ticker_metadata = nasdaqdatalink.get_table('SHARADAR/TICKERS', table='SF1', paginate=True)
# ticker_metadata.iloc[:, :10]
ticker_metadata["cik"] = ticker_metadata["secfilings"].str[-10:].astype(int)
ticker_metadata.to_csv(DATA_PATH + 'raw/tickers.csv')
ticker_metadata[ticker_metadata["ticker"] == "META"].head(3)

Unnamed: 0_level_0,table,permaticker,ticker,name,exchange,isdelisted,category,cusips,siccode,sicsector,...,location,lastupdated,firstadded,firstpricedate,lastpricedate,firstquarter,lastquarter,secfilings,companysite,cik
None,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5452,SF1,194817,META,META PLATFORMS INC,NASDAQ,N,Domestic Common Stock,30303M102,7370,Services,...,California; U.S.A,2023-07-27,2014-09-26,2012-05-18,2023-09-08,2010-12-31,2023-06-30,https://www.sec.gov/cgi-bin/browse-edgar?actio...,http://investor.fb.com,1326801


Second, get the list of definitions used in the fundamentals

In [30]:
columns = nasdaqdatalink.get_table('SHARADAR/INDICATORS', table='SF1')
columns.to_csv(DATA_PATH + "raw/SF1_definitions.csv")
columns.head(3)

Unnamed: 0_level_0,table,indicator,isfilter,isprimarykey,title,description,unittype
None,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,SF1,workingcapital,N,N,Working Capital,[Metrics] Working capital measures the differe...,currency
1,SF1,ticker,Y,Y,Ticker Symbol,[Entity] The ticker is a unique identifier for...,text
2,SF1,tbvps,N,N,Tangible Assets Book Value per Share,[Metrics] Measures the ratio between [Tangible...,currency/share


Exploring this table in Excel, we find that it does not have recycled tickers (NXU, GOLD). It also does not have renamed tickers (FB, ABX). The renamed tickers is not that of a big problem, using the Polygon ticker list I can get a list of renamings by comparing ciks. But not having recycled tickers means the further back in time I go, the less accurate the fundamental database gets. Even though the database contains delisted stocks. 

However, I will just accept this fact. The amount of recycled tickers is low. The most important thing is that (non-recycled) delisted stocks are included. It is good enough. There is no need to get to 100% accuracy, as I am not primarily interested in fundamental strategies. My sample size is too low for that anyways and I would just use CRSP for that. And these fundamental strategies are already researched into oblivion.

For any diversified strategy this <0.5% of missing fundamentals also not matter too much. In almost 3 years of data, there were only around 20 of such cases. Only the original have data, so we miss 10. However, for price data these are included. Just not for fundamentals. If I really wanted, I could get these fundamentals manually from the SEC files themselves or using other APIs. But I am not going to bother with it now, it is on the to do list.

By chance while exploring the ticker list in Excel, I noticed that the cik is actually available, hidden in the link to the SEC files. This takes care of the renaming problem, now I can just look up the cik in the polygon table. For tickers that do not have cik, the ticker is used.

## 2.2 Differences in 'dimension'
Let's see how the A, Q and T works. How is EPS calculated for the three? Is the 12-month trailing a sum? If it is a sum, how is cashneq calculated? Because for that metric it does not make sense to take a sum. Depending on the answers we may need to change our approach and use Q/A instead of T.

(The free subscription only has data for MRY (so annual, most-recent) for a small selection of stocks, so here I got a premium subscription.)

In [21]:
#calendardate={'gte':'2013-01-01', 'lte':'2017-01-01'}
data = nasdaqdatalink.get_table('SHARADAR/SF1', ticker='AAPL', dimension='MRY')
print(data.shape)
data.to_csv(DATA_PATH + 'raw/AAPL.csv')
data.iloc[:3, :4] # Only show part, there are a lot of columns

(8, 111)


Unnamed: 0_level_0,ticker,dimension,calendardate,datekey
None,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,AAPL,MRY,2018-12-31,2018-09-29
1,AAPL,MRY,2017-12-31,2017-09-30
2,AAPL,MRY,2016-12-31,2016-09-24


In [15]:
ticker_metadata = pd.read_csv(DATA_PATH + "raw/tickers.csv")


## 2.3 Effect of splits

## 2.4 Difference between Sharadar and Polygon
This section looks at the differences in tickers between Polygon and Sharadar. For every ticker in <code>tickers_v3</code>, we look whether the cik or ticker is in the Sharadar ticker list. 

In [52]:
sharadar_tickers = pd.read_csv(DATA_PATH + "raw/tickers.csv")
polygon_tickers = get_tickers_v3()
polygon_tickers = polygon_tickers[polygon_tickers['type'].isin(["CS", "ADRC"])]
polygon_tickers['cik'] = pd.to_numeric(polygon_tickers['cik'], errors='coerce').astype('Int32')

In [56]:
is_contained_in_sharadar = (polygon_tickers['ticker'].isin(sharadar_tickers.ticker) |  polygon_tickers['cik'].isin(sharadar_tickers.cik))
print(f"Total amount of stocks in v3: {len(polygon_tickers)}")
print(f"Amount of missing stocks: {len(polygon_tickers[~is_contained_in_sharadar])}")
print(f"Percentage of stocks that have fundamentals in tickers_v3: {(is_contained_in_sharadar.sum()/len(polygon_tickers)):.3f}")

polygon_tickers[~is_contained_in_sharadar].tail(3)

Total amount of stocks in v3: 7549
Amount of missing stocks: 204
Percentage of stocks that have fundamentals in tickers_v3: 0.973


Unnamed: 0,ID,ticker,name,active,start_date,end_date,last_updated_utc,type,cik,composite_figi,delisted_utc
7439,XPOA-2021-01-04,XPOA,"DPCM Capital, Inc.",False,2021-01-04,2022-08-05,2022-08-07,CS,1821742,BBG00XRTC019,
7454,YAC-2021-01-04,YAC,Yucaipa Acquisition Corporation Class A Ordina...,False,2021-01-04,2021-12-14,2021-12-14,CS,1815302,,
7475,YRCW-2021-01-04,YRCW,"YRC Worldwide, Inc.",False,2021-01-04,2021-02-05,2021-02-07,CS,1683182,BBG00C0L8K58,


Just under 3% of tickers in <code>tickers_v3</code> are not contained in Sharadar. That is acceptable. This number will be higher the further in time we go. This 3% should not make a significant difference in our trading systems. But we must still be aware of this just in case.