# Index constituents
(not everything is up-to-date, we use the Sharadar SF1 database)

Sometimes (especially for long strategies), I want to limit my focus to the largest X stocks. The high-market-cap stocks are less likely to be manipulated, more liquid and actually profitable companies. However I don't care what is exactly in the S&P500. I care only about marketcap and liquidity. So I also don't need a day-by-day update. Quarterly is fine.

The goal is to create a list of stocks of the largest N market caps. However we will apply more rules:

* Must have market cap
* Must have a SIC code
* Must be headquartered in the US

In [1]:
from datetime import datetime, date, time, timedelta
from times import get_market_dates, get_market_calendar, last_trading_date_before
from data import get_data
from tickers import get_tickers
from polygon.rest import RESTClient
import json
import numpy as np
import ast
import pandas as pd

DATA_PATH = "../data/polygon/"

START_DATE =date(2003, 11, 1) # MUST BE 1st of FEB, MAY, AUG or NOV
END_DATE = date(2023, 8, 1)

with open(DATA_PATH + "secret.txt") as f:
    KEY = next(f).strip()

client = RESTClient(api_key=KEY)

Using the fundamentals create a list of historical top N market cap stocks.

In [4]:
N = 500

market_cap = pd.read_csv('../data/other/sharadar_processed.csv')
market_cap.date = pd.to_datetime(market_cap.date).dt.date
market_cap = market_cap[market_cap['country'] == 'US'] # Filter out ADRs
market_cap = market_cap[~market_cap['sic'].isna()] # Filter out tickers with no SIC code

dates_and_IDs = {}

for day in pd.date_range(START_DATE, END_DATE, freq='3MS').date:
    market_cap_day = market_cap[(market_cap['date'] <= day) & (market_cap['date'] > day - timedelta(days=90))]
    market_cap_day = market_cap_day[~market_cap_day['ID'].duplicated(keep='last')]
    market_cap_day_sorted = market_cap_day.sort_values(by='marketcap_M', ascending=False)

    top_market_cap_tickers = market_cap_day_sorted['ID'].head(N).values.tolist()
    dates_and_IDs[day.isoformat()] = top_market_cap_tickers

with open(f'../output/universes/TOP_{N}.json', 'w') as f: 
    json.dump(dates_and_IDs, f)

To load the most recent list, we need a helper function as the universe does not have tickers for every day.

In [5]:
with open('../output/universes/TOP_500.json', 'r') as f: 
    largest_stocks = json.load(f)

In [6]:
def get_latest_value(dictionary, day):
    """Get the value corresponding to the latest key before <day> in a dictionary"""
    dates = [date.fromisoformat(day) for day in dictionary.keys()]
    key = max(filter(lambda x: x < day, dates))
    return dictionary[key.isoformat()]

In [7]:
get_latest_value(largest_stocks, date(2022, 5, 2))[:3]

['AAPL-2003-09-10', 'MSFT-2003-09-10', 'GOOG-2004-08-19']

# Current S&P500 vs Top 500
What is the difference between the real S&P500 and the list we created? To get the current S&P500 holdings, I use this [link](https://www.slickcharts.com/sp500) from slickcharts. We need to to cautious with ticker conventions. Polygon uses a . with some share classes, such as BRK.B. Others have no points.

In [49]:
sp500 = list(pd.read_excel('../output/universes/S&P500.xlsx', index_col=0)['Symbol'])
sp500[:5]

['MSFT', 'AAPL', 'NVDA', 'AMZN', 'META']

In [50]:
with open('../output/universes/TOP_500.json', 'r') as f: 
    TOP_500 = json.load(f)
TOP_500 = get_latest_value(TOP_500, date(2024, 3, 1))
TOP_500 = [ticker[:-11] for ticker in TOP_500] # No need to remove the . in tickers like BRK.B as the S&P500 already contains these
TOP_500[:5]

['AAPL', 'MSFT', 'GOOG', 'GOOGL', 'AMZN']

Let's first look at whether we even have all tickers that are in the S&P500.

In [51]:
available = []
for index, row in get_tickers(4).iterrows():
    ticker = row['ID'][:-11]
    if ticker in sp500:
        available.append(ticker)

difference = set(sp500).difference(set(available))
print(f'Tickers in our list that match SP500: {len(set(available))}')
print(f'Tickers in SP500: {len(sp500)}')

Tickers in our list that match SP500: 498
Tickers in SP500: 503


In [52]:
difference

{'C', 'CAT', 'ETN', 'IVZ', 'JPM'}

Of these tickers:
* CPAY: Polygon does not have it for some unknown reason.
* GEV: Newly listed, so this it is correct that we don't have that.
* SOLV: Also newly listed.

Do we have fundamental data for all S&P500 stocks?

In [53]:
fundamentals = pd.read_csv(DATA_PATH + 'processed/fundamentals.csv', index_col=0)
tickers_we_have_fundamentals_of = list(set(fundamentals['ID']))
tickers_we_have_fundamentals_of = [ticker[:-11] for ticker in tickers_we_have_fundamentals_of]

not_available = []
for ticker in sp500:
    if ticker not in tickers_we_have_fundamentals_of:
        not_available.append(ticker)
len(not_available)

8

In [54]:
not_available

['JPM', 'CAT', 'ETN', 'C', 'GEV', 'CPAY', 'SOLV', 'IVZ']

* DAY: our data only starts in 2024-02, so it's correct that we do not have fundamentals yet. The reason our data only starts in 2024-02 is because we had no ticker change data for 2024 and DAY got renamed.

For the other tickers, see the Russell 3000 section.

Tickers that are in S&P500 but not in Top 500

In [55]:
len(set(sp500).difference(set(TOP_500)) )

90

Other way around.

In [56]:
len(set(TOP_500).difference(set(sp500)))

87

# Current Russell 3000 vs Top 3000
The Russell 3000 is essentially all stocks except the microcaps. I could not find the real holdings of the Russell 3000 so I will use the Russell 3000 ETF (IWV), which only has around 2750 holdings instead of 3000. Nevertheless, the 250 won't make a difference.

First, not all tickers that we have are in IWV:

In [60]:
Russell_3000 = list(pd.read_csv('../output/universes/IWV_holdings.csv')['Ticker'])
print(len(Russell_3000))

2686


In [61]:
available = []
for index, row in get_tickers(5).iterrows():
    ticker = row['ID'][:-11]
    ticker = ticker.replace('.', '')
    if ticker in Russell_3000:
        available.append(ticker)

difference = set(Russell_3000).difference(set(available))
print(f'Tickers in our list that match R3000: {len(set(available))}')
print(f'Tickers in R3000: {len(Russell_3000)}')

Tickers in our list that match R3000: 2665
Tickers in R3000: 2686


Manually finding the cause:
* ADRO = unlisted
* SOLV = newly listed
* WLLAW = warrants
* WLLBW = warrants
* PFC = probably because of corrupted file
* METCV = ex-distribution
* GEV = newly listed
* CVR = contingent value right or something
* CPAY = polygon does not have it?? Not even in most recent ticker list.
* DHC = incorrect removed because 'trust' in name
* KKR = incorrect removed
* NXDT = incorrect removed because 'trust' in name
* CG = incorrect classified

I could manually correct these, but again I dont care about exact holdings. Having 99% accuracy is already very good.

Now we will look at the difference between IWC and Top 3000.

In [62]:
Russell_3000 = list(pd.read_csv('../output/universes/IWV_holdings.csv')['Ticker'])
len(Russell_3000)

2686

Because IWV only has 2686 holdings, I will compare it to the top 2686. I will take July 2023, because the Russell 3000 is rebalanced in June and July is the closest of the data I have.

In [63]:
with open('../output/universes/TOP_2686.json', 'r') as f: 
    TOP_3000_all = json.load(f)
TOP_3000 = get_latest_value(TOP_3000_all, date(2023, 7, 5))
TOP_3000 = [ticker[:-11].replace('.', '') for ticker in TOP_3000] # IWC_holdings has no . in BRK.B

Tickers that are in IWC but not in Top 3000.

In [64]:
len(set(Russell_3000).difference(set(TOP_3000)) )

379

Other way around.

In [65]:
diff = set(TOP_3000).difference(set(Russell_3000))
len(diff)

380

In [23]:
tickers = get_tickers()
top_3000 = get_latest_value(TOP_3000_all, date(2023, 7, 5))
tickers[tickers['ID'].isin(top_3000)].to_csv('../output/diff_R3000_T3000.csv')

In [24]:
tickers[tickers['ID'].isin(top_3000)].to_csv('../output/diff_R3000_T3000.csv')

A difference of 12% is quite OK.

Do we have fundamental data for all Russell 3000 stocks?

In [66]:
fundamentals = pd.read_csv(DATA_PATH + 'processed/fundamentals.csv', index_col=0)
tickers_we_have_fundamentals_of = list(set(fundamentals['ID']))
tickers_we_have_fundamentals_of = [ticker[:-11].replace('.', '') for ticker in tickers_we_have_fundamentals_of]

not_available = []
for ticker in Russell_3000:
    if ticker not in tickers_we_have_fundamentals_of:
        not_available.append(ticker)
len(not_available)

26

Removing the ones that we already have explained:

In [67]:
diff = set(not_available).difference(difference)
len(diff)

5

In [68]:
diff

{'ARD', 'CPAY', 'GEV', 'SOLV', 'TBRG'}

This is better than expected. I don't have ticker changes for 2024, so a part that could have explained the differences is that I failed to query on the old ticker.

# Conclusion

All in all, I am content with the indices that I have created because they resemble the original indices enough for it to be usable.