# Question 1: [Index] S&P 500 Stocks Added to the Index

Which year had the highest number of additions?

Using the list of S&P 500 companies from Wikipedia's [S&P 500 companies page][wiki-page], download the data including the year each company was added to the index.

- Create a DataFrame with company tickers, names, and the year they were added.
- Extract the year from the addition date and calculate the number of stocks added each year.
- Which year had the highest number of additions (1957 doesn't count, as it was the year when the S&P 500 index was founded)? Write down this year as your answer (the most recent one, if you have several records).

Context:

>" Following the announcement, all four new entrants saw their stock prices rise in extended trading on Friday" - recent examples of S&P 500 additions include DASH, WSM, EXE, TKO in 2025 ([Nasdaq article][nasdaq-article]).

Additional: How many current S&P 500 stocks have been in the index for more than 20 years? When stocks are added to the S&P 500, they usually experience a price bump as investors and index funds buy shares following the announcement.

[wiki-page]: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
[nasdaq-article]: https://www.nasdaq.com/articles/sp-500-reshuffle-dash-tko-expe-wsm-join-worth-buying

In [1]:
from bs4 import BeautifulSoup
from datetime import date
from polars import col as c
import polars as pl
import requests as r

resp = r.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
doc = BeautifulSoup(resp.content, 'html.parser')
table = doc.find_all(id='constituents')[0]
headers = [str(h.get_text().strip()).lower() for h in table.find_all(name='th')]
rows = [
    [str(cell.get_text().strip()) for cell in row.find_all(name='td')]
    for row in table.find_all(name='tr')[1:]   # for some strange reasons, header row (tr > th) is the first row in  <tbody>
]
df = pl.DataFrame([dict(zip(headers, row)) for row in rows]).with_columns(
    year=c("date added").str.strptime(pl.Date, "%Y-%m-%d").dt.year()
).select('symbol', 'security', 'year')
df.head()

symbol,security,year
str,str,i32
"""MMM""","""3M""",1957
"""AOS""","""A. O. Smith""",2017
"""ABT""","""Abbott Laboratories""",1957
"""ABBV""","""AbbVie""",2012
"""ACN""","""Accenture""",2011


In [2]:
df.group_by('year').agg(
    num_additions=pl.first().filter(c('year') != 1957).count()
).sort(['num_additions', 'year'], descending=True).head()

year,num_additions
i32,u32
2017,23
2016,23
2019,22
2008,17
2024,16


For simplicity, only the (extracted) year component used to determine whether the company is on S&P 500 for more than 20 years.

In [3]:
df.filter(c('year') + 20 < date.today().year).select(pl.first().count()).item(0, 0)

219