# S&P500 tracker

The goal of this notebook is to:
1. Explore the `yfinance` [library](https://pypi.org/project/yfinance/)
2. Develop scripts for the Extraction and Load parts of the ELT job. Transformation will be handled separately in dbt.

In [1]:
import pandas as pd
import yfinance as yf
from datetime import datetime

## Extract S&P500 data 

**General plan:**
1. Find all the S&P500 ticker symbols and other relevant data - source [Wikipedia](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies).
2. Extract/download data frim y_finance for the last X days (e.g. could be a month or more) and upload to GCS bucket.
3. Also load all the ticker data from step 1. to the GCS bucket.
4. Develop script that downloads daily data and uploads to bucket.

### 1. Find all S&P500 ticker symbols

In [2]:
# Get s&p500 info from wikipedia
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

# Get all the data of S&P 500 tickers in a dataframe
sp500_tickers = pd.read_html(url)[0]

# Get the list of S&P 500 tickers
sp500_symbols = sp500_tickers.Symbol.to_list()

#### There are 503 tickers in S&P500

In [3]:
sp500_tickers

Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989
...,...,...,...,...,...,...,...,...
498,XYL,Xylem Inc.,Industrials,Industrial Machinery & Supplies & Components,"White Plains, New York",2011-11-01,1524472,2011
499,YUM,Yum! Brands,Consumer Discretionary,Restaurants,"Louisville, Kentucky",1997-10-06,1041061,1997
500,ZBRA,Zebra Technologies,Information Technology,Electronic Equipment & Instruments,"Lincolnshire, Illinois",2019-12-23,877212,1969
501,ZBH,Zimmer Biomet,Health Care,Health Care Equipment,"Warsaw, Indiana",2001-08-07,1136869,1927


In [4]:
# List of symbols
len(sp500_symbols)

503

### 2. Extract/download data for the X days.

*First develop a function that downloads the data for a specified date range*

In [None]:
# Set the date range
# start_date = datetime.today().strftime('%Y-%m-%d')
# end_date = datetime.today().strftime('%Y-%m-%d')

# Download data for each ticker
sp500_data = {}
for symbol in sp500_symbols:
    try:
        data = yf.download(symbol, start='2024-04-01', end='2024-04-05')
        sp500_data[symbol] = data
        print(f"Downloaded data for {symbol}")
    except Exception as e:
        print(f"Error downloading data for {symbol}: {str(e)}")
        

# Concatenate data for all tickers into a single DataFrame
sp500_df = pd.concat(sp500_data.values(), keys=sp500_data.keys(), names=['Ticker'])

In [12]:
sp500_df.reset_index(inplace=True)

In [13]:
sp500_df.drop(columns=['index'])

Unnamed: 0,Ticker,Date,Open,High,Low,Close,Adj Close,Volume
0,MMM,2024-04-01,91.050003,94.339996,88.230003,94.019997,94.019997,13004800.0
1,MMM,2024-04-02,93.099998,94.419998,91.900002,92.839996,92.839996,8912000.0
2,MMM,2024-04-03,93.339996,94.699997,92.500000,93.190002,93.190002,6060200.0
3,MMM,2024-04-04,94.489998,95.669998,90.230003,90.540001,90.540001,5864200.0
4,AOS,2024-04-01,89.330002,89.769997,88.680000,89.080002,89.080002,676200.0
...,...,...,...,...,...,...,...,...
1999,ZBH,2024-04-04,130.889999,130.889999,127.430000,127.559998,127.559998,1032900.0
2000,ZTS,2024-04-01,168.990005,169.490005,166.119995,167.020004,166.545135,1896500.0
2001,ZTS,2024-04-02,165.669998,166.169998,163.639999,165.009995,164.540833,2391500.0
2002,ZTS,2024-04-03,165.000000,166.259995,162.639999,162.970001,162.506638,2481200.0


### 3. Join the financial data from y with 

In [18]:
sp500_df.merge(sp500_tickers, left_on='Ticker', right_on='Symbol', how='left')

Unnamed: 0,index,Ticker,Date,Open,High,Low,Close,Adj Close,Volume,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,0,MMM,2024-04-01,91.050003,94.339996,88.230003,94.019997,94.019997,13004800.0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,1,MMM,2024-04-02,93.099998,94.419998,91.900002,92.839996,92.839996,8912000.0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
2,2,MMM,2024-04-03,93.339996,94.699997,92.500000,93.190002,93.190002,6060200.0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
3,3,MMM,2024-04-04,94.489998,95.669998,90.230003,90.540001,90.540001,5864200.0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
4,4,AOS,2024-04-01,89.330002,89.769997,88.680000,89.080002,89.080002,676200.0,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1999,1999,ZBH,2024-04-04,130.889999,130.889999,127.430000,127.559998,127.559998,1032900.0,ZBH,Zimmer Biomet,Health Care,Health Care Equipment,"Warsaw, Indiana",2001-08-07,1136869,1927
2000,2000,ZTS,2024-04-01,168.990005,169.490005,166.119995,167.020004,166.545135,1896500.0,ZTS,Zoetis,Health Care,Pharmaceuticals,"Parsippany, New Jersey",2013-06-21,1555280,1952
2001,2001,ZTS,2024-04-02,165.669998,166.169998,163.639999,165.009995,164.540833,2391500.0,ZTS,Zoetis,Health Care,Pharmaceuticals,"Parsippany, New Jersey",2013-06-21,1555280,1952
2002,2002,ZTS,2024-04-03,165.000000,166.259995,162.639999,162.970001,162.506638,2481200.0,ZTS,Zoetis,Health Care,Pharmaceuticals,"Parsippany, New Jersey",2013-06-21,1555280,1952


In [None]:
sp500_df['Ticker'].nunique()


In [None]:
# Get S&P 500 symbols
sp500_symbols = yf.download('^GSPC', start='2020-01-01', end='2024-04-01').index

# Create an empty dictionary to store stock data
stock_data = {}

# Loop through each stock symbol and download data
for symbol in sp500_symbols:
    try:
        stock_data[symbol] = yf.download(symbol, start='2020-01-01', end='2024-04-01')
    except KeyError:
        print(f"Data not available for {symbol}")

In [None]:
stock_data