# S&P500 tracker

The goal of this notebook is to:
1. Explore the `yfinance` [library](https://pypi.org/project/yfinance/)
2. Develop scripts for the Extraction and Load parts of the ELT job. Transformation will be handled separately in dbt.

In [2]:
import pandas as pd
import yfinance as yf
from datetime import datetime

## Extract S&P500 data 

**General plan:**
1. Find all the S&P500 ticker symbols and other relevant data - source [Wikipedia](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies).
2. Extract/download data for the last month or so and upload to GCP bucket.
3. Develop script that downloads daily data and uploads to bucket.

### 1. Find all S&P500 ticker symbols

In [6]:
# Get s&p500 info from wikipedia
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

# Get all the data of S&P 500 tickers in a dataframe
sp500_tickers = pd.read_html(url)[0]

# Get the list of S&P 500 tickers
sp500_symbols = sp500_tickers.Symbol.to_list()

#### There are 503 tickers in S&P500

In [7]:
sp500_tickers

Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989
...,...,...,...,...,...,...,...,...
498,XYL,Xylem Inc.,Industrials,Industrial Machinery & Supplies & Components,"White Plains, New York",2011-11-01,1524472,2011
499,YUM,Yum! Brands,Consumer Discretionary,Restaurants,"Louisville, Kentucky",1997-10-06,1041061,1997
500,ZBRA,Zebra Technologies,Information Technology,Electronic Equipment & Instruments,"Lincolnshire, Illinois",2019-12-23,877212,1969
501,ZBH,Zimmer Biomet,Health Care,Health Care Equipment,"Warsaw, Indiana",2001-08-07,1136869,1927


In [9]:
# List of symbols
len(sp500_symbols)

503

### 2. Extract/download data for the last month and upload to GCP bucket.

*To do that, first develop a function that downloads the data for a specified data range*

In [None]:
# Set the date range
# start_date = datetime.today().strftime('%Y-%m-%d')
# end_date = datetime.today().strftime('%Y-%m-%d')

# Download data for each ticker
sp500_data = {}
for ticker in sp500_tickers:
    try:
        data = yf.download(ticker, start='2024-04-01', end='2024-04-05')
        sp500_data[ticker] = data
        print(f"Downloaded data for {ticker}")
    except Exception as e:
        print(f"Error downloading data for {ticker}: {str(e)}")

# Concatenate data for all tickers into a single DataFrame
sp500_df = pd.concat(sp500_data.values(), keys=sp500_data.keys(), names=['Ticker'])

In [45]:
sp500_df#.reset_index(inplace=True)

Unnamed: 0,index,Ticker,Date,Open,High,Low,Close,Adj Close,Volume
0,0,AOS,2024-04-01,89.330002,89.769997,88.680000,89.080002,89.080002,676200.0
1,1,AOS,2024-04-02,88.709999,88.900002,87.889999,88.550003,88.550003,921100.0
2,2,AOS,2024-04-03,88.550003,89.419998,88.290001,88.650002,88.650002,899000.0
3,3,AOS,2024-04-04,89.389999,89.690002,86.959999,87.139999,87.139999,1036500.0
4,4,ABT,2024-04-01,113.660004,113.660004,111.820000,112.089996,112.089996,3964000.0
...,...,...,...,...,...,...,...,...,...
2003,2003,ZBH,2024-04-04,130.889999,130.889999,127.430000,127.559998,127.559998,1032900.0
2004,2004,ZTS,2024-04-01,168.990005,169.490005,166.119995,167.020004,167.020004,1896500.0
2005,2005,ZTS,2024-04-02,165.669998,166.169998,163.639999,165.009995,165.009995,2391500.0
2006,2006,ZTS,2024-04-03,165.000000,166.259995,162.639999,162.970001,162.970001,2481200.0


In [46]:
sp500_df.drop(columns=['index'])

Unnamed: 0,Ticker,Date,Open,High,Low,Close,Adj Close,Volume
0,AOS,2024-04-01,89.330002,89.769997,88.680000,89.080002,89.080002,676200.0
1,AOS,2024-04-02,88.709999,88.900002,87.889999,88.550003,88.550003,921100.0
2,AOS,2024-04-03,88.550003,89.419998,88.290001,88.650002,88.650002,899000.0
3,AOS,2024-04-04,89.389999,89.690002,86.959999,87.139999,87.139999,1036500.0
4,ABT,2024-04-01,113.660004,113.660004,111.820000,112.089996,112.089996,3964000.0
...,...,...,...,...,...,...,...,...
2003,ZBH,2024-04-04,130.889999,130.889999,127.430000,127.559998,127.559998,1032900.0
2004,ZTS,2024-04-01,168.990005,169.490005,166.119995,167.020004,167.020004,1896500.0
2005,ZTS,2024-04-02,165.669998,166.169998,163.639999,165.009995,165.009995,2391500.0
2006,ZTS,2024-04-03,165.000000,166.259995,162.639999,162.970001,162.970001,2481200.0


In [48]:
sp500_df['Ticker'].nunique()


502

In [13]:
# Get S&P 500 symbols
sp500_symbols = yf.download('^GSPC', start='2020-01-01', end='2024-04-01').index

# Create an empty dictionary to store stock data
stock_data = {}

# Loop through each stock symbol and download data
for symbol in sp500_symbols:
    try:
        stock_data[symbol] = yf.download(symbol, start='2020-01-01', end='2024-04-01')
    except KeyError:
        print(f"Data not available for {symbol}")

[*********************100%%**********************]  1 of 1 completed


ValueError: value must be an integer, received <class 'str'> for year

In [11]:
stock_data

{}