# Homework 5 (Woosoo Kim)

# Web Scraping Yahoo! Finance - Project



#### Web Scraping Project

It's time to apply your newly-acquired web-scraping skills to obtain earnings and pricing data from *Yahoo! Finance*.

#### Data from Yahoo! Finance

*Yahoo! Finance* maintains data on firms' reported earnings per share (EPS), analysts' consensus estimate of EPS, and the date that firms' announce EPS. 

# Instructions

#### Import Modules and Create Directories

1. Import the following modules:
    1. pandas as pd
    2. webdriver from selenium
    3. from selenium.webdriver.common.by import By
    4. os
    5. shutil
    6. datetime as dt
    7. time
    8. requests
2. Create a variable called `fromdirectory` with the directory location of your Downloads folder.
3. Create a variable called `todirectory` with the directory of a new folder called 'Yahoo Finance Data' contained within the folder containing this Jupyter Notebook. Use the **os.getcwd** function to get the current working directory (i.e., the directory of the folder containing this Jupyter Notebook).

#### Solution - Import Modules and Create Directories

In [1]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
import os
import shutil
import datetime as dt
import time
import requests
import warnings

# Ignore warning
warnings.filterwarnings("ignore", message="Passing literal html to 'read_html' is deprecated.*")

# Where you save files (create a new folder called "Yahoo Finance Data")
todirectory = os.getcwd()+'/Yahoo Finance Data/' 
if not os.path.exists(todirectory):
    os.mkdir(todirectory)

# My Downloads directory
fromdirectory = '/Users/ws/Downloads/'

#### Create a Function to Extract the Earnings Per Share Data

Create a function called `get_earnings` to obtain the EPS data found on *Yahoo! Finance* at https://finance.yahoo.com/calendar/earnings?&symbol=TICKER.

1. Include `ticker` as an input to the `get_earnings` function.
2. Use the **read_html** function from the **pandas** module to read in the Yahoo! Finance URL into a list called `dfs`. Use the option `na_values = '-'` in the **read_html** function because the missing values are coded as `'-'` in the earnings table on Yahoo Finance.
3. Create a variable called `earnings` equal to the DataFrame in the `dfs` list containing the earnings table.
4. Remove rows from the `earnings` DataFrame in which the `Surprise(%)` is missing using the **.notna()** function.
5. Convert the `Earnings Date` column in the `earnings` DataFrame to a *datetime* object called `earnings_date`. See HINT below.
6. Keep the following columns in the `earnings` DataFrame: 'Symbol','earnings_date','EPS Estimate','Reported EPS', and 'Surprise(%)'.
7. Use the **to_csv** function from the **pandas** module to export the `earnings` DataFrame to a file called 'TICKER_earnings.csv' (e.g., 'AAPL_earnings.csv) in a new folder called 'Yahoo Finance Data'. Use the `index=False` option.
8. Test the `get_earnings` function by passing it the ticker 'F'. Is the correct earnings table saved to a file called 'F_earnings.csv' in the 'Yahoo Finance Data' folder?

#### Solution - Create a Function to Extract the Earnings Per Share Data

In [2]:
# See HINT Below
def convert_date(date):
    fixed_date = date.split(',')
    fixed_date = fixed_date[0]+','+fixed_date[1]
    fixed_date = dt.datetime.strptime(fixed_date, '%b %d, %Y')
    return fixed_date

def get_earnings(ticker):
    
    # Set up headless Chrome options
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')  # Optional: runs the browser in the background
    driver = webdriver.Chrome(options=options)

    # Load the Yahoo Finance Profile page
    url = 'https://finance.yahoo.com/calendar/earnings?&symbol='+ticker
    driver.get(url)

    # Wait for the page to load
    driver.implicitly_wait(2)  # Waits up to 2 seconds 

    # Retrieve HTML
    html = driver.page_source

    # Close the browser
    driver.quit()

    # 2. Read tables
    dfs = pd.read_html(html, na_values = '-')
    
    # 3. Create a variable called earnings
    earnings = dfs[0].copy()

    # 4. Remove rows where Surprise(%) is missing
    earnings = earnings[earnings['Surprise(%)'].notna()]

    # 5. Convert Earnings Date to a datetime object
    earnings['earnings_date'] = earnings['Earnings Date'].apply(convert_date)

    # 6. Keep only the Symbol, earnings_date, EPS Estimate, Reported EPS, and Surprise(%) columns
    earnings = earnings[['Symbol','earnings_date','EPS Estimate','Reported EPS','Surprise(%)']]
    
    # 7. Save the earnings Data to a CSV file
    earnings.to_csv(todirectory+ticker+'_earnings.csv', index=False)

# 8. Test get_earnings()
get_earnings('F')
os.listdir(todirectory)

['F_earnings.csv', '.DS_Store']

#### HINT for # 5:

The `Earnings Date` column is stored in the `earnings` DataFrame as a string object. You can use text splitting and the **datetime** function on that string to extract the date. To explain how to do this, let's start with a simple Python example.

Suppose you have a string variable called `date`:

In [3]:
date = 'Oct 23, 2019, 4 AMEST'

To extract the date and store the variable as a **datetime** object, we can use the following code:

In [4]:
fixed_date = date.split(',')
fixed_date = fixed_date[0]+','+fixed_date[1]
fixed_date = dt.datetime.strptime(fixed_date, '%b %d, %Y')
print(fixed_date)

2019-10-23 00:00:00


To do the same type of analysis for a column in a pandas DataFrame, you can create a function that takes the `date` as in input and returns the newly-created **datetime** `date`. You can then apply that function to the appropriate column in the **pandas** DataFrame.

In [5]:
def convert_date(date):
    fixed_date = date.split(',')
    fixed_date = fixed_date[0]+','+fixed_date[1]
    fixed_date = dt.datetime.strptime(fixed_date, '%b %d, %Y')
    return fixed_date

In [6]:
earnings = pd.DataFrame({'ticker':['F', 'AAPL', 'WMT'], 'Earnings Date':['Oct 23, 2019, 4 AMEST', 'Dec 5, 2017, 6 AMEST', 'Jan 20, 2019, 9 PMEST']})
earnings

Unnamed: 0,ticker,Earnings Date
0,F,"Oct 23, 2019, 4 AMEST"
1,AAPL,"Dec 5, 2017, 6 AMEST"
2,WMT,"Jan 20, 2019, 9 PMEST"


In [7]:
earnings['earnings_date'] = earnings['Earnings Date'].apply(convert_date)
earnings

Unnamed: 0,ticker,Earnings Date,earnings_date
0,F,"Oct 23, 2019, 4 AMEST",2019-10-23
1,AAPL,"Dec 5, 2017, 6 AMEST",2017-12-05
2,WMT,"Jan 20, 2019, 9 PMEST",2019-01-20


#### Obtain S&P 500 Tickers from Wikipedia

1. Use the **pandas read_html** function to read in this list of S&P 500 companies from Wikipedia: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies.
2. Create a DataFrame called `sp500`, keeping only the `Symbol` and `Security` columns.
3. Keep only the first 10 rows in the `sp500` DataFrame.

#### Solution - Obtain S&P 500 Tickers from Wikipedia

In [8]:
#1. Read HTML
html = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies').text
dfs = pd.read_html(html)
df = dfs[0]

# 2. Create a DataFrame called sp500
sp500 = df.copy()

# 3. Keep only the first 10 rows
sp500 = sp500.head(10)

# Print the first 10 rows
sp500

Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott Laboratories,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989
5,ADBE,Adobe Inc.,Information Technology,Application Software,"San Jose, California",1997-05-05,796343,1982
6,AMD,Advanced Micro Devices,Information Technology,Semiconductors,"Santa Clara, California",2017-03-20,2488,1969
7,AES,AES Corporation,Utilities,Independent Power Producers & Energy Traders,"Arlington, Virginia",1998-10-02,874761,1981
8,AFL,Aflac,Financials,Life & Health Insurance,"Columbus, Georgia",1999-05-28,4977,1955
9,A,Agilent Technologies,Health Care,Life Sciences Tools & Services,"Santa Clara, California",2000-06-05,1090872,1999


#### Loop through the First Ten Firms in the S&P 500 DataFrame to Obtain the Earnings and Price Tables for Each Ticker Symbol

1. Use a **for** loop to loop through each row of the `sp500` DataFrame using the **iterrows** function. See the HINT below.
2. For each `Symbol` (i.e., ticker), download the 'TICKER_earnings.csv' file to the "Yahoo Finance Data" folder using the `get_earnings` function. 

#### Solution - Loop through the First Ten Firms in the S&P 500 DataFrame to Obtain the Earnings and Price Tables for Each Ticker Symbol

In [13]:
for index, row in sp500.iterrows():
    ticker = row['Symbol']
    if not os.path.exists(todirectory+ticker+'_earnings.csv'):
        get_earnings(ticker)

#### HINT for # 2

To loop through each row of a DataFrame called `sp500` and to extract the `Symbol` (i.e., ticker) from each row, you can use the following code:

    for index,row in sp500.iterrows():
        ticker = row['Symbol']

#### Combine Earnings Data

1. Create an empty list called `all_data`. (i.e., `all_data = []`)
2. Create a list of file names called `filenames` from the Yahoo Finance Data folder using `os.listdir(DIRECTORY)`.
3. Use a **for** loop to loop through each file name in the `filenames` list and do the following:
    1. If '_earnings' is in the filename, execute the next steps.
    2. Extract the ticker from the filename using the **.split()** function.
    3. Read the file into a **pandas** DataFrame called `data` using the **read_csv** function.
    4. Add the ticker you extracted in Step B above to a new column called `ticker` in the `data` DataFrame.
    5. Append the `data` DataFrame to the `all_data` list you created in Step 1 above using the **.append()** function.
4. Create a DataFrame called `earnings` by concatenating the `all_data` list of DataFrames using the code: `earnings = pd.concat(all_data)`.

#### Solution - Combine Earnings Data

In [14]:
# 1. Create an empty list
all_data = []

# 2. Create a list
filenames = os.listdir(todirectory)

# 3. Loop
for filename in filenames:
    # A. Conditional Flow
    if '_earnings' in filename:
        
        # B. Extract Ticker
        ticker = filename.split('_')[0]
        
        # C. Read the file
        data = pd.read_csv(todirectory+filename)
        
        # D. Add the ticker
        data['ticker'] = ticker
        
        # Append
        all_data.append(data)

# Create a Dataframe and print its head
earnings = pd.concat(all_data)
earnings.head()

Unnamed: 0,Symbol,earnings_date,EPS Estimate,Reported EPS,Surprise(%),ticker
0,A,2024-08-21,1.26,1.32,5.01,A
1,A,2024-05-29,1.19,1.22,2.34,A
2,A,2024-02-27,1.22,1.29,5.45,A
3,A,2023-11-20,1.34,1.38,2.86,A
4,A,2023-08-15,1.36,1.43,4.8,A
