# Web Scraping Yahoo! Finance with pandas read_html

In this tutorial, we will use the **pandas read_html** method in Python to scrape data from *Yahoo! Finance*.

We will illustrate using Ford's Statistics page on *Yahoo! Finance* ('https://finance.yahoo.com/quote/F/key-statistics?p=F').

#### Scraping Data from Tables

The **pandas read_html** function reads the contents of HTML and extracts all tables into a list of **pandas** DataFrames, making web scraping extremely easy!

However, the **read_html** function is limited to extracting ONLY data that is contained within *table* tags in the HTML code. If the data you want to scrape is NOT contained in a *table* tag, this method will not work.

The **read_html** function requires a parameter specifying the HTML data. This parameter can either be HTML stored as text or the web site address.

In [None]:
import pandas as pd

from selenium import webdriver
from selenium.webdriver.common.by import By

# Set up headless Chrome options
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # Optional: runs the browser in the background
driver = webdriver.Chrome(options=options)

# Load the Yahoo Finance Profile page
url = "https://finance.yahoo.com/quote/F/key-statistics?p=F"
driver.get(url)

# Wait for the page to load
driver.implicitly_wait(5)  # Waits up to 5 seconds 

# Retrieve HTML
html = driver.page_source

# Close the browser
driver.quit()

dfs = pd.read_html(html)

Let's print out the DataFrames in our **dfs** list to see what data in our web page is contained in tables.

In [None]:
for df in dfs:
    print(df)
    print('\n\n----------------------------------------------------------------------------------\n\n')

Let's say we want to extract only the 5-year monthly Beta, which is contained in the second DataFrame in the **dfs** list.

In [None]:
dfs[8]

We can extract information from a DataFrame using the **.iloc[rownum,colnum]** function. The *rownum* and *colnum* are row and column numbers, respectively, and are indexed starting at 0.

Beta is contained in first row and second column, so the *rownum* is equal to 0 (i.e., first row), and the *colnum* is equal to 1 (i.e., second column).

In [None]:
beta = dfs[8].iloc[0,1]
print(beta)

#### Exercise -- Practice Using pandas read_html

1. Obtain the 'Trailing Annual Dividend Rate' for Ford listed on Ford's Yahoo Finance Statistics page.
2. Obtain the 'Shares Outstanding' for Ford listed on Ford's Yahoo Finance Statistics page.
3. Create a function to obtain the 'Trailing Annual Dividend Rate' and the 'Shares Outstanding' for any ticker. Then extract the these items for 'F','AAPL','AMZN', and 'WMT' and save the data to a new pandas DataFrame.

#### Solution for # 1

In [None]:
dfs[10]

In [None]:
divrate = dfs[10].iloc[3,1]
print(divrate)

#### Solution for # 2

In [None]:
dfs[9]

In [None]:
shrout = dfs[9].iloc[3,1]
print(shrout)

#### Solution for # 3

In [None]:
import pandas as pd

def get_data(ticker):

    # Set up headless Chrome options
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')  # Optional: runs the browser in the background
    driver = webdriver.Chrome(options=options)

    # Load the Yahoo Finance Profile page
    url = 'https://finance.yahoo.com/quote/'+ticker+'/key-statistics?p='+ticker
    driver.get(url)

    # Wait for the page to load
    driver.implicitly_wait(1)  # Waits up to 1 seconds 

    # Retrieve HTML
    html = driver.page_source

    # Close the browser
    driver.quit()

    dfs = pd.read_html(html)
    divrate = dfs[10].iloc[3,1]
    shrout = dfs[9].iloc[3,1]
    return divrate,shrout

# List of tickers to obtain
tickers = ['F','AAPL','AMZN']

# Initalize a new pandas DataFrame
df = pd.DataFrame(columns = ['ticker','divrate','shrout'])

# Iterate through list of tickers and save mktcap to our df DataFrame
for ticker in tickers:
    divrate,shrout = get_data(ticker)
    df = pd.concat([df, pd.DataFrame({'ticker':[ticker], 'divrate':[divrate], 'shrout':[shrout]})], ignore_index=True)

# Print the df DataFrame
df