# Web Scraping Yahoo! Finance with Text Splitting

In this tutorial, we will use the text splitting method in Python to scrape data from *Yahoo! Finance*.

We will illustrate using Ford's profile page on *Yahoo! Finance* ('https://finance.yahoo.com/quote/F/profile?p=F').

#### Text Splitting

I've found another trick that you might find useful: text splitting.

When we read HTML into Python using the **requests** module, the resulting HTML is stored as a text string. Therefore, we can simply split our string based on unique identifiers to manually extract the data we want. 

#### Simple Example

When we split a string, the result is a list containing the string before and after each instance of the text we split on. For example, let's say we have the string:

    "I love my job. This is the best job I have ever had."

Let's split this string on 'job' and view the result.

In [1]:
sentences = "I love my job. This is the best job I have ever had."
job_list = sentences.split('job')
job_list

['I love my ', '. This is the best ', ' I have ever had.']

Splitting on 'job' resulted in a list with three elements because there were two instances of 'job' in our string.

In the first element is all text before the first instance of 'job'.

In the second element is the string between the first and second instances of 'job'.

In the third element is the string after the second instance of 'job' and up to the end of the string.

Now, let's say we want to extract the word "have" from this string. Here's how that would be done.

In [2]:
have = sentences.split('job')[2].split('I ')[1].split(' ever')[0]
print(have)

have


#### Text Splitting and Ford

Now, let's apply this logic to our Ford Yahoo! Finance example. Let's go find the Industry within the HTML source code of the Yahoo! Finance web page.

We can split our HTML string to obtain the industry manually. We must choose a unique identifier to split on. The most obvious choice to me would be something like `Industry`. Let's split on this text and view the result:

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By

# Set up headless Chrome options
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # Optional: runs the browser in the background
driver = webdriver.Chrome(options=options)

# Load the Yahoo Finance Profile page
url = "https://finance.yahoo.com/quote/F/profile/?p=F"
driver.get(url)

# Wait for the page to load
driver.implicitly_wait(5)  # Waits up to 5 seconds 

# Retrieve HTML
html = driver.page_source

# Close the browser
driver.quit()


In [4]:
#html

In [5]:
industry = html.split('Industry:&nbsp;</dt>')
print(len(industry))

2


In [6]:
print(industry[1][0:300])

 <a class="subtle-link fin-size-large yf-1e4diqp" data-ylk="elm:itm;elmt:link;itc:0;sec:qsp-company-overview;subsec:profile;slk:Auto%20Manufacturers" href="/sectors/consumer-cyclical/auto-manufacturers/" data-rapid_p="22" data-v9y="1">Auto Manufacturers </a> </div><div><dt class="yf-wxp4ja">Full Tim


Splitting on `Industry</span>` returns a list with two elements. The first element contains everything in the HTML before `Industry</span>` and the second element contains everything in the HTML after `Industry</span>`.

In [7]:
print(industry[0][0:300])

<html lang="en-US" theme="light" data-color-scheme="light" class="desktop neo-green dock-upscale"><head><script type="text/javascript" async="" src="https://static.criteo.net/js/ld/publishertag.prebid.144.js"></script><script charset="UTF-8" type="text/javascript" async="async" src="https://cdn.tabo


The industry is contained in the second item in the list. Let's restrict our search to that second item using `[1]` at the end of our code:

In [8]:
industry = html.split('Industry:&nbsp;</dt>')[1]
print(industry[0:300])

 <a class="subtle-link fin-size-large yf-1e4diqp" data-ylk="elm:itm;elmt:link;itc:0;sec:qsp-company-overview;subsec:profile;slk:Auto%20Manufacturers" href="/sectors/consumer-cyclical/auto-manufacturers/" data-rapid_p="22" data-v9y="1">Auto Manufacturers </a> </div><div><dt class="yf-wxp4ja">Full Tim


 We can now extract the industry by making a few more splits (e.g., split on `</span>` and choose the first element in the list, etc.).

In [9]:
industry = html.split('Industry:&nbsp;</dt>')[1].split('</a>')[0]
print(industry[0:300])

 <a class="subtle-link fin-size-large yf-1e4diqp" data-ylk="elm:itm;elmt:link;itc:0;sec:qsp-company-overview;subsec:profile;slk:Auto%20Manufacturers" href="/sectors/consumer-cyclical/auto-manufacturers/" data-rapid_p="22" data-v9y="1">Auto Manufacturers 


In [10]:
industry = html.split('Industry:&nbsp;</dt>')[1].split('</a>')[0].split('>')[-1]
print(industry)

Auto Manufacturers 


#### Exercise -- Practice Using Text Splitting

1. Obtain the 'Corporate Goverance Score' for Ford listed on Ford's Yahoo Finance Profile page.
2. Create a function to obtain the Corporate Governance Score for any ticker. Then extract the Corporate Governance Score for 'F','AAPL','AMZN', and 'WMT' and save the data to a new pandas DataFrame.

#### Solution for # 1

In [11]:
cg = html.split('Governance QualityScore')[1].split(' is ')[1].split('.')[0]
print(cg)

10


#### Solution for # 2

In [12]:
import pandas as pd
import requests

def get_data(ticker):
    
    # Set up headless Chrome options
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')  # Optional: runs the browser in the background
    driver = webdriver.Chrome(options=options)    
    
    url = 'https://finance.yahoo.com/quote/'+ticker+'/profile?p='+ticker
    driver.get(url)

    # Wait for the page to load 
    driver.implicitly_wait(5)  # Waits up to 5 seconds
    

    # Retrieve HTML
    html = driver.page_source

    # Close the browser
    driver.quit()

    cg = html.split('Governance QualityScore')[1].split(' is ')[1].split('.')[0]
    
    return cg

# List of tickers to obtain
tickers = ['F','AAPL','AMZN','WMT']

# Initalize a new pandas DataFrame
df = pd.DataFrame(columns = ['ticker','cg'])

# Iterate through list of tickers and save mktcap to our df DataFrame
for ticker in tickers:
    cg = get_data(ticker)
    df = pd.concat([df, pd.DataFrame({'ticker':[ticker], 'cg':[cg]})], ignore_index=True)
    
# Print the df DataFrame
df

Unnamed: 0,ticker,cg
0,F,10
1,AAPL,1
2,AMZN,9
3,WMT,3
