# Web Scraping Yahoo! Finance using Python

A detailed guide for web scraping https://finance.yahoo.com/ using **Selenium**, **HTML tags**

![](https://i.imgur.com/V1bzyMs.png)

## Introduction

**What is Web scraping?**<br>
Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning.


**Objective**<br>
The main objective of this tutorial is to showcase different web scraping methods which can be applied to any web page. 
This is for educational purposes only. Please read the Terms & Conditions carefully for any website to see whether you can legally use the data. 

In this project, we will perform web scraping using the following 3 techniques based on the problem statement.
* use `Selenium` to scrape data from dynamically loading websites 



**How to run the Code**<br>
You can execute the code using "Run" button on the top of this page and selecting **"Run on Colab"** or **"Run Locally"** 
<br>
<br>
**Setup and Tools**<br>
<u>Run on Colab :</u> 
    You will need to provide the Google login to run this notebook on Colab.<br>
<u>Run Locally :</u> Download and install [Anaconda](https://www.anaconda.com/) framework, We will be using Jupyter Notebook for writing & executing code.

## 2. Web Scraping Earnings 



Here's an outline of the steps we'll follow<br>
**2.1 Introduction of selenium**<br>
**2.2 Downloads & Installation**<br>
**2.3 Install & Import libraries**<br>
**2.4 Create Web Driver**<br>
**2.5 Exploring and locating Elements**<br>
**2.6 Extract & Compile the information into a python list**<br>
**2.7 Save the extracted information to a CSV file**<br>

### 2.1 Introduction of selenium

**[Selenium](https://www.selenium.dev/)** is an open-source web-based automation tool. Python language and other languages are used with Selenium for testing as well as web scraping. Here we will use Chrome browser, but you can try on any browser.<br>

**Why you should use Selenium?**
- Clicking on buttons
- Filling forms
- Scrolling
- Taking a screen-shot
- Refreshing the page

You can find proper documentation on selenium [here](https://selenium-python.readthedocs.io/)<br>

The following methods will help to find elements in a webpage (these methods will return a list):
- `find_elements_by_name`
- `find_elements_by_xpath`
- `find_elements_by_link_text`
- `find_elements_by_partial_link_text`
- `find_elements_by_tag_name`
- `find_elements_by_class_name`
- `find_elements_by_css_selector`

In this tutorial we will use only `find_elements_by_xpath` and `find_elements_by_tag_name` You can find complete documentation of these methods [here](https://selenium-python.readthedocs.io/locating-elements.html)

### 2.2 Downloads & Installation 

Unlike the previous section, here we'll have to do some prep work to implement this method. We will need to install Selenium & proper web browser driver<br>

If you are using **Google Colab** platform then execute following code to perform Initial installation. This piece of code `'google.colab' in str(get_ipython())` is used to identify the Google Colab platform.

In [160]:
if 'google.colab' in str(get_ipython()):
    print('Google CoLab Installation')
    !apt update --quiet
    !apt install chromium-chromedriver --quiet

Google CoLab Installation
Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:4 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:9 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [83.3 kB]
Hit:10 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:11 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Fetched 261 kB in 3s (87.

To run it on **Locally** you will need **Webdriver for Chrome** on your machine. You can download it from this link https://chromedriver.chromium.org/downloads and just copy the file in the folder where we will create the python file (No need of installation). But make sure that the driver‘s version matches the Chrome browser version installed on the local machine.

![](https://i.imgur.com/FvQ586e.gif)
![](https://i.imgur.com/wQbjRIU.png)

### 2.3 Install & Import libraries

Installation of the required libraries.

In [161]:
!pip install selenium --quiet
!pip install pandas --quiet

Once the Libraries installation is done, next step is to import all the required modules / libraries. 

In [162]:
print('Library Import')
if 'google.colab' not in str(get_ipython()):
    print('Not running on CoLab')
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.chrome.service import Service
    import os
else:
    print('Running on CoLab')
    
print('Common Library Import')
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import pandas as pd 
import time

Library Import
Running on CoLab
Common Library Import


So all the necessary prep work is done. Let's  move ahead to implement this method.

### 2.4 Create Web Driver

In this step first we will create the instance of Chrome WebDriver using `webdriver.Chrome()` method. and then the `driver.get()` method will navigate to a page given by the URL. In this case also there is slight variation based on platform. Also we have used `options` parameters for e.g. `--headless` option will load the driver in background. 

In [163]:
if 'google.colab' in str(get_ipython()):
    print('Running on CoLab')
    def get_driver(url):
        """Return web driver"""
        colab_options = webdriver.ChromeOptions()
        colab_options.add_argument('--no-sandbox')
        colab_options.add_argument('--disable-dev-shm-usage')
        colab_options.add_argument('--headless')
        colab_options.add_argument('--start-maximized') 
        colab_options.add_argument('--start-fullscreen')
        colab_options.add_argument('--single-process')
        driver = webdriver.Chrome(options=colab_options)
        driver.get(url)
        return driver
else:
    print('Not running on CoLab')
    def get_driver(url):
        """Return web driver"""
        chrome_options = Options()
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--start-maximized') 
        chrome_options.add_argument('--start-fullscreen')
        chrome_options.add_argument('--single-process')
        serv = Service(os.getcwd()+'/chromedriver')
        driver = webdriver.Chrome(options=chrome_options, service=serv)
        driver.get(url)
        return driver

Running on CoLab


Test run of `get_driver`

In [184]:
EARNINGS_URL = 'https://finance.yahoo.com/calendar/earnings?from=2022-10-16&to=2022-10-22&day=2022-10-21' ## CHANGE THE DATE 
driver = get_driver(EARNINGS_URL)
print(driver.title)

Company Earnings Calendar - Yahoo Finance


In [185]:
filter_arrow = driver.find_element(By.XPATH, value='//*[@id="screener-criteria"]/div/header/button/span[1]/span')
filter_arrow.click()
time.sleep(5)

In [186]:
unclick_us = driver.find_element(By.XPATH, value='//*[@id="screener-criteria"]/div/div[1]/div[1]/div[1]/div/div[2]/ul/li[1]/button/span')
unclick_us.click()
time.sleep(5)

In [187]:
add_region = driver.find_element(By.XPATH, value='//*[@id="screener-criteria"]/div/div[1]/div[1]/div[1]/div/div[2]/ul/li/div/div/span')
add_region.click()
time.sleep(5)

In [188]:
find_filter = driver.find_element(By.XPATH, value='//*[@id="dropdown-menu"]/div/div[1]/div/input')
find_filter.send_keys("Brazil") ## you can replace brazil with any other country 
time.sleep(2)
find_filter.send_keys(Keys.TAB, Keys.TAB, Keys.SPACE)
#Keys.TAB
#Keys.TAB
#Keys.SPACE

In [189]:
find_button = driver.find_element(By.XPATH, value='//*[@id="screener-criteria"]/div/div[1]/div[3]/button/span')
find_button.click()
time.sleep(3)

### 2.5 Exploring and locating Elements

This is almost similar step that we have done in phase 1. We will try to identify relevant information like `<tags>`, `class` , `XPath` etc from the web page. Right-click and select the "Inspect" to do further analysis.

As the webpage showing cryptocurrency information in the Table form. We can grab the table header by using tag `<th>`, we will use find_elements by TAG to get the table headers. These headers can be used as columns for a CSV file.

In [190]:
header = driver.find_elements(By.TAG_NAME, value= 'th')
print(header[0].text)
print(header[2].text)

Symbol
Earnings Call Time


Creating a helper function to get first 10 columns from header, we have used List comprehension with conditions. You can also check out usage of `enumerate` method. 

In [191]:
def get_table_header(driver):
    """Return Table columns in list form """
    header = driver.find_elements(By.TAG_NAME, value= 'th')
    header_list = [item.text for index, item in enumerate(header) ]
    return header_list

In [192]:
header_list = get_table_header(driver)
print(header_list)

['Symbol', 'Company', 'Earnings Call Time', 'EPS Estimate', 'Reported EPS', 'Surprise(%)']


Next we find out number of rows available in a Page, you can see table rows are placed in `<tr>` tag, we can capture the `XPath` by selection `<tr>` tag the Right Click $\rightarrow$ Copy $\rightarrow$ Copy XPath.

![](https://i.imgur.com/DVAYMzY.gif)

In [193]:
txt=driver.find_element(By.XPATH, value='//*[@id="cal-res-table"]/div[1]/table/tbody/tr[1]').text
txt

'WHRL4.SA\nWhirlpool SA TAS - - -'

Above `XPath` points to first row, we can get rid of row number part from XPath and use it with `find_elements` to get hold of all the available rows. Let's implement this in a function.

In [194]:
def get_table_rows(driver):
    """Get number of rows available on the page """
    tablerows = len(driver.find_elements(By.XPATH, value='//*[@id="cal-res-table"]/div[1]/table/tbody/tr'))
    return tablerows    

In [195]:
print(get_table_rows(driver))

1


Similarly, we can take the XPath for any column value.

![](https://i.imgur.com/aT3I3Ur.gif)

In [196]:
driver.find_element(By.XPATH, value='//*[@id="cal-res-table"]/div[1]/table/tbody/tr[1]/td[2]').text

'Whirlpool SA'

So we can change the `row_number` & `column_number` in `XPath` and loop it through row count and column count to get all the available column values. Let's generalize this and put it in a function. We will get the data for one row at a time and return column values in the form of a dictionary 

In [197]:
def parse_table_rows(rownum, driver, header_list):
    """get the data for one row at a time and return column value in the form of dictionary"""
    row_dictionary = {}
    #time.sleep(1/3)
    for index , item in enumerate(header_list):
        time.sleep(1/20)
        column_xpath = '//*[@id="cal-res-table"]/div[1]/table/tbody/tr[{}]/td[{}]'.format(rownum, index+1)
        row_dictionary[item] = driver.find_element(By.XPATH, value=column_xpath).text
    return row_dictionary

In [198]:
#button_element = driver.find_element(By.XPATH, value = '//*[@id="cal-res-table"]/div[2]/button[3]')
#button_element.click()

In [199]:
txt=driver.find_element(By.XPATH, value='//*[@id="cal-res-table"]/div[1]/table/tbody/tr[1]').text
txt

'WHRL4.SA\nWhirlpool SA TAS - - -'

In this section we have learned how to get required data points, and perform events on webpage. 

In [200]:
## parsing each row
table_data = []
table_rows = get_table_rows(driver)
print('Found {} rows'.format(table_rows))
table_data += [parse_table_rows(i, driver, header_list) for i in range (1, table_rows + 1)]
total_count = len(table_data)
print('Total rows scraped : {}'.format(total_count))

Found 1 rows
Total rows scraped : 1


In [201]:
## save data
print('Save the data to a CSV')
table_df = pd.DataFrame(table_data)
table_df.to_csv('earnings_selenium.csv', index=None)

Save the data to a CSV


In [202]:
table_df

Unnamed: 0,Symbol,Company,Earnings Call Time,EPS Estimate,Reported EPS,Surprise(%)
0,WHRL4.SA,Whirlpool SA,TAS,-,-,-


In [203]:
#terminating driver 
driver.close()
driver.quit() 

**Summary** : Hope you've enjoyed this tutorial. Selenium enables us to perform multiple actions on the web browser, which is really very handy for scraping different types of data from any webpage.
