# Web Scraping Yahoo! Finance using Python

A detailed guide for web scraping https://finance.yahoo.com/ using **requests**, **BeautifulSoup**, **Selenium**, **HTML tags** & embedded **JSON** data.

![](https://imgur.com/7jMFOcE.png)

## Introduction

**What is Web scraping?**<br>
Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning.


**Objective**<br>
The main objective of this tutorial is to showcase different web scraping methods which can be applied to any web page. 
This is for educational purposes only. Please read the Terms & Conditions carefully for any website to see whether you can legally use the data. 

In this project, we will perform web scraping using the following 3 techniques based on the problem statement.
* use `requests`, `BeautifulSoup` and `HTML tags` to extract web page
* use `Selenium` to scrape data from dynamically loading websites 
* use embedded `JSON` data to scrape website 

**The problem statement**<br>
1. Scrape **Stock Market News** (url : https://finance.yahoo.com/topic/stock-market-news/) :<br>
    This web page shows the latest **news** related to **stock market**, we will try to extract data from this web page and store it in a `CSV` (comma-separated values) file. The file layout would be as mentioned below.
    ```
    source,headline,url,content,image
    <source of the news>,<news head line>,<news url>,<news content>,<news thumbnail image>
    ```

2. Scrape **Cryptocurrencies** (url : https://finance.yahoo.com/cryptocurrencies) :<br>
    This Yahoo! finance web page shows list of trending **Cryptocurrencies** in tabular format, we will perform the web scraping to retrieve first 10 columns for top 100 **Cryptocurrencies** in `CSV` format.
    ```
    Symbol,Name,Price (Intraday),Change,% Change,Market Cap,Volume in Currency (Since 0:00 UTC),
    Volume in Currency (24Hr),Total Volume All Currencies (24Hr),Circulating Supply
    BTC-USD,Bitcoin USD,"43,312.13",-947.50,-2.14%,821.76B,27.727B,27.727B,27.727B,18.973M

    ```
        
3. Scrape **Market Events Calendar** (url : https://finance.yahoo.com/calendar) :<br> 
    This page shows **date-wise market events**, user have the option to select the date and choose any one of the following market events **Earnings**, **Stock Splits**, **Economic Events** & **IPO**. Our aim is to create a script which can be run for any single date and market event which grabs the data and loads in `CSV` format.<br>

**Prerequisites**
* Knowledge of Python
* Basic knowledge of HTML although it is not necessary


**How to run the Code**<br>
You can execute the code using "Run" button on the top of this page and selecting **"Run on Colab"** or **"Run Locally"** 
<br>
<br>
**Setup and Tools**<br>
<u>Run on Colab :</u> 
    You will need to provide the Google login to run this notebook on Colab.<br>
<u>Run Locally :</u> Download and install [Anaconda](https://www.anaconda.com/) framework, We will be using Jupyter Notebook for writing & executing code.

**Code Re-usability & Version control**

You can make changes and save your version of the notebook to [Jovian](https://jovian.ai/) by executing following cells.

In [1]:
!pip install jovian --quiet

In [2]:
import jovian

In [3]:
# Execute this to save new versions of the notebook
jovian.commit(project="yahoo-finance-web-scraper")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "vinodvidhole/yahoo-finance-web-scraper" on https://jovian.ai/[0m
[jovian] Committed successfully! https://jovian.ai/vinodvidhole/yahoo-finance-web-scraper[0m


'https://jovian.ai/vinodvidhole/yahoo-finance-web-scraper'

## 1. Scrape Stock Market News

In this section we will learn basic Python web scraping technique using `requests`, `BeautifulSoup` and `HTML tags`. The objective here is to perform web scraping of [Yahoo! finance Stock Market News](https://finance.yahoo.com/topic/stock-market-news/)

![](https://i.imgur.com/1I0Btau.jpg)

Let's kick start with the first objective. Here's an outline of the steps we'll follow<br>
**1.1 Download & Parse web page using `requests` and `BeautifulSoup`**<br>
**1.2 Exploring and locating Elements**<br>
**1.3 Extract & Compile the information into python list**<br>
**1.4 Save the extracted information to a CSV file**<br>


### 1.1 Download & Parse webpage using requests and BeautifulSoup

First step is to install [`requests`](https://docs.python-requests.org/en/latest/) & [`beautifulsoup4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Libraries using `pip`.

In [4]:
!pip install requests --quiet
!pip install beautifulsoup4 --quiet

In [5]:
import requests
from bs4 import BeautifulSoup

The libraries are installed and imported.<br>

To download the page, we can use `requests.get`, which returns a response object. the HTML information of web page is captured in `response.text`.<br>
`response.ok` & [`response.status_code`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) can be used for error trapping &  tracking.<br> 
Finally we can use `BeautifulSoup` to parse the HTML data, this will return `bs4.BeautifulSoup` object. 

We can create a function to perform this step.

In [6]:
def get_page(url):
    """Download a webpage and return a beautiful soup doc"""
    response = requests.get(url)
    if not response.ok:
        print('Status code:', response.status_code)
        raise Exception('Failed to load page {}'.format(url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

calling function `get_page` and analyze the output.

In [7]:
my_url = 'https://finance.yahoo.com/topic/stock-market-news/' 
doc = get_page(my_url)

In [8]:
print('Type of doc: ',type(doc))

Type of doc:  <class 'bs4.BeautifulSoup'>


You can access different properties of HTML web page from doc, following example will display Title of the web page.  

In [9]:
doc.find('title')

<title>Latest Stock Market News</title>

We can use the function `get_page` to download any web page and parse it using beautiful soup.

### 1.2 Exploring and locating Elements
Now its time to explore the elements to find the required data point from the web page. Web pages are written in a language called HTML (Hyper Text Markup Language).  HTML is a fairly simple language comprised of *tags*  (also called *nodes* or *elements*) e.g. `<a href="https://finance.yahoo.com/" target="_blank">Go to Yahoo! Finance</a>`. An HTML tag has three parts:



1. **Name**: (`html`, `head`, `body`, `div`, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
2. **Attributes**: (`href`, `target`, `class`, `id`, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
3. **Children**: A tag can contain some text or other tags or both between the opening and closing segments, e.g., `<div>Some content</div>`.

Let's inspect the webpage source code by right-click and select the "Inspect" option. First we need to identify the tag which represents the news listing.

![](https://i.imgur.com/pGwXU1J.gif)

In this case we can see the `<div>` tag having class name `"Ov(h) Pend(44px) Pstart(25px)"` is representing news listing, we can apply `find_all` method to grab this information 

In [10]:
div_tags = doc.find_all('div', {'class': "Ov(h) Pend(44px) Pstart(25px)"})

Total elements in the `<div>` tag list is matching with the numbers of news displaying in the webpage , so we are heading towards right direction.

In [11]:
len(div_tags)

10

Next step to inspect the individual `<div>` tag and try to find more information. I am using "Visual Studio Code", but you can use any tool as simple as notepad.

In [12]:
div_tags[1]

<div class="Ov(h) Pend(44px) Pstart(25px)"><div class="C(#959595) Fz(11px) D(ib) Mb(6px)">TheStreet.com</div><h3 class="Mb(5px)"><a class="js-content-viewer wafer-caas Fw(b) Fz(18px) Lh(23px) LineClamp(2,46px) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 mega-item-header-link Td(n) C(#0078ff):h C(#000) LineClamp(2,46px) LineClamp(2,38px)--sm1024 not-isInStreamVideoEnabled" data-uuid="c0f061d5-dfab-3cf1-8bb7-1a2b8fc88694" data-wf-caas-prefetch="1" data-wf-caas-uuid="c0f061d5-dfab-3cf1-8bb7-1a2b8fc88694" href="/m/c0f061d5-dfab-3cf1-8bb7-1a2b8fc88694/the-stock-market-has-no-mercy.html"><u class="StretchedBox"></u>The Stock Market Has No Mercy For Tesla Rivals</a></h3><p class="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)--sm1024 LineClamp(2,38px) LineClamp(2,34px)--sm1024 M(0)">The love at first sight between investors and young manufacturers of electric vehicles seems to have died.  As for electric and hydrogen truck maker Nikola, its stock has lost at least 33.6% of its ma

![](https://i.imgur.com/ncnfg0z.png)

Luckily most of the required data points are available in this `<div>`, so we can use `find` method to grab each items.

In [13]:
print("Source: ", div_tags[1].find('div').text)
print("Head Line : {}".format(div_tags[1].find('a').text))

Source:  TheStreet.com
Head Line : The Stock Market Has No Mercy For Tesla Rivals


If any tag is not accessible directly, then you can use methods like `findParent()` or `'findChild()` to point to the required tag.

![](https://i.imgur.com/OnOAtT2.png)

In [14]:
print("Image URL: ",div_tags[1].findParent().find('img')['src'])

Image URL:  https://s.yimg.com/uu/api/res/1.2/SO2ugXDssbG6ktlpzC3mOg--~B/Zmk9c3RyaW07aD0xMjM7cT04MDt3PTIyMDthcHBpZD15dGFjaHlvbg--/https://s.yimg.com/uu/api/res/1.2/cP.5sgYHwRi1sxoeGT3WDw--~B/aD0xMDgwO3c9MTkyMDthcHBpZD15dGFjaHlvbg--/https://media.zenfs.com/en/thestreet.com/e6d13390cbdb46512b13cca1155bfe67.cf.jpg


Key Takeout from this exercise is to identify the optimal tag which will provide us required information. Mostly this is straight forward, but sometimes you will have to perform little more research.  

### 1.3 Extract & Compile the information into python list

We've identified all required tags and information, Let's put this together in the functions.

In [15]:
def get_news_tags(doc):
    """Get the list of tags containing news information"""
    news_class = "Ov(h) Pend(44px) Pstart(25px)" ## class name of div tag 
    news_list  = doc.find_all('div', {'class': news_class})
    return news_list

sample run of the function `get_news_tags`

In [16]:
my_news_tags = get_news_tags(doc)

we will create one more function, to parse individual `<div>` tag and return the information in dictionary form

In [17]:
BASE_URL = 'https://finance.yahoo.com' #Global Variable 

def parse_news(news_tag):
    """Get the news data point and return dictionary"""
    news_source = news_tag.find('div').text #source
    news_headline = news_tag.find('a').text #heading
    news_url = news_tag.find('a')['href'] #link
    news_content = news_tag.find('p').text #content
    news_image = news_tag.findParent().find('img')['src'] #thumb image
    return { 'source' : news_source,
            'headline' : news_headline,
            'url' : BASE_URL + news_url,
            'content' : news_content,
            'image' : news_image
           }

Testing the `parse_news` function for first `<div>` tag 

In [18]:
parse_news(my_news_tags[0])

{'source': 'TheStreet.com',
 'headline': "Etoro Wants to Make Amends After Liquidating Its clients' Russian Stocks",
 'url': 'https://finance.yahoo.com/m/bc18874c-fafb-3752-80eb-4730cc90aae5/etoro-wants-to-make-amends.html',
 'image': 'https://s.yimg.com/uu/api/res/1.2/kx.1v0quv6udevKA.nNY6w--~B/Zmk9c3RyaW07aD0xMjM7cT04MDt3PTIyMDthcHBpZD15dGFjaHlvbg--/https://s.yimg.com/uu/api/res/1.2/fCDt2jRhF0BKJ59MX.AHNw--~B/aD0xMDgwO3c9MTkyMDthcHBpZD15dGFjaHlvbg--/https://media.zenfs.com/en/thestreet.com/1e9bd7b19084e252e958c5108068ef2e.cf.jpg'}

We can use the `get_news_tags` & `parse_news` functions to pars news.

### 1.4 Save the extracted information to a CSV file

This is the last step of this section, We are going to use Python library [`pandas`](https://pandas.pydata.org/docs/) to save the data in CSV format. Install and then Import the pandas Library.

In [19]:
!pip install pandas --upgrade --quiet

In [20]:
import pandas as pd

Creating wrapper function which will call previously created helper functions.<br>

The `get_page` function will download HTML page,then we can pass the result in `get_news_tags` to identify list of `<div>` tags for news.<br>
After that we will use [List Comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) technique to pars each `<div>` tag using `parse_news`, the output will be in the form of `lists` of `dictionaries`<br>
Finally we will use `DataFrame` method to create pandas [dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and use `to_csv` method to store required data in CSV format.

In [21]:
def scrape_yahoo_news(url, path=None):
    """Get the yahoo finance market news and write them to CSV file """
    if path is None:
        path = 'stock-market-news.csv'
        
    print('Requesting html page')
    doc = get_page(url)

    print('Extracting news tags')
    news_list = get_news_tags(doc)

    print('Parsing news tags')
    news_data = [parse_news(news_tag) for news_tag in news_list]

    print('Save the data to a CSV')
    news_df = pd.DataFrame(news_data)
    news_df.to_csv(path, index=None)
    
    #This return statement is optional, we are doing this just analyze the final output 
    return news_df 

Scraping the news using `scrape_yahoo_news` function 

In [22]:
YAHOO_NEWS_URL = BASE_URL+'/topic/stock-market-news/'
news_df = scrape_yahoo_news(YAHOO_NEWS_URL)

Requesting html page
Extracting news tags
Parsing news tags
Save the data to a CSV


The "stock-market-news.csv" should be available in File $\rightarrow$ Open Menu, you can download the file or directly open it on browser. Please verify the file content and compare it with the actual information available on the webpage.

You can also check the data by grabbing few rows from the data frame returned by the `scrape_yahoo_news` function 

In [23]:
news_df[:5]

Unnamed: 0,source,headline,url,content,image
0,TheStreet.com,Etoro Wants to Make Amends After Liquidating I...,https://finance.yahoo.com/m/bc18874c-fafb-3752...,The trading platform took an unprecedented act...,https://s.yimg.com/uu/api/res/1.2/kx.1v0quv6ud...
1,TheStreet.com,The Stock Market Has No Mercy For Tesla Rivals,https://finance.yahoo.com/m/c0f061d5-dfab-3cf1...,The love at first sight between investors and ...,https://s.yimg.com/uu/api/res/1.2/SO2ugXDssbG6...
2,Investor's Business Daily,Dow Jones Futures: Market Correction Heading F...,https://finance.yahoo.com/m/2df44452-c887-3bb3...,The major indexes are nearing February lows as...,https://s.yimg.com/uu/api/res/1.2/ZZL4kX1McSx_...
3,Bloomberg,China Is Hidden Risk for Emerging Markets That...,https://finance.yahoo.com/news/china-hidden-ri...,(Bloomberg) -- As traders grapple with the bre...,https://s.yimg.com/uu/api/res/1.2/h4T0Z237N9bc...
4,Bloomberg,"After Whiplash Week in Markets, Traders Prep f...",https://finance.yahoo.com/news/whiplash-week-m...,(Bloomberg) -- Traders around the world are ge...,https://s.yimg.com/uu/api/res/1.2/LcnI7yBSK_1c...


**Summary** : Hopefully I was able to explain this simple but very powerful Python technique to scrape the yahoo finance market news. These steps can be used to scrape any web page, you just have to do little research to identify required `<tags>` and use relevant python methods to collect the data. 

## 2. Scrape Cryptocurrencies

In phase One we were able to scrape the [yahoo market news](https://finance.yahoo.com/topic/stock-market-news/) web page. However If you've noticed, as we scroll down the web page more news will appear at the bottom of the page. This is called dynamic page loading. Previous technique is a basic Python method useful to scrape static data, To scrape the dynamically loading data will use a different method called webs craping using **Selenium**. Let's move ahead with this topic. The goal of this section is extract top listing [Crypto currencies](https://finance.yahoo.com/cryptocurrencies) from Yahoo! finance.

![](https://i.imgur.com/sF6k0Pk.jpg)


Here's an outline of the steps we'll follow<br>
**2.1 Introduction of selenium**<br>
**2.2 Downloads & Installation**<br>
**2.3 Install & Import libraries**<br>
**2.4 Create Web Driver**<br>
**2.5 Exploring and locating Elements**<br>
**2.6 Extract & Compile the information into python list**<br>
**2.7 Save the extracted information to a CSV file**<br>

### 2.1 Introduction of selenium

**[Selenium](https://www.selenium.dev/)** is an open-source web-based automation tool. Python language and other languages are used with Selenium for testing as well as web scraping. Here we will use Chrome browser, but you can try on any browser.<br>

**Why you should use Selenium?**
- Clicking on buttons
- Filling forms
- Scrolling
- Taking a screen-shot
- Refreshing the page

You can find proper documentation on selenium [here](https://selenium-python.readthedocs.io/)<br>

Following methods will help to find elements in a webpage (these methods will return a list):
- `find_elements_by_name`
- `find_elements_by_xpath`
- `find_elements_by_link_text`
- `find_elements_by_partial_link_text`
- `find_elements_by_tag_name`
- `find_elements_by_class_name`
- `find_elements_by_css_selector`

In this tutorial we will use only `find_elements_by_xpath` and `find_elements_by_tag_name` You can find complete documentation of these methods [here](https://selenium-python.readthedocs.io/locating-elements.html)

### 2.2 Downloads & Installation 

Unlike previous section, here we'll have to do some prep work to implement this method. We will need to install Selenium & proper web browser driver<br>

If you are using **Google Colab** platform then execute following code to perform Initial installation. This piece of code `'google.colab' in str(get_ipython())` is used to identify the Google Colab platform.

In [24]:
if 'google.colab' in str(get_ipython()):
    print('Google CoLab Installation')
    !apt update --quiet
    !apt install chromium-chromedriver --quiet

To run it on **Locally** you will need **Webdriver for Chrome** in your machine. You can download it from this link https://chromedriver.chromium.org/downloads and just copy the file in the folder where we will create the python file (No need of installation). But make sure that the driver‘s version matches the the Chrome browser version installed on the local machine.

![](https://i.imgur.com/FvQ586e.gif)
![](https://i.imgur.com/wQbjRIU.png)

### 2.3 Install & Import libraries

Installation of the required libraries. Please note that there are some platform specific libraries

In [25]:
print('library Installation')
if 'google.colab' not in str(get_ipython()):
    print('Not running on CoLab')
    #!pip install webdriver-manager --upgrade --quiet
else:
    print('Running on CoLab')
    
!pip install selenium --quiet
!pip install pandas --quiet

library Installation
Not running on CoLab


Once the Libraries installation is done, next step is to import all the required modules / libraries. 

In [26]:
print('Library Import')
if 'google.colab' not in str(get_ipython()):
    print('Not running on CoLab')
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.chrome.service import Service
    #from webdriver_manager.chrome import ChromeDriverManager
    import os
else:
    print('Running on CoLab')
    
print('Common Library Import')
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd 
import time

Library Import
Not running on CoLab
Common Library Import


So all the necessary prep work is done, Let's  move ahead to implement this method.

### 2.4 Create Web Driver

In this step first we will create the instance of Chrome WebDriver using `webdriver.Chrome()` method. and then the `driver.get()` method will navigate to a page given by the URL. In this case also there is slight variation based on platform, Also passed `options` parameters for e.g. `--headless` option will load the driver in background. 

In [27]:
if 'google.colab' in str(get_ipython()):
    print('Running on CoLab')
    def get_driver(url):
        """Return web driver"""
        colab_options = webdriver.ChromeOptions()
        colab_options.add_argument('--no-sandbox')
        colab_options.add_argument('--disable-dev-shm-usage')
        colab_options.add_argument('--headless')
        driver = webdriver.Chrome(options=colab_options)
        driver.get(url)
        return driver
else:
    print('Not running on CoLab')
    def get_driver(url):
        """Return web driver"""
        chrome_options = Options()
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument('--headless')
        #serv = Service(ChromeDriverManager().install())
        serv = Service(os.getcwd()+'/chromedriver')
        driver = webdriver.Chrome(options=chrome_options, service=serv)
        driver.get(url)
        return driver

Not running on CoLab


Test run of `get_driver`

In [28]:
driver = get_driver('https://finance.yahoo.com/cryptocurrencies')

### 2.5 Exploring and locating Elements

This is almost similar step that we have done in phase 1, We will try to identify relevant information like `<tags>`, `class` , `XPath` etc from the web page. Right-click and select the "Inspect" to do further analysis.

As the webpage is showing cryptocurrency information in the Table form. We can grab the table header by using tag `<th>`, we will use find_elements by TAG to get the table headers. These headers will be used columns for CSV file.

In [29]:
header = driver.find_elements(By.TAG_NAME, value= 'th')
print(header[0].text)
print(header[2].text)

Symbol
Price (Intraday)


Creating a helper function to get first 10 columns from header, we have used List comprehension with conditions. you can also check out usage of `enumerate` method. 

In [30]:
def get_table_header(driver):
    """Return Table columns in list form """
    header = driver.find_elements(By.TAG_NAME, value= 'th')
    header_list = [item.text for index, item in enumerate(header) if index < 10]
    return header_list

Next we find out number of rows available in a Page, you can see table rows are placed in `<tr>` tag, we can capture the `XPath` by selection `<tr>` tag the Right Click $\rightarrow$ Copy $\rightarrow$ Copy XPath.

![](https://i.imgur.com/DVAYMzY.gif)

So we get the  XPath value as `//*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]`, Let's use this with `find_element()` & `By.XPATH`.

In [31]:
txt=driver.find_element(By.XPATH, value='//*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]').text
txt

'BTC-USD\nBitcoin USD 39,122.67 -415.29 -1.05% 742.389B 18.031B 18.031B 18.031B 18.976M'

Above `XPath` points to first row, we can get rid of row number part from XPath and use it with `find_elements` to get hold of all the available rows. Let's implement this in a function.

In [32]:
def get_table_rows(driver):
    """Get number of rows available on the page """
    tablerows = len(driver.find_elements(By.XPATH, value='//*[@id="scr-res-table"]/div[1]/table/tbody/tr'))
    return tablerows    

In [33]:
print(get_table_rows(driver))

25


Similarly we can take the XPath for any column value.

![](https://i.imgur.com/aT3I3Ur.gif)

This is the XPAth for a column `//*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]/td[2]`.<br>
If you noticed the the number after `tr` & `td` represents the `row_number` and `column_number`, we can check this with `find_element()` method

In [34]:
driver.find_element(By.XPATH, value='//*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]/td[2]').text

'Bitcoin USD'

So we can change the `row_number` & `column_number` in `XPath` and loop it through row count and column count to get all the available column values. Let's generalize this and put it in a function. We will get the data for one row at a time and return column value in the form of dictionary 

In [35]:
def parse_table_rows(rownum, driver, header_list):
    """get the data for one row at a time and return column value in the form of dictionary"""
    row_dictionary = {}
    time.sleep(1/3)
    for index , item in enumerate(header_list):
        column_xpath = '//*[@id="scr-res-table"]/div[1]/table/tbody/tr[{}]/td[{}]'.format(rownum, index+1)
        row_dictionary[item] = driver.find_element(By.XPATH, value=column_xpath).text
    return row_dictionary

The Yahoo! Finance web page is showing only 25 Cryptocurrencies per page and user will have click `Next` button to load next sets of crypto currencies. This is called as **Pagination**. This is the main reason we are implementing selenium method to handle the events like pagination. you can perform multiple events like clicking, scrolling , refreshing etc. on a webpage using selenium methods.

Now we will grab the `XPath` of `Next` button, find the element using `find_element` method and after that we can perform click action using `.click()` method 

![](https://i.imgur.com/tCxQKfR.gif)

In [36]:
button_element = driver.find_element(By.XPATH, value = '//*[@id="scr-res-table"]/div[2]/button[3]')
button_element.click()

In this section we have learned how to get required data points, and perform events on webpage. 

In [37]:
driver.quit() #terminating driver from test runs 

### 2.6 Extract & Compile the information into python list

Let's put all the pieces in the puzzle, we will pass the integer `total_crypto` i.e. numbers of rows to be scraped (in this case 100 rows) in the function. Parse each row from the page and append the data in the `List` till the total parsed row count reach to `total_crypto`. In addition we will perform `Next` button click if we are at the last row of the table. 

**Please Note** : Here to identify the `Next` button element we have used [WebDriverWait](https://www.selenium.dev/selenium/docs/api/java/org/openqa/selenium/support/ui/WebDriverWait.html) class instead of using `find_element()` method. In this technique we can pass some wait-time before grabbing the element. This type of implememtation is done to avoid the [`StaleElementReferenceException`](https://stackoverflow.com/questions/27003423/staleelementreferenceexception-on-python-selenium).

Code Sample:
```
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="scr-res-table"]/div[2]/button[3]')))
```

In [38]:
def parse_multiple_pages(driver, total_crypto):
    """Loop through each row, perform Next button click at the end of page 
    return total_crypto numbers of rows 
    """
    table_data = []
    page_num = 1
    is_scraping = True
    header_list = get_table_header(driver)

    while is_scraping:
        table_rows = get_table_rows(driver)
        print('Found {} rows on Page : {}'.format(table_rows, page_num))
        print('Parsing Page : {}'.format(page_num))
        table_data += [parse_table_rows(i, driver, header_list) for i in range (1, table_rows + 1)]
        total_count = len(table_data)
        print('Total rows scraped : {}'.format(total_count))
        if total_count >= total_crypto:
            print('Done Parsing..')
            is_scraping = False
        else:    
            print('Clicking Next Button')
            element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="scr-res-table"]/div[2]/button[3]')))
            element.click() 
            page_num += 1
    return table_data

### 2.7 Save the extracted information to a CSV file

This is the last step of this section, we are creating a last function which will be the placeholder for all helper functions and at the and we will save the data in CSV format using `pd.to_csv` method.

In [39]:
def scrape_yahoo_crypto(url, total_crypto, path=None):
    """Get the list of yahoo finance crypto-currencies and write them to CSV file """
    if path is None:
        path = 'crypto-currencies.csv'
    print('Creating driver')
    driver = get_driver(url)    
    table_data = parse_multiple_pages(driver, total_crypto)
    driver.close()
    print('Save the data to a CSV')
    table_df = pd.DataFrame(table_data)
    #print(table_df)
    table_df.to_csv(path, index=None)
    #This return statement is optional, we are doing this just analyze the final output 
    return table_df 

Time to scrape some cryoptos!!! , we will scrape top 100 cryptos in Yahoo! Finance webpage by calling `scrape_yahoo_crypto` 

In [40]:
YAHOO_FINANCE_URL = BASE_URL+'/cryptocurrencies'
TOTAL_CRYPTO = 100
crypto_df = scrape_yahoo_crypto(YAHOO_FINANCE_URL, TOTAL_CRYPTO,'crypto-currencies.csv')

Creating driver
Found 25 rows on Page : 1
Parsing Page : 1
Total rows scraped : 25
Clicking Next Button
Found 25 rows on Page : 2
Parsing Page : 2
Total rows scraped : 50
Clicking Next Button
Found 25 rows on Page : 3
Parsing Page : 3
Total rows scraped : 75
Clicking Next Button
Found 25 rows on Page : 4
Parsing Page : 4
Total rows scraped : 100
Done Parsing..
Save the data to a CSV


The "crypto-currencies.csv" should be available in File $\rightarrow$ Open Menu, you can download the file or directly open it on browser. Please verify the file content and compare it with the actual information available on the webpage.

You can also check the data by grabbing few rows form the data frame returned by the `scrape_yahoo_crypto` function

In [41]:
crypto_df[:5]

Unnamed: 0,Symbol,Name,Price (Intraday),Change,% Change,Market Cap,Volume in Currency (Since 0:00 UTC),Volume in Currency (24Hr),Total Volume All Currencies (24Hr),Circulating Supply
0,BTC-USD,Bitcoin USD,39122.67,-415.29,-1.05%,742.389B,18.031B,18.031B,18.031B,18.976M
1,ETH-USD,Ethereum USD,2640.22,-35.59,-1.33%,316.43B,7.875B,7.875B,7.875B,119.85M
2,USDT-USD,Tether USD,1.0003,-0.0001,-0.01%,79.734B,40.582B,40.582B,40.582B,79.713B
3,BNB-USD,Binance Coin USD,382.86,0.68,+0.18%,63.217B,1.258B,1.258B,1.258B,165.117M
4,USDC-USD,USD Coin USD,0.999583,-0.000997,-0.10%,52.919B,2.854B,2.854B,2.854B,52.941B


**Summary** : Hope you've enjoyed this tutorial. Selenium enables us to perform multiple actions on the web browser which is really very handy to scrape different type of data from any webpage.


## 3. Scrape Market Events Calendar

This is the final segment of the tutorial, in this section we will learn how to extract embedded [JSON](https://www.w3schools.com/js/js_json_intro.asp) formatted data which can be easily converted to Python dictionary. Problem statement for section is to scrape date-wise market events from [Yahoo! finance](https://finance.yahoo.com/calendar).

![](https://i.imgur.com/bKQoAjs.png)

Here's an outline of the steps we'll follow<br>
**3.1 Install & Import libraries**<br>
**3.2 Download & Parse web page**<br>
**3.3 Get Embedded Json data**<br>
**3.4 Locating Json Keys**<br>
**3.5 Pagination & Compiling the information into python list**<br>
**3.6 Save the extracted information to a CSV file**<br>

### 3.1 Install & Import libraries

First step to install and import Python Libraries 

In [42]:
!pip install requests --quiet
!pip install beautifulsoup4 --quiet
!pip install pandas --quiet

In [43]:
import re
import json
from io import StringIO
from bs4 import BeautifulSoup
import requests
import pandas as pd

### 3.2 Download & Parse web page

This is exactly same step that we've performed to download webpage in section 1.1 , Here we have used [custom header](https://docs.python-requests.org/en/master/user/quickstart/#custom-headers) in `requests.get()`

Most of the things are explained in section 1.1, creating the helper function.

In [44]:
def get_event_page(scraper_url):
    """Download a webpage and return a beautiful soup doc"""
    headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
                  "(KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    response = requests.get(scraper_url, headers=headers)
    if not response.ok:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ' + scraper_url)
    # Construct a beautiful soup document
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [45]:
doc = get_event_page('https://finance.yahoo.com/calendar/earnings?from=2022-02-27&to=2022-03-05&day=2022-02-28')

### 3.3 Get Embedded Json data


In this step we will locate the Jason formated data, Open the web page and do Right Click $\rightarrow$ View Page Source, If you scroll down to source page you will notice the [Json](https://www.w3schools.com/whatis/whatis_json.asp) formated data. Luckily this information is `<script>` tag which contain following text `/* -- Data -- */`.

![](https://i.imgur.com/2xlpNbw.gif)

we will use [Regular expressions](https://docs.python.org/3/library/re.html) to get text  inside `<script>` tag.

In [46]:
pattern = re.compile(r'\s--\sData\s--\s')
#script_data = doc.find('script', text=pattern).text
script_data = doc.find('script', text=pattern).contents[0]

Further the `Json` formated string has first key as `context` and it ends at 12 characters from the end 

In [47]:
print(script_data[:150])
print(script_data[-150:])


(function (root) {
/* -- Data -- */
root.App || (root.App = {});
root.App.now = 1646585550274;
root.App.main = {"context":{"dispatcher":{"stores":{"P
odal":{"strings":1},"tdv2-wafer-header":{"strings":1},"yahoodotcom-layout":{"strings":1}}},"options":{"defaultBundle":"td-app-finance"}}}};
}(this));



So we can grab the Json string using Python slicing.

In [48]:
start  = script_data.find('context')-2
json_text  = script_data[start:-12]

Using `json.loads()`method to convert Jason string into Python Dictionary 

In [49]:
parsed_dictionary = json.loads(json_text)
type(parsed_dictionary)

dict

Creating function using above information.

In [50]:
def get_json_dictionary(doc):
    """Get Json formated data in the form of Python Dictionary"""
    pattern = re.compile(r'\s--\sData\s--\s')
    script_data = doc.find('script', text=pattern).text
    script_data = doc.find('script', text=pattern).contents[0]
    
    start  = script_data.find('context')-2
    json_text  = script_data[start:-12]
    
    parsed_dictionary = json.loads(json_text)
    return parsed_dictionary    

### 3.4 Locating Json Keys

So basically the Json text is multi level nested dictionaries, and some keys are used to store all the meta data displayed on the webpage. In this section we will identify the keys for the data we are trying to scrape.

We'll need some `Json Formatter` tool to navigate through multiple keys, I am using online tool https://jsonblob.com/, However you can choose any tool.

We will write the Json text into `my_json_file.json` file, then grab the file content and paste it to the left panel of https://jsonblob.com/. The JSON Blob it will do nice formatting, we can easily navigate through each Keys and search any item.

In [51]:
with open('my_json_file.json', 'w') as file:
    file.write(json_text)

Next step is to find the Required Key location, Let's search the company name `3D Systems Corporation` displayed in the webpage in the [JSON Blob](https://jsonblob.com/) formatter.
![](https://i.imgur.com/Iv8b7vl.png)

![](https://i.imgur.com/jpJYOCy.png)

You can see the table data is stored in the `rows` key, and we can track down the parent keys as shown in the above screen, checkout the content of `row` key.

In [52]:
parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['rows'][:3]

[{'ticker': 'DDD',
  'companyshortname': '3D Systems Corporation',
  'startdatetime': '2022-02-28T16:05:00.000Z',
  'startdatetimetype': 'TAS',
  'epsestimate': 0.03,
  'epsactual': 0.09,
  'epssurprisepct': 181.25,
  'timeZoneShortName': 'EST',
  'gmtOffsetMilliSeconds': -18000000,
  'quoteType': 'EQUITY'},
 {'ticker': 'AMBA',
  'companyshortname': 'Ambarella, Inc.',
  'startdatetime': '2022-02-28T16:31:00.000Z',
  'startdatetimetype': 'TAS',
  'epsestimate': 0.42,
  'epsactual': 0.45,
  'epssurprisepct': 6.13,
  'timeZoneShortName': 'EST',
  'gmtOffsetMilliSeconds': -18000000,
  'quoteType': 'EQUITY'},
 {'ticker': 'AAON',
  'companyshortname': 'AAON, Inc.',
  'startdatetime': '2022-02-28T16:01:00.000Z',
  'startdatetimetype': 'TAS',
  'epsestimate': 0.28,
  'epsactual': 0.18,
  'epssurprisepct': -35.02,
  'timeZoneShortName': 'EST',
  'gmtOffsetMilliSeconds': -18000000,
  'quoteType': 'EQUITY'}]

In [53]:
print('Total Rows on the Current page :',len(parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['rows']))

Total Rows on the Current page : 100


This sub-dictionary is showing all the data displayed on current page.<br>
You can do more research and exploration to get different information from the web page.

In [54]:
print('Total Rows for the search criteria :',parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['total'])

Total Rows for the search criteria : 205


In [55]:
print("Columns")
parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['columns']

Columns


[{'data': 'ticker', 'content': 'Symbol'},
 {'data': 'companyshortname', 'content': 'Company Name'},
 {'data': 'startdatetime', 'content': 'Event Start Date'},
 {'data': 'startdatetimetype', 'content': 'Event Start Time'},
 {'data': 'epsestimate', 'content': 'EPS Estimate'},
 {'data': 'epsactual', 'content': 'Reported EPS'},
 {'data': 'epssurprisepct', 'content': 'Surprise (%)'},
 {'data': 'timeZoneShortName', 'content': 'Timezone short name'},
 {'data': 'gmtOffsetMilliSeconds', 'content': 'GMT Offset'}]

Putting this in function

In [56]:
def get_total_rows(parsed_dictionary):
    '''Get the Total Rows for the search criteria & Columns detail''' 
    total_rows = parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['total']
    return total_rows

In [57]:
def get_page_rows(parsed_dictionary):
    """Get the Content current page"""    
    data_dictionary = parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['rows']
    return data_dictionary

### 3.5 Pagination & Compiling the information into python list

As we saw in the previous section how to handle `Pagination` using selenium methods, here we'll learn new technique of accessing multiple pages.<br>

Most of the times webpage url gets changed runtime depending on the user selection, e.g. In below screen-shot I selected the **Earnings** for **1-March-2022**. You can notice how that information is passed in the url. 
![](https://i.imgur.com/h5QU99h.png)

Similarly, when i click next button `offset`& `size` values gets changed in the url.
![](https://i.imgur.com/jYa1vq5.png)

So we can figure out the pattern & structure of the url and how it affects the page navigation.<br> 

In this case webpage url pattern is mentioned below:<br>
- Following values are used for calendar event type 
`event_types = ['splits','economic','ipo','earnings']`
- Date is passed in `yyyy-mm-dd` format
- Page number is controlled by `offset` value (for first page `offset=0`)
- Maximum numbers of rows in a page is assigned to `size`

Based on the above information we can build url runtime and download the page then extract the information, this is how we handle the pagination.<br>

Putting all things together in a function. In this function we will pass `event_type` and `date`, then we will calculate the total rows for matching criteria using `get_columns_and_total_rows` function. Maximum rows per page is constant (i.e 100), so we can build iterating summation logic to calculate total number of pages involved for current criteria and extract each page data in the loop.    


In [58]:
def scrape_all_pages(event_type, date):
    """Loop through each row and return lists of data dictiionary"""
    YAHOO_CAL_URL = BASE_URL+'/calendar/{}?day={}&offset={}&size={}'
    max_rows_per_page = '100' # this indicates max rows per page 
    page_number = 1
    final_data_dictionary = []
    
    while page_number > 0:
        print("Pricessing page # {}".format(page_number))
        page_url = str((page_number - 1 ) * int(max_rows_per_page))
        scrape_url = YAHOO_CAL_URL.format(event_type, date, page_url, max_rows_per_page)
        print("Scrape url for page {} is {}".format(page_number,scrape_url))
        page_doc = get_event_page(scrape_url)
        parse_dict = get_json_dictionary(page_doc)
        if page_number == 1:
            total_rows = get_total_rows(parse_dict)        
        final_data_dictionary += get_page_rows(parse_dict)
        if len(final_data_dictionary) >= total_rows:
            page_number = 0
            return final_data_dictionary
        page_number += 1

### 3.6 Save the extracted information to a CSV file

In this last section, we will save the data to csv format using `pd.DataFrame()` & `to_csv()` and call everything in a single placeholder function.

In [59]:
def scrape_yahoo_calendar(event_types, date_param):
    """Get the list of yahoo finance calendar and write them to CSV file """
    for event in event_types:
        print('Web Scraping for ', event  )
        data_dict = scrape_all_pages(event, date_param)
        scraped_df = pd.DataFrame(data_dict)
        scraped_df.to_csv(event+'_'+date_param+'.csv',index=False)    

calling final function `scrape_yahoo_calendar`

In [60]:
BASE_URL = 'https://finance.yahoo.com' #Global Variable 
date_param = '2022-02-28'
date_param = '2022-03-18' # no data condition
event_types = ['splits','economic','ipo','earnings']
scrape_yahoo_calendar(event_types, date_param)

Web Scraping for  splits
Pricessing page # 1
Scrape url for page 1 is https://finance.yahoo.com/calendar/splits?day=2022-03-18&offset=0&size=100
Web Scraping for  economic
Pricessing page # 1
Scrape url for page 1 is https://finance.yahoo.com/calendar/economic?day=2022-03-18&offset=0&size=100
Web Scraping for  ipo
Pricessing page # 1
Scrape url for page 1 is https://finance.yahoo.com/calendar/ipo?day=2022-03-18&offset=0&size=100
Web Scraping for  earnings
Pricessing page # 1
Scrape url for page 1 is https://finance.yahoo.com/calendar/earnings?day=2022-03-18&offset=0&size=100


Total 4 csv files "event_type_yyyy-mm-dd.csv" should be available in File $\rightarrow$ Open Menu, you can download the file or directly open it on browser. Please verify the file content and compare it with the actual information available on the webpage.

**Summary** : This is very useful technique which can be easily replicable. Without writing any customized code we were able to extract the data from multiple types of web pages just by changing one variable (in this case `event_type`). 

## References

References to links you found useful

- https://htmldog.com/guides/html/
- https://selenium-python.readthedocs.io/index.html
- https://stackoverflow.com/questions/27003423/staleelementreferenceexception-on-python-selenium
- https://www.w3schools.com/js/js_json_intro.asp
- https://hhsm95.dev/blog/the-importance-of-using-user-agent-to-scraping-data/

## Future Work

Ideas for future work<br>
- Automate this process using [AWS Lambda](https://aws.amazon.com/lambda/) to download daily market calendar, crypto-currencies & market news in CSV format.
- Move the old files to an Archive folder append date-stamp to the file if required also  delete the Archived files older than 2 weeks.
- Process the raw data extracted from third technique using different methods of pandas 

## Conclusion

In this Tutorial we Implement following web scraping techniques.
 - Using requests, BeautifulSoup and HTML tags to extract web page.
 - Using Selenium to scrape data from dynamically loading websites.
 - Using embedded Json format data to scrape website .

I hope I was able to teach you these webscraping methods and I hope you can use this knowledge to scrape any website.

Thank you for reading. Happy coding!!!

In [64]:
jovian.commit(project="yahoo-finance-web-scraper")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "vinodvidhole/yahoo-finance-web-scraper" on https://jovian.ai/[0m
[jovian] Committed successfully! https://jovian.ai/vinodvidhole/yahoo-finance-web-scraper[0m


'https://jovian.ai/vinodvidhole/yahoo-finance-web-scraper'

- check grammar , extra space 

In [62]:
1

1