# Web Scraping Yahoo! Finance using multiple techniques in Python

A detailed guide for scraping https://finance.yahoo.com/ using ***requests***, ***BeautifulSoup***, ***selenium***, ***HTML tags*** & existing available data in ***json*** format.

![](https://imgur.com/7jMFOcE.png)

**What is Web scraping?**

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning.

**Introduction**

The main objective of this tutorial is to showcase different web scraping methods which can be applied to any web page. 
This is for educational purposes only. Please read the Terms & Conditions carefully for any website whether you can legally use the data. 

In this Project we will perform web scraping using following 3 techniques based on the problem statement.
* use `BeautifulSoup` and `HTML tags` to extract web page
* use `selenium` to scrape data from dynamically loading websites 
* scrape data using existing data available in `json` format



**The problem statement**

1. Scrape **Stock Market News** (url : https://finance.yahoo.com/topic/stock-market-news/) :<br>
    This web page shows latest **news** related to **stock market**, we will try to extract data from this web page and store it in `CSV` (comma-separated values) file. The file layout would be as mentioned below.
    ```
    source,headline,url,content,image
    <source of the news>,<news head line>,<news url>,<news content>,<news thumbnail image>
    ```
    
2. Scrape **Trending Tickers** (url : https://finance.yahoo.com/trending-tickers) :<br>
    This yahoo finance web page is showing list of trending **Tickers** in tabular format, we will perform the web scraping to retrieve first 8 columns for all available **Tickers** in `CSV` format.
    ```
    Symbol,Name,Last Price,Market Time,Change,% Change,Volume,Market Cap
    COKE,"Coca-Cola Consolidated, Inc.",446.66,4:00PM EST,-136.98,-23.47%,"100,345",4.187B
    ```
        
3. Scrape **Market Events Calendar** (url : https://finance.yahoo.com/calendar) :<br> 
    This page is showing **date-wise market events**, user have option to select the date and choose any one of the following market event **Earnings**, **Stock Splits**, **Economic Events** & **IPO**. Our aim is to create script which can be run for any single date and market event which grabs the data and load it in `CSV` format. If there is no data found then just create file with column headers.
    


**Future Work**

Automate this process to get daily calendar , trending tickers & news in CSV files
- create daily 6 files 
- move the old files in to archive folder with time stamp 
- delete older files files older than 2 weeks

**Prerequisites**

* Knowledge of Python
* Basic knowledge of HTML although it is not necessary


**How to run the Code**

You can execute the code using "Run" button on the top of this page and selecting **"Run on Colab"** or **"Run Locally"** 
<br>
<br>
**Setup and Tools**

<u>Run on Colab :</u> 
    You will need to provide the Google login to run this notebook on Colab.<br>

<u>Run Locally :</u> Download and install [Anaconda](https://www.anaconda.com/) framework, We will be using Jupyter Notebook for writing the & executing code



**Code Re-usability & Version control**

You can make changes and save your version of the notebook to [Jovian](https://jovian.ai/) by executing following cells.

In [None]:
!pip install jovian --upgrade --quiet

In [None]:
import jovian

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project="yahoo-finance-web-scraper")

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
Committed successfully! https://jovian.ai/vinodvidhole/yahoo-finance-web-scraper


'https://jovian.ai/vinodvidhole/yahoo-finance-web-scraper'

## 1. Scrape Stock Market News 

## TODO Add image ?
Lets kick start with the first objective. Here's an outline of the steps we'll follow<br>

**1.1 Download & Parse webpage using `requests` and `BeautifulSoup`**<br>
**1.2 Exploring and locating Elements**<br>
**1.3 Extract & Compile the information into python list**<br>
**1.4 Save the extracted information to a CSV file**<br>


### 1.1 Download & Parse webpage using requests and BeautifulSoup

First step is to install [`requests`](https://docs.python-requests.org/en/latest/) & [`beautifulsoup4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Libraries using `pip`.

In [None]:
!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet

In [None]:
import requests
from bs4 import BeautifulSoup

The library is now installed and imported.<br>

To download the page, we can use `requests.get`, which returns a response object. the HTML information of web page is captured in `response.text`.<br>
`response.ok` & [`response.status_code`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) can be used for error trapping &  tracking.<br> 
Finally we can use `BeautifulSoup` to parse the HTML data, this will return `bs4.BeautifulSoup` object  

Lets create a function to perform this step 

In [None]:
def get_page(url):
    """Download a webpage and return a beautiful soup doc"""
    response = requests.get(url)
    if not response.ok:
        print('Status code:', response.status_code)
        raise Exception('Failed to load page {}'.format(url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

calling function `get_page` and analyze the output.

In [None]:
my_url = 'https://finance.yahoo.com/topic/stock-market-news/' #Global variable 
doc = get_page(my_url)

In [None]:
print('Type of doc: ',type(doc))

Type of doc:  <class 'bs4.BeautifulSoup'>


You can access different properties of HTML web page from doc, following example will display Title of the web page.  

In [None]:
doc.find('title')

<title>Latest Stock Market News</title>

**Summary** : We can now use the function `get_page` to download any web page and parse it using beautiful soup.

### 1.2 Exploring and locating Elements
Now its time to explore the elements to find the required data point from the web page. Web pages are written in a language called HTML (Hyper Text Markup Language).  HTML is a fairly simple language comprised of *tags*  (also called *nodes* or *elements*) e.g. `<a href="https://finance.yahoo.com/" target="_blank">Go to Yahoo! Finance</a>`. An HTML tag has three parts:



1. **Name**: (`html`, `head`, `body`, `div`, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
2. **Attributes**: (`href`, `target`, `class`, `id`, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
3. **Children**: A tag can contain some text or other tags or both between the opening and closing segments, e.g., `<div>Some content</div>`.

Now lets inspect the webpage source code by right-click and select the "Inspect" option. First we need to identify the tag which represents the news listing.

## TODO FIX BELOW 

![](https://media.giphy.com/media/RQpW64jdiQG8LNCGsZ/giphy.gif)


In this case we can see the `<div>` tag having class name `"Ov(h) Pend(44px) Pstart(25px)"` is representing news listing, we can apply `find_all` method to grab this information 

In [None]:
div_tags = doc.find_all('div', {'class': "Ov(h) Pend(44px) Pstart(25px)"})

Total elements in the `<div>` tag list is matching with the numbers of news displaying in the webpage , so we are heading towards right direction.

In [None]:
len(div_tags)

9

Next step to inspect the individual `<div>` tag and try to find more information. I am using "Visual Studio Code", but you can use any tool as simple as notepad.

In [None]:
div_tags[1]

<div class="Ov(h) Pend(44px) Pstart(25px)"><div class="C(#959595) Fz(11px) D(ib) Mb(6px)">Investor's Business Daily</div><h3 class="Mb(5px)"><a class="js-content-viewer wafer-caas Fw(b) Fz(18px) Lh(23px) LineClamp(2,46px) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 mega-item-header-link Td(n) C(#0078ff):h C(#000) LineClamp(2,46px) LineClamp(2,38px)--sm1024 not-isInStreamVideoEnabled" data-uuid="e074e541-a391-3d14-a74d-dcf58539eb5c" data-wf-caas-prefetch="1" data-wf-caas-uuid="e074e541-a391-3d14-a74d-dcf58539eb5c" href="/m/e074e541-a391-3d14-a74d-dcf58539eb5c/5-best-chinese-stocks-to-buy.html"><u class="StretchedBox"></u>5 Best Chinese Stocks To Buy And Watch</a></h3><p class="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)--sm1024 LineClamp(2,38px) LineClamp(2,34px)--sm1024 M(0)">Hundreds of Chinese companies are listed on U.S. markets. But which are the best Chinese stocks to buy or watch right now? JD.com , NetEase, Li Auto, Xpeng and BYD Co.. China is the world's most-po

![](https://i.imgur.com/ncnfg0z.png)

Luckily most of the required data points are available in this `<div>`, so we can use `find` method to grab each items.

In [None]:
print("Source: ", div_tags[1].find('div').text)
print("Head Line : {}".format(div_tags[1].find('a').text))

Source:  Investor's Business Daily
Head Line : 5 Best Chinese Stocks To Buy And Watch


If any tag is not accessible directly, then you can use methods like `findParent()` or `'findChild()` to point to the required tag.

![](https://i.imgur.com/OnOAtT2.png)

In [None]:
print("Image URL: ",div_tags[1].findParent().find('img')['src'])

Image URL:  https://s.yimg.com/uu/api/res/1.2/LZOlRB.FeJ.k1zr8FrQ1Zg--~B/Zmk9c3RyaW07aD0xMjM7cT04MDt3PTIyMDthcHBpZD15dGFjaHlvbg--/https://s.yimg.com/uu/api/res/1.2/WneqmBtsZgljDP2fI8nbuQ--~B/aD01NjM7dz0xMDAwO2FwcGlkPXl0YWNoeW9u/https://media.zenfs.com/en/ibd.com/921b5be5ac8c931a68a703b184467a49.cf.jpg


**Summary** : Key Takeout from this exercise is to identify the optimal tag which will provide us required information. Sometimes its straight forward, sometimes you will have to perform little more research.  

### 1.3 Extract & Compile the information into python list

Now we've identified all required tags and information, Let's put this together in the functions.

In [None]:
def get_news_tags(doc):
    """Get the list of tags containing news information"""
    news_class = "Ov(h) Pend(44px) Pstart(25px)" ## class name of div tag 
    news_list  = doc.find_all('div', {'class': news_class})
    return news_list

sample run of the function `get_news_tags`

In [None]:
my_news_tags = get_news_tags(doc)

we will create one more function, to parse individual `<div>` tag and return the information in dictionary form

In [None]:
BASE_URL = 'https://finance.yahoo.com' #Global Variable 

def parse_news(news_tag):
    news_source = news_tag.find('div').text #source
    news_headline = news_tag.find('a').text #heading
    news_url = news_tag.find('a')['href'] #link
    news_content = news_tag.find('p').text #content
    news_image = news_tag.findParent().find('img')['src'] #thumb image
    return { 'source' : news_source,
            'headline' : news_headline,
            'url' : BASE_URL + news_url,
            'content' : news_content,
            'image' : news_image
           }

Lets test this `parse_news` function on first `<div>` tag 

In [None]:
parse_news(my_news_tags[0])

{'content': 'Dow Jones futures were in focus Monday, as the stock market rally attempt continues. Lucid and Zoom plunged on earnings after the close.',
 'headline': 'Dow Jones Futures: Five Stocks Eyeing Buy Points In Market Rally Attempt; Lucid, Zoom Plunge On Earnings',
 'image': 'https://s.yimg.com/uu/api/res/1.2/61FUW4bOvpRgi5tqYWsCvA--~B/Zmk9c3RyaW07aD0xMjM7cT04MDt3PTIyMDthcHBpZD15dGFjaHlvbg--/https://s.yimg.com/uu/api/res/1.2/JPX5jeyihDRi4N6ueuYJVQ--~B/aD01NjM7dz0xMDAwO2FwcGlkPXl0YWNoeW9u/https://media.zenfs.com/en/ibd.com/ed4e9880b0c3e9257977ca3554a6c383.cf.jpg',
 'source': "Investor's Business Daily",
 'url': 'https://finance.yahoo.com/m/811b94de-fece-3540-984f-e0097f4cd7c2/dow-jones-futures-five.html'}

**Summary** : We can use the `get_news_tags` & `parse_news` functions to pars news.

### 1.4 Save the extracted information to a CSV file

This is the last step of this section, We are going to use Python library [`pandas`](https://pandas.pydata.org/docs/) to save the data in CSV format. Install and then Import the pandas Library.

In [None]:
!pip install pandas --upgrade --quiet

In [None]:
import pandas as pd

Now we will create one final function, in this function we will use all previously created helper functions.<br>
The `get_page` function will download HTML page,then we can pass the result in `get_news_tags` to identify list of `<div>` tags for news.<br>
After that we will use [List Comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) technique to pars each `<div>` tag using `parse_news`, the output will be in the form of `lists` of `dictionaries`<br>
Finally we will use `DataFrame` method to create pandas [dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and use `to_csv` method to store required data in CSV format.

In [None]:
def scrape_yahoo_news(url, path=None):
    """Get the yahoo finance market news and write them to CSV file """
    if path is None:
        path = 'stock-market-news.csv'
        
    print('Requesting html page')
    doc = get_page(url)

    print('Extracting news tags')
    news_list = get_news_tags(doc)

    print('Parsing news tags')
    news_data = [parse_news(news_tag) for news_tag in news_list]

    print('Save the data to a CSV')
    news_df = pd.DataFrame(news_data)
    news_df.to_csv('stock-market-news.csv', index=None)
    
    #This return statement is optional, we are doing this just analyze the final output 
    return news_df 

It's time to test the `scrape_yahoo_news` function 

In [None]:
YAHOO_NEWS_URL = BASE_URL+'/topic/stock-market-news/'
news_df = scrape_yahoo_news(YAHOO_NEWS_URL)

Requesting html page
Extracting news tags
Parsing news tags
Save the data to a CSV


The "stock-market-news.csv" should be available in File --> Open Menu, you can download the file or directly open it on browser. Please verify the file content and compare it with the actual information available on the webpage.

You can also check the data by grabbing few rows form the data frame returned by the `scrape_yahoo_news` function 

In [None]:
news_df[:5]

Unnamed: 0,source,headline,url,content,image
0,Investor's Business Daily,Dow Jones Futures: Five Stocks Eyeing Buy Poin...,https://finance.yahoo.com/m/811b94de-fece-3540...,"Dow Jones futures were in focus Monday, as the...",https://s.yimg.com/uu/api/res/1.2/61FUW4bOvpRg...
1,Investor's Business Daily,5 Best Chinese Stocks To Buy And Watch,https://finance.yahoo.com/m/e074e541-a391-3d14...,Hundreds of Chinese companies are listed on U....,https://s.yimg.com/uu/api/res/1.2/LZOlRB.FeJ.k...
2,TipRanks,2 “Strong Buy” Dividend Stocks Yielding 8%,https://finance.yahoo.com/news/2-strong-buy-di...,"Oil is up, the Russian ruble is down, and fina...",https://s.yimg.com/uu/api/res/1.2/rSqVk4eGqXGf...
3,MarketWatch,Lucid stock falls 14% after luxury EV maker sl...,https://finance.yahoo.com/m/fecfb3fa-c65f-3073...,Lucid Group Inc. stock fell more than 14% late...,https://s.yimg.com/uu/api/res/1.2/WVUzLFFdiTdh...
4,Bloomberg,"Russian Stocks, Bonds Face Rising Risk of Ejec...",https://finance.yahoo.com/news/russian-stocks-...,(Bloomberg) -- The mounting sanctions against ...,https://s.yimg.com/uu/api/res/1.2/BvyKw_zpv0g7...


**Summary** : Hopefully I was able to explain this simple but very powerful Python technique to scrape the yahoo finance market news. These steps can be used to scrape any web page, you just have to little research ti identify required <tags> and use relevant python methods to collect the data. 

## 2. Scrape **Trending Tickers**

In phase One we were able to scrape the [yahoo market news](https://finance.yahoo.com/topic/stock-market-news/) web page. However If you've noticed, as we scroll down the web page more news will appear at the bottom of the page. This is called dynamic page loading. Previous technique is a basic Python method useful to scrape static data, To scrape the dynamically loading data will required a special method that we are going to discussion in this phase.

## TODO Add image ?

about selenium 

###  prereq

- !pip install webdriver-manager --upgrade --quiet
- download required chromdriver and place it in the project path 

In [None]:
!pip install selenium --upgrade --quiet

In [None]:
!pip install webdriver-manager --upgrade --quiet

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd 

In [None]:
def get_driver():
    chrome_options = Options()
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--headless')
    serv = Service(ChromeDriverManager().install())
    #driver = webdriver.Chrome(service=serv)
    driver = webdriver.Chrome(options=chrome_options, service=serv)
    return driver

In [1]:
!pip install chromium-chromedriver --quiet

[31mERROR: Could not find a version that satisfies the requirement chromium-chromedriver (from versions: none)[0m
[31mERROR: No matching distribution found for chromium-chromedriver[0m


In [81]:
# install chromium, its driver, and selenium
####!apt update --quiet
!apt install chromium-chromedriver --quiet
!pip install selenium --quiet
# set options to be headless, ..
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# open it, go to a website, and get results
driver = webdriver.Chrome(options=options)
driver.get("https://finance.yahoo.com/trending-tickers'")
#print(wd.page_source)  # results
# divs = wd.find_elements_by_css_selector('div')

header = driver.find_elements(By.TAG_NAME, value= 'th')
print(header[0].text)

rownum=2
print(driver.find_element(By.XPATH, value="//tr[{}]/td[2]".format(rownum)).text)

Reading package lists...
Building dependency tree...
Reading state information...
chromium-chromedriver is already the newest version (97.0.4692.71-0ubuntu0.18.04.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-470
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 69 not upgraded.
Symbol
Zoom Video Communications, Inc.


In [None]:
def get_tickers(driver):
    TABLE_CLASS = "W(100%)"  
    driver.get(YAHOO_FINANCE_URL)
    tablerows = len(driver.find_elements(By.XPATH, value="//table[@class= '{}']/tbody/tr".format(TABLE_CLASS)))
    return tablerows

In [None]:
def parse_ticker(rownum, table_driver):
    Symbol = table_driver.find_element(By.XPATH, value="//tr[{}]/td[1]".format(rownum)).text
    Name = table_driver.find_element(By.XPATH, value="//tr[{}]/td[2]".format(rownum)).text
    LastPrice = table_driver.find_element(By.XPATH, value="//tr[{}]/td[3]".format(rownum)).text
    MarketTime = table_driver.find_element(By.XPATH, value="//tr[{}]/td[4]".format(rownum)).text
    Change = table_driver.find_element(By.XPATH, value="//tr[{}]/td[5]".format(rownum)).text
    PercentChange = table_driver.find_element(By.XPATH, value="//tr[{}]/td[6]".format(rownum)).text	
    Volume = table_driver.find_element(By.XPATH, value="//tr[{}]/td[7]".format(rownum)).text
    MarketCap = table_driver.find_element(By.XPATH, value="//tr[{}]/td[8]".format(rownum)).text	

    return {
    'Symbol': Symbol,
    'Name': Name,
    'LastPrice': LastPrice,
    'MarketTime': MarketTime,
    'Change': Change,
    'PercentChange': PercentChange,
    'Volume': Volume,
    'MarketCap': MarketCap
    }

In [None]:
YAHOO_FINANCE_URL = 'https://finance.yahoo.com/trending-tickers'

print('Creating driver')
driver = get_driver()

In [None]:
get_tickers(driver)

In [None]:
header = driver.find_elements(By.TAG_NAME, value= 'th')

In [None]:
header[0].text

In [None]:
rownum=2
txt=driver.find_element(By.XPATH, value="//tr[{}]/td[2]".format(rownum)).text
txt

In [None]:
YAHOO_FINANCE_URL = 'https://finance.yahoo.com/trending-tickers'

print('Creating driver')
driver = get_driver()

print('Fetching the page')
table_rows = get_tickers(driver)

print(f'Found {table_rows} Tickers')

print('Parsing Trending tickers')
ticker_data = [parse_ticker(i, driver) for i in range (1, table_rows + 1)]

print('Save the data to a CSV')
videos_df = pd.DataFrame(ticker_data)
#print(videos_df)
videos_df.to_csv('trending-tickers.csv', index=None)






Creating driver


Current google-chrome version is 98.0.4758
Get LATEST chromedriver version for 98.0.4758 google-chrome
Trying to download new driver from https://chromedriver.storage.googleapis.com/98.0.4758.102/chromedriver_mac64.zip
Driver has been saved in cache [/Users/vinoddhole/.wdm/drivers/chromedriver/mac64/98.0.4758.102]


Fetching the page
Found 30 Tickers
Parsing Trending tickers
Save the data to a CSV


**Installation**

Anaconda: Download and install it from this link https://www.anaconda.com/ . We will be using Jupyter Notebook for writing the code
Chromedriver — Webdriver for Chrome: Download it from this link https://chromedriver.chromium.org/downloads. No need of installing, just copy the file in the folder where we will create the python file. But before downloading, confirm that the driver‘s version matches that of the Chrome browser installed.

In [None]:
jovian.commit(project="yahoo-finance-web-scraper")

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
Committed successfully! https://jovian.ai/vinodvidhole/yahoo-finance-web-scraper


'https://jovian.ai/vinodvidhole/yahoo-finance-web-scraper'

future 
fix timezone in market events 

-do to

-check above notes 

-testing - done normal, zero rows 

-comments 

-print statements & function doc strings  

-code clean up *** is applicable 

-documentation

## reference
    https://htmldog.com/guides/html/