# Web Scraping Yahoo! Finance using Python

A detailed guide for scraping https://finance.yahoo.com/ using **requests**, **BeautifulSoup**, **Selenium**, **HTML tags** & existing available data in **json** format.

![](https://imgur.com/7jMFOcE.png)

**What is Web scraping?**<br>
Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning.

**Introduction**<br>
The main objective of this tutorial is to showcase different web scraping methods which can be applied to any web page. 
This is for educational purposes only. Please read the Terms & Conditions carefully for any website whether you can legally use the data. 

In this Project we will perform web scraping using following 3 techniques based on the problem statement.
* use `requests`, `BeautifulSoup` and `HTML tags` to extract web page
* use `Selenium` to scrape data from dynamically loading websites 
* scrape data using existing data available in `json` format

**The problem statement**<br>
1. Scrape **Stock Market News** (url : https://finance.yahoo.com/topic/stock-market-news/) :<br>
    This web page shows latest **news** related to **stock market**, we will try to extract data from this web page and store it in `CSV` (comma-separated values) file. The file layout would be as mentioned below.
    ```
    source,headline,url,content,image
    <source of the news>,<news head line>,<news url>,<news content>,<news thumbnail image>
    ```

2. Scrape **Cryptocurrencies** (url : https://finance.yahoo.com/cryptocurrencies) :<br>
    This yahoo finance web page is showing list of trending **Cryptocurrencies** in tabular format, we will perform the web scraping to retrieve first 10 columns for top 100 **Cryptocurrencies** in `CSV` format.
    ```
    Symbol,Name,Price (Intraday),Change,% Change,Market Cap,Volume in Currency (Since 0:00 UTC),
    Volume in Currency (24Hr),Total Volume All Currencies (24Hr),Circulating Supply
    BTC-USD,Bitcoin USD,"43,312.13",-947.50,-2.14%,821.76B,27.727B,27.727B,27.727B,18.973M

    ```
        
3. Scrape **Market Events Calendar** (url : https://finance.yahoo.com/calendar) :<br> 
    This page is showing **date-wise market events**, user have option to select the date and choose any one of the following market event **Earnings**, **Stock Splits**, **Economic Events** & **IPO**. Our aim is to create script which can be run for any single date and market event which grabs the data and load it in `CSV` format. If there is no data found then just create file with column headers.<br>
    



**Prerequisites**
* Knowledge of Python
* Basic knowledge of HTML although it is not necessary


**How to run the Code**<br>
You can execute the code using "Run" button on the top of this page and selecting **"Run on Colab"** or **"Run Locally"** 
<br>
<br>
**Setup and Tools**<br>
<u>Run on Colab :</u> 
    You will need to provide the Google login to run this notebook on Colab.<br>
<u>Run Locally :</u> Download and install [Anaconda](https://www.anaconda.com/) framework, We will be using Jupyter Notebook for writing the & executing code



**Code Re-usability & Version control**

You can make changes and save your version of the notebook to [Jovian](https://jovian.ai/) by executing following cells.

In [None]:
!pip install jovian --upgrade --quiet

In [None]:
import jovian

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project="yahoo-finance-web-scraper")

## <u>1. Scrape Stock Market News</u>

In this section we will learn basic Python web scraping technique using `requests`, `BeautifulSoup` and `HTML tags`. The objective here is to perform web scraping of [Yahoo! finance Stock Market News](https://finance.yahoo.com/topic/stock-market-news/)

![](https://i.imgur.com/1I0Btau.jpg)

Lets kick start with the first objective. Here's an outline of the steps we'll follow<br>
**1.1 Download & Parse webpage using `requests` and `BeautifulSoup`**<br>
**1.2 Exploring and locating Elements**<br>
**1.3 Extract & Compile the information into python list**<br>
**1.4 Save the extracted information to a CSV file**<br>


### 1.1 Download & Parse webpage using requests and BeautifulSoup

First step is to install [`requests`](https://docs.python-requests.org/en/latest/) & [`beautifulsoup4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Libraries using `pip`.

In [None]:
!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet

In [None]:
import requests
from bs4 import BeautifulSoup

The library is now installed and imported.<br>

To download the page, we can use `requests.get`, which returns a response object. the HTML information of web page is captured in `response.text`.<br>
`response.ok` & [`response.status_code`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) can be used for error trapping &  tracking.<br> 
Finally we can use `BeautifulSoup` to parse the HTML data, this will return `bs4.BeautifulSoup` object  

Lets create a function to perform this step 

In [None]:
def get_page(url):
    """Download a webpage and return a beautiful soup doc"""
    response = requests.get(url)
    if not response.ok:
        print('Status code:', response.status_code)
        raise Exception('Failed to load page {}'.format(url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

calling function `get_page` and analyze the output.

In [None]:
my_url = 'https://finance.yahoo.com/topic/stock-market-news/' 
doc = get_page(my_url)

In [None]:
print('Type of doc: ',type(doc))

You can access different properties of HTML web page from doc, following example will display Title of the web page.  

In [None]:
doc.find('title')

**Summary** : We can now use the function `get_page` to download any web page and parse it using beautiful soup.

### 1.2 Exploring and locating Elements
Now its time to explore the elements to find the required data point from the web page. Web pages are written in a language called HTML (Hyper Text Markup Language).  HTML is a fairly simple language comprised of *tags*  (also called *nodes* or *elements*) e.g. `<a href="https://finance.yahoo.com/" target="_blank">Go to Yahoo! Finance</a>`. An HTML tag has three parts:



1. **Name**: (`html`, `head`, `body`, `div`, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
2. **Attributes**: (`href`, `target`, `class`, `id`, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
3. **Children**: A tag can contain some text or other tags or both between the opening and closing segments, e.g., `<div>Some content</div>`.

Now lets inspect the webpage source code by right-click and select the "Inspect" option. First we need to identify the tag which represents the news listing.

## TODO FIX BELOW 

![](https://media.giphy.com/media/RQpW64jdiQG8LNCGsZ/giphy.gif)


In this case we can see the `<div>` tag having class name `"Ov(h) Pend(44px) Pstart(25px)"` is representing news listing, we can apply `find_all` method to grab this information 

In [None]:
div_tags = doc.find_all('div', {'class': "Ov(h) Pend(44px) Pstart(25px)"})

Total elements in the `<div>` tag list is matching with the numbers of news displaying in the webpage , so we are heading towards right direction.

In [None]:
len(div_tags)

Next step to inspect the individual `<div>` tag and try to find more information. I am using "Visual Studio Code", but you can use any tool as simple as notepad.

In [None]:
div_tags[1]

![](https://i.imgur.com/ncnfg0z.png)

Luckily most of the required data points are available in this `<div>`, so we can use `find` method to grab each items.

In [None]:
print("Source: ", div_tags[1].find('div').text)
print("Head Line : {}".format(div_tags[1].find('a').text))

If any tag is not accessible directly, then you can use methods like `findParent()` or `'findChild()` to point to the required tag.

![](https://i.imgur.com/OnOAtT2.png)

In [None]:
print("Image URL: ",div_tags[1].findParent().find('img')['src'])

**Summary** : Key Takeout from this exercise is to identify the optimal tag which will provide us required information. Sometimes its straight forward, sometimes you will have to perform little more research.  

### 1.3 Extract & Compile the information into python list

Now we've identified all required tags and information, Let's put this together in the functions.

In [None]:
def get_news_tags(doc):
    """Get the list of tags containing news information"""
    news_class = "Ov(h) Pend(44px) Pstart(25px)" ## class name of div tag 
    news_list  = doc.find_all('div', {'class': news_class})
    return news_list

sample run of the function `get_news_tags`

In [None]:
my_news_tags = get_news_tags(doc)

we will create one more function, to parse individual `<div>` tag and return the information in dictionary form

In [None]:
BASE_URL = 'https://finance.yahoo.com' #Global Variable 

def parse_news(news_tag):
    news_source = news_tag.find('div').text #source
    news_headline = news_tag.find('a').text #heading
    news_url = news_tag.find('a')['href'] #link
    news_content = news_tag.find('p').text #content
    news_image = news_tag.findParent().find('img')['src'] #thumb image
    return { 'source' : news_source,
            'headline' : news_headline,
            'url' : BASE_URL + news_url,
            'content' : news_content,
            'image' : news_image
           }

Lets test this `parse_news` function on first `<div>` tag 

In [None]:
parse_news(my_news_tags[0])

**Summary** : We can use the `get_news_tags` & `parse_news` functions to pars news.

### 1.4 Save the extracted information to a CSV file

This is the last step of this section, We are going to use Python library [`pandas`](https://pandas.pydata.org/docs/) to save the data in CSV format. Install and then Import the pandas Library.

In [None]:
!pip install pandas --upgrade --quiet

In [None]:
import pandas as pd

Now we will create one final function, in this function we will use all previously created helper functions.<br>
The `get_page` function will download HTML page,then we can pass the result in `get_news_tags` to identify list of `<div>` tags for news.<br>
After that we will use [List Comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) technique to pars each `<div>` tag using `parse_news`, the output will be in the form of `lists` of `dictionaries`<br>
Finally we will use `DataFrame` method to create pandas [dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and use `to_csv` method to store required data in CSV format.

In [None]:
def scrape_yahoo_news(url, path=None):
    """Get the yahoo finance market news and write them to CSV file """
    if path is None:
        path = 'stock-market-news.csv'
        
    print('Requesting html page')
    doc = get_page(url)

    print('Extracting news tags')
    news_list = get_news_tags(doc)

    print('Parsing news tags')
    news_data = [parse_news(news_tag) for news_tag in news_list]

    print('Save the data to a CSV')
    news_df = pd.DataFrame(news_data)
    news_df.to_csv(path, index=None)
    
    #This return statement is optional, we are doing this just analyze the final output 
    return news_df 

It's time to test the `scrape_yahoo_news` function 

In [None]:
YAHOO_NEWS_URL = BASE_URL+'/topic/stock-market-news/'
news_df = scrape_yahoo_news(YAHOO_NEWS_URL)

The "stock-market-news.csv" should be available in File --> Open Menu, you can download the file or directly open it on browser. Please verify the file content and compare it with the actual information available on the webpage.

You can also check the data by grabbing few rows form the data frame returned by the `scrape_yahoo_news` function 

In [None]:
news_df[:5]

**Summary** : Hopefully I was able to explain this simple but very powerful Python technique to scrape the yahoo finance market news. These steps can be used to scrape any web page, you just have to little research to identify required <tags> and use relevant python methods to collect the data. 

## <u>2. Scrape Cryptocurrencies</u>

In phase One we were able to scrape the [yahoo market news](https://finance.yahoo.com/topic/stock-market-news/) web page. However If you've noticed, as we scroll down the web page more news will appear at the bottom of the page. This is called dynamic page loading. Previous technique is a basic Python method useful to scrape static data, To scrape the dynamically loading data will use a different method called webs craping using **Selenium**. Lets move ahead with this topic. The goal of this section is extract top listing [Crypto currencies](https://finance.yahoo.com/cryptocurrencies) from Yahoo! finance.

![](https://i.imgur.com/sF6k0Pk.jpg)


Here's an outline of the steps we'll follow<br>
**2.1 Introduction of selenium**<br>
**2.2 Downloads & Installation**<br>
**2.3 Install & Import libraries**<br>
**2.4 Create Web Driver**<br>
**2.5 Exploring and locating Elements**<br>
**1.3 Extract & Compile the information into python list**<br>
**1.4 Save the extracted information to a CSV file**<br>

### 2.1 Introduction of selenium

**[Selenium](https://www.selenium.dev/)** is an open-source web-based automation tool. Python language and other languages are used with Selenium for testing as well as web scraping. Here we will use Chrome browser, but you can try on any browser.<br>

**Why you should use Selenium?**
- Clicking on buttons
- Filling forms
- Scrolling
- Taking a screenshot

You can find proper documentation on selenium [here](https://selenium-python.readthedocs.io/)<br>

Following methods will help to find elements in a webpage (these methods will return a list):
- `find_elements_by_name`
- `find_elements_by_xpath`
- `find_elements_by_link_text`
- `find_elements_by_partial_link_text`
- `find_elements_by_tag_name`
- `find_elements_by_class_name`
- `find_elements_by_css_selector`

In this tutorial we will use only `find_elements_by_xpath` and `find_elements_by_tag_name` You can find complete documentation of these methods [here](https://selenium-python.readthedocs.io/locating-elements.html)

### 2.2 Downloads & Installation 

Unlike previous section, here we'll have to do some prep work to implement this method. We will need to install Selenium & proper web browser driver<br>

If you are using **Google Colab** platform then execute following code to perform Initial installation. This piece of code `'google.colab' in str(get_ipython())` is used to identify the Google Colab platform.

In [None]:
if 'google.colab' in str(get_ipython()):
    print('Google CoLab Installation')
    !apt update --quiet
    !apt install chromium-chromedriver --quiet

To run it on **Locally** you will need to download Webdriver for Chrome. You can download it from this link https://chromedriver.chromium.org/downloads and just copy the file in the folder where we will create the python file (No need of installation). But make sure that the driver‘s version matches that of the Chrome browser installed on the local machine.

## TODO ADD image 

### 2.3 Install & Import libraries

Now lets Install the required libraries. Please note that there are some platform specific libraries

In [None]:
print('library Installation')
if 'google.colab' not in str(get_ipython()):
    print('Not running on CoLab')
    !pip install webdriver-manager --upgrade --quiet
else:
    print('Running on CoLab')
    
!pip install selenium --upgrade --quiet
!pip install pandas --upgrade --quiet

Once the Libraries installation is done, next step is to import all the required modules / libraries. 

In [None]:
print('Library Import')
if 'google.colab' not in str(get_ipython()):
    print('Not running on CoLab')
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.chrome.service import Service
    from webdriver_manager.chrome import ChromeDriverManager
else:
    print('Running on CoLab')

print('Common Library Import')
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd 
import time

So all the necessary prep work is done, lets move ahead to implement this method.

### 2.4 Create Web Driver

In this step first we will create the instance of Chrome WebDriver using `webdriver.Chrome()` method. and then the `driver.get()` method will navigate to a page given by the URL. In this case also there is slight variation based on platform, Also passed `options` parameters for e.g. `--headless` option will load the driver in background. 

In [None]:
if 'google.colab' in str(get_ipython()):
    print('Running on CoLab')
    def get_driver(url):
        colab_options = webdriver.ChromeOptions()
        colab_options.add_argument('--no-sandbox')
        colab_options.add_argument('--disable-dev-shm-usage')
        colab_options.add_argument('--headless')
        driver = webdriver.Chrome(options=colab_options)
        driver.get(url)
        return driver
else:
    print('Not running on CoLab')
    def get_driver(url):
        chrome_options = Options()
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument('--headless')
        serv = Service(ChromeDriverManager().install())
        driver = webdriver.Chrome(options=chrome_options, service=serv)
        driver.get(url)
        return driver

lets run the function `get_driver`

In [None]:
driver = get_driver('https://finance.yahoo.com/cryptocurrencies')

### 2.5 Exploring and locating Elements

This is almost similar step that we have done in phase 1, We will try to identify relevant information like `<tags>`, `class` , `xpath` etc from the web page. So lets do right-click and select the "Inspect" to do further analysis.

As the webpage is showing crypto currency information in the tabular form. We can grab the table header by using tag `<th>`, we will use find_elements by TAG to get the table headers which will  

In [None]:
#use this for explain 
header = driver.find_elements(By.TAG_NAME, value= 'th')
header[0].text

In [None]:
rownum=2
txt=driver.find_element(By.XPATH, value="//tr[{}]/td[2]".format(rownum)).text
txt

In [None]:
def get_table_rows(driver):
    TABLE_CLASS = "W(100%)"  
    tablerows = len(driver.find_elements(By.XPATH, value="//table[@class= '{}']/tbody/tr".format(TABLE_CLASS)))
    return tablerows

In [None]:
#//*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]

In [None]:
def get_table_header(driver):
    header = driver.find_elements(By.TAG_NAME, value= 'th')
    header_list = [item.text for index, item in enumerate(header) if index < 10]
    return header_list

use below to explain 
```
rownum = 1
colnum = 3
driver.find_element(By.XPATH, value="//tr[{}]/td[{}]".format(rownum,colnum)).text
```

In [None]:
def parse_table_rows(rownum, driver, header_list):
    row_dictionary = {}
    time.sleep(1/3)
    for index , item in enumerate(header_list):
        column_xpath = '//*[@id="scr-res-table"]/div[1]/table/tbody/tr[{}]/td[{}]'.format(rownum, index+1)
        row_dictionary[item] = driver.find_element(By.XPATH, value=column_xpath).text
    return row_dictionary

In [None]:
def parse_multiple_pages(driver, total_crypto):
    table_data = []
    page_num = 1
    is_scraping = True
    header_list = get_table_header(driver)

    while is_scraping:
        table_rows = get_table_rows(driver)
        print('Found {} rows on Page : {}'.format(table_rows, page_num))
        print('Parsing Page : {}'.format(page_num))
        table_data += [parse_table_rows(i, driver, header_list) for i in range (1, table_rows + 1)]
        total_count = len(table_data)
        print('Total rows scraped : {}'.format(total_count))
        if total_count >= total_crypto:
            print('Done Parsing..')
            is_scraping = False
        else:    
            print('Clicking Next Button')
            element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="scr-res-table"]/div[2]/button[3]')))
            element.click() 
            page_num += 1
    return table_data

In [None]:
def scrape_yahoo_crypto(url, total_crypto, path=None):
    """Get the list of yahoo finance crypto-currencies and write them to CSV file """
    if path is None:
        path = 'crypto-currencies.csv'
    
    print('Creating driver')
    driver = get_driver(url)    
    
    table_data = parse_multiple_pages(driver, total_crypto)
            
    driver.close()
    
    print('Save the data to a CSV')
    table_df = pd.DataFrame(table_data)
    #print(table_df)
    table_df.to_csv(path, index=None)
    
    #This return statement is optional, we are doing this just analyze the final output 
    return table_df 

In [None]:
YAHOO_FINANCE_URL = BASE_URL+'/cryptocurrencies'
TOTAL_CRYPTO = 100
crypto_df = scrape_yahoo_crypto(YAHOO_FINANCE_URL, TOTAL_CRYPTO,'crypto-currencies.csv')

In [None]:
crypto_df

## <u>3. Scrape Market Events Calendar</u>

In [None]:
import re
import json
from io import StringIO
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [None]:
def get_event_page(scraper_url):
    """add something"""
    headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
                  "(KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    response = requests.get(scraper_url, headers=headers)
    if not response.ok:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ' + scraper_url)
    
    # Construct a beautiful soup document
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [None]:
def get_json_dictionary(doc):
    """Add"""
    ## Explain 
    pattern = re.compile(r'\s--\sData\s--\s')
    script_data = doc.find('script', text=pattern).text
    
    ## Explain
    start  = script_data.find('context')-2
    json_dictionary  = script_data[start:-12]
    
    ##explain 
    parsed_dictionary = json.loads(json_dictionary)
    
    return parsed_dictionary, json_dictionary    

In [None]:
# this function is useful during analysis 
def create_json_file(file_name,json_dictionary):
    """Add"""
    with open(file_name+'.json', 'w') as file:
        file.write(json_dictionary)

In [None]:
def get_columns_and_total_rows(parsed_dictionary):
    #explain 
    total_rows = parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['total']
    column_dictionary = parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['columns']
    return total_rows, column_dictionary

In [None]:
def get_page_rows(parsed_dictionary):
    data_dictionary = parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['rows']
    return len(data_dictionary), data_dictionary

In [None]:
def scrape_all_pages(event_type, date):
    YAHOO_CAL_URL = BASE_URL+'/calendar/{}?day={}&offset={}&size={}'
    page_size = '100' # this indicates max rows per page 
    page_number = 1
    pagewise_rows = 0
    final_data_dictionary = []
    
    while page_number > 0:
        #Explain : starting page 0, next page must be multiple of page_size i.e. 100
        print("Pricessing page # {}".format(page_number))
        page_url = str((page_number - 1 ) * int(page_size))
        scrape_url = YAHOO_CAL_URL.format(event_type, date, page_url, page_size)
        print("Scrape url for page {} is {}".format(page_number,scrape_url))
        page_doc = get_event_page(scrape_url)
        parse_dict, json_dict = get_json_dictionary(page_doc)
        total_row, column_dict = get_columns_and_total_rows(parse_dict)        
        page_rows , data_dict = get_page_rows(parse_dict)
        print("total rows for page {} : {}".format(page_number,page_rows))
        pagewise_rows += page_rows
        final_data_dictionary += data_dict
        if pagewise_rows >= total_row:
            page_number = 0
            return final_data_dictionary, column_dict
        page_number += 1

In [None]:
def get_dataframe_layout(column_dictionary):
    """Add"""
    csv_columns = []
    csv_columns_header = []
    for i in range(len(column_dictionary)):
        csv_columns.append(column_dictionary[i]['data'])
        csv_columns_header.append(column_dictionary[i]['content'])
    return csv_columns, csv_columns_header  

In [None]:
def create_csv(data_dictionary, filter_columns, filter_columns_header, file_name):
    """"""
    if len(data_dictionary) > 0:
        scraped_df = pd.DataFrame(data_dictionary)
        scraped_df.to_csv(file_name,columns=filter_columns,header=filter_columns_header,index=False)
    else:
        scraped_df=pd.DataFrame(columns=filter_columns_header)
        scraped_df.to_csv(file_name,index=False)

In [None]:
def scrape_yahoo_calendar(event_type, date_param, path=None):
    """Get the list of yahoo finance calendar and write them to CSV file """
    if path is None:
        path = event_type+'_'+date_param+'.csv'
    
    data_dict, column_dict = scrape_all_pages(event_type, date_param)
    csv_columnm , csv_header = get_dataframe_layout(column_dict)
    create_csv(data_dict, csv_columnm, csv_header, path)

In [None]:
BASE_URL = 'https://finance.yahoo.com' #Global Variable 

## create dictionary to identify webpage and technique to use 
event_type = 'splits'
event_type = 'economic'
event_type = 'ipo'
event_type = 'earnings'
# create a function to get default date if date is not passed 
date_param = '2022-02-28'


scrape_yahoo_calendar(event_type, date_param)

In [None]:
jovian.commit(project="yahoo-finance-web-scraper")

## References and Future Work

Summary of what we did

- ?
- ?


References to links you found useful

-  https://htmldog.com/guides/html/
- ?
 
**Ideas for future work**<br>
- Automate this process using [AWS Lambda](https://aws.amazon.com/lambda/) to download daily market calendar, crypto-currencies & market news in CSV format.
- Move the old files in to an Archive folder and append date-stamp to the archive files.
- Delete the files older than 2 weeks.
- process the raw data extracted from third technique using different methods of pandas 