# Webscraping Yahoo Finance
### Introduction
In this notebook, we shall learn how to scrape websites for information using Python. 


The following are general steps to scrape websites:
1. Identify target website.
2. Learn how the website constructs their URL so we can retrieve the desired page from the website.
3. Programmatically retrieve websites with the URL we reverse-engineered from the website.
4. Use libraries to traverse and obtain information we need from website.
5. Organise the information and return in desired format.

In [1]:
# Some of the libraries we need to construct
import requests
from bs4 import BeautifulSoup
from time import sleep
import json
import argparse
from collections import OrderedDict
from time import sleep

## Step 1. Identify target website

Our target website is Yahoo Finance. Our desired information is the stock information available on the website. 

Before we start scraping, have a look at how a typical stock information page looks like: 

__[Apple - AAPL](https://finance.yahoo.com/quote/AAPL?p=AAPL)__ : https://finance.yahoo.com/quote/AAPL?p=AAPL


## Step 2. Deconstruct URL

We don't want to manually type in the search box and search for stocks we want. Instead, by studying the URL, we can deconstruct them, and by tweaking them, we can get to the page that holds the stock information we want. 


For example, look at the URL in **Step 1**. The ticker symbol of the Apple is **AAPL**. Notice that **AAPL** repeats twice in the URL. 

Perhaps, if we replace **AAPL** in the URL with any ticker symbol we want, we can get the stock information for that stock. 

For example, try __[Google - GOOG](https://finance.yahoo.com/quote/GOOG?p=GOOG)__ : https://finance.yahoo.com/quote/GOOG?p=GOOG


In [2]:
def construct_url(ticker):
    base_url = "http://finance.yahoo.com/quote/{0}?p={0}"
    url = base_url.replace("{0}", ticker)
    return url






## Step 3. Retrieve website

Now that we know which URL to go to, we can retrieve the webpage programmatically. 

To do so, we use a Python Library called **requests**, which we have imported above.

After retrieving the webpage, we then pass it to another library, **BeautifulSoup**, a library that will greatly faciliate our webscraping.

In [3]:
def retrieve_website(url):
    response = requests.get(url)
    html = response.text
    parsed_html = BeautifulSoup(html, 'html.parser')
    return parsed_html

## Step 4. Retrieve desired information

This step is probably the hardest step. 

We need to look at our a website is constructed, and utilise bits of the metadata to reliably retrieve the information we need.


### Chrome Developer Tool
To do so, we can open the **Chrome Developer Tools**. 

On the Yahoo Finance page, 
1. hover the mouse over any element you would like to find out more information about,
2. Right-click on it.
3. Then select inspect.

A side bar should popup with the raw HTML information. As you hover your mouse over the HTML elements, it should highlight the location of the element on the page. 



### Example
In the code below, we use the `find` method to look for a HTML element with *tag* `span` and *class* `Fz(36px)`. 

We then retrieve the textual information nested inside this element with `getText()`.



### Your turn!
Now, try to identify and scrape more financial data.

Here's a list for you to try:
1. Beta
2. PE Ratio (TTM)
3. EPS (TTM)
4. Earnings Date
5. Dividend & Yield

In [4]:
def get_information(ticker, url, website):
    
    ticker_price = website.find("span", class_="Fz(36px)").getText()
######## Add in your webscrapping code here to scrape more information #############
    
    
    
####################################################################################
######## Remember to add the information to the summary_data object below ##########
    summary_data = {
        'ticker': ticker,
        'ticker_price': ticker_price,
        'url': url     
    }
    return summary_data

## Putting it all together

The method below puts everything together. Try running it!

In [8]:
def get_ticker_data(ticker):
    
    url = construct_url(ticker)
    print("getting from URL: " + url) #jupyter_
    website = retrieve_website(url)

    summary_data = get_information(ticker, url, website)
    
    return summary_data


print(get_ticker_data("AAPL")) #jupyter_

getting from URL: http://finance.yahoo.com/quote/AAPL?p=AAPL
{'ticker': 'AAPL', 'ticker_price': '156.875', 'url': 'http://finance.yahoo.com/quote/AAPL?p=AAPL'}
