# Web Scraping Yahoo! Finance with scrapy

In this tutorial, we will use the **scrapy** module in Python to scrape data from *Yahoo! Finance*.

We will illustrate using Ford's profile page on *Yahoo! Finance* ('https://finance.yahoo.com/quote/F/profile?p=F').

#### Load HTML using the requests Module

Before we use **scrapy** on our website, we must first load the HTML of the website into our Python program. We can do so using the **requests** module. Let's load Ford's Yahoo Finance website HTML into a variable called **html**:

**UPDATE:** It seems that Yahoo! Finance has learned to recognize when their site is accessed by a bot such as in this code. However, I found a workaround. You can pass headers to your **requests.get** function to mimic a real browser. The code below shows an example of this. The *headers* variable contains a dictionary of potential "user agents" such as Mozilla, Chrome, etc. To use it in your **requests.get** function, simply add **headers=headers** as done below.

In [None]:
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' } 
html = requests.get('https://finance.yahoo.com/quote/F', headers=headers).text 

print(html[0:1000])

Let's now use the **Selector** function within the **scrapy** module to read the HTML of our website.

In [None]:
from scrapy.selector import Selector

response = Selector(text=html)

Just as in our simple examples in the previous tutorial, we can now use **xpath** functions on the **response** variable to extract data from our website. For example, let's extract the title of our website.

In [None]:
response.xpath('//title/text()').extract_first()

Now, let's say we want to extract the 'Market Cap' from the website. The simplest way to do so is to identify the xpath of the 'Market Cap' within our HTML. Thankfully, our web browser (e.g., Chrome) has a simple built-in way to identify the xpath of the objects that you see on the web page.

In [None]:
mktcap = response.xpath('//*[@id="nimbus-app"]/section/section/section/article/div[3]/ul/li[9]/span[2]/fin-streamer').extract_first()
print(mktcap)

To extract only the text and not the tags surrounding the text, we can modify the xpath as follows:

In [None]:
mktcap = response.xpath('//*[@id="nimbus-app"]/section/section/section/article/div[3]/ul/li[9]/span[2]/fin-streamer/text()').extract_first()
print(mktcap)

#### Scrape Data for Multiple Companies

Let's now create a simple function to obtain the market cap for any given ticker. Then we can use that function to obtain data for any given list of tickers.

In [None]:
import pandas as pd

def get_mktcap(ticker):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' } 
    html = requests.get('https://finance.yahoo.com/quote/'+ticker, headers=headers).text
    response = Selector(text=html)
    mktcap = response.xpath('//*[@id="nimbus-app"]/section/section/section/article/div[3]/ul/li[9]/span[2]/fin-streamer/text()').extract_first()
    return mktcap

# List of tickers to obtain
tickers = ['F','AAPL','MSFT','AMZN']

# Initalize a new pandas DataFrame
df = pd.DataFrame(columns = ['ticker','mktcap'])

# Iterate through list of tickers and save mktcap to our df DataFrame
for ticker in tickers:
    mktcap = get_mktcap(ticker)
    df = pd.concat([df, pd.DataFrame({'ticker': [ticker], 'mktcap': [mktcap]})], ignore_index=True)
    
# Print the df DataFrame
df

#### Exercise -- Practice Using scrapy and xpath

1. Obtain the 'Previous Close' for Ford listed on Ford's Yahoo! Finance Summary page.
2. Obtain the '1y Target Est' for Ford listed on Ford's Yahoo! Finance Summary page.
3. Create a function to obtain the previous closing price and one-year target estimate for the following tickers: 'AMZN', 'FB', 'V', 'HD', and 'KO'. Create a new pandas DataFrame with the following columns: **ticker**, **close**, **target_est** and add data for these companies to the DataFrame.

#### Solution for # 1

In [None]:
html = requests.get('https://finance.yahoo.com/quote/F', headers=headers).text 
response = Selector(text=html)

In [None]:
close = response.xpath('//*[@id="nimbus-app"]/section/section/section/article/div[3]/ul/li[1]/span[2]/fin-streamer/text()').extract_first()
print(close)

#### Solution for # 2

In [None]:
target_est = response.xpath('//*[@id="nimbus-app"]/section/section/section/article/div[3]/ul/li[16]/span[2]/fin-streamer/text()').extract_first()
print(target_est)

#### Solution for # 3

In [None]:
def get_data(ticker):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' } 
    html = requests.get('https://finance.yahoo.com/quote/'+ticker, headers=headers).text
    response = Selector(text=html)
    close = response.xpath('//*[@id="nimbus-app"]/section/section/section/article/div[3]/ul/li[1]/span[2]/fin-streamer/text()').extract_first()
    target_est = response.xpath('//*[@id="nimbus-app"]/section/section/section/article/div[3]/ul/li[16]/span[2]/fin-streamer/text()').extract_first()
    return close,target_est

# List of tickers to obtain
tickers = ['AMZN','FB','V','HD','KO']

# Initalize a new pandas DataFrame
df = pd.DataFrame(columns = ['ticker','close','target_est'])

# Iterate through list of tickers and save mktcap to our df DataFrame
for ticker in tickers:
    close,target_est = get_data(ticker)
    df = pd.concat([df, pd.DataFrame({'ticker':[ticker], 'close':[close], 'target_est':[target_est]})], ignore_index=True)

# Print the df DataFrame
df