Before starting, you'll need to install these packages. This can be done by executing <em>pip install package_to_install</em> in terminal. For example, to install pandas, run <em>pip install pandas</em>.

In [1]:
import requests  # this package allows us to 'request' data from a site in the form of HTML
import bs4  # this package, known as BeautifulSoup, helps with parsing the HTML you get from a site
import pandas as pd  # this package works with tabular data

<b>Using pd.read_html</b>

Before we start webscraping, we should make sure that there isn't already an easier way to get the data we need. pandas has a function called read_html, which is very good at taking tabular data from a website and converting it into a dataframe.

In [4]:
url = 'https://finance.yahoo.com/quote/ARNC/financials?p=ARNC'

In [5]:
df1 = pd.read_html(url)
df1 = df1[0]  # because read_html returns a list of dataframes

ValueError: No tables found

In [None]:
df1.head()  # first 5 rows of the dataframe

Note that we lose information on the 52 Week Range since this isn't easily parsed.

In [5]:
df1.shape  # how big the dataframe is: (# rows, # columns)

(100, 10)

Well, we don't only want the first 100 rows. We want all of the data. After some poking around on the website (mainly hitting the next button and looking at how the url changes), we can retrieve the rest of the data.

In [6]:
# if there were more than three pages, we would instead use a loop to go over count and offset
df2 = pd.read_html('https://finance.yahoo.com/most-active?count=100&offset=100')
df2 = df2[0]
df3 = pd.read_html('https://finance.yahoo.com/most-active?count=100&offset=200')
df3 = df3[0]

In [7]:
total_df = pd.concat([df1, df2, df3], axis = 0)  # putting all the dataframes together

In [8]:
total_df.head()

Unnamed: 0,Symbol,Name,Price (Intraday),Change,% Change,Volume,Avg Vol (3 month),Market Cap,PE Ratio (TTM),52 Week Range
0,GE,General Electric Company,5.49,-0.21,-3.68%,128.486M,112.792M,48.022B,,
1,F,Ford Motor Company,4.9,0.01,+0.20%,80.545M,100.421M,19.487B,,
2,AMD,"Advanced Micro Devices, Inc.",54.2,-0.31,-0.57%,66.951M,83.163M,63.479B,127.23,
3,BAC,Bank of America Corporation,21.44,-0.27,-1.24%,64.484M,93.075M,186.005B,8.72,
4,WFC,Wells Fargo & Company,23.36,-0.7,-2.91%,51.617M,45.707M,95.776B,8.08,


In [9]:
total_df.shape

(265, 10)

Now our total dataframe contains all the data from Yahoo's most-active.

<b>Webscraping with BeautifulSoup (bs4)</b>

Webscraping with BeautifulSoup has the following steps (generally):
1. Request HTML from a website using requests. You will get a response object that contains HTML.
2. Create a soup object with the HTML you got.
3. Look through the HTML to find where your data is and see if there are patterns in the way your data is stored in the HTML.
4. Use BeautifulSoup to exploit those patterns and extract your data.
5. Store your data somewhere nice, like a csv file or a pandas dataframe.

In [10]:
# getting a response from a site
url = 'https://finance.yahoo.com/most-active'
response = requests.get(url)

Notice that you get HTML as a response and it looks really ugly. This is why we use BeautifulSoup to help us parse this.

In [11]:
response.text[:1000]  # displaying the first 1000 characters of the response we got

'<!DOCTYPE html><html id="atomic" class="NoJs featurephone" lang="en-US"><head prefix="og: http://ogp.me/ns#"><script>window.performance && window.performance.mark && window.performance.mark(\'PageStart\');</script><meta charset="utf-8"/><title>Most Active Stocks Today - Yahoo Finance</title><meta name="keywords" content="Stock Screener, industry, index membership, share data, stock price, market cap, beta, sales, profitability, valuation ratios, analyst estimates, large cap value, bargain growth, preset stock screens"/><meta http-equiv="x-dns-prefetch-control" content="on"/><meta property="twitter:dnt" content="on"/><meta property="fb:app_id" content="90376669494"/><meta name="theme-color" content="#400090"/><meta name="viewport" content="width=device-width, initial-scale=1"/><meta name="description" lang="en-US" content="See the list of the most active stocks today, including share price change and percentage, trading volume, intraday highs and lows, and day charts."/><meta name="oat

In [13]:
soup = bs4.BeautifulSoup(response.text)  # we create a BeautifulSoup object using the response we got

In [14]:
print(soup.prettify()[:1000])  # displaying the first 1000 characters in a prettier way

<!DOCTYPE html>
<html class="NoJs featurephone" id="atomic" lang="en-US">
 <head prefix="og: http://ogp.me/ns#">
  <script>
   window.performance && window.performance.mark && window.performance.mark('PageStart');
  </script>
  <meta charset="utf-8"/>
  <title>
   Most Active Stocks Today - Yahoo Finance
  </title>
  <meta content="Stock Screener, industry, index membership, share data, stock price, market cap, beta, sales, profitability, valuation ratios, analyst estimates, large cap value, bargain growth, preset stock screens" name="keywords"/>
  <meta content="on" http-equiv="x-dns-prefetch-control"/>
  <meta content="on" property="twitter:dnt"/>
  <meta content="90376669494" property="fb:app_id"/>
  <meta content="#400090" name="theme-color"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="See the list of the most active stocks today, including share price change and percentage, trading volume, intraday highs and lows, and day charts." lang

In [16]:
len(response.text)  # the actual number of characters is over 900,000

906963

For webscraping, it is useful to know how HTML is written because you have to look for patterns within the HTML to find the data you need. When scrolling through the massive HTML, I noticed that the data we want is under the 'td' HTML tag (which is for table cells).

In [17]:
soup.find_all('td')[:10]  # using BeautifulSoup to find all the td tags and looking at the first 10

[<td aria-label="Symbol" class="Va(m) Ta(start) Pstart(6px) Pend(10px) Miw(90px) Start(0) Pend(10px) simpTblRow:h_Bgc($extraLightBlue) Bgc(white) Ta(start)! Fz(s)" colspan="" data-reactid="59"><a class="Fw(600)" data-reactid="60" href="/quote/GE?p=GE" title="General Electric Company">GE</a></td>,
 <td aria-label="Name" class="Va(m) Ta(start) Px(10px) Fz(s)" colspan="" data-reactid="61"><!-- react-text: 62 -->General Electric Company<!-- /react-text --></td>,
 <td aria-label="Price (Intraday)" class="Va(m) Ta(end) Pstart(20px) Fw(600) Fz(s)" colspan="" data-reactid="63"><span class="Trsdu(0.3s)" data-reactid="64">5.49</span></td>,
 <td aria-label="Change" class="Va(m) Ta(end) Pstart(20px) Fw(600) Fz(s)" colspan="" data-reactid="65"><span class="Trsdu(0.3s) Fw(600) C($negativeColor)" data-reactid="66">-0.21</span></td>,
 <td aria-label="% Change" class="Va(m) Ta(end) Pstart(20px) Fw(600) Fz(s)" colspan="" data-reactid="67"><span class="Trsdu(0.3s) Fw(600) C($negativeColor)" data-reactid=

Looking closely at the list above, we see all the information for a single company: the first string gives us the symbol for the company, the second string gives us the name of the company, the third string gives us the intraday price, and so on. Using this information, we can parse through the HTML to extract the information we want.

In [18]:
# find all td tags with attribute aria-label = "Symbol"
symbols_list = soup.find_all('td', attrs = {'aria-label': "Symbol"})
symbols_col = []
# loop through all the found tags and extract their text in the tag
for i in symbols_list:
    symbols_col.append(i.text)

In [19]:
symbols_col[:5]

['GE', 'F', 'AMD', 'BAC', 'WFC']

In [20]:
# find all td tags with attribute aria-label = "Name"
names_list = soup.find_all('td', attrs = {'aria-label': "Name"})
names_col = []
# loop through all the found tags and extract their text in the tag
for i in names_list:
    names_col.append(i.text)

In [21]:
names_col[:5]

['General Electric Company',
 'Ford Motor Company',
 'Advanced Micro Devices, Inc.',
 'Bank of America Corporation',
 'Wells Fargo & Company']

You get the gist...

I will loop through things we care about (symbol, name, price, etc.) and extract it in a similar way.

In [22]:
# these are all the columns we need (excluding 52 week range)
column_names = ['Symbol', 'Name', 'Price (Intraday)', 'Change', 
                '% Change', 'Volume', 'Avg Vol (3 month)', 'Market Cap', 'PE Ratio (TTM)']

In [23]:
all_columns = []
for col_name in column_names:
    soup_list = soup.find_all('td', attrs = {'aria-label': col_name})
    temp_col = []
    for i in soup_list:
        temp_col.append(i.text)
    all_columns.append(temp_col)

In [24]:
# we pass all the information we extract into a pandas dataframe
webscrape_df = pd.DataFrame(all_columns).T
webscrape_df.columns = column_names
webscrape_df.head()

Unnamed: 0,Symbol,Name,Price (Intraday),Change,% Change,Volume,Avg Vol (3 month),Market Cap,PE Ratio (TTM)
0,GE,General Electric Company,5.49,-0.21,-3.68%,128.486M,112.792M,48.022B,
1,F,Ford Motor Company,4.9,0.01,+0.20%,80.545M,100.421M,19.487B,
2,AMD,"Advanced Micro Devices, Inc.",54.2,-0.31,-0.57%,66.951M,83.163M,63.479B,127.23
3,BAC,Bank of America Corporation,21.44,-0.27,-1.24%,64.484M,93.075M,186.005B,8.72
4,WFC,Wells Fargo & Company,23.36,-0.7,-2.91%,51.617M,45.707M,95.776B,8.08


We get the same dataframe as we did when we used pd.read_html!

Similar logic can be used to extract the rest of the data on other pages by changing the url we request. This code should be cleaned up and put into a function so we can simply pass in a url and the function will return a dataframe of the data we want.