<h1>Web Scraping Stocks with Python</h1>

<p>In this notebook we're going to use web scraping to pull revenue data for Worthington Industries from <a href="https://www.macrotrends.net/stocks/charts/WOR/worthington-industries/revenue">macrotrends.com</a> using the Requests package combined with either BeautifulSoup or Pandas (I'll demonstrate both methods) to pull revenue data for the same company; then I'll show how you can use the yfinance package to retrieve annual stock data; and finally I'll show how you can visualize both historical revenue and stock prices with the Plotly package.</p>
<p>First, let's install the packages that we'll be working with.</p>

In [1]:
!pip install yfinance
!pip install requests
!pip install bs4
!pip install plotly

Collecting yfinance
  Downloading yfinance-0.1.63.tar.gz (26 kB)
Collecting multitasking>=0.0.7
  Downloading multitasking-0.0.9.tar.gz (8.1 kB)
Building wheels for collected packages: yfinance, multitasking
  Building wheel for yfinance (setup.py) ... [?25ldone
[?25h  Created wheel for yfinance: filename=yfinance-0.1.63-py2.py3-none-any.whl size=23909 sha256=a8016eafb6671dc91c3cbaf76d8928975e2844f3f4f8b5b9c8773f514ea387a3
  Stored in directory: /root/.cache/pip/wheels/fe/87/8b/7ec24486e001d3926537f5f7801f57a74d181be25b11157983
  Building wheel for multitasking (setup.py) ... [?25ldone
[?25h  Created wheel for multitasking: filename=multitasking-0.0.9-py3-none-any.whl size=8368 sha256=4c5af8e762445dd17a57645a61be2d8ede1e7a8ba62d0e0aa76e5959fd152dc4
  Stored in directory: /root/.cache/pip/wheels/ae/25/47/4d68431a7ec1b6c4b5233365934b74c1d4e665bf5f968d363a
Successfully built yfinance multitasking
Installing collected packages: multitasking, yfinance
Successfully installed multitasking

<p>Next, we import the packages that we'll be working with</p

In [2]:
import yfinance as yf
import pandas as pd
from bs4 import BeautifulSoup
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import requests

<p>This function defines the charts; I borrowed it from the IBM Data Science course on <a href="https://www.coursera.org/professional-certificates/ibm-data-science">Coursera.com</a>.  It takes a dataframe with stock data, a dataframe with revenue data, and the name of a stock and generates two charts with a slider to filter the date range.   

In [3]:
def make_graph(stock_data, revenue_data, stock):
    fig = make_subplots(rows=2, cols=1, shared_xaxes=True, subplot_titles=("Historical Share Price", "Historical Revenue"), vertical_spacing = .3)
    stock_data_specific = stock_data[stock_data.Date <= '2021--06-14']
    revenue_data_specific = revenue_data[revenue_data.Date <= '2021-04-30']
    fig.add_trace(go.Scatter(x=pd.to_datetime(stock_data_specific.Date, infer_datetime_format=True), y=stock_data_specific.Close.astype("float"), name="Share Price"), row=1, col=1)
    fig.add_trace(go.Scatter(x=pd.to_datetime(revenue_data_specific.Date, infer_datetime_format=True), y=revenue_data_specific.Revenue.astype("float"), name="Revenue"), row=2, col=1)
    fig.update_xaxes(title_text="Date", row=1, col=1)
    fig.update_xaxes(title_text="Date", row=2, col=1)
    fig.update_yaxes(title_text="Price ($US)", row=1, col=1)
    fig.update_yaxes(title_text="Revenue ($US Millions)", row=2, col=1)
    fig.update_layout(showlegend=False,
    height=900,
    title=stock,
    xaxis_rangeslider_visible=True)
    fig.show()

<p>This next chunk of code gets all stock ticker data for NYSE:WOR and puts it into a data frame named wor_data.  We then output the first 5 rows to view a sample of the data frame.</p>

In [4]:
worthington = yf.Ticker("WOR")
wor_data = worthington.history(period = "max")
wor_data.reset_index(inplace = True)
wor_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,1980-03-17,0.0,0.781691,0.741258,0.741258,83025.0,0.0,0.0
1,1980-03-18,0.0,0.808645,0.768213,0.768213,45900.0,0.0,0.0
2,1980-03-19,0.0,0.808645,0.768213,0.768213,56700.0,0.0,0.0
3,1980-03-20,0.0,0.808645,0.768213,0.768213,35775.0,0.0,0.0
4,1980-03-21,0.0,0.795168,0.754736,0.754736,37125.0,0.0,0.0


<p>At this point we can scrape macrotrends data for NYSE:WOR using one of two packages: (a) Beautiful Soup or (b) Pandas; note that for both methods we'll start out by making a get request to the server for the HTML text.  A get request will not work for all websites; for example, <a href="https://www.amazon.com">Amazon.com</a> requires use of their own APIs and will not grant access to the HTML text through a standard get request.</p>
<p>First we define the URL and then we pass the URL to the get method of the Requests package and assign the text to the html_data variable.</p>

In [10]:
url = "https://www.macrotrends.net/stocks/charts/WOR/worthington-industries/revenue"
html_data = requests.get(url).text

<h3>Web Scraping with Beautiful Soup</h3>
<p>First we'll use the Beautiful Soup package following these steps:</p>
<ol>
    <li>Create a Beautiful Soup object containing all of the HTML text from the web page.</li>
    <li>Find all table tags within the Beautiful Soup object and print them for identification. Once we know what table we want, we can comment out those lines and move to the next step.</li>
    <li>Create a custom data frame; define column names, loop through table rows and table data to populate the data frame.</li>
    <li>Print the last 5 rows of the data frame</li>
</ol>

In [32]:
soup = BeautifulSoup(html_data, "html5lib")
soup_tables = soup.find_all("table")
# for i in soup_tables:
#    print(i.prettify())
# soup.prettify() # using this method will display the html code in indented format
bs_revenue_df = pd.DataFrame(columns = ["Date", "Revenue"])

for row in soup_tables[1].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        date = col[0].text
        revenue = col[1].text
        bs_revenue_df = bs_revenue_df.append({"Date" : date, "Revenue" : revenue}, ignore_index = True)
        
bs_revenue_df["Revenue"] = bs_revenue_df["Revenue"].str.replace(",|\$", "")
bs_revenue_df.tail()


The default value of regex will change from True to False in a future version.



Unnamed: 0,Date,Revenue
61,2006-02-28,682
62,2005-11-30,700
63,2005-08-31,694
64,2005-05-31,817
65,2005-02-28,747


In [33]:
make_graph(wor_data, bs_revenue_df, "Worthington Industries")

<h3>Web Scraping with Pandas</h3>
<p>Second we'll look at using the Pandas package following these steps:</p>
<p><i>Note: When using the read_html method of the Pandas package, any tables that exist on the web page are automatically read into a list of data frames.</i></p>
<ol>
    <li>Pass the URL that we defined earlier to the read_html method of the Pandas package.</li>
    <li>Since we know that the revenue table is the second table on the page, we can access it directly through index [1] of the read_html list</li>
    <li>Rename the columns of the dataframe</li>
    <li>Remove unwanted character symbols from the Revenue column</li>
    <li>View the last 5 rows in the data frame</li>
</ol>

In [31]:
read_html = pd.read_html(url)
wor_revenue = read_html[1]
wor_revenue = wor_revenue.rename(columns = {"Worthington Industries Quarterly Revenue(Millions of US $)" : "Date", "Worthington Industries Quarterly Revenue(Millions of US $).1" : "Revenue"})
wor_revenue["Revenue"] = wor_revenue["Revenue"].str.replace(",|\$", "")
wor_revenue.tail()


The default value of regex will change from True to False in a future version.



Unnamed: 0,Date,Revenue
61,2006-02-28,682
62,2005-11-30,700
63,2005-08-31,694
64,2005-05-31,817
65,2005-02-28,747


In [8]:
make_graph(wor_data, wor_revenue, "Worthington Industries")