Today we'll build upon our earlier lesson in the basic_web_plus_api_call_interactions lesson. Remember during that lesson we learned about the webbrowser module, learned how to properly format a URL to use webbrowser, and followed that with using an API call to request some basic information from a web page.  However not all content you may want to download from a script will have an API that you can directly interact with.  Sometimes you will have to actually scrape that content for what you are looking to find.

Reference:
https://automatetheboringstuff.com/chapter11/

In [1]:
'''
Fairly simliar to the last lesson with requests, we need to install the module
'''
!pip install beautifulsoup4



In [2]:
# prep for below
import os
import sys

Ok, now we'll have to import this module.  Note that up until now everything we've used we haven able to import it by the name we installed it with.  Unfortunantely, beautifulsoup4 is not this way.  Take a moment to read the below:

https://docs.python.org/3/tutorial/modules.html#the-module-search-path

Let's take a step and look through this.  Everything you installed with pip gets installed under the site packages directory in your PYTHON_PATH.  So you *should* be able to do the following and see beautifulsoup (somewhere in the results) assuming you are on python3.6.

In [22]:

os.listdir('C:/Users/estyw/Anaconda3/Lib/site-packages')

['adodbapi',
 'alabaster',
 'alabaster-0.7.12.dist-info',
 'anaconda_client-1.7.2-py3.7.egg-info',
 'anaconda_navigator',
 'anaconda_navigator-1.9.6-py3.7.egg-info',
 'anaconda_project',
 'anaconda_project-0.8.2-py3.7.egg-info',
 'asn1crypto',
 'asn1crypto-0.24.0-py3.7.egg-info',
 'astroid',
 'astroid-2.1.0.dist-info',
 'astropy',
 'astropy-3.1-py3.7.egg-info',
 'atomicwrites',
 'atomicwrites-1.2.1-py3.7.egg-info',
 'attr',
 'attrs-18.2.0.dist-info',
 'babel',
 'Babel-2.6.0-py3.7.egg-info',
 'backcall',
 'backcall-0.1.0.dist-info',
 'backports',
 'backports.os-0.1.1-py3.7.egg-info',
 'backports.shutil_get_terminal_size-1.0.0.dist-info',
 'beautifulsoup4-4.6.3.dist-info',
 'binstar_client',
 'bitarray',
 'bitarray-0.8.3-py3.7.egg-info',
 'bkcharts',
 'bkcharts-0.2-py3.7.egg-info',
 'blaze',
 'blaze-0.11.3-py3.7.egg-info',
 'bleach',
 'bleach-3.0.2-py3.7.egg-info',
 'bokeh',
 'bokeh-1.0.2.dist-info',
 'boto',
 'boto-2.49.0-py3.7.egg-info',
 'bottleneck',
 'Bottleneck-1.2.1-py3.7.egg-info

That's a lot to look through, let's see if we can at this without hard coding our python version in and without opening a file browser to look for this manually.  Let's inspect the contents of PYTHONPATH.  In your command prompt or bash shell, that is the variable.  In your Python code it's known as sys.path so you can just get the list now.

In [23]:
sys.path

['C:\\Users\\estyw\\Desktop\\Python Instruction\\python-instruction-master',
 'C:\\Users\\estyw\\Anaconda3\\python37.zip',
 'C:\\Users\\estyw\\Anaconda3\\DLLs',
 'C:\\Users\\estyw\\Anaconda3\\lib',
 'C:\\Users\\estyw\\Anaconda3',
 '',
 'C:\\Users\\estyw\\AppData\\Roaming\\Python\\Python37\\site-packages',
 'C:\\Users\\estyw\\Anaconda3\\lib\\site-packages',
 'C:\\Users\\estyw\\Anaconda3\\lib\\site-packages\\win32',
 'C:\\Users\\estyw\\Anaconda3\\lib\\site-packages\\win32\\lib',
 'C:\\Users\\estyw\\Anaconda3\\lib\\site-packages\\Pythonwin',
 'C:\\Users\\estyw\\Anaconda3\\lib\\site-packages\\IPython\\extensions',
 'C:\\Users\\estyw\\.ipython']

Phew, that's still a bunch of directories to look through. Let's see if we can get through them in just a few lines of code and find beautifulsoup4.  Since we didn't see a module named "beautifulsoup4" above let's look through for modules that start with the letter "b".

In [24]:
for dir in sys.path:
    if os.path.exists(dir):
        print([x for x in os.listdir(dir) if x.startswith('b')])

['basic_database_tech.ipynb', 'basic_web_beautifulsoup4.ipynb', 'basic_web_plus_api_call_interactions.ipynb']
[]
['base64.py', 'bdb.py', 'binhex.py', 'bisect.py', 'bz2.py']
[]
[]
['babel', 'backcall', 'backcall-0.1.0.dist-info', 'backports', 'backports.os-0.1.1-py3.7.egg-info', 'backports.shutil_get_terminal_size-1.0.0.dist-info', 'beautifulsoup4-4.6.3.dist-info', 'binstar_client', 'bitarray', 'bitarray-0.8.3-py3.7.egg-info', 'bkcharts', 'bkcharts-0.2-py3.7.egg-info', 'blaze', 'blaze-0.11.3-py3.7.egg-info', 'bleach', 'bleach-3.0.2-py3.7.egg-info', 'bokeh', 'bokeh-1.0.2.dist-info', 'boto', 'boto-2.49.0-py3.7.egg-info', 'bottleneck', 'bs4']
[]
[]
[]
[]
[]


That was a bit of a pain, but above I see the "bs4" directory...

Let's just look at the internet for the documentation on beautifulsoup4.

In [25]:
import webbrowser
webbrowser.open("https://pypi.org/search/?q=beautifulsoup4")

True

Moral of the story, modules don't always get installed with the same name that you installed for the package name. Even if you ask pip what it knows about beautifulsoup4 you won't see that it's actually called 'bs4'. Sadly this is a Python packaging nuance and you'll find it's easiest just to reference PyPi.org for information regarding the package name or search for information about it. It's usually a bit quicker than trying to guess or look through directories on what the package name actually is called.

In [32]:
!pip show -v beautifulsoup4

Name: beautifulsoup4
Version: 4.6.3
Summary: Screen-scraping library
Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
Author: Leonard Richardson
Author-email: leonardr@segfault.org
License: MIT
Location: c:\users\estyw\anaconda3\lib\site-packages
Requires: 
Required-by: conda-build
Metadata-Version: 2.1
Installer: pip
Classifiers:
  Development Status :: 5 - Production/Stable
  Intended Audience :: Developers
  License :: OSI Approved :: MIT License
  Programming Language :: Python
  Programming Language :: Python :: 2.7
  Programming Language :: Python :: 3
  Topic :: Text Processing :: Markup :: HTML
  Topic :: Text Processing :: Markup :: XML
  Topic :: Text Processing :: Markup :: SGML
  Topic :: Software Development :: Libraries :: Python Modules
Entry-points:


Let's import this along with requests from yesterday and move on...

In [33]:
!pip install requests
import requests, bs4



Let's start by learning a bit about HTML. HTML is the Hypertext Markup Language. Text in HTML is surrounded by tags which tell your browser how to render the content. Take a look here to learn a bit more about HTML.

Reference: https://www.dataquest.io/blog/web-scraping-tutorial-python/


In [34]:
# NOT IMPLEMENTED YET - Add html/head/title/paragraph tags
test = '''
<html>
    <head>
    <title>
    This is the title!
    </title>
    </head>
    <body>
        <p>
        This is in the first paragraph element
        </p>
        <p>
        This is in the second paragraph element
        </p>
        <p>
        This is in the third paragraph tag
        </p>
        <p>
        <b> This is the fourth paragraph and it's important so put bold tags on it too </b>
        </p>
    </body>
</html>
'''

In [35]:
# Make some soup!
soup = bs4.BeautifulSoup(test)

In [37]:
# NOT IMPLEMENTED YET - find all paragraph tags
soup.find_all('p')

[<p>
         This is in the first paragraph element
         </p>, <p>
         This is in the second paragraph element
         </p>, <p>
         This is in the third paragraph tag
         </p>, <p>
 <b> This is the fourth paragraph and it's important so put bold tags on it too </b>
 </p>]

At this point you got the basic idea on how to request all the paragraph elements from a page you wrote. So let's take this to the next step and parse out content of a page you didn't write. However you can expect a consistent output to look for results. We'll be doing a bit of scraping of stock quotes.  Take a look here for more.

Reference:  http://altitudelabs.com/blog/web-scraping-with-python-and-beautiful-soup/

However, I want you do a little research on Bloomberg and get your program to print out the "Dow Jones Industrial Average" quote rather than what is shown here.

In [38]:
# NOT IMPLEMENTED YET - Define Dow Jones Industrial Average URL
dowurl = "https://www.bloomberg.com/quote/INDU:IND/"
# NOT IMPLEMENTED YET - Get the HTML of the page using requests
dowhtml = requests.get(dowurl)
# NOT IMPLEMENTED YET - Parse the text and get the name and the price
soup2 = bs4.BeautifulSoup(dowhtml.content, 'html.parser')
# NOT IMPLEMENTED YET - Display results
soup2.find_all('title')

[<title>Bloomberg - Are you a robot?</title>]

The reference for this exercise handed you the fields to search for. Take a moment to do the following and examine the developer tools built into the browser. You can get to this by clicking the settings -> more tools -> developer tools (on Chrome). Search for the companyName and priceText fields.