- Tutorial used: https://realpython.com/python-web-scraping-practical-introduction/#scrape-and-parse-text-from-websites

- Python standard library urllib which has tools for working with URLs

- The urlopen module can open urls.

- The module re will be used to utilize regular expressions to find data in the html.

- **Warning from tutorial states you can get banned for scraping web data.**

In [1]:
from urllib.request import urlopen
import re

- Set the URL.
- Use the urlopen module open the URL.

In [2]:
url = "http://olympus.realpython.org/profiles/aphrodite"
page = urlopen(url)

- Decode() returns a sequence of bytes which is a string of html text without any formatting.
- Decode method defaults to utf-8; errors='strict,ignore,or replace' possible to handle errors, not checked by default.
- When outputting raw data using print, html_bytes is just a long string while html is formatted nicely.


In [5]:
html_bytes = page.read()
html = html_bytes.decode()

### **First way to extract content from web page using html methods.**

#### This example extracts the word Aphrodite from example web page at the top.

In [28]:
# finding start index of substring example using html
start_index = html.find("Aphrodite")
# get length of target word
target_word_len = len("Aphrodite")
# find ending of word which is the title
end_index = html.find("</title>")
#extract target word from html
target_word = html[start_index:end_index]
print(target_word)

Aphrodite


### **Second way to extract content from web page using regular expressions.**

#### Count the number of times the letter b appears in the html using regular expressions.

In [9]:
# returns array of all b's found
re.findall("b",html)
# return how many b's we found
len(re.findall("b",html))

11

#### Find all instances of \<br>\ html tag.

In [10]:
# returns array of all <br> found
re.findall("<br>",html)
# return how many <br> we found
len(re.findall("<br>",html))

8

### **Practical Example: Get all products from an example online store.**
Previous regex pattern - title="[A-Z])\w+.+"

regex pattern - title=\"[A-Z].+\"

All products started in the html code with title=". The \\ in \\" is to escape the " quotes which starts the title. [A-Z] is to look for the first character in a capitalized title of the product. . looks for any character except line breaks. The + is to search for # of characters >= 1. The last \\" is the end of the title with the \ escaping the " end quote.

reg ex cheat sheet used: https://www.rexegg.com/regex-quickstart.html

regex live tester used: https://regexr.com

In [3]:
# load page and decode
url_practice = "https://webscraper.io/test-sites/e-commerce/static"
page_practice = urlopen(url_practice)
html_practice_bytes = page_practice.read()
html_practice = html_practice_bytes.decode()

# using regular expressions to find all product names
re.findall('(title=\"[A-Z].+\")',html_practice)

['title="Asus ROG Strix GL753VE-GC096T"',
 'title="Acer Aspire 3 A315-31 Black"',
 'title="Dell Vostro 15"']