# Chapter 02. Advanced HTML Parsing

In this part we will look into how to search for tags by attributes, working with lists of tags, and navigating parse trees.

For the purpose of the exercise we will create an example of web scraper that scrapes the page located at http://
www.pythonscraping.com/pages/warandpeace.html.

In this page the lines spoken by characters in the story are written in red, whereas the names of characters are in green.

In [42]:
# import required libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [43]:
# get the html page
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')

In [44]:
# create BeautifulSoup object and parse the html content
bs = BeautifulSoup(html.read(), 'html.parser')

In [45]:
# investigate the source page for used tags
# right click -> View Page Source

In [46]:
# looking at the span tags we can see that some of them have CSS style (e.g. green, red)
# we can grab all the names from span green color
namelist = bs.find_all('span', {'class': 'green'})
for name in namelist:
    # to get only the text from spans we use get_text() function
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


# find() and find_all() with BeautifulSoup

In [47]:
h_tags = bs.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])

In [48]:
h_tags

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

all_spans = bs.find_all('span', {'class': {'green', 'red'}})

In [49]:
# print the first 10 spans containing both colors: green, red
all_spans[:10]

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">St. Petersburg</span>,
 <span class="red">If you have nothing better to do, Count [or Prince], and if the
 prospect of spending an evening with a poor invalid is not too
 terrible, I shall be very charmed to see you tonight between 7 and 10-
 Annette Scherer.</span>,
 <span clas

In [50]:
# if we want to find the number of times "the prince" is surrounded by tags on the example page
# we could replace our function .find_all()
namelist = bs.find_all(text='the prince')
print(len(namelist))

7


The limit argument is used only in the find_all method; find is equivalent to the same find_all with a limit of 1. However, if we are interested only in retrieving the first x items from the page, we could specify the limit in the limit argument of the function. This will give us the first items on the page in the order that they occur but not necessarily the first ones we want.

In [51]:
# the following two lines are identical:
# bs.find_all('', {'id': 'text'})
# bs.find_all(id='text')

In [52]:
# the following line will throw an error because class will make reference to Python class and not CSS class of a tag
# bs.find_all(class='green')

In [53]:
# to overcome this issue we can use the following line of code:
bs.find_all('', {'class': 'green'})

[<span class="green">Anna
 Pavlovna Scherer</span>, <span class="green">Empress Marya
 Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="green">the prince</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">Prince Vasili</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">Wintzingerode</span>, <span class="green">King of Prussia</span>, <span class="green">le Vicomte de Mortemart</span>, <span class="green">Montmorencys</span>, <span class="green">Rohans</span>, <span class="green">Abbe Morio</span>, <span class="green">the Emperor</span>, <span class="green">the prince</span>, <span class="green">Pri

# Dealing with children and other descendants

In this part we will focus the url page http://www.pythonscraping.com/pages/page3.html  where we learn how to discern or differentiate the childrens and descendants in the text. 

In [54]:
# import required libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [55]:
# get the html page
html = urlopen('http://www.pythonscraping.com/pages/page3.html')

In [56]:
# parse the html page using BeautifulSoup
bs = BeautifulSoup(html.read(), 'html.parser')

In [57]:
# get the siblings from table
for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

The output of this code is to print all rows of products from the product table, except the header of the table. The reason is that objects cannot be siblings with themselves.

In [58]:
# get the first row of the table (header)
bs.find('table', {'id': 'giftList'}).tr

<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>

# Dealing with parents

When scraping pages, we will likely discover that we need to find parents of tags less frequently than to find their children or siblings.

In [62]:
# import required libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [63]:
# get html page
html = urlopen('http://www.pythonscraping.com/pages/page3.html')

In [64]:
# parse the html page using BeautifulSoup
bs = BeautifulSoup(html, 'html.parser')

In [74]:
print(bs.find('img', {'src': '../img/gifts/img1.jpg'}).parent.previous_sibling.get_text())


$15.00



In [78]:
# get the previous siblings of the td tag which is parent in this case
# we use .get_text() function to retrive the text of all siblings
for sibling in bs.find('img', {'src': '../img/gifts/img1.jpg'}).parent.previous_siblings:
    print(sibling.get_text())


$15.00


This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
Now with super-colorful bell peppers!


Vegetable Basket



# Regular Expressions and BeautifulSoup

Let's assume that we want to extract image urls from an html page. At first we could just grab all the images using .find_all("img") function, but there is a problem. 

In addition to the extra images such as logos, modern websites often have hidden images, blank images used for spacing and aligning elements that we are not aware of. The solution is to look for a tag itself such as the file path of the product images.

In [86]:
# import required libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup 
import re

In [81]:
# get the html page
html = urlopen('http://www.pythonscraping.com/pages/page3.html')

In [82]:
# parse the html page using BeautifulSoup
bs = BeautifulSoup(html, 'html.parser')

In [83]:
# extract images
images = bs.find_all('img', {'src': re.compile('\.\.\/img/gifts/img.*\.jpg')})

In [85]:
for image in images:
    print(image['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
