# Task

We will parse the example webpage provided in *Web Scraping with Python*, Chapter 2.

The url is:

http://www.pythonscraping.com/pages/warandpeace.html

# Open the webpage in the browser

- Check source
- Inspect element
- Selector Gadget

# Load packages

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Get the page, put in BeautifulSoup

In [2]:
url = 'http://www.pythonscraping.com/pages/warandpeace.html'
html = urlopen(url)

In [3]:
bs = BeautifulSoup(html, 'html.parser')

In [5]:
bs.h1.get_text()

'War and Peace'

In [6]:
bs.find_all('h1')

[<h1>War and Peace</h1>]

# Extract green text

## With `find_all()`


In [7]:
green_tags = bs.findAll('span', {'class': 'green'})

In [None]:
texts = []
for gr in green_tags:
  # print(gr.get_text())
  texts.append(gr.get_text())
texts

In [8]:
[gr.get_text() for gr in green_tags]

['Anna\nPavlovna Scherer',
 'Empress Marya\nFedorovna',
 'Prince Vasili Kuragin',
 'Anna Pavlovna',
 'St. Petersburg',
 'the prince',
 'Anna Pavlovna',
 'Anna Pavlovna',
 'the prince',
 'the prince',
 'the prince',
 'Prince Vasili',
 'Anna Pavlovna',
 'Anna Pavlovna',
 'the prince',
 'Wintzingerode',
 'King of Prussia',
 'le Vicomte de Mortemart',
 'Montmorencys',
 'Rohans',
 'Abbe Morio',
 'the Emperor',
 'the prince',
 'Prince Vasili',
 'Dowager Empress Marya Fedorovna',
 'the baron',
 'Anna Pavlovna',
 'the Empress',
 'the Empress',
 "Anna Pavlovna's",
 'Her Majesty',
 'Baron\nFunke',
 'The prince',
 'Anna\nPavlovna',
 'the Empress',
 'The prince',
 'Anatole',
 'the prince',
 'The prince',
 'Anna\nPavlovna',
 'Anna Pavlovna']

## With CSS selector

In [9]:
[gr.get_text() for gr in bs.select("span.green")]

['Anna\nPavlovna Scherer',
 'Empress Marya\nFedorovna',
 'Prince Vasili Kuragin',
 'Anna Pavlovna',
 'St. Petersburg',
 'the prince',
 'Anna Pavlovna',
 'Anna Pavlovna',
 'the prince',
 'the prince',
 'the prince',
 'Prince Vasili',
 'Anna Pavlovna',
 'Anna Pavlovna',
 'the prince',
 'Wintzingerode',
 'King of Prussia',
 'le Vicomte de Mortemart',
 'Montmorencys',
 'Rohans',
 'Abbe Morio',
 'the Emperor',
 'the prince',
 'Prince Vasili',
 'Dowager Empress Marya Fedorovna',
 'the baron',
 'Anna Pavlovna',
 'the Empress',
 'the Empress',
 "Anna Pavlovna's",
 'Her Majesty',
 'Baron\nFunke',
 'The prince',
 'Anna\nPavlovna',
 'the Empress',
 'The prince',
 'Anatole',
 'the prince',
 'The prince',
 'Anna\nPavlovna',
 'Anna Pavlovna']

# Extract heading text

## With `findAll()`


In [10]:
h_tags = bs.findAll(['h1', 'h2', 'h3', 'h4'])
h_tags
[h.getText() for h in h_tags]

['War and Peace', 'Chapter 1']

## With CSS selector

In [11]:
h_tags = bs.select('h1, h2, h3, h4')
h_tags
[h.getText() for h in h_tags]

['War and Peace', 'Chapter 1']