<h3 style="text-align:center;color:cadetblue;">Web Scraping</h3>

**What Is Web Scraping**

Web scraping is the process of gathering information from the internet. Even copying and pasting the lyrics of your favorite song can be considered a form of web scraping! However, the term “web scraping” usually refers to a process that **involves automation**. While some websites don’t like it when automatic scrapers gather their data, which can lead to legal issues, others don’t mind it.

Instead of having to check the job site every day, you can use Python to help automate the repetitive parts of your job search. With **automated web scraping**, you can write the code once, and it’ll get the information that you need many times and from many pages. Whether you’re actually on the <u>job hunt</u> or just want to automatically download all the lyrics of your favorite artist, automated web scraping can help you accomplish your goals.

**Web scraping steps**:

1. Inspect your data source.
2. Scrape HTML content from a page.
3. Parse HTML code with **Beautiful Soup**.

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

We have to run the command below to install BeautifulSoup.
```bash
pip install beautifulsoup4
```

Let's say we have following HTML content:

```html
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
```

In [3]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

type(html_doc)

str

Running the document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure:

In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [5]:
type(soup)

bs4.BeautifulSoup

In [6]:
[attrib for attrib in dir(soup) if not attrib.startswith('_')]

['ASCII_SPACES',
 'DEFAULT_BUILDER_FEATURES',
 'DEFAULT_INTERESTING_STRING_TYPES',
 'EMPTY_ELEMENT_EVENT',
 'END_ELEMENT_EVENT',
 'ROOT_TAG_NAME',
 'START_ELEMENT_EVENT',
 'STRING_ELEMENT_EVENT',
 'append',
 'attrs',
 'builder',
 'can_be_empty_element',
 'cdata_list_attributes',
 'childGenerator',
 'children',
 'clear',
 'contains_replacement_characters',
 'contents',
 'css',
 'currentTag',
 'current_data',
 'declared_html_encoding',
 'decode',
 'decode_contents',
 'decompose',
 'decomposed',
 'default',
 'descendants',
 'element_classes',
 'encode',
 'encode_contents',
 'endData',
 'extend',
 'extract',
 'fetchNextSiblings',
 'fetchParents',
 'fetchPrevious',
 'fetchPreviousSiblings',
 'find',
 'findAll',
 'findAllNext',
 'findAllPrevious',
 'findChild',
 'findChildren',
 'findNext',
 'findNextSibling',
 'findNextSiblings',
 'findParent',
 'findParents',
 'findPrevious',
 'findPreviousSibling',
 'findPreviousSiblings',
 'find_all',
 'find_all_next',
 'find_all_previous',
 'find_next',

Here are some simple ways to navigate that data structure:

In [7]:
title = soup.title
title

<title>The Dormouse's story</title>

In [8]:
type(title)

bs4.element.Tag

In [9]:
name = soup.title.name
name

'title'

In [10]:
soup.title.text

"The Dormouse's story"

In [11]:
type(name)

str

In [12]:
soup.title.string

"The Dormouse's story"

In [13]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [14]:
soup.p['class']

['title']

In [15]:
a = soup.a
a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [16]:
type(a)

bs4.element.Tag

In [17]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [18]:
soup.find(id='link3')

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

One common task is extracting all the URLs found within a page’s `<a>` tags:

In [19]:
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


Another common task is extracting all the text from a page:

In [20]:
print(soup.get_text())


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



**Tag**

A Tag object corresponds to an XML or HTML tag in the original document:

In [42]:
soup = BeautifulSoup('<b id="bold text" class="boldest another">Extremely bold</b>')
tag = soup.b
type(tag)

bs4.element.Tag

Every tag has a **name**, accessible as `.name`:

In [43]:
tag.name

'b'

In [44]:
tag

<b class="boldest another" id="bold text">Extremely bold</b>

If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

In [45]:
tag.name = "blockquote"
tag

<blockquote class="boldest another" id="bold text">Extremely bold</blockquote>

A tag may have any number of **attributes**. The tag `<b id="bold-text">` has an attribute *id* whose value is *bold-text*. You can access a tag’s attributes by treating the tag like a dictionary:

In [46]:
tag

<blockquote class="boldest another" id="bold text">Extremely bold</blockquote>

In [47]:
tag['id']

'bold text'

You can access that dictionary directly as `.attrs`:

In [41]:
tag.attrs['class']

['boldest', 'another']

You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

In [36]:
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag

<blockquote another-attribute="1" class="boldest" id="verybold">Extremely bold</blockquote>

In [37]:
del tag['id']
del tag['another-attribute']
tag

<blockquote class="boldest">Extremely bold</blockquote>

In [38]:
tag['id'] # KeyError

KeyError: 'id'

In [39]:
tag.get('id') # None

**Multi-valued attributes**

The most common multi-valued attribute is `class` (that is, a tag can have more than one CSS class). Others include `rel`, `rev`, `accept-charset`, `headers`, and `accesskey`. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

In [48]:
css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']

['body']

In [49]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']

['body', 'strikeout']

If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

In [50]:
id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']

'my id'

You can use `get_attribute_list` to get a value that’s always a list, whether or not it’s a multi-valued atribute:

In [51]:
id_soup.p.get_attribute_list('id')

['my id']

If you parse a document as XML, there are no multi-valued attributes:

In [52]:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']

'body strikeout'

**Example**

In [53]:
import requests
from bs4 import BeautifulSoup

url = 'https://data.worldbank.org/country'
res = requests.get(url)
res.status_code

200

In [54]:
soup = BeautifulSoup(res.content)
soup

<!DOCTYPE html>
<html data-react-checksum="1909013177" data-reactid="1" data-reactroot=""><head data-reactid="2"><meta charset="utf-8" data-reactid="3"/><title data-react-helmet="true" data-reactid="4">Countries | Data</title><meta content="width=device-width, initial-scale=1, minimal-ui" data-reactid="5" name="viewport"/><meta content="IE=Edge" data-reactid="6" http-equiv="X-UA-Compatible"/><meta content="Countries from The World Bank: Data" data-react-helmet="true" data-reactid="7" name="description"/><link data-reactid="8" href="/favicon.ico?v=1.1" rel="shortcut icon"/><meta content="ByFDZmo3VoJURCHrA3WHjth6IAISYQEbe20bfzTPCPo" data-reactid="9" name="google-site-verification"/><meta content="World Bank Open Data" data-reactid="10" property="og:title"/><meta content="Free and open access to global development data" data-reactid="11" property="og:description"/><meta content="https://data.worldbank.org/assets/images/logo-wb-header-en.svg" data-reactid="12" property="og:image"/><meta co

In [65]:
countries = {}

sections = soup.find_all('section')
for section in sections:
    title = section.find('h3')
    countries[title.text] = []
    # print(title.text)
    names = section.find_all('a')
    for name in names:
        countries[title.text].append(name.text)
        # print('\t', name.text)

# {
#     'A': ['Afghanistan', 'Albania'],
#     'B':
# }

In [68]:
countries['B']

['Bahamas, The',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'British Virgin Islands',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi']