<h3 style="text-align:center;color:cadetblue;">Web Scraping</h3>

**What Is Web Scraping**

Web scraping is the process of gathering information from the internet. Even copying and pasting the lyrics of your favorite song can be considered a form of web scraping! However, the term “web scraping” usually refers to a process that **involves automation**. While some websites don’t like it when automatic scrapers gather their data, which can lead to legal issues, others don’t mind it.

Instead of having to check the job site every day, you can use Python to help automate the repetitive parts of your job search. With **automated web scraping**, you can write the code once, and it’ll get the information that you need many times and from many pages. Whether you’re actually on the <u>job hunt</u> or just want to automatically download all the lyrics of your favorite artist, automated web scraping can help you accomplish your goals.

**Web scraping steps**:

1. Inspect your data source.
2. Scrape HTML content from a page.
3. Parse HTML code with **Beautiful Soup**.

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

We have to run the command below to install BeautifulSoup.
```bash
pip install beautifulsoup4
```

Let's say we have following HTML content:

```html
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
```

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

type(html_doc)

Running the document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure:

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

In [None]:
type(soup)

In [None]:
[attrib for attrib in dir(soup) if not attrib.startswith('_')]

Here are some simple ways to navigate that data structure:

In [None]:
title = soup.title
title

In [None]:
type(title)

In [None]:
name = soup.title.name
name

In [None]:
type(name)

In [None]:
soup.title.string

In [None]:
soup.p

In [None]:
soup.p['class']

In [None]:
a = soup.a
a

In [None]:
type(a)

In [None]:
soup.find_all('a')

In [None]:
soup.find(id='link3')

One common task is extracting all the URLs found within a page’s `<a>` tags:

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

Another common task is extracting all the text from a page:

In [None]:
print(soup.get_text())

**Tag**

A Tag object corresponds to an XML or HTML tag in the original document:

In [None]:
soup = BeautifulSoup('<b id="bold-text" class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)

Every tag has a **name**, accessible as `.name`:

In [None]:
tag.name

If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

In [None]:
tag.name = "blockquote"
tag

A tag may have any number of **attributes**. The tag `<b id="bold-text">` has an attribute *id* whose value is *bold-text*. You can access a tag’s attributes by treating the tag like a dictionary:

In [None]:
tag['id']

You can access that dictionary directly as `.attrs`:

In [None]:
tag.attrs

You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

In [None]:
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag

In [None]:
del tag['id']
del tag['another-attribute']
tag

In [None]:
tag['id'] # KeyError

In [44]:
tag.get('id') # None

**Multi-valued attributes**

The most common multi-valued attribute is `class` (that is, a tag can have more than one CSS class). Others include `rel`, `rev`, `accept-charset`, `headers`, and `accesskey`. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

In [None]:
css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']

In [None]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']

If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

In [None]:
id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']

You can use `get_attribute_list` to get a value that’s always a list, whether or not it’s a multi-valued atribute:

In [None]:
id_soup.p.get_attribute_list('id')

If you parse a document as XML, there are no multi-valued attributes:

In [None]:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']

In [None]:
type(tag.string)