# Inline-vs-Block Elements and Implications for Web Scraping

This notebook explains briefly the difference between inline and block elements and the implications for web scraping. We will use an example of a test page that has two block elements, and within it, some inline elements.

## Setup

We will setup a simple example following the codes below:

In [1]:
from bs4 import BeautifulSoup

In [2]:
html = """
<!DOCTYPE html>
<html>

<head>
    <title>Your Title Here</title>
</head>

<body>
    <!-- Content -->
    <main class="container-fluid" role="main">
        <h1 class="mt5">Hello world!</h1>
        <p>Hello I am <b>X</b></p>
        <p>and I am a worker at <b>Grab</b></p>
        <!-- Your content goes here -->
    </main>
</body>

</html>
"""

In [3]:
soup = BeautifulSoup(html, 'html.parser')

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Your Title Here
  </title>
 </head>
 <body>
  <!-- Content -->
  <main class="container-fluid" role="main">
   <h1 class="mt5">
    Hello world!
   </h1>
   <p>
    Hello I am
    <b>
     X
    </b>
   </p>
   <p>
    and I am a worker at
    <b>
     Grab
    </b>
   </p>
   <!-- Your content goes here -->
  </main>
 </body>
</html>



## Inline-vs-Block Elements

### Block elements

Elements that are in "blocks". Examples are paragraphs and divs. They appear on the page as invisible blocks unless we add some form of styling to these blocks.

```
<p>This is a paragraph</p>
<p>This is another paragraph</p>

<div id="" class=""></div>
```

### Inline elements

Elements that are included within a block. Examples are **bold**, *italics* and <u>underline</u>.

```
<p>This is a paragraph. <br /><b>Inside here I am an inline element that is bold!</b></p>
<p>This is another paragraph.</p>
```

The above HTML code will output:

> This is a paragraph.<br />
> **Inside here I am an inline element that is bold!**
>
> This is another paragraph.

## Implications of Web Scraping

We will still find such tags when we use BeautifulSoup's `find_all()` function. This is where we will need to be careful if we want to extract only the text without the tags.

In [5]:
find_p = soup.find_all('p')
find_p

[<p>Hello I am <b>X</b></p>, <p>and I am a worker at <b>Grab</b></p>]

Using the `get_text()` function should strip all HTML tags.

In [6]:
find_p[0].get_text()

'Hello I am X'

In [7]:
all_text = [t.get_text() for t in find_p]
all_text

['Hello I am X', 'and I am a worker at Grab']