# Unit 2 Scraping Data Within HTML Tables

### **Scraping Data from HTML Tables**

In this lesson, we will delve into the specifics of scraping data within HTML tables using Python and the Beautiful Soup library. By the end of this lesson, you will be able to effectively extract structured data from HTML tables and handle related challenges. The goals of this lesson are:

  * Understand the HTML table structure.
  * Learn to extract table data using BeautifulSoup.
  * Handle row data effectively.
  * Print and format the extracted data.

Let's get started\!

#### **Understanding HTML Tables**

HTML tables are a widely used element in web development for displaying structured data. The basic structure of an HTML table is composed of the following tags:

  * `<table>`: Defines a table.
  * `<tr>` (Table Row): Defines a row in a table.
  * `<th>` (Table Header): Defines a header cell in a table.
  * `<td>` (Table Data): Defines a standard cell in a table.

Here is an example of a simple HTML table:

```html
<table>
    <tr>
        <th>Author</th>
        <th>Quote</th>
    </tr>
    <tr>
        <td>Albert Einstein</td>
        <td>"Life is like riding a bicycle. To keep your balance, you must keep moving."</td>
    </tr>
    <tr>
        <td>Isaac Newton</td>
        <td>"If I have seen further it is by standing on the shoulders of Giants."</td>
    </tr>
</table>
```

#### **Extracting Table Element with Beautiful Soup**

Now, let's start by fetching the webpage content and parsing it with BeautifulSoup.

Here’s how you can make an HTTP GET request and parse the HTML content:

```python
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/tableful'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
```

Once we have the HTML content, we can extract the table element using the `find` and `find_all` methods. Here is the code to extract the table element:

```python
quotes = soup.find("table")
rows = quotes.find_all("tr")[1:-1]
```

Notice, we are using the `find` method to get the table element and the `find_all` method to get all the rows in the table. We are using slicing to exclude the first and last rows, which are headers and footers, respectively.

#### **Extracting Individual Cell Data**

Next, we’ll loop through the rows and extract individual cell data. We also need to handle rows with nested elements, such as tags within rows. Here is the code to handle this:

```python
for i in range(0, len(rows), 2):
    quote = rows[i]
    tags_row = quote.find_next_sibling()
    tags = tags_row.find_all("a") if tags_row else []
    print("Quote: ", quote.text)
    for tag in tags:
        print("Tag: ", tag.text)
```

In the code, we take the first two quotes and their tags. We then print the quote and tags for each quote - notice that the information for one quote is stored in 2 rows in the table. The `i`-th row contains the quote, and the `i+1`-th row contains the tags for that quote, that's why we are iterating over the rows with a step of 2. We use the `find_next_sibling` method to get the next row in the table that contains the tags, which are stored in anchor tags (`<a>`). We then extract the text from the anchor tags and print them.

The output of the above code will be:

```
Quote:  “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” Author: Albert Einstein
Tag:  change
Tag:  deep-thoughts
Tag:  thinking
Tag:  world
Quote:  “It is our choices, Harry, that show what we truly are, far more than our abilities.” Author: J.K. Rowling
Tag:  abilities
Tag:  choices
...
```

This output demonstrates the successful extraction and formatting of quotes and tags from the HTML table on the targeted website. By processing the structure as illustrated, we have efficiently consolidated valuable insights from nested HTML elements.

#### **Lesson Summary**

In this lesson, you learned how to scrape and process data within HTML tables using Python and Beautiful Soup. We covered the structure of HTML tables, extracting table elements, handling row data, and printing the extracted data. By mastering these skills, you are now equipped to scrape structured data from web pages effectively.

It's time to put your skills to the test with a hands-on exercise. Let's get started\!

## Scraping Quotes and Handling Errors

Great job so far! Now, let's run the code you saw in the lesson to better understand how it works.

First, we will fetch the webpage content and parse it using BeautifulSoup. Then, we will locate and extract rows from an HTML table. Finally, we will loop through the rows, extracting and printing quote texts and their associated tags.

Simply hit the Run button to see the magic unfold!

```python
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/tableful'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

quotes = soup.find("table")
rows = quotes.find_all("tr")[1:-1]

for i in range(0, len(rows), 2):
    quote = rows[i]
    tags_row = quote.find_next_sibling()
    tags = tags_row.find_all("a") if tags_row else []

    print("Quote: ", quote.text)
    for tag in tags:
        print("Tag: ", tag.text)

```

### Running the Scraper

```python
import requests
from bs4 import BeautifulSoup

# Define the URL
url = 'http://quotes.toscrape.com/tableful'

# Fetch the webpage content
try:
    response = requests.get(url)
    response.raise_for_status()  # Check for HTTP errors
    soup = BeautifulSoup(response.text, 'html.parser')

    # Locate the table and its rows, excluding header/footer
    quotes = soup.find("table")
    if quotes:
        rows = quotes.find_all("tr")[1:-1]

        # Loop through the rows to extract quotes and tags
        # We iterate with a step of 2 because each quote and its tags are in separate rows
        for i in range(0, len(rows), 2):
            quote = rows[i]
            # Find the next sibling row which contains the tags
            tags_row = quote.find_next_sibling()
            # Find all anchor tags (<a>) for the tags
            tags = tags_row.find_all("a") if tags_row else []

            # Print the quote and its associated tags
            print("Quote: ", quote.text.strip())
            for tag in tags:
                print("Tag: ", tag.text)
            print("-" * 20)  # Separator for readability
    else:
        print("Could not find the table on the page.")

except requests.exceptions.RequestException as e:
    print(f"Error fetching the URL: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

```

## Modify the Number of Quotes

Great job learning how to extract quotes from HTML tables! Let's build on that.

Currently, the code extracts all 10 quotes from the table and prints them. However, we only want to extract the first 2 quotes. Can you modify the code to achieve that?

```python
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/tableful'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

quotes = soup.find("table")
rows = quotes.find_all("tr")[1:-1]

# TODO: Modify the code to extract only the first 2 quotes - note that the quotes are in every other row, since their tags are placed as a separate row
count = len(rows)

for i in range(0, count, 2):
    quote = rows[i]
    tags_row = quote.find_next_sibling()
    tags = tags_row.find_all("a") if tags_row else []

    print("Quote: ", quote.text)
    for tag in tags:
        print("Tag: ", tag.text)

```

```python
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/tableful'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

quotes = soup.find("table")
rows = quotes.find_all("tr")[1:-1]

# We want to process only the first 2 quotes, which correspond to the first 4 rows
# (2 rows per quote: one for the quote, one for the tags).
# So, we should loop through the first 4 rows.
# The range should be from 0 to 4, with a step of 2.
for i in range(0, 4, 2):
    if i < len(rows):
        quote = rows[i]
        tags_row = quote.find_next_sibling()
        tags = tags_row.find_all("a") if tags_row else []

        print("Quote: ", quote.text)
        for tag in tags:
            print("Tag: ", tag.text)
```

## Filter Quotes by Tag

Great job on learning how to extract quotes and authors from HTML tables! Let's practice further.

Change the given code to print only those quotes that have the tag life. This way, you'll learn how to filter data based on specific criteria.

```python
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/tableful'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

quotes = soup.find("table")
rows = quotes.find_all("tr")[1:-1]

for i in range(0, len(rows), 2):
    quote = rows[i]
    tags_row = quote.find_next_sibling()
    tags = tags_row.find_all("a") if tags_row else []

    tags_text = [tag.text for tag in tags]

    # TODO: Print only those quotes that have the tag 'life'
    print("Quote: ", quote.text)
    print("Tags: ", tags_text)

```

To filter the quotes, you need to add a conditional statement inside the `for` loop. The condition should check if the string `'life'` is present in the `tags_text` list.

Here is the modified code:

```python
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/tableful'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

quotes = soup.find("table")
rows = quotes.find_all("tr")[1:-1]

for i in range(0, len(rows), 2):
    quote = rows[i]
    tags_row = quote.find_next_sibling()
    tags = tags_row.find_all("a") if tags_row else []
    
    tags_text = [tag.text for tag in tags]

    # TODO: Print only those quotes that have the tag 'life'
    if 'life' in tags_text:
        print("Quote: ", quote.text)
        print("Tags: ", tags_text)

```

### Explanation of the change:

  - **`if 'life' in tags_text:`**: This line is the core of the filtering. It checks if the string `'life'` exists as an item within the `tags_text` list.
  - **Indentation**: The `print` statements are now indented inside the `if` block. This ensures that they will only be executed when the condition (the quote having the tag 'life') is `True`.

## Fill in the Scraping Blanks

You're doing great so far! Let's continue to improve your skills.

In this task, you need to fill in the missing parts of the code to extract quotes and tags from the given URL. The quotes and tags are displayed in a table on the website - feel free to check it out here to see how the data is structured.

```python
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/tableful'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

quotes = soup.find("table")
rows = quotes.find_all("tr")[1:-1]

for i in range(0, len(rows), 2):
    # TODO: Extract the quote row - the first row in the pair

    # TODO: Extract the tags row - the second row in the pair. Hint: Use the find_next_sibling() method

    # TODO: Extract all the tags from the tags row by finding all the 'a' tags inside the row. Note that there may be no tags in the row.

    print("Quote: ", quote.text)
    for tag in tags:
        print("Tag: ", tag.text)
```

Here is the completed code with the missing parts filled in:

```python
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/tableful'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

quotes = soup.find("table")
rows = quotes.find_all("tr")[1:-1]

for i in range(0, len(rows), 2):
    # TODO: Extract the quote row - the first row in the pair
    quote = rows[i]

    # TODO: Extract the tags row - the second row in the pair. Hint: Use the find_next_sibling() method
    tags_row = quote.find_next_sibling()

    # TODO: Extract all the tags from the tags row by finding all the 'a' tags inside the row. Note that there may be no tags in the row.
    tags = tags_row.find_all("a") if tags_row else []

    print("Quote: ", quote.text)
    for tag in tags:
        print("Tag: ", tag.text)
```

### Explanation of the changes:

1.  **`quote = rows[i]`**: This line correctly extracts the current quote row from the `rows` list. Since the loop increments by 2, `rows[i]` will always be the first `<tr>` element in each pair.
2.  **`tags_row = quote.find_next_sibling()`**: This line uses the `find_next_sibling()` method on the `quote` element. As the quote and tags are in adjacent `<tr>` elements, this is the most direct way to get the tags row.
3.  **`tags = tags_row.find_all("a") if tags_row else []`**: This line is a safe and efficient way to extract the tags.
      - `tags_row.find_all("a")`: This part finds all `<a>` (link) tags within the `tags_row`.
      - `if tags_row else []`: This is an important conditional expression. It first checks if `tags_row` actually exists. If it does (because some rows might not have a tags row), it proceeds to find the `<a>` tags. If it doesn't, it assigns an empty list (`[]`) to `tags`, preventing an error from trying to call a method on a `None` object. This handles cases where a quote has no associated tags.

## Extracting Quotes and Tags

You're doing great so far! Let's continue to improve your skills.

In this task, you need to fill in the missing parts of the code to extract table data and rows from the HTML content. Follow the TODO comments in the code to complete the task.

```python
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/tableful'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# TODO: Extract the table element from the HTML content

# TODO: Extract the rows from the table element, excluding the header and footer rows (first and last rows) and store the result in a variable called `rows`

for i in range(0, len(rows), 2):
    quote = rows[i]
    tags_row = quote.find_next_sibling()
    tags = tags_row.find_all("a") if tags_row else []

    print("Quote: ", quote.text)
    for tag in tags:
        print("Tag: ", tag.text)

```

```python
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/tableful'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# TODO: Extract the table element from the HTML content
table = soup.find('table')

# TODO: Extract the rows from the table element, excluding the header and footer rows (first and last rows) and store the result in a variable called `rows`
rows = table.find_all('tr')[1:-1]

for i in range(0, len(rows), 2):
    quote = rows[i]
    tags_row = quote.find_next_sibling()
    tags = tags_row.find_all("a") if tags_row else []

    print("Quote: ", quote.text)
    for tag in tags:
        print("Tag: ", tag.text)
```

## Scraping Quotes and Tags

Great job on making it this far! Let's put your knowledge of scraping data within HTML tables to the test.

Your task is to scrape quotes and corresponding tags from an HTML table on a webpage. Remember, only analyzing the HTML structure of the page will help you identify the correct tags to scrape.

```python
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/tableful'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# TODO: Locate the HTML table and extract its rows (excluding header and footer)

# TODO: Loop through the rows to extract quote text and tags

# TODO: Print the extracted quote text and tags


```

```python
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/tableful'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# TODO: Locate the HTML table and extract its rows (excluding header and footer)
table = soup.find('table')
rows = table.find_all('tr')[1:-1]

# TODO: Loop through the rows to extract quote text and tags
for i in range(0, len(rows), 2):
    quote_row = rows[i]
    tags_row = rows[i+1]
    
    quote_text = quote_row.find('td').text.strip()
    tags = [tag.text.strip() for tag in tags_row.find_all('a')]

    # TODO: Print the extracted quote text and tags
    print(f"Quote: {quote_text}")
    print(f"Tags: {', '.join(tags)}\n")
```