$\newcommand{\trinom}[3]{\begin{pmatrix} #1 \\ #2 \\ #3 \end{pmatrix}}$

# **Q1. Get the element**

Using Beautiful Soup in Python, which one-liner code will find the first <h2> tag in the following HTML structure?

HTML Structure:

```html
<!DOCTYPE html>
<html>
<head>
    <title>My Web Page</title>
</head>
<body>
    <h1>Welcome to My Web Page</h1>
    <div class="header">
        <h2>News</h2>
        <p>Latest news will be displayed here</p>
    </div>
    <div class="content">
        <h2>Hello</h2>
        <p>This is some content on my webpage.</p>
        <ul>
            <li>Point 1</li>
            <li>Point 2</li>
            <li>Point 3</li>
        </ul>
    </div>
    <div class="footer">
        <p>Contact us at contact@example.com</p>
    </div>
</body>
</html>
```

**Options:**

1. soup.find('h2')
2. soup.find_all('h2')[0]
3. soup.body.h2
4. All of the above

#### **Ans:**

Option 4. All of the above

# **Q2. finding all**

What is the output of the given below Python code?

**Python Code:**

```python
from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <div class="info">
        <p>First Paragraph</p>
        <p class="highlight">Second Paragraph</p>
        <p>Third Paragraph</p>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
result = soup.find_all('p', class_='highlight')
print(result)
```

In [1]:
from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <div class="info">
        <p>First Paragraph</p>
        <p class="highlight">Second Paragraph</p>
        <p>Third Paragraph</p>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
result = soup.find_all('p', class_='highlight')
print(result)

[<p class="highlight">Second Paragraph</p>]


# **Q3. understanding robots.txt**

Which of the following statements best describes the importance of checking a website's `robots.txt` file before performing web scraping?

1. The robots.txt file contains the necessary passwords and credentials required for web scraping.
2. It is important to check the robots.txt file because it lists the data types that can be legally scraped from the website.
3. The robots.txt file indicates which parts of the website the owner prefers not to be accessed by web crawlers, making it crucial for ethical web scraping.
4. Checking the robots.txt file is not necessary for web scraping as it only pertains to search engine optimization.


#### **Ans:**

Option 3: The robots.txt file indicates which parts of the website the owner prefers not to be accessed by web crawlers, making it crucial for ethical web scraping.

# **Q4. finding all tags**

HTML Structure:

```html
<!DOCTYPE html>
<html>
<head>
    <title>News Portal</title>
</head>
<body>
    <div id="main-content">
        <p>Paragraph 1 in main content.</p>
        <p>Paragraph 2 in main content.</p>
    </div>
    <div id="sidebar">
        <p>Paragraph in sidebar.</p>
    </div>
    <div>
        <p>Paragraph in a div without an ID.</p>
    </div>
</body>
</html>
```

Using Beautiful Soup in Python, which one-liner code will find all the `<p>` tags within the `<div>` element having the ID `main-content` in the above HTML structure?

#### **Ans:**

```python
soup.find('div', id='main-content').find_all('p')
```

# **Q5. scrape the output**

What is the output of the given below Python code?

from bs4 import BeautifulSoup

```python
html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Book Store</title>
</head>
<body>
    <div id="bestsellers">
        <h2>Best Selling Books</h2>
        <ul>
            <li><a href="/book1">The Great Gatsby</a></li>
            <li><a href="/book2">To Kill a Mockingbird</a></li>
            <li><a href="/book3">1984</a></li>
        </ul>
    </div>
    <div id="new-releases">
        <h2>New Releases</h2>
        <ul>
            <li><a href="/book4">The Testaments</a></li>
            <li><a href="/book5">Normal People</a></li>
        </ul>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
titles = [a.get_text() for a in soup.find_all('a')]
print(titles)
```

In [2]:
html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>Book Store</title>
</head>
<body>
    <div id="bestsellers">
        <h2>Best Selling Books</h2>
        <ul>
            <li><a href="/book1">The Great Gatsby</a></li>
            <li><a href="/book2">To Kill a Mockingbird</a></li>
            <li><a href="/book3">1984</a></li>
        </ul>
    </div>
    <div id="new-releases">
        <h2>New Releases</h2>
        <ul>
            <li><a href="/book4">The Testaments</a></li>
            <li><a href="/book5">Normal People</a></li>
        </ul>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
titles = [a.get_text() for a in soup.find_all('a')]
print(titles)

['The Great Gatsby', 'To Kill a Mockingbird', '1984', 'The Testaments', 'Normal People']


# **AQ1. scraping the table**

Given following the HTML structure of a website, complete the python function that takes this html structure as a string variable html_content and scrape the content of table from it and return in the form of Python 2D list as show in output sample.

**HTML Structure**

```html
<!DOCTYPE html>
<html>
<head>
    <title>Product Data</title>
</head>
<body>
    <table>
        <thead>
            <tr>
                <th>Product Name</th>
                <th>Price</th>
                <th>Category</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Laptop</td>
                <td>999</td>
                <td>Electronics</td>
            </tr>
            <tr>
                <td>Smartwatch</td>
                <td>250</td>
                <td>Wearables</td>
            </tr>
            <tr>
                <td>Novel</td>
                <td>15.99</td>
                <td>Books</td>
            </tr>
        </tbody>
    </table>
</body>
</html>
```

**Output Sample**

```python
[['Product Name', 'Price', 'Category'],
 ['Laptop', '999', 'Electronics'],
 ['Smartwatch', '250', 'Wearables'],
 ['Novel', '15.99', 'Books']]
 ```

In [3]:
from bs4 import BeautifulSoup

def scrape_table(html_content):
  # write your code here
  soup = BeautifulSoup(html_content, 'html.parser')
  table = soup.find('table')
  table_data = []
  header_row = table.find('thead').find('tr')
  header_data = [th.get_text(strip=True) for th in header_row.find_all('th')]
  table_data.append(header_data)
  body_rows = table.find('tbody').find_all('tr')
  for row in body_rows:
    row_data = [td.get_text(strip=True) for td in row.find_all('td')]
    table_data.append(row_data)
  return table_data

html_content = """
<!DOCTYPE html>
<html>
<head>
    <title>Product Data</title>
</head>
<body>
    <table>
        <thead>
            <tr>
                <th>Product Name</th>
                <th>Price</th>
                <th>Category</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Laptop</td>
                <td>999</td>
                <td>Electronics</td>
            </tr>
            <tr>
                <td>Smartwatch</td>
                <td>250</td>
                <td>Wearables</td>
            </tr>
            <tr>
                <td>Novel</td>
                <td>15.99</td>
                <td>Books</td>
            </tr>
        </tbody>
    </table>
</body>
</html>
"""

table_data = scrape_table(html_content)
print(table_data)

[['Product Name', 'Price', 'Category'], ['Laptop', '999', 'Electronics'], ['Smartwatch', '250', 'Wearables'], ['Novel', '15.99', 'Books']]
