# Unit 1 Handling Pagination in Web Scraping with Python and Beautiful Soup

# Introduction to Pagination with BeautifulSoup

Hello and welcome\! In this lesson, we will focus on handling **pagination** in web scraping using Python and Beautiful Soup. Pagination is essential when scraping large datasets from websites that display their content over multiple pages. By the end of this lesson, you will be equipped with the skills to navigate multiple web pages, extract necessary data, and handle pagination effectively.

### What is Pagination?

Pagination is a web design technique used to divide extensive content into multiple pages, commonly seen in search results, blogs, and forums. Each page shows a subset of the total data, and navigation links (typically labeled "Next", "Previous", or page numbers) let users move through the data.

**Challenges:**

  * Identifying and following "Next" buttons programmatically.
  * Constructing URLs dynamically to request subsequent pages.
  * Ensuring consistent data extraction amidst varying page layouts.

Understanding pagination is essential for effective web scraping since it allows you to gather comprehensive datasets.

-----

### Implementing Pagination in Web Scraping

Let's consider a scenario where we scrape quotes from a website that paginates its content. The website displays quotes on multiple pages, with a "Next" button to navigate to the next page. The following code demonstrates how to scrape quotes from multiple pages:

```python
import requests
from bs4 import BeautifulSoup

base_url = 'http://quotes.toscrape.com'
next_url = '/page/1/'

while next_url:
    response = requests.get(f"{base_url}{next_url}")
    soup = BeautifulSoup(response.text, 'html.parser')
    quotes = soup.find_all("div", class_="quote")

    for quote in quotes:
        print(quote.find("span", class_="text").text)

    next_button = soup.find("li", class_="next")
    next_url = next_button.find("a")["href"] if next_button else None
```

The code iterates through multiple pages of the website and extracts quotes using Beautiful Soup.
The `while` loop continues as long as the `next_url` is available, extracting the next URL dynamically from the "Next" button link. This code elegantly handles pagination by recursively following "Next" links until no more pages are available.

We use `soup.find_all` to locate all `div` tags with class `quote`. Within each `quote` `div`, we find the `span` with the class `text` to extract the quote text.

-----

### Lesson Summary

In this lesson, we explored how to handle pagination while scraping web data using Python and Beautiful Soup. We started with the concept of pagination, broke down the example code, and implemented a full pagination logic to scrape multiple pages.

Let's practice and reinforce the concepts we learned. Happy scraping\!

## Scraping Quotes with Pagination

Great job on understanding the lesson! Let's now run the example code to see pagination in action.

Run the code to observe how it scrapes quotes from multiple pages.

```python
import requests
from bs4 import BeautifulSoup

base_url = 'http://quotes.toscrape.com'
next_url = '/page/1/'
while next_url:
    response = requests.get(f"{base_url}{next_url}")
    soup = BeautifulSoup(response.text, 'html.parser')

    quotes = soup.find_all("div", class_="quote")
    for quote in quotes:
        print(quote.find("span", class_="text").text)

    next_button = soup.find("li", class_="next")
    next_url = next_button.find("a")["href"] if next_button else None
```

## Enhance Quote Scraping Script

Great job so far! Now, let's take one step further and modify our existing code.

Currently, the code prints only the text of the quotes. Change the code to also print the author of each quote.

Remember, analyzing the structure of the page is crucial in web scraping. So, if you are not sure where to find the author's name, you can open the URL in another tab and inspect the page.

```python
import requests
from bs4 import BeautifulSoup

base_url = 'http://quotes.toscrape.com'
next_url = '/page/1/'
while next_url:
    response = requests.get(f"{base_url}{next_url}")
    soup = BeautifulSoup(response.text, 'html.parser')

    quotes = soup.find_all("div", class_="quote")

    # TODO: Modify the code to print the author of each quote along with the text
    for quote in quotes:
        print(quote.find("span", class_="text").text)

    next_button = soup.find("li", class_="next")
    next_url = next_button.find("a")["href"] if next_button else None
```

```python
import requests
from bs4 import BeautifulSoup

base_url = 'http://quotes.toscrape.com'
next_url = '/page/1/'
while next_url:
    response = requests.get(f"{base_url}{next_url}")
    soup = BeautifulSoup(response.text, 'html.parser')

    quotes = soup.find_all("div", class_="quote")
    
    # TODO: Modify the code to print the author of each quote along with the text
    for quote in quotes:
        text = quote.find("span", class_="text").text
        author = quote.find("small", class_="author").text
        print(f'"{text}" - {author}')

    next_button = soup.find("li", class_="next")
    next_url = next_button.find("a")["href"] if next_button else None
```

**Penjelasan Modifikasi:**

1.  **Mencari Elemen Penulis**: Saya menganalisis struktur HTML dari halaman `quotes.toscrape.com` dan menemukan bahwa nama penulis berada di dalam tag `<small>` dengan `class="author"`.
2.  **Menyimpan Teks dan Penulis**: Di dalam loop `for`, saya sekarang mengambil teks kutipan dan nama penulis secara terpisah dan menyimpannya ke dalam variabel `text` dan `author`.
3.  **Mencetak Format Baru**: Saya menggunakan f-string (`f'"{text}" - {author}'`) untuk menggabungkan teks kutipan dan nama penulis ke dalam satu baris output yang mudah dibaca.

## Enhance Pagination Handling Script

## Handle Next Button

## Handle Pagination in Web Scraping

## Creating a Pagination Scraper