# Unit 3 Structured Data Extraction and Storage with Python

## Topic Overview

Hello and welcome\! In this lesson, we'll be focusing on **Structured Data Extraction and Storage**. Specifically, we'll use **Python**, along with **BeautifulSoup** and **Pandas**, to scrape data from web pages and store it in a CSV file. This process involves retrieving HTML content, parsing it to extract data, handling pagination, and finally saving the structured data.

### Introduction to CSV Files

When scraping data from web pages, it's essential to store the extracted data in a structured format for further analysis. One common way to store structured data is by using a CSV (Comma-Separated Values) file. CSV files are easy to create, read, and share, making them a popular choice for storing tabular data. Here is an example of a CSV file:

```csv
actor,character,movie
Tom Hanks,Forrest Gump,Forrest Gump
Leonardo DiCaprio,Dominick Cobb,Inception
```

### Pandas Library

**Pandas** is a powerful library in Python for data manipulation and analysis. It provides data structures like `DataFrame` and tools for reading and writing data in various formats, including CSV files. By using **Pandas**, we can easily store structured data in a CSV file.

Here is an example of how to create a `DataFrame` and save it to a CSV file using **Pandas**:

```python
import pandas as pd

data = {
    'actor': ['Tom Hanks', 'Leonardo DiCaprio'],
    'character': ['Forrest Gump', 'Dominick Cobb'],
    'movie': ['Forrest Gump', 'Inception']
}

df = pd.DataFrame(data)
df.to_csv('actors.csv', index=False)
```

After running this code, a CSV file named `actors.csv` will be created with the following content:

```csv
actor,character,movie
Tom Hanks,Forrest Gump,Forrest Gump
Leonardo DiCaprio,Dominick Cobb,Inception
```

Now that we have an understanding of CSV files and the **Pandas** library, let's move on to web scraping and data extraction.

### Storing Scraped Data in a CSV File

```python
import pandas as pd
import requests
from bs4 import BeautifulSoup

def extract_to_csv(base_url, start_page, filename):
    all_quotes = []

    current_page = start_page
    while current_page:
        response = requests.get(f"{base_url}{current_page}")
        soup = BeautifulSoup(response.text, 'html.parser')
        quotes = soup.find_all('div', class_='quote')
        for quote in quotes:
            text = quote.find('span', class_='text').text
            author = quote.find('small', class_='author').text
            tags = [tag.text for tag in quote.find_all('a', class_='tag')]
            all_quotes.append({"text": text, "author": author, "tags": tags})

        next_link = soup.find('li', class_='next')
        current_page = next_link.find('a')['href'] if next_link else None

    df = pd.DataFrame(all_quotes)
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

base_url = 'http://quotes.toscrape.com'
start_page = '/page/1/'
filename = 'quotes.csv'
extract_to_csv(base_url, start_page, filename)
```

In this code:

  * We define the `extract_to_csv` function to handle the entire process.
  * `all_quotes` collects all the quotes from all the pages.
  * We loop through each page, extract quotes in the format: `{"text": text, "author": author, "tags": tags}`, and append them to `all_quotes`.
  * The loop is controlled by the `current_page` variable, which is updated to the next page URL until there are no more pages.
  * The next page URL is extracted from the `li` element with the class `next`.
  * `pd.DataFrame(all_quotes)` creates a DataFrame.
  * `df.to_csv(filename, index=False)` saves the DataFrame to a CSV file.

The output of the above code will be the `quotes.csv` file containing the extracted data in a structured format.

### Lesson Summary

In this lesson, we covered the process of extracting structured data from web pages and storing it in a CSV file. We used **Python**, **BeautifulSoup**, and **Pandas** to scrape quotes from a website and save them in a CSV file.

Make sure to practice this on your own and explore other web scraping projects to enhance your skills. Happy coding\!

## Store Scraped Data in CSV

Great job on learning the basics of structured data extraction and storage! Now, let's run the code from the lesson to see it in action and understand it better.

This task will scrape quotes from a website and store them in a CSV file using Python, BeautifulSoup, and Pandas.

Here's a brief overview of the code:

Import Libraries: Use requests to handle HTTP requests, BeautifulSoup to parse HTML, and pandas to manage and save data.

Function extract_to_csv:

Fetch HTML content.
Parse HTML to find quotes, authors, and tags.
Handle multiple pages.
Store data in a CSV file.
Make sure to review the code and its structure. Running it will save the scraped data into a CSV file, making it ready for further analysis.

Remember to check out the quotes.csv file to observe the saved data.

```python
import requests
from bs4 import BeautifulSoup
import pandas as pd

def extract_to_csv(base_url, start_page, filename):
    all_quotes = []
    current_page = start_page

    while current_page:
        response = requests.get(f"{base_url}{current_page}")
        soup = BeautifulSoup(response.text, 'html.parser')

        quotes = soup.find_all('div', class_='quote')
        for quote in quotes:
            text = quote.find('span', class_='text').text
            author = quote.find('small', class_='author').text
            tags = [tag.text for tag in quote.find_all('a', class_='tag')]
            all_quotes.append({"text": text, "author": author, "tags": tags})

        next_link = soup.find('li', class_='next')
        current_page = next_link.find('a')['href'] if next_link else None

    df = pd.DataFrame(all_quotes)
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

base_url = 'http://quotes.toscrape.com'
start_page = '/page/1/'
filename = 'quotes.csv'
extract_to_csv(base_url, start_page, filename)

```

You've provided a complete and correct Python script for scraping quotes and saving them to a CSV file. The code is well-structured and follows the best practices discussed in the lesson.

When you run this script, it will perform the following actions:

1.  **Initialize**: It starts with the `base_url` and `start_page` for the website and the desired `filename` for the output.
2.  **Loop Through Pages**: The `while current_page:` loop will continue as long as it finds a "next" button, automatically handling pagination.
3.  **Scrape Data**: On each page, it finds all the quote `div` elements, then extracts the quote text, author, and tags.
4.  **Append to List**: The extracted data for each quote is stored as a dictionary and added to the `all_quotes` list.
5.  **Create DataFrame**: Once all pages have been scraped, the `all_quotes` list is converted into a pandas `DataFrame`.
6.  **Save to CSV**: The `df.to_csv()` method saves the DataFrame to a file named `quotes.csv`. The `index=False` argument prevents pandas from writing the DataFrame's index as a column in the CSV.
7.  **Confirmation**: Finally, a confirmation message is printed, letting you know that the data has been successfully saved.

By running this code, you'll have a `quotes.csv` file in the same directory, containing all the scraped quotes, authors, and tags from the website, ready for you to explore and analyze!

## Adding Author's Description

You have done well with the basics of extracting and storing structured data.

In this task, we will update the existing logic to save a description of each author instead of their birth date. To complete the task, follow the instructions in the starter code.

Remember to check out the quotes.csv file to observe the saved data after running the code.

```python
import requests
from bs4 import BeautifulSoup
import pandas as pd

def extract_to_csv(base_url, start_page, filename):
    all_quotes = []
    current_page = start_page

    while current_page:
        response = requests.get(f"{base_url}{current_page}")
        soup = BeautifulSoup(response.text, 'html.parser')
        quotes = soup.find_all('div', class_='quote')[:2]

        for quote in quotes:
            text = quote.find('span', class_='text').text
            author = quote.find('small', class_='author').text
            tags = [tag.text for tag in quote.find_all('a', class_='tag')]

            # Get author's detail page URL and fetch the birthdate
            author_url_tag = quote.select_one('span a')
            author_url = author_url_tag['href']

            author_response = requests.get(f"{base_url}{author_url}")
            author_soup = BeautifulSoup(author_response.text, 'html.parser')

            # TODO: Update the lines below to fetch the author description instead of the birth date
            # Hint: It can be found in the element with class 'author-description'
            birthdate_tag = author_soup.find('span', class_='author-born-date')
            birthdate = birthdate_tag.text

            # TODO: Update the append statement to include the description
            all_quotes.append({"text": text, "author": author, "tags": tags, "birthdate": birthdate})

        next_link = soup.find('li', class_='next')
        current_page = next_link.find('a')['href'] if next_link else None

    df = pd.DataFrame(all_quotes)
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

base_url = 'http://quotes.toscrape.com'
start_page = '/page/1/'
filename = 'quotes.csv'
extract_to_csv(base_url, start_page, filename)
```

## Complete the Scraping Code

Great work so far!

In this task, you'll fill in the missing parts of the code needed to extract quotes from a website and save them in a CSV file.

The function extract_to_csv retrieves HTML content, parses it to find quotes, authors, and tags, and stores the data in a structured format.

Your job is to complete the TODO sections of the given code.

Remember to check out the quotes.csv file to observe the saved data after running the code.

```python
import requests
from bs4 import BeautifulSoup
import pandas as pd

def extract_to_csv(base_url, start_page, filename):
    all_quotes = []
    current_page = start_page

    while current_page:
        response = requests.get(f"{base_url}{current_page}")  # Get the page content
        soup = BeautifulSoup(response.text, 'html.parser')   # Parse the HTML content

        quotes = soup.find_all('div', class_='quote')  # Find all quote blocks

        for quote in quotes:
            # TODO: Extract the quote text - span elements with class 'text'

            # TODO: Extract the author name - span elements with class 'author'

            tags = [tag.text for tag in quote.find_all('a', class_='tag')]
            all_quotes.append({"text": text, "author": author, "tags": tags})  # Append the quote data

        next_link = soup.find('li', class_='next')  # Find the next page link
        current_page = next_link.find('a')['href'] if next_link else None  # Update the current page

    # TODO: Create Pandas DataFrame called 'df' with all_quotes that we constucted in the loop

    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

base_url = 'http://quotes.toscrape.com'
start_page = '/page/1/'
filename = 'quotes.csv'
extract_to_csv(base_url, start_page, filename)

```
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd

def extract_to_csv(base_url, start_page, filename):
    all_quotes = []
    current_page = start_page

    while current_page:
        response = requests.get(f"{base_url}{current_page}")  # Get the page content
        soup = BeautifulSoup(response.text, 'html.parser')   # Parse the HTML content

        quotes = soup.find_all('div', class_='quote')  # Find all quote blocks

        for quote in quotes:
            # TODO: Extract the quote text - span elements with class 'text'
            text = quote.find('span', class_='text').text.strip()

            # TODO: Extract the author name - span elements with class 'author'
            author = quote.find('small', class_='author').text.strip()

            tags = [tag.text for tag in quote.find_all('a', class_='tag')]
            all_quotes.append({"text": text, "author": author, "tags": tags})  # Append the quote data

        next_link = soup.find('li', class_='next')  # Find the next page link
        current_page = next_link.find('a')['href'] if next_link else None  # Update the current page

    # TODO: Create Pandas DataFrame called 'df' with all_quotes that we constucted in the loop
    df = pd.DataFrame(all_quotes)

    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

base_url = 'http://quotes.toscrape.com'
start_page = '/page/1/'
filename = 'quotes.csv'
extract_to_csv(base_url, start_page, filename)
```

## Add Extra Data in Scraping

Great job on learning the basics of structured data extraction and storage.

In this task, we will enhance our scraping function to not only extract quotes but also save the author's birthdate. Modify the existing code by adding the necessary parts to fetch the birthdate from the author's detail page. Ensure that the final output includes the quote text, author, tags, and birthdate in the CSV file.

Remember to check out the quotes.csv file to observe the saved data after running the code.

```python
import requests
from bs4 import BeautifulSoup
import pandas as pd

def extract_to_csv(base_url, start_page, filename):
    all_quotes = []
    current_page = start_page

    while current_page:
        response = requests.get(f"{base_url}{current_page}")
        soup = BeautifulSoup(response.text, 'html.parser')
        quotes = soup.find_all('div', class_='quote')

        for quote in quotes:
            text = quote.find('span', class_='text').text
            author = quote.find('small', class_='author').text
            tags = [tag.text for tag in quote.find_all('a', class_='tag')]

            # Get author's detail page URL and fetch the birthdate
            author_url_tag = quote.select_one('span a')
            author_url = author_url_tag['href']

            author_response = requests.get(f"{base_url}{author_url}")
            author_soup = BeautifulSoup(author_response.text, 'html.parser')

            # TODO: Find the span containing the birthdate with class 'author-born-date'
            
            # TODO: Extract the birthdate text

            # TODO: Extend the object below to include birthdate field with the extracted birthdate
            all_quotes.append({"text": text, "author": author, "tags": tags})

        next_link = soup.find('li', class_='next')
        current_page = next_link.find('a')['href'] if next_link else None

    # TODO: Construct Pandas DataFrame with all_quotes
    
    # TODO: Write the data to a CSV file
    print(f"Data saved to {filename}")

base_url = 'http://quotes.toscrape.com'
start_page = '/page/1/'
filename = 'quotes.csv'
extract_to_csv(base_url, start_page, filename)

```

```python
import requests
from bs4 import BeautifulSoup
import pandas as pd

def extract_to_csv(base_url, start_page, filename):
    all_quotes = []
    current_page = start_page

    while current_page:
        response = requests.get(f"{base_url}{current_page}")
        soup = BeautifulSoup(response.text, 'html.parser')
        quotes = soup.find_all('div', class_='quote')

        for quote in quotes:
            text = quote.find('span', class_='text').text
            author = quote.find('small', class_='author').text
            tags = [tag.text for tag in quote.find_all('a', class_='tag')]

            # Get author's detail page URL and fetch the birthdate
            author_url_tag = quote.select_one('span a')
            author_url = author_url_tag['href']

            author_response = requests.get(f"{base_url}{author_url}")
            author_soup = BeautifulSoup(author_response.text, 'html.parser')

            # TODO: Find the span containing the birthdate with class 'author-born-date'
            born_date_tag = author_soup.find('span', class_='author-born-date')

            # TODO: Extract the birthdate text
            birthdate = born_date_tag.text if born_date_tag else None

            # TODO: Extend the object below to include birthdate field with the extracted birthdate
            all_quotes.append({"text": text, "author": author, "tags": tags, "birthdate": birthdate})

        next_link = soup.find('li', class_='next')
        current_page = next_link.find('a')['href'] if next_link else None

    # TODO: Construct Pandas DataFrame with all_quotes
    df = pd.DataFrame(all_quotes)

    # TODO: Write the data to a CSV file
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

base_url = 'http://quotes.toscrape.com'
start_page = '/page/1/'
filename = 'quotes.csv'
extract_to_csv(base_url, start_page, filename)
```

## Scrape and Save Quotes Data

Awesome progress so far! This final practice will let you put everything you've learned into action.

Your task is to write a Python function to scrape quotes from a website, handle pagination, and store the data in a CSV file using requests, BeautifulSoup, and pandas.

Follow the TODO comments to complete the task.

Remember to check out the quotes.csv file to observe the saved data after running the code.

```python
import requests
from bs4 import BeautifulSoup
import pandas as pd

def extract_to_csv(base_url, start_page, filename):
    # TODO: Initialize an empty list to collect all quotes

    # TODO: Set the current page to the start page

    # TODO: Start a while loop that continues as long as there is a current page
        # TODO: Retrieve the page content using requests.get

        # TODO: Parse the page content using BeautifulSoup

        # TODO: Find all the quote elements on the page

        # TODO: Loop through each quote element and extract the text, author, and tags

        # TODO: Append the extracted data to the list

        # TODO: Find the link to the next page and update the current page

    # TODO: Create a DataFrame from the collected data

    # TODO: Save the DataFrame to a CSV file

    # TODO: Print a confirmation message with the filename

base_url = 'http://quotes.toscrape.com'
start_page = '/page/1/'
filename = 'quotes.csv'
extract_to_csv(base_url, start_page, filename)

```

```python
import requests
from bs4 import BeautifulSoup
import pandas as pd

def extract_to_csv(base_url, start_page, filename):
    # Initialize an empty list to collect all quotes
    all_quotes = []

    # Set the current page to the start page
    current_page = start_page

    # Start a while loop that continues as long as there is a current page
    while current_page:
        # Retrieve the page content using requests.get
        response = requests.get(f"{base_url}{current_page}")
        
        # Parse the page content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all the quote elements on the page
        quotes = soup.find_all('div', class_='quote')

        # Loop through each quote element and extract the text, author, and tags
        for quote in quotes:
            text = quote.find('span', class_='text').text.strip()
            author = quote.find('small', class_='author').text.strip()
            tags = [tag.text for tag in quote.find_all('a', class_='tag')]

            # Append the extracted data to the list
            all_quotes.append({"text": text, "author": author, "tags": tags})

        # Find the link to the next page and update the current page
        next_link = soup.find('li', class_='next')
        current_page = next_link.find('a')['href'] if next_link else None

    # Create a DataFrame from the collected data
    df = pd.DataFrame(all_quotes)

    # Save the DataFrame to a CSV file
    df.to_csv(filename, index=False)

    # Print a confirmation message with the filename
    print(f"Data saved to {filename}")

base_url = 'http://quotes.toscrape.com'
start_page = '/page/1/'
filename = 'quotes.csv'
extract_to_csv(base_url, start_page, filename)
```