# Scraping example (fake data)

* Vous souhaitez inviter des clients à une conférence et devez récupérer leur nom sur ce `[site](https://quotes.toscrape.com/), ainsi que leur citation pour mieux les connaître.
* Écrivez un programme qui permet de récupérer tous les auteurs avec leur citation.

In [10]:
#libraries
import requests  # for making HTTP requests to web pages
from bs4 import BeautifulSoup  # for parsing HTML content
import pandas as pd

# The base URL of the website we're scraping
base_url = 'https://quotes.toscrape.com/'

def fetch_quotes(page_url):
    """
    Fetch quotes and authors from a given page URL.
    
    Parameters:
    - page_url: URL of the page to scrape
    
    Returns:
    - A list of dictionaries, each containing an 'Author' and their 'Citation'
    """
    quotes_data = []  # Initialize an empty list to store quotes and authors
    response = requests.get(page_url)  # Make a GET request to fetch the page content
    
    if response.status_code == 200:  # Check if the request was successful (HTTP status code 200)
        soup = BeautifulSoup(response.content, 'html.parser')  # Parse the HTML content of the page
        quotes = soup.find_all('div', class_='quote')  # Find all quote blocks on the page
        
        for quote in quotes:  # Iterate over each quote block
            text_element = quote.find('span', class_='text')  # Find the element containing the quote text
            author_element = quote.find('small', class_='author')  # Find the element containing the author's name
            if text_element and author_element:  # Ensure both elements were found
                text = text_element.get_text(strip=True)  # Extract the text of the quote, stripping whitespace
                author = author_element.get_text(strip=True)  # Extract the author's name, stripping whitespace
                quotes_data.append({'Author': author, 'Citation': text})  # Add the quote and author to our list
            
        # Check for a link to the next page
        next_page = soup.find('li', class_='next')
        if next_page and next_page.find('a'):  # Ensure the next page link exists
            next_page_url = base_url + next_page.find('a')['href']  # Construct the URL for the next page
            quotes_data.extend(fetch_quotes(next_page_url))  # Recursively fetch quotes from the next page
    
    return quotes_data  # Return the list of quotes and authors

# Start the scraping process from the first page of the site
quotes_data = fetch_quotes(base_url)

# Convert the list of dictionaries into a pandas DataFrame
df = pd.DataFrame(quotes_data)

# Sort the DataFrame by the author names
df_sorted = df.sort_values(by='Author').reset_index(drop=True) 

df_sorted


Unnamed: 0,Author,Citation
0,Albert Einstein,“The world as we have created it is a process ...
1,Albert Einstein,"“If I were not a physicist, I would probably b..."
2,Albert Einstein,“Any fool can know. The point is to understand.”
3,Albert Einstein,“Logic will get you from A to Z; imagination w...
4,Albert Einstein,"“If you want your children to be intelligent, ..."
...,...,...
95,Suzanne Collins,“You don’t forget the face of the person who w...
96,Terry Pratchett,"“The trouble with having an open mind, of cour..."
97,Thomas A. Edison,"“I have not failed. I've just found 10,000 way..."
98,W.C. Fields,“I am free of all prejudice. I hate everyone e...


In [11]:
quotes

NameError: name 'quotes' is not defined