<div style="display: flex; align-items: center; justify-content: center;">
  <div style="flex: 1; text-align: left;">
    <h2>MySpider Class</h2>
    <p>The <code>MySpider</code> class is designed to handle web scraping operations for news articles from the main page of <em>La Repubblica</em>, an Italian newspaper. It initiates HTTP requests to retrieve article data from specified URLs and parses the HTML content to extract relevant information such as titles, links, and publication dates.</p>
    <p>The class manages the process of crawling through a list of start URLs, avoiding revisiting previously visited URLs, and storing visited URLs to prevent duplicate scraping. It implements a basic spider logic, making HTTP requests, handling response parsing, and writing scraped data to a CSV file.</p>
    <p>Key methods include:</p>
    <ul>
      <li><code>start_requests()</code>: Initiates the scraping process by making HTTP requests to start URLs, avoiding revisiting visited URLs, and handling 403 errors by waiting before retrying.</li>
      <li><code>parse(response)</code>: Parses the HTML content of each response, extracts article data, and writes it to a CSV file.</li>
    </ul>
    <p>The class encapsulates the scraping logic, making it reusable and modular for different scraping tasks.</p>
    <p>This spider is specifically designed to scrape news articles from the main page of <em>La Repubblica</em>, an Italian newspaper, as indicated by the provided start URLs.</p>
  </div>
</div>


In [None]:
import csv  # For reading and writing CSV files
import os  # For file system operations like checking paths
import requests  # For making HTTP requests
from bs4 import BeautifulSoup  # For HTML parsing
import time  # For pauses between requests



# Spider class to handle news article crawling
class MySpider:
    def __init__(self, start_urls_csv):
        self.start_urls_csv = start_urls_csv
        self.visited_urls_file = self._build_visited_urls_file()

    def _build_visited_urls_file(self, use_output_suffix=False):

        # Get the base filename of the start URLs CSV file without the extension
        base_filename = os.path.splitext(os.path.basename(self.start_urls_csv))[0]

        # Remove the date part from the base filename
        base_filename_without_date = base_filename.split("_")[0]

        # Construct the filename suffix based on the use_output_suffix parameter
        if use_output_suffix:
            filename_suffix = "_output.csv"
        else:
            filename_suffix = "_visited_urls.csv"

        # Construct the filename for the visited URLs file
        visited_urls_filename = f"{base_filename_without_date}{filename_suffix}"

        return visited_urls_filename

    # Método para iniciar las solicitudes
    def start_requests(self):

        # Check if the visited URLs file exists to see if it's the first run or a continuation after a failure
        if os.path.exists(self.visited_urls_file):
            # Open the file and create a set of visited URLs
            with open(self.visited_urls_file) as f:
                reader = csv.reader(f)
                visited_urls = {row[0] for row in reader if row}
        else:
            # If the file doesn't exist, it's the first run
            visited_urls = set()

        # Open the start URL CSV and iterate through it
        with open(self.start_urls_csv, "r", newline="") as file:
            reader = csv.DictReader(file)

            iteration = 1

            for row in reader:
                url = row["URL"]

                print(f"iteration number {iteration}")

                # Check the URL against the already visited ones
                if url not in visited_urls:
                    visited_urls.add(url)

                    # Make the HTTP request
                    response = requests.get(url)
                    if response.status_code == 200:
                        # Call the parse method to handle the response
                        self.parse(response)
                    elif response.status_code == 403:

                        print('status is 403\n' * 3)

                        print('start the 4 minute wait')

                        # Wait for 4 minutes before retrying
                        for i in range(1,5):
                            time.sleep(60)
                            print(f'{i} minutes have passed')
                iteration += 1
            print('************** End **************')
            print('************** End **************')


    # Method to handle parsing of the response
    def parse(self, response):
        # Parse the HTML content
        soup = BeautifulSoup(response.content, "html.parser")

        # Find all article elements on the page
        articles = soup.find_all("article")

        # List to store extracted article data
        article_data = []

        for article in articles:
            # Get the inner HTML of the title anchor tag
            anchor_html = article.find("h1").find("a")

            # Extract clean text from the title
            title = anchor_html.get_text(separator=" ", strip=True)

            # Extract the article link
            link = anchor_html["href"]

            # Extract the publication date
            aside_element = article.find("aside").find_all("a")
            date = aside_element[-1].get_text(separator=" ", strip=True)

            # Add article data to the list
            article_data.append(
                {"title": title, "link": link, "date": date, "page_url": response.url}
            )

        # Write the extracted article data to the CSV file
        with open(self._build_visited_urls_file(use_output_suffix=True), "a", newline="", encoding="utf-8") as file:
            writer = csv.DictWriter(
                file, fieldnames=["title", "link", "date", "page_url"]
            )

            # Write the header if the file is empty
            if os.stat(self._build_visited_urls_file(use_output_suffix=True)).st_size == 0:
                writer.writeheader()

            # Write a row for each article data
            for data in article_data:
                writer.writerow(data)

        # Add the URL to the visited file
        with open(self.visited_urls_file, "a", newline="", encoding="utf-8") as f:
            writer = csv.writer(f)
            writer.writerow([response.url])


<div style="text-align: justify;">
  <h2>MySpider Class</h2>
  <p>The <code>MySpider</code> class is designed to handle web scraping operations for news articles from the main page of <em>La Repubblica</em>, an Italian newspaper. To initiate the scraping process, users need to provide a CSV file containing the URLs to be scraped.</p>
  <p>If users have already collected URLs using the <code>scraper()</code> function, the CSV file generated by the <font color="blue">'url_collector.ipynb'</font> script should be named following a similar pattern to this example: <font color="green">'pizza_1984-01-01_1987-01-01.csv'</font>.</p>
  <p>For example, if the <code>scraper()</code> function was used with parameters like this:</p>
  <pre><code>scraper("pizza", '1984-01-01', "1987-01-01", "any")</code></pre>
  <p>The CSV file generated by the <font color="blue">'url_collector.ipynb'</font> script should have a name like this:</p>
  <pre><code><font color="green">'pizza_1984-01-01_1987-01-01.csv'</font></code></pre>
  <p>When running the <code>MySpider</code> class, users should pass the name of this CSV file as a parameter, including the file extension. For example:</p>
  <pre><code>spider = MySpider(start_urls_csv=<font color="green">'pizza_1984-01-01_1987-01-01.csv'</font>)
spider.start_requests()</code></pre>
  <p>This code snippet creates an instance of the <code>MySpider</code> class and starts the scraping process using the specified CSV file.</p>
</div>


In [None]:
if __name__ == "__main__":
    # Create an instance of MySpider and start the scraping process
    spider = MySpider(start_urls_csv="mafia_nigeriana_2015-07-12_2024-01-01_d.csv")
    spider.start_requests()
