# Description of the `FinanceNewsScraper` Class

The `FinanceNewsScraper` class is designed to scrape financial news articles from the business section of Google News based on a set of specified buzzwords and a given date range.

- **Initialization (`__init__`)**: 
  - The scraper accepts two sets of buzzwords:
    - **Must-have buzzwords**: Keywords that must appear in the article title or description.
    - **Percentage-based buzzwords**: Keywords that need to match a certain percentage within the article.
  - It also takes a `start_date`, `end_date`, and an interval for scraping in chunks (e.g., weekly).

- **URL Construction (`construct_url`)**: 
  - This function builds a Google News RSS URL specifically for the business section, incorporating the provided buzzwords and date range.

- **Fetching Data (`fetch_rss_feed`)**: 
  - This function retrieves the RSS feed using the constructed URL, retrying up to three times if errors are encountered.

- **Keyword Matching**:
  - **Must-have buzzwords**: Ensures that at least one must-have buzzword appears in the article's title or description.
  - **Percentage-based buzzwords**: Verifies that a minimum percentage of the provided buzzwords are present in the article.

- **Article Parsing (`parse_articles`)**: 
  - This function parses the RSS feed and extracts relevant information such as the article title, URL, and publication date, but only for articles that match the buzzword criteria.

- **Scraping (`scrape`)**: 
  - This method iterates through the specified date range, fetching and parsing articles in chunks as defined by the provided interval.

- **Saving to CSV (`save_to_csv`)**: 
  - After scraping, the articles are saved to a CSV file using the `pandas` library for easy storage and further analysis.

This class simplifies the process of scraping Google News for business-related articles based on keywords, while also offering functionality to save the results as a CSV file for later analysis.


In [6]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
import math
import time
import csv
import pandas as pd

In [7]:
class FinanceNewsScraper:
    def __init__(self, primary_buzzwords, secondary_buzzwords, start_date, end_date, required_percentage, interval):
        """
        Initialize the scraper with two sets of buzzwords, start date, end date, and required percentage.
        :param primary_buzzwords: List of buzzwords that must be present.
        :param secondary_buzzwords: List of buzzwords to search for with percentage matching.
        :param start_date: The start date (YYYY-MM-DD) for the articles.
        :param end_date: The end date (YYYY-MM-DD) for the articles.
        :param required_percentage: The percentage of percentage-based buzzwords that should be present (default 60%).
        """
        self.primary_buzzwords = primary_buzzwords
        self.secondary_buzzwords = secondary_buzzwords
        self.start_date = datetime.strptime(start_date, '%Y-%m-%d')
        self.end_date = datetime.strptime(end_date, '%Y-%m-%d')
        self.required_percentage = required_percentage / 100  # Convert percentage to decimal for calculations
        self.base_url = "https://news.google.com/rss"
        self.interval = interval
        self.max_retries = 3  # Number of retries in case of failure

    def construct_url(self, start_date, end_date):
        """
        Construct the Google News RSS URL with all buzzwords and date range.
        :return: The constructed URL.
        """
        combined_buzzwords = self.primary_buzzwords + self.secondary_buzzwords
        query = " AND ".join(combined_buzzwords)  # Combine all buzzwords with 'AND' to ensure all words are present
        formatted_query = query.replace(" ", "%20")  # Format query for URL
        
        url = f"{self.base_url}?q={formatted_query}+after:{start_date}+before:{end_date}&hl=en-US&gl=US&ceid=US:en"
        
        return url

    def fetch_rss_feed(self, start_date, end_date):
        """
        Fetch the RSS feed from Google News for a given date range.
        :return: BeautifulSoup object with the RSS feed content.
        """
        rss_url = self.construct_url(start_date, end_date)
        try:
            response = requests.get(rss_url)
            if response.status_code == 200:
                return BeautifulSoup(response.content, 'xml')  # Parsing as XML
            else:
                print(f"Failed to retrieve RSS feed with status code {response.status_code}")
                return None
        except requests.RequestException as e:
            print(f"Error fetching the RSS feed: {e}")
            return None

    def contains_any_primary_buzzwords(self, text):
        """
        Check if any must-have buzzwords are present in the given text.
        :param text: The text to search for must-have buzzwords (case-insensitive).
        :return: True if at least one must-have buzzword is found, False otherwise.
        """
        text = text.lower()
        return any(buzzword.lower() in text for buzzword in self.primary_buzzwords)

    def contains_percentage_of_buzzwords(self, text):
        """
        Check if at least the required percentage of percentage-based buzzwords are present in the given text.
        :param text: The text to search for percentage-based buzzwords (case-insensitive).
        :return: True if the required percentage of percentage-based buzzwords are found, False otherwise.
        """
        text = text.lower()
        buzzwords_found = sum(1 for buzzword in self.secondary_buzzwords if buzzword.lower() in text)
        required_count = math.ceil(len(self.secondary_buzzwords) * self.required_percentage)
        
        # The condition now checks if at least the required count of buzzwords is found
        return buzzwords_found >= required_count

    def parse_articles(self, soup):
        """
        Parse the RSS feed and extract article information.
        Only return articles where all must-have buzzwords and a percentage of percentage-based buzzwords are found.
        :param soup: BeautifulSoup object of the RSS feed.
        :return: List of dictionaries with article titles, URLs, and publication dates.
        """
        articles = []
        for item in soup.find_all('item'):
            title = item.title.text
            link = item.link.text
            description = item.description.text if item.description else ""
            pub_date = item.pubDate.text
            pub_date = datetime.strptime(pub_date, '%a, %d %b %Y %H:%M:%S %Z')  # Format the date
            
            # Check if any must-have buzzwords are present in title or description
            first_100_words = " ".join(description.split()[:100])
            if self.contains_any_primary_buzzwords(title) or self.contains_any_primary_buzzwords(first_100_words):
                # Check if the required percentage of percentage-based buzzwords are present
                if self.contains_percentage_of_buzzwords(title) or self.contains_percentage_of_buzzwords(first_100_words):
                    articles.append({'title': title, 'url': link, 'date': pub_date})
        return articles

    def scrape(self):
        """
        Scrape the RSS feed and extract articles that match both must-have and percentage-based buzzwords.
        :return: List of articles (titles, URLs, and dates).
        """
        all_articles = []
        delta = timedelta(days=self.interval)  # Fetch in intervals (e.g., weekly)
        current_start_date = self.start_date
        print(f"Fetching articles from {current_start_date.strftime('%Y-%m-%d')} to {self.end_date.strftime('%Y-%m-%d')}")

        # Loop through the date range with the specified interval
        while current_start_date < self.end_date:
            current_end_date = min(current_start_date + delta, self.end_date)

            soup = self.fetch_rss_feed(current_start_date.strftime('%Y-%m-%d'),
                                       current_end_date.strftime('%Y-%m-%d'))
            if soup:
                articles = self.parse_articles(soup)
                all_articles.extend(articles)

            current_start_date += delta  # Move to the next interval

        if all_articles:
            print(f"Found {len(all_articles)} articles matching the criteria.")
        else:
            print("No articles found matching the criteria.")
        return all_articles

    def save_to_csv(articles, filename):
        """
        Save the scraped articles to a CSV file using pandas.
        :param articles: List of articles with title, URL, and date.
        :param filename: The name of the CSV file (default is "articles.csv").
        """
        # Convert the list of articles to a pandas DataFrame
        df = pd.DataFrame(articles)
        
        # Save DataFrame to CSV
        df.to_csv(filename, index=False, encoding='utf-8')


In [5]:
primary_buzzwords = [" Israel ", " Gaza "]
secondary_buzzwords = [" military ", " humanitarian ", " escalation ", " troops " ]  # List of buzzwords to search for
start_date = "2024-10-14"  # Start date
end_date = "2024-10-16"  # End date
required_percentage = 25  # required_percentage% of the secondary buzzwords should be in title or description
interval = 1

scraper = FinanceNewsScraper(primary_buzzwords, secondary_buzzwords, start_date, end_date, required_percentage, interval)
articles = scraper.scrape()

# Save the output as CSV files 
FinanceNewsScraper.save_to_csv(articles, "finance_news.csv")

Fetching articles from 2024-10-14 to 2024-10-16
Failed to retrieve RSS feed with status code 503
Failed to retrieve RSS feed with status code 503
No articles found matching the criteria.
