# Description of the `FinanceNewsScraper` Class

The `FinanceNewsScraper` class is designed to scrape financial news articles from the business section of Google News based on a set of specified buzzwords and a given date range.

- **Initialization (`__init__`)**: 
  - The scraper accepts two sets of buzzwords:
    - **Must-have buzzwords**: Keywords that must appear in the article title or description.
    - **Percentage-based buzzwords**: Keywords that need to match a certain percentage within the article.
  - It also takes a `start_date`, `end_date`, and an interval for scraping in chunks (e.g., weekly).

- **URL Construction (`construct_url`)**: 
  - This function builds a Google News RSS URL specifically for the business section, incorporating the provided buzzwords and date range.

- **Fetching Data (`fetch_rss_feed`)**: 
  - This function retrieves the RSS feed using the constructed URL, retrying up to three times if errors are encountered.
  - **Robust Retry Mechanism**:
      - To ensure stable scraping even when there are network issues, the class includes a retry mechanism. It retries the process multiple times if it fails to retrieve the Yahoo Finance page, adding reliability to the data extraction process.


- **Keyword Matching**:
  - **Must-have buzzwords**: Ensures that at least one must-have buzzword appears in the article's title or description.
  - **Percentage-based buzzwords**: Verifies that a minimum percentage of the provided buzzwords are present in the article.

- **Article Parsing (`parse_articles`)**: 
  - This function parses the RSS feed and extracts relevant information such as the article title, URL, and publication date, but only for articles that match the buzzword criteria.

- **Scraping (`scrape`)**: 
  - This method iterates through the specified date range, fetching and parsing articles in chunks as defined by the provided interval.

- **Saving to CSV (`save_to_csv`)**: 
  - After scraping, the articles are saved to a CSV file using the `pandas` library for easy storage and further analysis.

This class simplifies the process of scraping Google News for business-related articles based on keywords, while also offering functionality to save the results as a CSV file for later analysis.


In [6]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
import math
import time
import csv
import pandas as pd

In [10]:
class FinanceNewsScraper:
    def __init__(self, primary_buzzwords, secondary_buzzwords, start_date, end_date, required_percentage, interval):
        """
        Initialize the scraper with two sets of buzzwords, start date, end date, and required percentage.
        :param primary_buzzwords: List of buzzwords that must be present.
        :param secondary_buzzwords: List of buzzwords to search for with percentage matching.
        :param start_date: The start date (YYYY-MM-DD) for the articles.
        :param end_date: The end date (YYYY-MM-DD) for the articles.
        :param required_percentage: The percentage of percentage-based buzzwords that should be present (default 60%).
        """
        self.primary_buzzwords = primary_buzzwords
        self.secondary_buzzwords = secondary_buzzwords
        self.start_date = datetime.strptime(start_date, '%Y-%m-%d')
        self.end_date = datetime.strptime(end_date, '%Y-%m-%d')
        self.required_percentage = required_percentage / 100  # Convert percentage to decimal for calculations
        self.base_url = "https://news.google.com/rss"
        self.interval = interval
        self.max_retries = 3  # Number of retries in case of failure

    def construct_url(self, start_date, end_date):
        """
        Construct the Google News RSS URL with all buzzwords and date range.
        :return: The constructed URL.
        """
        combined_buzzwords = self.primary_buzzwords + self.secondary_buzzwords
        query = " AND ".join(combined_buzzwords)  # Combine all buzzwords with 'AND' to ensure all words are present
        formatted_query = query.replace(" ", "%20")  # Format query for URL
        
        url = f"{self.base_url}?q={formatted_query}+after:{start_date}+before:{end_date}&hl=en-US&gl=US&ceid=US:en"
        
        return url

    def fetch_rss_feed(self, start_date, end_date, max_retries=5, backoff_factor=2):
        """
        Fetch the RSS feed from Google News for a given date range, with retries and exponential backoff to avoid 503 errors.
        :param start_date: The start date for fetching articles.
        :param end_date: The end date for fetching articles.
        :param max_retries: Maximum number of retries if the request fails.
        :param backoff_factor: Factor by which the wait time increases after each failure.
        :return: BeautifulSoup object with the RSS feed content.
        """
        rss_url = self.construct_url(start_date, end_date)
        attempt = 0
        delay = 5  # Start with an initial delay of 5 seconds

        while attempt < max_retries:
            try:
                response = requests.get(rss_url, headers={'User-Agent': 'Mozilla/5.0'})
                
                if response.status_code == 200:
                    return BeautifulSoup(response.content, 'xml')  # Parsing as XML
                else:
                    print(f"Failed to retrieve RSS feed with status code {response.status_code}. Retrying...")

            except requests.RequestException as e:
                print(f"Error fetching the RSS feed: {e}. Retrying...")

            # Apply the exponential backoff
            attempt += 1
            time.sleep(delay)
            delay *= backoff_factor  # Increase the delay exponentially

        print("Max retries exceeded. Could not fetch the RSS feed.")
        return None


    def contains_any_primary_buzzwords(self, text):
        """
        Check if any must-have buzzwords are present in the given text.
        :param text: The text to search for must-have buzzwords (case-insensitive).
        :return: True if at least one must-have buzzword is found, False otherwise.
        """
        text = text.lower()
        return any(buzzword.lower() in text for buzzword in self.primary_buzzwords)

    def contains_percentage_of_buzzwords(self, text):
        """
        Check if at least the required percentage of percentage-based buzzwords are present in the given text.
        :param text: The text to search for percentage-based buzzwords (case-insensitive).
        :return: True if the required percentage of percentage-based buzzwords are found, False otherwise.
        """
        text = text.lower()
        buzzwords_found = sum(1 for buzzword in self.secondary_buzzwords if buzzword.lower() in text)
        required_count = math.ceil(len(self.secondary_buzzwords) * self.required_percentage)
        
        # The condition now checks if at least the required count of buzzwords is found
        return buzzwords_found >= required_count

    def parse_articles(self, soup):
        """
        Parse the RSS feed and extract article information.
        Only return articles where all must-have buzzwords and a percentage of percentage-based buzzwords are found.
        :param soup: BeautifulSoup object of the RSS feed.
        :return: List of dictionaries with article titles, URLs, and publication dates.
        """
        articles = []
        for item in soup.find_all('item'):
            title = item.title.text
            link = item.link.text
            description = item.description.text if item.description else ""
            pub_date = item.pubDate.text
            pub_date = datetime.strptime(pub_date, '%a, %d %b %Y %H:%M:%S %Z')  # Format the date
            
            # Check if any must-have buzzwords are present in title or description
            first_100_words = " ".join(description.split()[:100])
            if self.contains_any_primary_buzzwords(title) or self.contains_any_primary_buzzwords(first_100_words):
                # Check if the required percentage of percentage-based buzzwords are present
                if self.contains_percentage_of_buzzwords(title) or self.contains_percentage_of_buzzwords(first_100_words):
                    articles.append({'title': title, 'url': link, 'date': pub_date})
        return articles

    def scrape(self):
        """
        Scrape the RSS feed and extract articles that match both must-have and percentage-based buzzwords.
        :return: List of articles (titles, URLs, and dates).
        """
        all_articles = []
        delta = timedelta(days=self.interval)  # Fetch in intervals (e.g., weekly)
        current_start_date = self.start_date
        print(f"Fetching articles from {current_start_date.strftime('%Y-%m-%d')} to {self.end_date.strftime('%Y-%m-%d')}")

        # Loop through the date range with the specified interval
        while current_start_date < self.end_date:
            current_end_date = min(current_start_date + delta, self.end_date)

            soup = self.fetch_rss_feed(current_start_date.strftime('%Y-%m-%d'),
                                       current_end_date.strftime('%Y-%m-%d'))
            if soup:
                articles = self.parse_articles(soup)
                all_articles.extend(articles)

            current_start_date += delta  # Move to the next interval

        if all_articles:
            print(f"Found {len(all_articles)} articles matching the criteria.")
        else:
            print("No articles found matching the criteria.")
        return all_articles

    def save_to_csv(articles, filename):
        """
        Save the scraped articles to a CSV file using pandas.
        :param articles: List of articles with title, URL, and date.
        :param filename: The name of the CSV file (default is "articles.csv").
        """
        # Convert the list of articles to a pandas DataFrame and drop duplicates in case there are any
        df = pd.DataFrame(articles).drop_duplicates()
        
        # Save DataFrame to CSV
        df.to_csv(filename, index=False, encoding='utf-8')

In [11]:
primary_buzzwords = ["apple"]
secondary_buzzwords = ["stock", "china", "strike"]  # List of buzzwords to search for
start_date = "2024-09-01"  # Start date
end_date = "2024-10-20"  # End date
required_percentage = 30  # required_percentage% of the secondary buzzwords should be in title or description
interval = 1

scraper_news = FinanceNewsScraper(primary_buzzwords, secondary_buzzwords, start_date, end_date, required_percentage, interval)
articles_news = scraper_news.scrape()

# Save the output as CSV files 
FinanceNewsScraper.save_to_csv(articles_news, "finance_news.csv")

Fetching articles from 2024-09-01 to 2024-10-20
Found 13 articles matching the criteria.


### `FinanceNewsAPIScraper` Class Description

The `FinanceNewsAPIScraper` class is designed to fetch, filter, and save news articles from NewsAPI based on specified buzzwords. Key functionalities:

- **Initialization**: Takes in an API key, primary and secondary buzzwords, date range, and retry settings.
- **Fetch News**: Sends API requests and retries if rate limits are hit.
- **Filter**: Filters articles to ensure primary buzzwords are present, with a required percentage of secondary buzzwords.
- **Display & Save**: Displays the filtered articles and provides an option to save them to a CSV file.

### Key Methods:
- `fetch_news()`
- `filter_articles()`
- `display_articles()`


In [12]:
class FinanceNewsAPIScraper:
    def __init__(self, api_key, primary_buzzwords, secondary_buzzwords, start_date, end_date, required_percentage, retry_after):
        self.api_key = api_key
        self.primary_buzzwords = primary_buzzwords
        self.secondary_buzzwords = secondary_buzzwords
        self.start_date = start_date
        self.end_date = end_date
        self.required_percentage = required_percentage / 100
        self.base_url = 'https://newsapi.org/v2/everything'
        self.retry_after = retry_after

    def contains_any_primary_buzzwords(self, text):
        text = text.lower()
        return any(buzzword.lower() in text for buzzword in self.primary_buzzwords)

    def contains_required_percentage_of_secondary_buzzwords(self, text):
        text = text.lower()
        buzzwords_found = sum(1 for buzzword in self.secondary_buzzwords if buzzword.lower() in text)
        required_count = math.ceil(len(self.secondary_buzzwords) * self.required_percentage)
        return buzzwords_found >= required_count

    def fetch_news(self, retries=3):
        params = {
                'q': ' OR '.join(self.primary_buzzwords + self.secondary_buzzwords),
                'apiKey': self.api_key,
                'from': self.start_date,
                'to': self.end_date,
                'language': 'en',
                'sortBy': 'relevancy'
            }


        attempt = 0
        while attempt < retries:
            response = requests.get(self.base_url, params=params)
            if response.status_code == 200:
                return response.json()
            elif response.status_code == 429:
                print(f"Rate limit exceeded. Retrying after {self.retry_after} seconds...")
                time.sleep(self.retry_after)
            else:
                print(f"Failed to fetch news articles. Status code: {response.status_code}")
                return None
            attempt += 1
        print("Max retries exceeded. Could not fetch the news.")
        return None

    def filter_articles(self, news_data):
        filtered_articles = []
        if news_data and 'articles' in news_data:
            for article in news_data['articles']:
                title = article['title']
                description = article['description'] or ""
                content = title + " " + description
                if self.contains_any_primary_buzzwords(content) and self.contains_required_percentage_of_secondary_buzzwords(content):
                    filtered_articles.append(article)
        return filtered_articles

    def display_articles(self, articles):
        if articles:
            for i, article in enumerate(articles, start=1):
                print(f"{i}. {article['title']} ({article['publishedAt']})")
        else:
            print("No articles found.")

    def scrape_and_filter_news(self):
        news_data = self.fetch_news()
        if news_data:
            filtered_articles = self.filter_articles(news_data)
            self.display_articles(filtered_articles)
            return filtered_articles  # Ensure filtered_articles is returned
        else:
            print("Failed to retrieve or filter articles.")
            return []

    def save_to_csv(self, articles, filename):
        if articles:
            data = [{
                'title': article['title'],
                'publishedAt': article['publishedAt'],
                'url': article['url']
            } for article in articles]
            df = pd.DataFrame(data)
            df.to_csv(filename, index=False, encoding='utf-8')
            print(f"Articles saved to {filename}")
        else:
            print("No articles to save.")


In [13]:
api_keys = ['51f3c8bce6b1473e9537d03fe37815e3','5d7f3433c9404b6aaba5c5db771f2c79','25d106c70b3c4ff3af1fb174e0afc2ed']

**News Focus:** `Geopoliticial conflicts`


In [None]:
secondary_buzzwords = ["israel", "gaza", "palestine", "conflict", "war", "hamas", 
                       "ukraine", "russia","airstrike","attack", "crisis","oil","prices","nato","invasion"
                       "iran","afghanistan","china","taiwan","military",
                       "indo-pacific","south china sea","market","nuclear","escalate","zelensky","putin"]  # List of buzzwords to search for
required_percentage = 6
retry_after = 60
start_date = '2024-09-23'
end_date = '2024-10-22'