# Sentiment Analysis

We're able to crawl for headlines from various collections but I don't yet know how to see which publishers are available.

In [29]:
from fundus import PublisherCollection, Crawler, NewsMap
from flair.data import Sentence
from flair.models import TextClassifier
from tqdm import tqdm

In [30]:
# initialize the crawler for news publishers based in the US
crawler = Crawler(PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)

In [31]:
print(article.title)
print(article.lang) # detects language of article
article.plaintext

“Black Doves” Offers a Sentimental Spin on the Spy Genre
en


'The Keira Knightley- and Ben Whishaw-led Netflix series eventually snares its protagonists in a traditional espionage plot—but it’s most interested in their friendship.\n\nIn the new mystery thriller “Black Doves,” the purview of secret agents can include school plays, bedtime stories, and holiday decorations. By day, Helen Webb (Keira Knightley) is a model wife to the U.K.’s Defense Secretary (Andrew Buchan) and a doting mother to their two young kids; by night, she relays intelligence to her handler, Mrs. Reed (Sarah Lancashire), who sells information gathered by “black doves” like Helen to the highest bidder. That’s when things go right. In the second episode, things go wrong: an assassin breaks into Helen’s kitchen, where she and the intruder tussle in the dark, their knock-down-drag-out brawl lit by the glow of a Christmas tree. Upon gaining the upper hand, Helen threatens him in the manner of a wrathful domestic goddess. “I have a cheese grater in the dishwasher,” she says. “I h

In [32]:
# initialize the crawler for The New Yorker
crawler = Crawler(PublisherCollection.us.TheNewYorker)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)

In [33]:
def crawl_headlines(crawlers, name_of_publishers, article_number=20):
    """Crawls headlines from a list of crawlers for specified publishers.

    This function takes three arguments:
        - crawlers: A list of web crawlers, each responsible for a specific publisher.
        - name_of_publishers: A list of publisher names corresponding to the crawlers.
        - article_number (optional): The maximum number of articles to crawl per publisher. Defaults to 20.

    It returns a dictionary where the keys are publisher names and the values are lists of headlines crawled from those publishers.
    """

    headlines = {}

    for crawler, name_of_publisher in zip(crawlers, name_of_publishers):
        """Iterates through crawlers and corresponding publisher names."""

        publisher = []

        for article in tqdm(crawler.crawl(max_articles=article_number)):
            """Crawls the title of articles from the current crawler up to the specified article_number."""
            publisher.append(article.title)

        headlines[name_of_publisher] = publisher

    return headlines

In [34]:
crawler1 = Crawler(PublisherCollection.us.CNBC)
crawler2 = Crawler(PublisherCollection.us.TheNation)

crawlers = [crawler1, crawler2]
names_of_publishers = ["CNBC", "The Nation"]
headlines = crawl_headlines(crawlers, names_of_publishers, article_number=10)

0it [00:40, ?it/s]
0it [00:00, ?it/s]

In [7]:
headlines

{'CNBC': ["Silicon Valley's White House influence grows as Trump taps tech execs for key roles",
  'Biden administration withdraws student loan forgiveness plans. What borrowers should know',
  'U.S. sues Walmart, Branch Messenger over payment accounts for delivery drivers',
  "Savannah James' worst money mistake still gives her the 'heebie jeebies': It was 'a substantial amount, I'm still stressed out'",
  'I asked a 105-year-old when middle age starts: Her answer delighted me—and made me feel better about turning 40',
  "House Ethics panel finds Matt Gaetz had sex with 17-year-old, 'regularly' paid for sex",
  'Here are our top 10 things to watch in the stock market Monday',
  'Nordstrom to go private in $6.25 billion deal with founding family, Mexican retailer',
  "This career coach 'always' negotiates for more PTO—her top 3 tips for making the ask",
  "Charge card vs. credit card: What's the difference?"],
 'The Nation': ['A Far-Right Attacker Kills 5 in a Christmas Market. The Ger

In [8]:
def predict_labels(publisher_headlines):
    """Predicts sentiment labels for headlines from each publisher.

    This function takes a dictionary `publisher_headlines` as input. 
    The dictionary keys are publisher names and the values are lists of headlines.

    The function performs sentiment analysis on each headline and stores the predicted labels 
    in a new dictionary with the same publisher names as keys.

    It returns a dictionary where the keys are publisher names and the values are lists of predicted sentiment labels 
    for the corresponding headlines.
    """

    sentiments_per_publisher = {}

    # Load a sentiment classifier (TextClassifier likely refers to a custom class or library)
    tagger = TextClassifier.load('sentiment')  

    for key, values in publisher_headlines.items():
        """Iterates through each publisher and its corresponding headlines."""

        temp = []
        for value in values:
            """Iterates through each headline for the current publisher"""
            sentence = Sentence(value)  # Create a Sentence object (likely custom class) for the headline
            tagger.predict(sentence)    # Predict sentiment label for the sentence using the loaded classifier
            temp.append(sentence.get_label().value)  # Append the predicted label value to a temporary list

        sentiments_per_publisher[key] = temp  # Add the list of predicted labels for the publisher to the result dictionary

    return sentiments_per_publisher

In [9]:
sentiments_per_publisher=predict_labels(headlines) # getting sentiments

In [10]:
len(sentiments_per_publisher['CNBC'])

10

In [11]:
def print_statistics(sentiments_per_publisher, number_of_articles=20):
    """
    This function iterates over a dictionary of sentiments per publisher and prints statistics about the sentiment distribution.

    Args:
        sentiments_per_publisher (dict): A dictionary where keys are publishers and values are lists of sentiment labels for their articles.
        number_of_articles (int, optional): The number of articles to consider when calculating statistics. Defaults to 20.
    """

    for keys, values in sentiments_per_publisher.items():
        """
        Iterates over each publisher and their corresponding sentiment labels.
        """

        positive = 0
        negative = 0
        something_else = 0
        for value in values:
            """
            Iterates over each sentiment label for the current publisher.
            """

            if value == "POSITIVE":
                positive += 1
            elif value == "NEGATIVE":
                negative += 1
            else:
                something_else += 1
        print(f"{keys} has {positive} positive and {negative} negative headlines out of {number_of_articles}.")
        if something_else >= 1:
            print(f"If something got wrong then it has {something_else} something_else headlines.")
        print()

    return


In [12]:
print_statistics(sentiments_per_publisher,number_of_articles = len(sentiments_per_publisher['CNBC']))

CNBC has 4 positive and 6 negative headlines out of 10.

The Nation has 6 positive and 4 negative headlines out of 10.



In [13]:
# Yep, many of these titles sound negative to me
headlines['The Nation']

['A Far-Right Attacker Kills 5 in a Christmas Market. The German Far Right Takes Advantage.',
 'Novelist on a Deadline: Barry Malzberg, 1939–2024',
 'These Progressive Will Guide Us Through the Darkness',
 'The Best Albums of 2024',
 'A Stunning Year for Student Journalism',
 'The Spending Fiasco Was a Preview of the Trump-Musk Administration',
 'When the Feds Are Still Watching',
 'My 2025 Project: Starting a New Column, “Hiding in Plain Sight”',
 'The Billionaire Who Stole Christmas',
 'The Downsides of the Wind Energy Boom']

Let's try to filter by topic.

In [14]:
from typing import Dict, Any

keywords = ['market','stock']
def body_filter(extracted: Dict[str, Any]) -> bool:
    if body := extracted.get("body"):
        for word in keywords:
            if word in str(body).casefold():
                return False
    return True

In [28]:
crawler = Crawler(PublisherCollection.us)

for us_themed_article in crawler.crawl(max_articles=2,only_complete=body_filter):
    print(us_themed_article)

In [26]:
import datetime

def date_filter(extracted: Dict[str, Any]) -> bool:
    start_date = datetime.date(2024,11,1)
    end_date = datetime.date(2024,12,1)
    publishing_date = extracted.get("publishing_date")
    return (start_date <= publishing_date.date() <= end_date)

In [27]:
for us_themed_article in crawler.crawl(max_articles=2,only_complete=date_filter):
    print(us_themed_article)