# Sentiment Analysis

We're able to crawl for headlines from various collections but I don't yet know how to see which publishers are available.

In [20]:
from fundus import PublisherCollection, Crawler, NewsMap
from flair.data import Sentence
from flair.models import TextClassifier
from tqdm import tqdm

In [2]:
# initialize the crawler for news publishers based in the US
crawler = Crawler(PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)

Fundus-Article:
- Title: "Albanian prime minister says TikTok ban was not a 'rushed reaction to a [...]"
- Text:  "TIRANA, Albania (AP) — Albania’s prime minister said Sunday the ban on TikTok
          his government announced a day earlier was “not a rushed [...]"
- URL:    https://apnews.com/article/albania-tiktok-ban-children-violence-bullying-c6dd46c1de5cc2996b4004bcb2a76a66
- From:   Associated Press News (2024-12-22 19:08)
Fundus-Article:
- Title: "The Spending Fiasco Was a Preview of the Trump-Musk Administration"
- Text:  "Congress narrowly averted a shutdown, but the whole episode offers a sneak peak
          of the oligarchy in store.  With minutes to spare ahead of the [...]"
- URL:    https://www.thenation.com/article/politics/spending-shutdown-musk-oligarchy/
- From:   The Nation (2024-12-21 17:03)


In [3]:
print(article.title)
print(article.lang) # detects language of article
article.plaintext


The Spending Fiasco Was a Preview of the Trump-Musk Administration
en


'Congress narrowly averted a shutdown, but the whole episode offers a sneak peak of the oligarchy in store.\n\nWith minutes to spare ahead of the deadline for a shutdown, the Senate approved the third version of the House’s spending bill last night. This latest installment in the dysfunctional soap opera known as Mike Johnson’s speakership was also the most unhinged, thanks to the obstructionist role played by centibillionaire turned first buddy Elon Musk. As the House prepared to vote on the initial bill to fund the government earlier this week, Musk took to the account he keeps on X, the social media platform he owns and now maintains as a palace for workshopping right-wing agitprop, to declare that any House member supporting the deal should be voted out in the next midterm cycle. (His subsequent flurry of attacks on the deal were a trademark Musk fireworks display of ignorance and lies.) Chaos then ensued: Johnson’s spending package died before a vote could be scheduled, and under 

In [4]:
# initialize the crawler for The New Yorker
crawler = Crawler(PublisherCollection.us.TheNewYorker)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)

Fundus-Article:
- Title: "Lee Chang-dong on South Korea in the Nineteen-Eighties and Today"
- Text:  "The author discusses his story “The Leper.”  In this week’s story, “The Leper,”
          the narrator discovers that his father has confessed to spying [...]"
- URL:    https://www.newyorker.com/books/this-week-in-fiction/lee-chang-dong-12-30-24
- From:   The New Yorker (2024-12-22 06:00)
Fundus-Article:
- Title: "“The Leper,” by Lee Chang-dong"
- Text:  "Before I knocked, I took a moment to calm my breathing. But even a couple of
          deep breaths did nothing to lessen my anxiety, and, to the sound [...]"
- URL:    https://www.newyorker.com/magazine/2024/12/30/the-leper-fiction-lee-chang-dong
- From:   The New Yorker (2024-12-22 06:00)


In [21]:
def crawl_headlines(crawlers, name_of_publishers, article_number=20):
    """Crawls headlines from a list of crawlers for specified publishers.

    This function takes three arguments:
        - crawlers: A list of web crawlers, each responsible for a specific publisher.
        - name_of_publishers: A list of publisher names corresponding to the crawlers.
        - article_number (optional): The maximum number of articles to crawl per publisher. Defaults to 20.

    It returns a dictionary where the keys are publisher names and the values are lists of headlines crawled from those publishers.
    """

    headlines = {}

    for crawler, name_of_publisher in zip(crawlers, name_of_publishers):
        """Iterates through crawlers and corresponding publisher names."""

        publisher = []

        for article in tqdm(crawler.crawl(max_articles=article_number)):
            """Crawls the title of articles from the current crawler up to the specified article_number."""
            publisher.append(article.title)

        headlines[name_of_publisher] = publisher

    return headlines

In [29]:
crawler1 = Crawler(PublisherCollection.us.CNBC)
crawler2 = Crawler(PublisherCollection.us.TheNation)

crawlers = [crawler1, crawler2]
names_of_publishers = ["CNBC", "The Nation"]
headlines = crawl_headlines(crawlers, names_of_publishers, article_number=10)

10it [00:14,  1.46s/it]
10it [00:14,  1.49s/it]


In [42]:
headlines

{'CNBC': ['Macao is becoming a city of sports and entertainment, Sands China CEO says, as President Xi urges diversification',
  'CNBC Daily Open: With cooler-than-expected PCE, would the Fed’s dot plot have looked different?',
  'Asia-Pacific markets begin Christmas week higher; Nissan-Honda merger deal in focus',
  'How Gen X and millennials are changing the face of the traditional family office as they inherit over $80 trillion',
  "1 big question this week: Was Friday's good news double a market turning point?",
  'Amtrak temporarily suspends Northeast Corridor service days before holiday',
  'Donald Trump says Vladimir Putin wants a meeting as soon as possible about the war with Ukraine',
  'What Google’s quantum computing breakthrough Willow means for the future of bitcoin and other cryptos',
  '‘It feels like Elon Musk is our prime minister’: The fallout from the funding debacle',
  "This is the No. 1 company where workers are happy with their pay—it's not based in New York or S

In [24]:
def predict_labels(publisher_headlines):
    """Predicts sentiment labels for headlines from each publisher.

    This function takes a dictionary `publisher_headlines` as input. 
    The dictionary keys are publisher names and the values are lists of headlines.

    The function performs sentiment analysis on each headline and stores the predicted labels 
    in a new dictionary with the same publisher names as keys.

    It returns a dictionary where the keys are publisher names and the values are lists of predicted sentiment labels 
    for the corresponding headlines.
    """

    sentiments_per_publisher = {}

    # Load a sentiment classifier (TextClassifier likely refers to a custom class or library)
    tagger = TextClassifier.load('sentiment')  

    for key, values in publisher_headlines.items():
        """Iterates through each publisher and its corresponding headlines."""

        temp = []
        for value in values:
            """Iterates through each headline for the current publisher"""
            sentence = Sentence(value)  # Create a Sentence object (likely custom class) for the headline
            tagger.predict(sentence)    # Predict sentiment label for the sentence using the loaded classifier
            temp.append(sentence.get_label().value)  # Append the predicted label value to a temporary list

        sentiments_per_publisher[key] = temp  # Add the list of predicted labels for the publisher to the result dictionary

    return sentiments_per_publisher

In [30]:
sentiments_per_publisher=predict_labels(headlines) # getting sentiments

In [36]:
len(sentiments_per_publisher['CNBC'])

10

In [40]:
def print_statistics(sentiments_per_publisher, number_of_articles=20):
    """
    This function iterates over a dictionary of sentiments per publisher and prints statistics about the sentiment distribution.

    Args:
        sentiments_per_publisher (dict): A dictionary where keys are publishers and values are lists of sentiment labels for their articles.
        number_of_articles (int, optional): The number of articles to consider when calculating statistics. Defaults to 20.
    """

    for keys, values in sentiments_per_publisher.items():
        """
        Iterates over each publisher and their corresponding sentiment labels.
        """

        positive = 0
        negative = 0
        something_else = 0
        for value in values:
            """
            Iterates over each sentiment label for the current publisher.
            """

            if value == "POSITIVE":
                positive += 1
            elif value == "NEGATIVE":
                negative += 1
            else:
                something_else += 1
        print(f"{keys} has {positive} positive and {negative} negative headlines out of {number_of_articles}.")
        if something_else >= 1:
            print(f"If something got wrong then it has {something_else} something_else headlines.")
        print()

    return


In [41]:
print_statistics(sentiments_per_publisher,number_of_articles = len(sentiments_per_publisher['CNBC']))

CNBC has 5 positive and 5 negative headlines out of 10.

The Nation has 3 positive and 7 negative headlines out of 10.



In [43]:
# Yep, many of these titles sound negative to me
headlines['The Nation']

['The Spending Fiasco Was a Preview of the Trump-Musk Administration',
 'When the Feds Are Still Watching',
 'My 2025 Project: Starting a New Column, “Hiding in Plain Sight”',
 'The Billionaire Who Stole Christmas',
 'The Downsides of the Wind Energy Boom',
 'Joe Biden’s Bodyguard of Liars Betrayed American Democracy',
 'Is America Killing Itself?',
 'Red Tape Saves Lives',
 'If Less Than 115,000 Votes Had Switched in Three Battleground States, Harris Would Have Beaten Trump',
 'Brain-Dead Bipartisanship Is Getting Us Nowhere']