# Sentiment-Based News Sorter Prototype

This notebook demonstrates a prototype for classifying news articles as "Good News ðŸŽ‰", "Bad News ðŸ‘Ž", or "Just News ðŸ¤·" using transformer-based NLP models.

## 1. Import Required Libraries
Import the necessary libraries, including pandas and transformers.

In [1]:
# Install required libraries if not already installed
!pip install pandas transformers






[notice] A new release of pip is available: 25.3 -> 26.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import pandas as pd
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


## 2. Load and Inspect the News Data
Load the news dataset and inspect its structure and sample rows.

In [3]:
# Load the news dataset
news_df = pd.read_csv('./data/news.csv')

# Display dataset structure and sample rows
print(f"Number of articles: {len(news_df)}")
display(news_df.head())
print(news_df.dtypes)

Number of articles: 510


Unnamed: 0,headline,text
0,UK economy facing 'major risks',The UK manufacturing sector will continue to f...
1,Aids and climate top Davos agenda,Climate change and the fight against Aids are ...
2,Asian quake hits European shares,Shares in Europe's leading reinsurers and trav...
3,India power shares jump on debut,"Shares in India's largest power producer, Nati..."
4,Lacroix label bought by US firm,Luxury goods group LVMH has sold its loss-maki...


headline    str
text        str
dtype: object


## 3. Set Up Sentiment Analysis Pipeline
Load a transformer-based sentiment analysis model from HuggingFace.

In [4]:
# Load a sentiment analysis pipeline (using a popular model from HuggingFace)
sentiment_analyzer = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Loading weights: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 104/104 [00:00<00:00, 398.65it/s, Materializing param=pre_classifier.weight]                                  


## 4. Classify Article Sentiment and Map to Categories
Apply the sentiment model to each article and map results to 'Good News ðŸŽ‰', 'Bad News ðŸ‘Ž', or 'Just News ðŸ¤·'.

In [5]:
# Define a function to map sentiment to news categories
def map_sentiment_to_category(sentiment_label, score, threshold=0.7):
    if sentiment_label == 'POSITIVE' and score >= threshold:
        return 'Good News ðŸŽ‰'
    elif sentiment_label == 'NEGATIVE' and score >= threshold:
        return 'Bad News ðŸ‘Ž'
    else:
        return 'Just News ðŸ¤·'

# Apply sentiment analysis to headlines (or use 'title' if that's the column name)
results = sentiment_analyzer(list(news_df['headline']))

# Add results to DataFrame
news_df['sentiment_label'] = [r['label'] for r in results]
news_df['sentiment_score'] = [r['score'] for r in results]
news_df['category'] = [map_sentiment_to_category(r['label'], r['score']) for r in results]

## 5. Review and Visualize Results
Display the enriched dataset and sample the categorized news.

In [6]:
# Display a few sample rows with categories
print(news_df[['headline', 'sentiment_label', 'sentiment_score', 'category']].head())

# Show category distribution
print(news_df['category'].value_counts())

                            headline sentiment_label  sentiment_score  \
0    UK economy facing 'major risks'        NEGATIVE         0.963595   
1  Aids and climate top Davos agenda        NEGATIVE         0.741379   
2   Asian quake hits European shares        POSITIVE         0.992087   
3   India power shares jump on debut        POSITIVE         0.999317   
4    Lacroix label bought by US firm        NEGATIVE         0.982360   

      category  
0   Bad News ðŸ‘Ž  
1   Bad News ðŸ‘Ž  
2  Good News ðŸŽ‰  
3  Good News ðŸŽ‰  
4   Bad News ðŸ‘Ž  
category
Bad News ðŸ‘Ž     299
Good News ðŸŽ‰    188
Just News ðŸ¤·     23
Name: count, dtype: int64


## 6. Improved Sentiment Labeling: Use Article Text and Alternative Model

To address limitations, we will:
- Use the full article text for sentiment analysis (falling back to headline if text is missing).
- Try an alternative transformer model (e.g., cardiffnlp/twitter-roberta-base-sentiment-latest) for comparison.
- Add a flag for low-confidence predictions for potential manual review.

In [7]:
# Install and import alternative model if needed
# !pip install transformers torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np

# Load alternative sentiment model (CardiffNLP's Twitter RoBERTa)
model_name = 'cardiffnlp/twitter-roberta-base-sentiment-latest'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Helper function for sentiment prediction
def get_sentiment_label(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        logits = model(**inputs).logits
    scores = torch.softmax(logits, dim=1).numpy()[0]
    labels = ['Negative', 'Neutral', 'Positive']
    max_idx = np.argmax(scores)
    return labels[max_idx], float(scores[max_idx]), float(np.max(scores))

# Use article text if available, else headline
def get_text(row):
    if pd.notnull(row.get('content', None)) and str(row['content']).strip():
        return str(row['content'])
    return str(row['headline'])

# Apply improved sentiment analysis
sentiment_results = news_df.apply(lambda row: get_sentiment_label(get_text(row)), axis=1)
news_df['alt_sentiment_label'] = [r[0] for r in sentiment_results]
news_df['alt_sentiment_score'] = [r[1] for r in sentiment_results]
news_df['alt_sentiment_confidence'] = [r[2] for r in sentiment_results]

# Flag low-confidence predictions for manual review
confidence_threshold = 0.7
news_df['needs_review'] = news_df['alt_sentiment_confidence'] < confidence_threshold

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Loading weights: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 201/201 [00:00<00:00, 459.12it/s, Materializing param=roberta.encoder.layer.11.output.dense.weight]              
[1mRobertaForSequenceClassification LOAD REPORT[0m from: cardiffnlp/twitter-roberta-base-sentiment-latest
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.pooler.dense.bias       | UNEXPECTED |  | 
roberta.embeddings.position_ids | UNEXPECTED |  | 
roberta.pooler.dense.weight     | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


## 7. Review Improved Results
Display and compare the new sentiment labels, confidence, and review flags.

In [8]:
# Show a sample of improved results
print(news_df[['headline', 'alt_sentiment_label', 'alt_sentiment_score', 'alt_sentiment_confidence', 'needs_review']].head())

# Show distribution of new sentiment labels
print(news_df['alt_sentiment_label'].value_counts())

# Show how many articles need manual review
print(f"Articles flagged for review: {news_df['needs_review'].sum()} out of {len(news_df)}")

                            headline alt_sentiment_label  alt_sentiment_score  \
0    UK economy facing 'major risks'            Negative             0.605086   
1  Aids and climate top Davos agenda             Neutral             0.682842   
2   Asian quake hits European shares             Neutral             0.649496   
3   India power shares jump on debut            Positive             0.833764   
4    Lacroix label bought by US firm             Neutral             0.906590   

   alt_sentiment_confidence  needs_review  
0                  0.605086          True  
1                  0.682842          True  
2                  0.649496          True  
3                  0.833764         False  
4                  0.906590         False  
alt_sentiment_label
Neutral     353
Negative     98
Positive     59
Name: count, dtype: int64
Articles flagged for review: 214 out of 510
