<a href="https://colab.research.google.com/github/seleenabusi07/raise26-ai-headlines-analysis/blob/main/Copy_of_RAISE26.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

df = pd.read_csv("/content/dataset_B_news_subset_3500.csv")
df.head()


FileNotFoundError: [Errno 2] No such file or directory: '/content/dataset_B_news_subset_3500.csv'

In [None]:
df.columns

In [None]:
df["classes_str"].value_counts().head(10)

The most frequent category in AI-related headlines is Learning, Knowledge & Education, followed closely by Work, Jobs & Economy. This suggests that media coverage most often frames AI as a force shaping how people learn and how they work. Everyday life and lifestyle changes are also prominent, indicating that AI is increasingly portrayed as being integrated in daily human routines rather than only as a technical tool.


In [None]:
import matplotlib.pyplot as plt

top_classes = df["classes_str"].value_counts().head(10)

top_classes.plot(kind="bar")
plt.title("Top 10 Human Behavior Categories in AI News Headlines")
plt.xlabel("Category")
plt.ylabel("Number of Headlines")
plt.xticks(rotation=75)
plt.show()


In [None]:
import numpy as np

In [None]:
# filter out headlines not classed as education or work

In [None]:
classes_list = ['Learning, Knowledge & Education', 'Work, Jobs & Economy']
df[df['classes_str'].isin(classes_list)]

In [None]:
# installing required libraries for sentiment analysis

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import classification_report

In [None]:
# DATA CLEANING

In [None]:
# Examining data types
data_types = df.dtypes

print(data_types)

In [None]:
# checking for duplicated values

In [None]:
df.duplicated()

In [None]:
# splitting cols into categorical and numerical

In [None]:
cat_col = [col for col in df.columns if df[col].dtype == 'object']
num_col = [col for col in df.columns if df[col].dtype != 'object']

print('Categorical columns:', cat_col)
print('Numerical columns:', num_col)

In [None]:
# checking for unique values

In [None]:
df[cat_col].nunique()

In [None]:
# checking for null values

In [None]:
df.isnull()

In [None]:
df.head()

In [None]:
# text normalization: converting text to lowercase, removing punctuation, and removing unecessary whitespace

In [None]:
df['normalized_text'] = df['title'].str.lower()

In [None]:
import string
df['normalized_text'] = df['title'].str.translate(str.maketrans('', '', string.punctuation))

In [None]:
df['normalized_text'] = df['normalized_text'].str.strip()

In [None]:
df['normalized_text'] = df['normalized_text'].str.replace(r'\s+', ' ', regex=True)

In [None]:
df.head()

In [None]:
# function to remove special characters and any extra whitespace

In [None]:
import re
def remove_noise(text):
    text = re.sub(r'[^È´\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['normalized_text'] = df['normalized_text'].apply(remove_noise)

In [None]:
df.head()

In [None]:
# TOKENIZATION

In [None]:
# importing nltk libraries

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [None]:
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
stop_words = set(stopwords.words('english'))

In [None]:
# function to remove stopwords

In [None]:
def remove_stopwords(text):
  word_tokens = word_tokenize(text)
  filtered_text = [word for word in word_tokens \
                     if word.lower() not in stop_words]
  return ' '.join(filtered_text)

In [None]:
df['filtered_text'] = df['normalized_text'].apply(remove_stopwords)

In [None]:
nltk.download('punkt_tab')

In [None]:
df.head()

In [None]:
# stemming and lemmatization

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')

In [None]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [None]:
# functions for stemming and lemmatizing each word

In [None]:
def perform_stemming(text):
  word_tokens = word_tokenize(text)
  stemmed_words = [stemmer.stem(word) for word in word_tokens]
  return ' '.join(stemmed_words)

In [None]:
def perform_lemmatization(text):
  word_tokens = word_tokenize(text)
  lemmatized_words = [lemmatizer.lemmatize(word) for word in word_tokens]
  return ' '.join(lemmatized_words)

In [None]:
df['stemmed_text'] = df['filtered_text'].apply(perform_stemming)
df['lemmatized_text'] = df['stemmed_text'].apply(perform_lemmatization)

In [None]:
df.head()

In [None]:
# converting text to numbers using tf-idf technique

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
count_vectorizer = CountVectorizer()
tfidf_vectorizer = TfidfVectorizer()

In [None]:
bow_matrix = count_vectorizer.fit_transform(df['lemmatized_text'])

In [None]:
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=count_vectorizer.get_feature_names_out())

In [None]:
print(bow_df)

In [None]:
tfidf_matrix = tfidf_vectorizer.fit_transform(df['lemmatized_text'])

In [None]:
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

In [None]:
print(tfidf_df)

In [None]:
# SENTIMENT ANALYSIS

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
nltk.download('vader_lexicon')

In [None]:
analyzer = SentimentIntensityAnalyzer()

In [None]:
# function to return binary value for sentiment

In [None]:
def get_sentiment(text):
  scores = analyzer.polarity_scores(text)
  sentiment = 1 if scores['pos'] > 0 else 0
  return sentiment

In [None]:
df['sentiment'] = df['lemmatized_text'].apply(get_sentiment)

In [None]:
df.head()

In [None]:
df

In [None]:
# disregarding binary sentiment col to instead class sentiment as positive, neutral, negative, or compound (more inclusive)

In [None]:
df.drop('sentiment', axis=1)

In [None]:
# analyzing first row using new technique

In [None]:
analyzer.polarity_scores(['lemmatized_text'][0])

In [None]:
analyzer.polarity_scores(['stemmed_text'][0])

In [None]:
# function to apply technique to the rest of the cells

In [None]:
body = df.lemmatized_text
neg, neu, pos, compound = [], [], [], []
for headline in body:
  res = analyzer.polarity_scores(str(headline))
  neg.append(res['neg'])
  neu.append(res['neu'])
  pos.append(res['pos'])
  compound.append(res['compound'])

In [None]:
df["Negative"] = neg
df["Neutral"] = neu
df["Positive"] = pos
df["Compound"] = compound

In [None]:
df.head()

In [None]:
# assigning either positive or negative sentiment tag for classing headlines based on pos, neg metrics from earlier

In [None]:
tag=[]
for i in range(len(df.lemmatized_text)):
  winning_val = max(neg[i], pos[i])
  if (winning_val == neg[i]):
    tag.append("Negative")
  elif(pos[i]==winning_val):
    tag.append("Positive")

df["Sentiment_Tag"]=tag

In [None]:
df.head()

In [None]:
classes_list1 = ['Learning, Knowledge & Education', 'Work, Jobs & Economy']
df = df[df['classes_str'].isin(classes_list1)]

In [None]:
pp = df['Sentiment_Tag'][df['Sentiment_Tag']=="Positive"].count()
nn = df['Sentiment_Tag'][df['Sentiment_Tag']=="Negative"].count()

In [None]:
print("Number of Positive Headlines:", pp)
print("Number of Negative Headlines:", nn)

In [None]:
# CORRELATION ANALYSIS

In [None]:
import seaborn as sns

In [None]:
correlation = df['sentiment'].corr(df['quarter'])

NameError: name 'df' is not defined

In [None]:
# negligible correlation, checking for percentage distribution of positive and negative headlines instead

In [None]:
print(f"Correlation: {correlation}")

In [None]:
grouped = df.groupby('quarter')['sentiment'].value_counts(normalize=True).unstack()

In [None]:
print(grouped)

In [None]:
counts = df.groupby(['quarter', 'sentiment']).size().reset_index(name='count')

In [None]:
total_counts_per_quarter = counts.groupby('quarter')['count'].transform('sum')
counts['percent'] = (counts['count'] / total_counts_per_quarter) * 100

In [None]:
table = counts.pivot(index='quarter', columns='sentiment', values='percent')
print(table)

In [None]:
table.plot(kind='bar', stacked=True)


In [None]:
# all articles abt work and education were only written in q3...

In [None]:
quarter_list = [1, 2, 4]
df[df['quarter'].isin(quarter_list)]

# Summary

After initial plot of the data, it became apparent that the majority of the headlines (665/3500) were classed under Work or Education, so those headlines were the focus of this project to see how AI impacts sentiment in professional spaces like work and school. The dataset was preprocessed, first checking for any null, duplicate or missing values (none found, indicating clean data). The headlines were then normalized by converting text to lowercase, removing punctuation, special characters, and unnecessary whitespace. They were then tokenized by word, removing stopwords before stemming and lemmatization to reduce words to their most basic forms. Using TF-IDF, the tokens were converted into numerical data for sentiment analysis, and then assigned values from 0 to 1 in four categories; positive, negative, neutral, and compound. Headlines were then assigned a sentiment tag of either "Positive" or "Negative" depending on these values and tallied. There were a total of 135 "Positive" headlines and 520 "Negative" headlines, nearly 3.85x as many negative headlines in comparison to positive. A correlation analysis was then performed to see if there's any correlation between the sentiment tag and the quarter the article was published, resulting in a negligible result as all articles in these two classes were written in Q3.