# Multilingual Sentiment Analysis of News Sources

Idea: What are the sentiments of different news sources on the same topic? Is it different across languages? 

1. Rank average happiness of different news sources in different languages
    - [Happiness of News](https://hedonometer.org/showcase/nyt/) has word lists in different languages
    - Dataset of articles related to the topic
    - Lookup each word in the dataset and average (?) the happiness score, need to determine metric
    - Happiness metrics verified in diff languages since not a model, distill-BERT is predictive so maybe not accurate
    - Languages: german/korean/spanish/russian/english/chinese/arabic/portuguese/french/ukranian
2. Rank positivity/neutral/negativity of different news sources in different languages
    - [HuggingFace Distill-BERT multilingual model](https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student)
    - Languages: en/ar/de/es/fr/ja/zh/id/hi/it/ms/pt


## Dataset Collection and Cleaning

In [1]:
# setup notebook
import pandas as pd;
from langdetect import detect
from googletrans import Translator

path = "./Latest_News/Latest_News.csv"
df = pd.read_csv(path)
df.head()

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd;


Unnamed: 0,title,link,keywords,creator,video_url,description,content,pubDate,full_description,image_url,source_id
0,Spalletti arrabbiato con i suoi calciatori nei...,https://www.areanapoli.it/rassegna-stampa/spal...,['Rassegna Stampa'],['AreaNapoli.it'],,La Gazzetta dello Sport racconta del nervosism...,,2021-10-26 07:06:10,,https://cdn.areanapoli.it/immagini/notizie/big...,areanapoli
1,Squalifica e multa per Spalletti: la reazione ...,https://www.areanapoli.it/rassegna-stampa/squa...,['Rassegna Stampa'],['AreaNapoli.it'],,Il tecnico del Napoli è stato sanzionato per l...,,2021-10-26 07:06:10,,https://cdn.areanapoli.it/immagini/notizie/big...,areanapoli
2,"Napoli-Bologna, gli ultras potrebbero entrare ...",https://www.areanapoli.it/rassegna-stampa/napo...,['Rassegna Stampa'],['AreaNapoli.it'],,Gli ultras del Napoli potrebbero rientrare all...,,2021-10-26 07:06:10,,https://cdn.areanapoli.it/immagini/notizie/big...,areanapoli
3,"Juve, Dybala vale doppio: ecco perch&eacute; p...",https://www.tuttosport.com/news/calcio/serie-a...,['Juventus'],,,Decisivo nella costruzione del gioco e nella f...,,2021-10-26 07:05:03,,https://cdn.tuttosport.com/img/600/400/2021/10...,tuttosport
4,Tubertini non è più l'allenatore di Siena,https://www.tuttosport.com/news/pallavolo/a2-m...,"['Tubertini', 'Emma Villas']",,,Dopo il negativo inizio di stagione il preside...,,2021-10-26 07:05:03,,https://cdn.tuttosport.com/img/600/400/2021/10...,tuttosport


In [2]:
# detect language of article
def detect_language_langdetect(text):
    try:
        detected_language = detect(text)
        return detected_language
    except Exception as e:
        print("An error occurred:", e)
        return None

df["language"] = df["title"].apply(detect_language_langdetect)

An error occurred: expected string or bytes-like object
An error occurred: No features in text.
An error occurred: No features in text.
An error occurred: No features in text.
An error occurred: No features in text.
An error occurred: expected string or bytes-like object
An error occurred: No features in text.


In [3]:
# filter to only languages with more than 100 articles
lang_counts = df["language"].value_counts()
geq50_lang = lang_counts[lang_counts >= 100].index.tolist()
lang_df = df[df["language"].isin(geq50_lang)]
print(geq50_lang)
print(lang_counts)
# german, english, french, korean, spanish, thai, croatian, greek, russian, romanian] 
df.to_csv("./Latest_News/LNews_language.csv")

['en', 'de', 'es', 'ar', 'fr', 'it', 'nl', 'ko', 'hu', 'pl', 'el', 'ru', 'ja', 'hr', 'id', 'th', 'pt', 'ro', 'tr', 'sv', 'no', 'af', 'ca', 'sl', 'he', 'lv', 'uk', 'cs', 'bg', 'da', 'lt', 'zh-tw']
language
en       25409
de        9234
es        7729
ar        6208
fr        5759
it        5646
nl        2701
ko        2526
hu        2446
pl        2325
el        2055
ru        1914
ja        1546
hr        1528
id        1485
th        1392
pt        1251
ro        1064
tr         876
sv         560
no         491
af         294
ca         280
sl         252
he         197
lv         192
uk         191
cs         150
bg         146
da         144
lt         138
zh-tw      126
sk          92
tl          55
et          48
mk          19
fi          17
so          16
vi          14
zh-cn       13
sw          10
cy           7
sq           4
fa           3
Name: count, dtype: int64


In [60]:
# resample dataset
df_de = lang_df[lang_df['language'] == 'de']
df_sampled_de = df_de.sample(n=350, random_state=42)
df_en = lang_df[lang_df['language'] == 'en']
df_sampled_en = df_en.sample(n=350, random_state=42)
df_remaining = lang_df[(lang_df['language'] != 'de') & (lang_df['language'] != 'en')]
# we don't need to translate the english ones to english :D
df_resampled = pd.concat([df_sampled_de, df_remaining])
df_resampled["language"].value_counts()

language
de    350
fr    322
ko    317
es    224
th    116
Name: count, dtype: int64

In [47]:
# translate titles and descriptions
translator = Translator()
titles = df_resampled["title"].tolist()
descriptions = df_resampled["description"].tolist()

translated_titles = translator.translate(titles, dest='en')
translated_desc = translator.translate(descriptions, dest='en')
# grab text itself
english_titles = [tra.text for tra in translated_titles]
english_desc = [tra.text for tra in translated_desc]

In [68]:
# add back into dataframe
df_resampled["en_title"] = english_titles
df_resampled["en_description"] = english_desc
df_sampled_en["en_title"] = df_sampled_en["title"]
df_sampled_en["en_description"] = df_sampled_en["description"]
df_final = pd.concat([df_resampled, df_sampled_en])
df_final.head(5)

Unnamed: 0,title,link,keywords,creator,video_url,description,content,pubDate,full_description,image_url,source_id,language,en_title,en_description
683,Die große Kimmich-Debatte - Impf-Ruck statt Im...,https://www.bild.de/politik/inland/politik-inl...,"['Politik-Inland', 'Corona-Impfung', 'Coronavi...",,,Bayern-Star Kimmich hat mit seinem Nicht-Geimp...,,2021-10-25 20:14:44,Bayern-Star Joshua Kimmich (26) hat mit seinem...,https://bilder.bild.de/fotos-skaliert/die-gros...,bild,de,The big Kimmich debate - vaccination rush inst...,Bayern star Kimmich has sparked a huge debate ...
194,Nach Putsch im Sudan: UN-Sicherheitsrat kommt ...,https://www.faz.net/aktuell/nach-putsch-im-sud...,,,,Nach dem Putsch im Sudan schießen Soldaten auf...,,2021-10-26 06:04:07,N ach dem Putsch im Sudan will der UN-Sicherhe...,,faz,de,After coup in Sudan: UN Security Council meets...,"After the coup in Sudan, soldiers shoot at dem..."
2747,An deutsch-polnischer Grenze - Rechtsextreme m...,https://www.bild.de/politik/inland/politik-inl...,"['Politik-Inland', 'Guben', 'Migrationspolitik...",,,Die Polizei sprach Platzverweise für die selbs...,,2021-10-24 12:41:35,Großer Einsatz gegen selbsternannte rechtsextr...,https://bilder.bild.de/fotos-skaliert/an-deuts...,bild,de,On the German-Polish border - right-wing extre...,The police expelled the self-proclaimed “borde...
1425,Polizeigewerkschafts-Chef - „Bundesregierung m...,https://www.bild.de/politik/2021/politik/poliz...,['Politik'],,,Die Lage an der polnischen Grenze ist dramatis...,,2021-10-25 10:55:28,Die Lage an der deutsch-polnischen Grenze ist ...,https://bilder.bild.de/fotos-skaliert/polizeig...,bild,de,Police union boss - “Federal government must a...,The situation at the Polish border is dramatic...
1841,Berlin: Nach tödlichem Tram-Unfall – Gaffer ma...,https://www.t-online.de/region/berlin/news/id_...,,,,Schreckliche Bilder aus Berlin: Eine Straßenba...,,2021-10-25 06:43:08,,https://bilder.t-online.de/b/91/02/33/46/id_91...,t-online,de,Berlin: After fatal tram accident – ​​gawkers ...,Horrible pictures from Berlin: A tram hits a T...


English, German, French, Korean, Spanish, and Thai languages

In [69]:
df_final.to_csv("./Latest_News/LNews.csv")

## Happiness of News Analysis

## Sentiment Analysis
### Calculate Sentiments

In [177]:
from transformers import pipeline
df_final = pd.read_csv("./Latest_News/LNews.csv")
distilled_student_sentiment_classifier = pipeline(
    model="lxyuan/distilbert-base-multilingual-cased-sentiments-student", 
    return_all_scores=True
)



In [123]:
# filter to languages in distilbert
distil_langs = ['de','fr', 'es']
df_subset = df_final[df_final["language"].isin(distil_langs)]

# remove descriptions over 512 characters for distilbert predictions
mask = df_subset['description'].str.len() > 512
df_filtered = df_subset[~mask]
df_filtered.dropna(subset=['description'], inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered.dropna(subset=['description'], inplace=True)


In [124]:
orig_title = df_filtered["title"].to_list()
en_title = df_filtered["en_title"].to_list()
orig_desc = df_filtered["description"].to_list()
en_desc = df_filtered["en_description"].to_list()

sentiments_orig_title = [distilled_student_sentiment_classifier(d) for d in orig_title]
sentiments_en_title = [distilled_student_sentiment_classifier(d) for d in en_title]
sentiments_orig_desc = [distilled_student_sentiment_classifier(d) for d in orig_desc]
sentiments_en_desc = [distilled_student_sentiment_classifier(d) for d in en_desc]

In [125]:
# saving progress just in case
# def list2txt(list_arr, file_path):
#     with open(file_path, 'w') as file:
#         for item in list_arr:
#             file.write('%s\n' % item)
# list2txt(sentiments_orig_title, "sentiments_orig_title.txt")
# list2txt(sentiments_en_title, "sentiments_en_title.txt")
# list2txt(sentiments_orig_desc, "sentiments_orig_desc.txt")
# list2txt(sentiments_en_desc, "sentiments_en_desc.txt")

In [162]:
def convert_to_sentiment_df(sent_list, prefix):
    rows = []
    # Iterate over each element in the list and add scores to DataFrame
    for elements in sent_list:
        positive_score = elements[0][0]['score']
        neutral_score = elements[0][1]['score']
        negative_score = elements[0][2]['score']
        
        rows.append({f'{prefix}_positive_score': positive_score,
                    f'{prefix}_neutral_score': neutral_score,
                    f'{prefix}_negative_score': negative_score})
    return pd.DataFrame(rows)
temp_title = convert_to_sentiment_df(sentiments_orig_title, "title")
temp_en_title = convert_to_sentiment_df(sentiments_en_title, "en_title")
temp_desc = convert_to_sentiment_df(sentiments_orig_desc, "desc")
temp_en_desc = convert_to_sentiment_df(sentiments_en_desc, "en_desc")

In [164]:
lang_result_df = temp_title.merge(temp_en_title, how='outer', left_index=True, right_index=True) \
                      .merge(temp_desc, how='outer', left_index=True, right_index=True) \
                      .merge(temp_en_desc, how='outer', left_index=True, right_index=True)

In [165]:
mask = df_sampled_en['description'].str.len() > 512
df_filtered = df_sampled_en[~mask]
df_filtered.dropna(subset=['description'], inplace=True)

desc = df_filtered["description"].to_list()
title = df_filtered["en_description"].to_list()

sentiments_title = [distilled_student_sentiment_classifier(d) for d in title]
sentiments_desc = [distilled_student_sentiment_classifier(d) for d in desc]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered.dropna(subset=['description'], inplace=True)


In [166]:
temp_title = convert_to_sentiment_df(sentiments_title, "title")
temp_desc = convert_to_sentiment_df(sentiments_desc, "desc")
temp_en_title = convert_to_sentiment_df(sentiments_title, "en_title")
temp_en_desc = convert_to_sentiment_df(sentiments_desc, "en_desc")

en_result_df = temp_title.merge(temp_en_title, how='outer', left_index=True, right_index=True) \
                      .merge(temp_desc, how='outer', left_index=True, right_index=True) \
                      .merge(temp_en_desc, how='outer', left_index=True, right_index=True)
result_df = pd.concat([lang_result_df, en_result_df])

In [204]:
mask = df_final['description'].str.len() > 512
df_final = df_final[~mask]
df_final.dropna(subset=['description'], inplace=True)

temp_df = df_final[df_final["language"].isin(['en', 'fr', 'de', 'es'])]
temp_df = temp_df.reset_index(drop=True)
result_df = result_df.reset_index(drop=True)

sentiment_df = pd.concat([temp_df, result_df], axis=1)
sentiment_df.to_csv("./Latest_News/LNews_sentiments.csv")

### Investigate Sentiments

In [212]:
sentiment_df = pd.read_csv("./Latest_News/LNews_sentiments.csv")
sentiment_df = sentiment_df.iloc[:, 2:] # remove index columns 

["Bayern star Kimmich has sparked a huge debate with his confession that he hasn't been vaccinated! Photo: Getty Images",
 'After the coup in Sudan, soldiers shoot at demonstrators. America is temporarily stopping aid worth $700 million. The UN Security Council wants to discuss the situation this Tuesday.',
 'The police expelled the self-proclaimed “border guards”. Photo: MICHELE TANTUSSI/REUTERS',
 'The situation at the Polish border is dramatic - but the federal government is not doing enough. Photo: BILD LIVE',
 'Horrible pictures from Berlin: A tram hits a Toyota and drags it along the tracks. The people in the car are seriously injured and two die. And onlookers take photos. Two people died when a car collided with a tram in Berlin-Lichtenberg.',
 'China is taking strict measures to combat a new wave of corona infections.',
 "Armin Laschet's time as Prime Minister of North Rhine-Westphalia is coming to an end. The CDU leader is moving to the Bundestag as a member of parliament. Fo

In [222]:
from collections import Counter
import re

# Assuming df is your DataFrame and 'column_name' is the column you want to analyze
text = ' '.join(sentiment_df['en_description'].dropna())  # Concatenate all non-null values into a single string
words = re.findall(r'\w+', text.lower())  # Tokenize the text into words and convert to lowercase

# Count the frequency of each word
word_freq = Counter(words)

# Sort the words based on their frequencies
sorted_word_freq = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)

# Extract the top N most common words
top_n = 100  # Change this value as needed
most_common_words = sorted_word_freq[:top_n]

for word, freq in most_common_words:
    print(f"{word}: {freq}")

the: 3069
of: 1218
to: 901
in: 805
a: 731
and: 599
on: 483
for: 452
is: 397
that: 282
s: 256
has: 215
will: 201
was: 200
with: 199
by: 182
this: 180
from: 177
he: 162
at: 162
his: 157
are: 154
as: 143
be: 141
not: 126
president: 126
government: 121
it: 121
have: 121
an: 120
said: 114
minister: 107
after: 102
monday: 100
been: 99
state: 97
who: 88
party: 87
election: 83
candidate: 78
new: 76
october: 75
their: 74
but: 71
first: 71
against: 70
its: 68
there: 67
presidential: 67
more: 66
people: 64
former: 64
national: 61
they: 60
sunday: 59
year: 58
were: 56
which: 55
now: 54
years: 54
her: 54
no: 52
tuesday: 51
federal: 51
corona: 51
over: 49
health: 49
also: 49
political: 47
old: 45
prime: 44
council: 43
group: 43
country: 43
week: 42
all: 41
one: 41
about: 40
head: 40
congress: 40
campaign: 39
between: 39
had: 39
up: 39
vote: 39
announced: 38
mayor: 38
time: 37
leader: 37
before: 37
climate: 37
she: 37
last: 36
during: 36
other: 35
than: 35
elected: 35
would: 35
police: 34
under: 34
