# Multilingual Sentiment Analysis of News Sources

Idea: What are the sentiments of different news sources on the same topic? Is it different across languages? 

1. Rank average happiness of different news sources in different languages
    - [Happiness of News](https://hedonometer.org/showcase/nyt/) has word lists in different languages
    - Dataset of articles related to the topic
    - Lookup each word in the dataset and average (?) the happiness score, need to determine metric
    - Happiness metrics verified in diff languages since not a model, distill-BERT is predictive so maybe not accurate
    - Languages: german/korean/spanish/russian/english/chinese/arabic/portuguese/french/ukranian
2. Rank positivity/neutral/negativity of different news sources in different languages
    - [HuggingFace Distill-BERT multilingual model](https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student)
    - Languages: en/ar/de/es/fr/ja/zh/id/hi/it/ms/pt


## Dataset Collection and Cleaning

In [None]:
# setup notebook
import pandas as pd;
from langdetect import detect
from googletrans import Translator

path = "./World_Politics_News/World_Politics_News.csv"
df = pd.read_csv(path)
df.head()

In [7]:
# detect language of article
def detect_language_langdetect(text):
    try:
        detected_language = detect(text)
        return detected_language
    except Exception as e:
        print("An error occurred:", e)
        return None

df["language"] = df["title"].apply(detect_language_langdetect)

In [59]:
# filter to only languages with more than 100 articles
lang_counts = df["language"].value_counts()
geq50_lang = lang_counts[lang_counts >= 100].index.tolist()
lang_df = df[df["language"].isin(geq50_lang)]
print(geq50_lang)
print(lang_counts)
# german, english, french, korean, spanish, thai, croatian, greek, russian, romanian] 

['de', 'en', 'fr', 'ko', 'es', 'th']
language
de       1162
en        417
fr        322
ko        317
es        224
th        116
hr         96
el         75
ru         62
ro         51
pt         27
nl         18
zh-tw      12
ja          9
af          6
ca          4
no          4
pl          3
da          2
it          2
et          1
sl          1
he          1
bg          1
Name: count, dtype: int64


In [60]:
# resample dataset
df_de = lang_df[lang_df['language'] == 'de']
df_sampled_de = df_de.sample(n=350, random_state=42)
df_en = lang_df[lang_df['language'] == 'en']
df_sampled_en = df_en.sample(n=350, random_state=42)
df_remaining = lang_df[(lang_df['language'] != 'de') & (lang_df['language'] != 'en')]
# we don't need to translate the english ones to english :D
df_resampled = pd.concat([df_sampled_de, df_remaining])
df_resampled["language"].value_counts()

language
de    350
fr    322
ko    317
es    224
th    116
Name: count, dtype: int64

In [47]:
# translate titles and descriptions
translator = Translator()
titles = df_resampled["title"].tolist()
descriptions = df_resampled["description"].tolist()

translated_titles = translator.translate(titles, dest='en')
translated_desc = translator.translate(descriptions, dest='en')
# grab text itself
english_titles = [tra.text for tra in translated_titles]
english_desc = [tra.text for tra in translated_desc]

In [68]:
# add back into dataframe
df_resampled["en_title"] = english_titles
df_resampled["en_description"] = english_desc
df_sampled_en["en_title"] = df_sampled_en["title"]
df_sampled_en["en_description"] = df_sampled_en["description"]
df_final = pd.concat([df_resampled, df_sampled_en])
df_final.head(5)

Unnamed: 0,title,link,keywords,creator,video_url,description,content,pubDate,full_description,image_url,source_id,language,en_title,en_description
683,Die große Kimmich-Debatte - Impf-Ruck statt Im...,https://www.bild.de/politik/inland/politik-inl...,"['Politik-Inland', 'Corona-Impfung', 'Coronavi...",,,Bayern-Star Kimmich hat mit seinem Nicht-Geimp...,,2021-10-25 20:14:44,Bayern-Star Joshua Kimmich (26) hat mit seinem...,https://bilder.bild.de/fotos-skaliert/die-gros...,bild,de,The big Kimmich debate - vaccination rush inst...,Bayern star Kimmich has sparked a huge debate ...
194,Nach Putsch im Sudan: UN-Sicherheitsrat kommt ...,https://www.faz.net/aktuell/nach-putsch-im-sud...,,,,Nach dem Putsch im Sudan schießen Soldaten auf...,,2021-10-26 06:04:07,N ach dem Putsch im Sudan will der UN-Sicherhe...,,faz,de,After coup in Sudan: UN Security Council meets...,"After the coup in Sudan, soldiers shoot at dem..."
2747,An deutsch-polnischer Grenze - Rechtsextreme m...,https://www.bild.de/politik/inland/politik-inl...,"['Politik-Inland', 'Guben', 'Migrationspolitik...",,,Die Polizei sprach Platzverweise für die selbs...,,2021-10-24 12:41:35,Großer Einsatz gegen selbsternannte rechtsextr...,https://bilder.bild.de/fotos-skaliert/an-deuts...,bild,de,On the German-Polish border - right-wing extre...,The police expelled the self-proclaimed “borde...
1425,Polizeigewerkschafts-Chef - „Bundesregierung m...,https://www.bild.de/politik/2021/politik/poliz...,['Politik'],,,Die Lage an der polnischen Grenze ist dramatis...,,2021-10-25 10:55:28,Die Lage an der deutsch-polnischen Grenze ist ...,https://bilder.bild.de/fotos-skaliert/polizeig...,bild,de,Police union boss - “Federal government must a...,The situation at the Polish border is dramatic...
1841,Berlin: Nach tödlichem Tram-Unfall – Gaffer ma...,https://www.t-online.de/region/berlin/news/id_...,,,,Schreckliche Bilder aus Berlin: Eine Straßenba...,,2021-10-25 06:43:08,,https://bilder.t-online.de/b/91/02/33/46/id_91...,t-online,de,Berlin: After fatal tram accident – ​​gawkers ...,Horrible pictures from Berlin: A tram hits a T...


English, German, French, Korean, Spanish, and Thai languagess

In [69]:
df_final.to_csv("./World_Politics_News/WPNews.csv")

## Happiness of News Analysis

## Sentiment Analysis

In [70]:
from transformers import pipeline
df_final = pd.read_csv("")
distilled_student_sentiment_classifier = pipeline(
    model="lxyuan/distilbert-base-multilingual-cased-sentiments-student", 
    return_all_scores=True
)

  from .autonotebook import tqdm as notebook_tqdm


In [94]:
# filter to languages in distilbert
distil_langs = ['de','fr', 'es']
df_subset = df_final[df_final["language"].isin(distil_langs)]

# remove descriptions over 512 characters for distilbert predictions
mask = df_subset['description'].str.len() > 512
df_filtered = df_subset[~mask]
df_filtered.dropna(subset=['description'], inplace=True)


Index(['title', 'link', 'keywords', 'creator', 'video_url', 'description',
       'content', 'pubDate', 'full_description', 'image_url', 'source_id',
       'language', 'en_title', 'en_description'],
      dtype='object')

In [None]:
orig_title = df_filtered["title"].to_list()
en_title = df_filtered["en_title"].to_list()
orig_desc = df_filtered["description"].to_list()
en_desc = df_filtered["en_description"].to_list()

sentiments_orig_title = [distilled_student_sentiment_classifier(d) for d in orig_title]
sentiments_en_title = [distilled_student_sentiment_classifier(d) for d in en_title]
sentiments_orig_desc = [distilled_student_sentiment_classifier(d) for d in orig_desc]
sentiments_en_desc = [distilled_student_sentiment_classifier(d) for d in en_desc]

In [96]:
def list2txt(list_arr, file_path):
    with open(file_path, 'w') as file:
        for item in list_arr:
            file.write('%s\n' % item)
list2txt(sentiments_orig_title, "sentiments_orig_title.txt")
list2txt(sentiments_en_title, "sentiments_en_title.txt")
list2txt(sentiments_orig_desc, "sentiments_orig_desc.txt")
list2txt(sentiments_en_desc, "sentiments_en_desc.txt")