<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h1 style='margin:10px 5px'> 
Master Thesis Yannik Haller - Sentiment Analysis: Performance Evaluation of the Self-Developed Naïve Italian Sentiment Algorithm
</h1>
</div>

In order to evaluate the performance of the self-developed naïve Italian sentiment algorithm, we consider 50 professionally translated press releases from the portal of the Swiss government (released between December 18, 2020 and the January 1, 2021) that were published in Italian, German and French. We then compare the resulting polarity scores of the Italian articles, which are assigned by the self-developed naïve Italian sentiment algorithm, with the resulting polarity scores of the same articles written in German and French, which are assigned by the respective (language specific) version of the established Vader and TextBlob sentiment algorithms. To do so, we apply the same classifier-specific text preprocessing to the articles as for the respective Sentiment analysis conducted beforehand.

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
1. Load required packages and the data
</h2>
</div>

In [1]:
# Import required baseline packages
import re
import os
import sys
import glob
import time
import sys
import pandas as pd
import numpy as np
from pprint import pprint

# Change pandas' setting to print out long strings
pd.options.display.max_colwidth = 200

# Plotting tools
import matplotlib.pyplot as plt
%matplotlib inline
# Set global parameters for plotting
import matplotlib.pylab as pylab
params = {'legend.fontsize': 10,
          'figure.figsize': (8, 6),
          'axes.labelsize': 14,
          'axes.titlesize': 16,
          'xtick.labelsize': 10,
          'ytick.labelsize': 10}
pylab.rcParams.update(params)

# Spacy (for lemmatization)
import spacy

# Enable logging for gensim (optional)
import logging
logging.basicConfig(format = '%(asctime)s : %(levelname)s : %(message)s', level = logging.ERROR)

import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning)

In [2]:
# Set the appropriate working directory
os.chdir('D:\\Dropbox\\MA_data')

In [3]:
# Read in the multilingual articles
multilingual_articles = pd.read_csv("Sentiment/Naive/Performance_evaluation/swiss_gov_multilingual_press_releases.csv", sep = ";", index_col = 0)

# Take a look at the head of the dataframe
multilingual_articles.head(3)

Unnamed: 0,date_retrieved,date_released,title_de,tx_de,tx_fr,tx_it
0,19.04.2021,01.01.2021,2021 - Neujahrsansprache von Bundespräsident Guy Parmelin,"Wir haben ein dunkles Jahr hinter uns. Die Gesundheitskrise hat uns schwer getroffen. Viele Familien haben einen nahen Menschen verloren. Viele konnten von ihm nicht Abschied nehmen, wie sie es si...","Notre pays, comme tant d’autres, a vécu une année sombre. La crise sanitaire nous a durement éprouvés. De nombreuses familles ont perdu des proches et n’ont pas pu faire leur deuil comme elles l’a...","il nostro Paese, come molti altri, ha vissuto un anno buio. La crisi sanitaria ci ha inferto un duro colpo. Molte famiglie hanno perso loro cari e non hanno potuto congedarsi da questi affetti com..."
1,19.04.2021,31.12.2020,Bundesratsfoto 2021: Neue Sicht auf Altbekanntes,Das Bundesratsfoto 2021 zeigt die sieben Bundesratsmitglieder und den Bundeskanzler als Einheit. Im Hintergrund ist das Parlamentsgebäude aus der Vogelperspektive zu sehen. Markus A. Jegerlehner h...,La photo du Conseil fédéral de 2021 montre les sept conseillers fédéraux et le chancelier de la Confédération formant un groupe. L’arrière-plan est occupé par le palais du Parlement vu du ciel. La...,"La fotografia del Consiglio federale per il 2021 ritrae i sette consiglieri federali e il cancelliere della Confederazione in gruppo. Alle loro spalle, una veduta a volo d’uccello del Palazzo del ..."
2,19.04.2021,30.12.2020,Medienberichte zu Todesfall nach Covid-19-Impfung in der Schweiz: Kein Zusammenhang mit der Impfung ersichtlich,"Einige Tage nach einer Covid-19-Impfung ist in einem Alters- und Pflegeheim im Kanton Luzern eine 91-jährige Person, die an mehreren schweren Vorerkrankungen litt, verstorben. Weder die Krankenges...","Quelques jours après s’être fait vacciner contre le COVID-19, une personne de 91 ans souffrant de plusieurs maladies préexistantes graves est décédée dans un établissement médico-social du canton ...","Una persona di 91 anni, affetta da gravi patologie pregresse, è morta in una casa di cura nel cantone di Lucerna, pochi giorni dopo essersi sottoposta alla vaccinazione contro il Covid-19. Né l’an..."


In [4]:
# Extract the German articles as a list of articles
articles_de = multilingual_articles.tx_de.values.tolist()
# Extract the French articles as a list of articles
articles_fr = multilingual_articles.tx_fr.values.tolist()
# Extract the Italian articles as a list of articles
articles_it = multilingual_articles.tx_it.values.tolist()

In [5]:
# Take a look at the first article in German
articles_de[0]

'Wir haben ein dunkles Jahr hinter uns. Die Gesundheitskrise hat uns schwer getroffen. Viele Familien haben einen nahen Menschen verloren. Viele konnten von ihm nicht Abschied nehmen, wie sie es sich gewünscht hätten. Für sie wird das vergangene Jahr für immer verbunden sein mit diesem schmerzlichen Verlust. Die Mitarbeitenden in Spitälern und Pflegeheimen kamen an den Rand ihrer Kräfte und sind es heute noch. Andere durften lange gar nicht mehr arbeiten, waren in Kurzarbeit oder haben sogar ihre Stelle verloren. Traditionsunternehmen sind verschwunden. Auch unser Bildungssystem wurde auf eine harte Probe gestellt. Kurz: Die Pandemie hat unser aller Leben auf den Kopf gestellt. Selten haben wir Vergleichbares erlebt: Unsere Tätigkeiten kamen zum Stillstand. Die ganze Gesellschaft befand sich in noch nie dagewesener Isolation. Wir mussten lernen, ohne Händeschütteln auszukommen. Dieses wichtige Begrüssungsritual gefährdete plötzlich unsere Gesundheit. All das war und ist für uns umso sc

In [6]:
# Take a look at the first article in French
articles_fr[0]

'Notre pays, comme tant d’autres, a vécu une année sombre. La crise sanitaire nous a durement éprouvés. De nombreuses familles ont perdu des proches et n’ont pas pu faire leur deuil comme elles l’auraient souhaité. Pour elles, l’année 2020 restera liée au souvenir de cette perte douloureuse. La pandémie a bouleversé nos existences aussi en envoyant des personnes aux chômage, en détruisant des entreprises de tradition ou en mettant nos systèmes d’éducation et de santé à l’épreuve.  Nous n’avons pratiquement jamais connu pareille situation: voir nos activités au point mort, la population à l’isolement, la poignée de main bannie de nos codes sociaux. C’est d’autant plus cruel que l’être humain, comme le soulignait Aristote, « est fait par nature pour vivre avec ses semblables ».  A l’aube de cette année 2021, le réalisme m’interdit de former des vœux trop enthousiastes. Il m’oblige plutôt à constater que les inconnues sont nombreuses et que la situation demeure précaire. Je tiens néanmoin

In [7]:
# Take a look at the first article in Italian
articles_it[0]

'il nostro Paese, come molti altri, ha vissuto un anno buio. La crisi sanitaria ci ha inferto un duro colpo. Molte famiglie hanno perso loro cari e non hanno potuto congedarsi da questi affetti come avrebbero voluto. Per loro il 2020 resterà legato al ricordo di questa dolorosa perdita. La pandemia ha stravolto le nostre vite causando disoccupazione, distruggendo aziende radicate da tempo nel territorio e mettendo a dura prova i nostri sistemi formativi e sanitari.  Mai in passato ci eravamo trovati confrontati a una situazione del genere: le nostre attività ferme, la popolazione in isolamento, la stretta di mano − parte del nostro vivere sociale − bandita. Questa situazione è tanto più crudele se si pensa che l’essere umano, come sosteneva Aristotele, «tende per natura ad aggregarsi con altri individui».  All’alba di questo 2021 il realismo mi impedisce di formulare auguri troppo entusiastici. Mi costringe piuttosto a constatare che le incognite sono molte e che la situazione resta pr

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2. Language- and algorithm-specific preprocessing
</h2>
</div>

In [8]:
## Define all function required for the subsequent preprocessing

# Define a function to prepare/pre-clean the text data for the sentiment analysis using the VADER algorithms
def pre_clean_vader(articles):
    # Raise an error if an inappropriate data type is given as an input
    if(not isinstance(articles, list)):
        raise ValueError("Invalid input type. Expected a list.")

    # Keep track of the processing time
    t = time.time()
    # Replace punctuations which are not followed by a blank with punctuations followed by a blank
    articles = [re.sub(r'[\.]', '. ', x) for x in articles]
    # Separate words in which a lowercase letter is followed by a capital letter, since they usually do not belong together
    articles = [re.sub('(^[a-z]*)+([A-Z])', r'\1 \2', x) for x in articles]
    # Correct manually for those cases where a name like 'McDonalds' was separated to Mc Donalds
    articles = [re.sub('Mc ', 'Mc', x) for x in articles]
    # Replace quotation marks with a blank
    articles = [re.sub('«', ' ', x) for x in articles]
    articles = [re.sub('»', ' ', x) for x in articles]
    # Remove percentage signs
    articles = [re.sub('%', ' ', x) for x in articles]
    # Remove distracting hyphens
    articles = [re.sub("-", " ", x) for x in articles]
    articles = [re.sub("–", " ", x) for x in articles]
    # Remove any blank that precedes a comma
    articles = [re.sub(" ,", ",", x) for x in articles]
    # Remove any blank that precedes a dot
    articles = [re.sub(" \.", ".", x) for x in articles]

    # Remove any links starting with http:// or https://
    articles = [re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+').sub('',x) for x in articles]
    # Remove any links starting with www.
    articles = [re.compile('www\.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+').sub('',x) for x in articles]
    
    # Replace new line characters (i.e. \n) and multiple blanks with a single blank
    articles = [re.sub('\s+', ' ', x) for x in articles]
    # Print out the processing time
    print("Processing time for pre-cleaning: ", str((time.time() - t)/60), "minutes")

    # Return the pre-cleaned text data
    return articles


# Define a function to prepare/pre-clean the text data
def pre_clean(articles):
    # Raise an error if an inappropriate data type is given as an input
    if(not isinstance(articles, list)):
        raise ValueError("Invalid input type. Expected a list.")

    # Keep track of the processing time
    t = time.time()
    # Replace punctuations which are not followed by a blank with punctuations followed by a blank
    articles = [re.sub(r'[\.]', '. ', x) for x in articles]
    # Separate words in which a lowercase letter is followed by a capital letter, since they usually do not belong together
    articles = [re.sub('(^[a-z]*)+([A-Z])', r'\1 \2', x) for x in articles]
    # Correct manually for those cases where a name like 'McDonalds' was separated to Mc Donalds
    articles = [re.sub('Mc ', 'Mc', x) for x in articles]
    # Replace quotation marks with a blank
    articles = [re.sub('«', ' ', x) for x in articles]
    articles = [re.sub('»', ' ', x) for x in articles]
    # Remove percentage signs
    articles = [re.sub('%', ' ', x) for x in articles]
    # Remove distracting hyphens
    articles = [re.sub("-", " ", x) for x in articles]
    articles = [re.sub("–", " ", x) for x in articles]
    # Replace new line characters (i.e. \n) and multiple blanks with a single blank
    articles = [re.sub('\s+', ' ', x) for x in articles]
    # Print out the processing time
    print("Processing time for pre-cleaning: ", str((time.time() - t)/60), "minutes")

    # Return the pre-cleaned text data
    return articles


# Define a function to perform the following tasks at once:
## 1. Tokenize: transform text into a list of words, digits and punctuations
## 2. Filter the tokens, such that only nouns, proper nouns, verbs, adjectives, adverbs and negations are kept, while digits and punctuations are removed
## 3. Lemmatize: transform each word back to its word stem
## 4. Lowercase the entire data
def tokenize_filter_and_lemmatize(articles, nlp, language):
    # Keep track of the processing time
    t = time.time()
    # Create a list to store the output
    articles_out = []
    # Define the list of allowed postags (pos = part of speech)
    allowed_postags = ['PROPN', 'NOUN', 'ADJ', 'VERB', 'ADV']
    # Define the list of allowed negation words, depending on the focal language
    if language == 'de':
        allowed_negations = ['nie', 'nichts', 'nicht', 'kein', 'wenig', 'ohne']
    if language == 'fr':
        allowed_negations = ['rien', 'jamais', 'aucun', 'aucune', 'pas', 'ni', 'encore', 'guère', 'personne', 'nullement']
    if language == 'it':
        allowed_negations = ['no', 'non', 'niente', 'nessuno']
    # Create a loop to go through all articles in the input list of articles
    for article in articles:
        # Define the current article as the focal document
        doc = nlp(article)
        # Tokenize, filter and lemmatize the document, while filtering out punctuations and unused word types (i.e. words which are not contained in the 'allowed_posttags' variable)
        articles_out.append([token.lemma_.lower() for token in doc if (token.pos_ in allowed_postags) or (token.lemma_.lower() in allowed_negations)])
    # Print out the processing time
    print("Processing time for tokenizing, filtering and lemmatization: ", str((time.time() - t)/60), "minutes")

    # Return the tokenized, filtered and lemmatized text data
    return articles_out


# Define a function to apply the supplemental/manual cleaning to the filtered and lemmatized data
def supp_clean_blob(articles, language = 'de'):
    # Raise an error if an inappropriate data type is given as an input
    if(not isinstance(articles, list)):
        raise ValueError("Invalid input type. Expected a list.")

    # Keep track of the processing time
    t = time.time()
    # Remove any links starting with http:// or https://
    articles = [re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+').sub('',x) for x in articles]
    # Remove any links starting with www.
    articles = [re.compile('www\.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+').sub('',x) for x in articles]
    # Remove any instances where 1 to 3 initiating letters are followed by a dot (either once at the end or after each letter), since such cases usually represent abbreviations with low semantic meaning
    articles = [re.compile(' [a-z][\.]?[a-z]?[\.]?[a-z]?\.+').sub(' ', x) for x in articles]
    # Remove any remaining digit
    articles = [re.sub(r'\b\d+\b', '', x) for x in articles]
    # Remove anything except words, spaces and the & sign, since this might appear in certain names
    articles = [re.sub(r'[^\w\s\&]','', x) for x in articles]
    # Remove a list of specific words which appear quite often but do not seem to add any semantic value (for Italian also the auxiliary verb essere, as this is not identified as such by the Italian Spacy module)
    if language == 'it':
        words_to_remove = ['awp','afp','essere']
    else:
        words_to_remove = ['awp','afp']
    for word in words_to_remove:
        articles = [re.sub(' '+word, '', x) for x in articles] 
    # Replace new line characters (i.e. \n) and multiple blanks with a single blank
    articles = [re.sub('\s+', ' ', x) for x in articles]
    # Print out the processing time
    print("Processing time for supplemental manual cleaning: ", str((time.time() - t)/60), "minutes")

    # Return the manually cleaned text data
    return articles


# Define a function to prepare/pre-clean the text data for the sentiment analysis using the Blob algorithms
# Note: "articles" has to be a list containing the raw text of the articles as input
def process_blob(articles, language, tokenize = True):
    # Raise an error if an inappropriate data type is given as an input
    if(not isinstance(articles, list)):
        raise ValueError("Invalid input type. Expected a List.")
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)

    # Initialize the appropriate spacy model depending on the language of the text data, while keeping only the tagger component (for efficiency)
    if language == 'de':
        nlp = spacy.load('de_core_news_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    elif language == 'fr':
        nlp = spacy.load('fr_core_news_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    elif language == 'it':
        nlp = spacy.load('it_core_news_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    
    ## Apply the actual processing by means of the previously defined functions
    # Pre-clean the data
    articles = pre_clean(articles)
    # Tokenize, filter and lemmatize the data
    articles = tokenize_filter_and_lemmatize(articles, nlp, language)
    
    # Generate a list containing the preprocessed text data in form of strings in which all remaining lemmatized tokens are contained and separated by a blank (i.e. untokenization)
    articles_out = []
    for article in articles:
        articles_out.append(" ".join(article))

    # Apply the supplemental cleaning
    articles = supp_clean_blob(articles_out, language)

    # Retokenize the data if tokenize = true
    if tokenize:
        articles = retokenize(articles)
    
    # Return processed text data
    return articles


# Define a function to retokenize the preprocessed text data
def retokenize(articles):
    articles_out = []
    for article in articles:
        articles_out.append(article.split())
    return articles_out

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.1.1 German: Preprocessing for the Vader sentiment algorithm
</h2>
</div>

In [9]:
# Apply the above defined function to preprocess the German text data for the Vader sentiment algorithm
articles_vader_de = pre_clean_vader(articles_de)

Processing time for pre-cleaning:  0.00019950071970621744 minutes


In [10]:
# Take a look at the first article of the preprocessed data
articles_vader_de[0]

' Wir haben ein dunkles Jahr hinter uns. Die Gesundheitskrise hat uns schwer getroffen. Viele Familien haben einen nahen Menschen verloren. Viele konnten von ihm nicht Abschied nehmen, wie sie es sich gewünscht hätten. Für sie wird das vergangene Jahr für immer verbunden sein mit diesem schmerzlichen Verlust. Die Mitarbeitenden in Spitälern und Pflegeheimen kamen an den Rand ihrer Kräfte und sind es heute noch. Andere durften lange gar nicht mehr arbeiten, waren in Kurzarbeit oder haben sogar ihre Stelle verloren. Traditionsunternehmen sind verschwunden. Auch unser Bildungssystem wurde auf eine harte Probe gestellt. Kurz: Die Pandemie hat unser aller Leben auf den Kopf gestellt. Selten haben wir Vergleichbares erlebt: Unsere Tätigkeiten kamen zum Stillstand. Die ganze Gesellschaft befand sich in noch nie dagewesener Isolation. Wir mussten lernen, ohne Händeschütteln auszukommen. Dieses wichtige Begrüssungsritual gefährdete plötzlich unsere Gesundheit. All das war und ist für uns umso s

In [11]:
## Create a tsv file which suits the requirements to be fed into the GerVADER algorithm
# Create a dataframe containing the preprocessed fulltext data in a column
articles_vader_de = pd.DataFrame(articles_vader_de, columns = ['tx'])
# Add a column containing the sentiment (= unknown or un --> Required from GerVADER)
articles_vader_de['senti'] = np.repeat('un', articles_vader_de.shape[0]).tolist()
# Correct the order of the columns
articles_vader_de = articles_vader_de[['senti','tx']]
# Extract the Dataframe as a tsv file
articles_vader_de.to_csv("Sentiment/Naive/Performance_evaluation/de_multilingual_articles_vader.tsv", index = True, encoding = 'utf-8-sig', sep = '\t', header = False)

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.1.1 German: Preprocessing for the TextBlob sentiment algorithm
</h2>
</div>

In [12]:
# Apply the above defined functions to preprocess the German text data for the TextBlob sentiment algorithm
articles_blob_de = process_blob(articles_de, 'de', tokenize = False)

Processing time for pre-cleaning:  0.00019945700963338217 minutes
Processing time for tokenizing, filtering and lemmatization:  0.02363736629486084 minutes
Processing time for supplemental manual cleaning:  0.0001329819361368815 minutes


In [13]:
# Take a look at the first article of the preprocessed data
articles_blob_de[0]

'dunkel jahr gesundheitskrise schwer triefen familie nah mensch verlieren können nicht abschied nehmen wie wünschen vergangen jahr immer verbinden schmerzlich verlust mitarbeitende spitälern pflegeheimen kommen rand kraft heute noch dürfen lang gar nicht mehr arbeiten kurzarbeit sogar stelle verlieren traditionsunternehmen verschwinden auch bildungssystem hart probe stellen kurz pandemie leben kopf stellen selten vergleichbares erleben tätigkeit kommen stillstand ganze gesellschaft befinden noch nie dagewesener isolation mussten lernen ohne händeschütteln auskommen wichtig begrüssungsritual gefährden plötzlich gesundheit umso schwierig mensch so schon aristoteles sagen natur gesellig wesen sichern verständnis beginn neu jahr mögen nicht enthusiastisch äussern ungewiss lage bleiben prekär trotzdem mögen herz gut wunsch überbringen denken insbesondere mensch einsam kranken denken verlust nah bekennen leiden denken ungewohnte alltag sorge machen mögen heute erneut versichern bundesrat so 

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.2.1 French: Preprocessing for the Vader sentiment algorithm
</h2>
</div>

In [14]:
# Apply the above defined function to preprocess the French text data for the Vader sentiment algorithm
articles_vader_fr = pre_clean_vader(articles_fr)

Processing time for pre-cleaning:  0.00024871826171875 minutes


In [15]:
# Take a look at the first article of the preprocessed data
articles_vader_fr[0]

' Notre pays, comme tant d’autres, a vécu une année sombre. La crise sanitaire nous a durement éprouvés. De nombreuses familles ont perdu des proches et n’ont pas pu faire leur deuil comme elles l’auraient souhaité. Pour elles, l’année 2020 restera liée au souvenir de cette perte douloureuse. La pandémie a bouleversé nos existences aussi en envoyant des personnes aux chômage, en détruisant des entreprises de tradition ou en mettant nos systèmes d’éducation et de santé à l’épreuve. Nous n’avons pratiquement jamais connu pareille situation: voir nos activités au point mort, la population à l’isolement, la poignée de main bannie de nos codes sociaux. C’est d’autant plus cruel que l’être humain, comme le soulignait Aristote, est fait par nature pour vivre avec ses semblables . A l’aube de cette année 2021, le réalisme m’interdit de former des vœux trop enthousiastes. Il m’oblige plutôt à constater que les inconnues sont nombreuses et que la situation demeure précaire. Je tiens néanmoins à 

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.2.1 French: Preprocessing for the TextBlob sentiment algorithm
</h2>
</div>

In [16]:
# Apply the above defined functions to preprocess the French text data for the TextBlob sentiment algorithm
articles_blob_fr = process_blob(articles_fr, 'fr', tokenize = False)

Processing time for pre-cleaning:  0.00018286705017089844 minutes
Processing time for tokenizing, filtering and lemmatization:  0.05114614963531494 minutes
Processing time for supplemental manual cleaning:  0.00011636018753051758 minutes


In [17]:
# Take a look at the first article of the preprocessed data
articles_blob_fr[0]

'pays tant autre vivre année sombre crise sanitaire durement éprouver nombreux famille perdre proche n pas pouvoir faire deuil souhaiter année rester lier souvenir perte douloureux pandémie bouleverser existence aussi envoyer personne chômage détruire entreprise tradition mettre système éducation santé épreuve n pratiquement jamais connaître pareil situation voir activité point mort population isolement poignée main bannir code social autant plus cruel être humain souligner aristote faire nature vivre semblable aube année réalisme interdire former vœu trop enthousiaste obliger plutôt constater inconnu nombreux situation demeure précaire tenir néanmoins adresser fond cœur chaleureux pensée avoir priorité souffrir solitude maladie perte être cher rigueur âge effet pandémie encore accentuer difficulté personnel grand nombre tenir assurer nouveau fois soutien conseil fédéral engagement constant trouver solution permettre pays repartir bon pied aimer tout même dire aussi optimisme pas optim

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.3 Italian: Preprocessing for the naïve sentiment algorithm
</h2>
</div>

In [18]:
# Apply the above defined functions to preprocess the Italian text data for the naïve sentiment algorithm
articles_naive_it = process_blob(articles_it, 'it', tokenize = True)

Processing time for pre-cleaning:  0.00016624132792154948 minutes
Processing time for tokenizing, filtering and lemmatization:  0.02636292775472005 minutes
Processing time for supplemental manual cleaning:  0.00013294617335001628 minutes


In [19]:
# Take a look at the first 10 tokens of the first article of the preprocessed data
articles_naive_it[0][:10]

['paese',
 'vivere',
 'anno',
 'buio',
 'crisi',
 'sanitario',
 'inferto',
 'durare',
 'colpo',
 'famiglia']

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
3. Sentiment Assessment
</h2>
</div>

In [20]:
## Apply all preparation steps needed to set up the naïve Italian sentiment algorithm

# Read in the sentiment lexicon Sentix (i.e. an Italian lexicon for sentiment analysis) as a dataframe
senti_lex_df = pd.read_csv("Sentiment/Naive/Italian/sentix.txt", sep = '\t', header = None, names = ['lemma','POS','ID','pos_score','neg_score','polarity','intensity'])
# Lowercase the entries in the column 'lemma'
senti_lex_df['lemma'] = senti_lex_df['lemma'].str.lower()
# Relabel the POS Tags to a common standard
senti_lex_df['POS'].replace({'a': 'ADJ', 'n': 'NOUN', 'v': 'VERB', 'r': 'ADV'}, inplace = True)
# Remove exact duplicates
n_duplicates = sum(senti_lex_df.duplicated())
senti_lex_df.drop_duplicates(keep = 'first', inplace = True, ignore_index = True)
print(n_duplicates, "exactly duplicated entries have been removed.")

# Calculate the average polarity of all duplicated words that are assigned with the same POS
senti_lex_df = senti_lex_df.groupby(['POS','lemma'])['polarity'].mean().reset_index()
# Sort the dataframe according to the alphabetical order of the POS-tags, such that first all ADJs appear, which then are followed by ADVs, NOUNs and then VERBs
senti_lex_df.sort_values(['POS','lemma'], inplace = True)
# Remove duplicates, while the first appearing lemma is kept (hence, in case of duplicated lemmas that have different POS-tags, first adjectives are kept, then adverbs, then nouns and then verbs)
n_duplicates = sum(senti_lex_df.duplicated(subset = ['lemma'], keep = 'first'))
print(n_duplicates, "duplicated lemmas with differing POS-tags have been removed.")
senti_lex_df.drop_duplicates(subset = ['lemma'], keep = 'first', inplace = True, ignore_index = True)

# Create a dictionary out of the sentiment lexicon
senti_lex_dict = {}
for index, row in senti_lex_df.iterrows():
    senti_lex_dict[row['lemma']] = {'POS': str(row['POS']), 'polarity': float(row['polarity'])}

# Remove unnecessary variables to save RAM
del senti_lex_df

# Define the set of possible negations
negations = ['no', 'non', 'niente', 'nessuno']

# Empirically derived mean sentiment intensity rating increase for booster words (adapted from the VADER module)
# Note: The values have been devided by 4, because we are working with polarities directly (which range from -1 to 1) instead of the unscaled crowd ratings (which range from -4 to 4)
B_INCR = 0.293/4
B_DECR = -0.293/4

# Define the dictionary of booster words
booster_dic = \
    {"assolutamente": B_INCR, "assoluto": B_INCR, "assoluta": B_INCR, "totalmente": B_INCR, "totale": B_INCR, #"absolutely": B_INCR,
     "sorprendente": B_INCR, "mirabolante": B_INCR, "stupefacente": B_INCR, "straordinario": B_INCR, "straordinaria": B_INCR, "strabiliante": B_INCR, #"amazingly": B_INCR,
     "enorme": B_INCR, "esorbitante": B_INCR, "immenso": B_INCR, "immensa": B_INCR, "colossale": B_INCR, #"awfully": B_INCR,
     "completo": B_INCR, "completa": B_INCR, "intero": B_INCR, "intera": B_INCR,    #"completely": B_INCR,
     "considerevole": B_INCR, "ingente": B_INCR, "notevole": B_INCR, "ragguardevole": B_INCR, "rilevante": B_INCR, "apprezzabile": B_INCR, "cospicuo": B_INCR, "cospicua": B_INCR, #"considerably": B_INCR,
     "inequivocabile": B_INCR, "univoco": B_INCR, "univoca": B_INCR, "netto": B_INCR, "netta": B_INCR, "indubitato": B_INCR, "indubitata": B_INCR, #"decidedly": B_INCR,
     "fundamentale": B_INCR, #"deeply": B_INCR,
     "dannato": B_INCR, "dannata": B_INCR, #"effing": B_INCR,
     "oltremodo": B_INCR, "oltremisura": B_INCR, "sommamente": B_INCR, "squisitamente": B_INCR, "straordinariamente": B_INCR, #"enormously": B_INCR,
     #"entirely": B_INCR,
     "particolare": B_INCR, "particolarmente": B_INCR, "speciale": B_INCR, "specialmente": B_INCR, # "especially": B_INCR,
     "insolito": B_INCR, "insolita": B_INCR, "eccezionalmente": B_INCR,  #"exceptionally": B_INCR,
     "estremamente": B_INCR, "estremo": B_INCR, "estrema": B_INCR, #"extremely": B_INCR,
     "favoloso": B_INCR, "favolosa": B_INCR, "fantastico": B_INCR, #"fabulously": B_INCR,
     #"flipping": B_INCR,
     #"flippin": B_INCR,
     #"fricking": B_INCR,
     #"frickin": B_INCR,
     #"frigging": B_INCR,
     #"friggin": B_INCR,
     #"fully": B_INCR,
     #"fucking": B_INCR,
     "molto": B_INCR, "intensamente": B_INCR, "parecchio": B_INCR, "tanto": B_INCR, "massiccio": B_INCR, "massiccia": B_INCR,  #"greatly": B_INCR,
     #"hella": B_INCR,
     "supremo": B_INCR, "suprema": B_INCR, #"highly": B_INCR,
     "immensamente": B_INCR, "immenso": B_INCR, "immensa": B_INCR, #"hugely": B_INCR,
     "incredibile": B_INCR, #"incredibly": B_INCR,
     "intensamente": B_INCR, #"intensely": B_INCR,
     "principalmente": B_INCR, #"majorly": B_INCR,
     "più": B_INCR, #"more": B_INCR,
     "maggior": B_INCR, #"most": B_INCR,
     "particolarmente": B_INCR, "soprattutto": B_INCR, #"particularly": B_INCR,
     "puramente": B_INCR, "esclusivamente": B_INCR, #"purely": B_INCR,
     "abbastanza": B_INCR, "piuttosto": B_INCR, "alquanto": B_INCR, #"quite": B_INCR,
     "davvero": B_INCR, "veramente": B_INCR, #"really": B_INCR,
     "notevolmente": B_INCR, #"remarkably": B_INCR,
     "essenziale": B_INCR, "considerabilmente": B_INCR, #"substantially": B_INCR,
     "accuratamente": B_INCR, "completamente": B_INCR, #"thoroughly": B_INCR,
     #"totally": B_INCR,
     "tremendamente": B_INCR, "enormemente": B_INCR, #"tremendously": B_INCR,
     #"uber": B_INCR,
     "incredibilmente": B_INCR, #"unbelievably": B_INCR,
     "insolitamente": B_INCR, "inusualmente": B_INCR, #"unusually": B_INCR,
     #"utterly": B_INCR,
     #"very": B_INCR,
     #####
     "quasi": B_DECR, "pressoché": B_DECR, #"almost": B_DECR,
     "appena": B_INCR, "malapena": B_INCR, #"barely": B_DECR,
     "stento": B_DECR, #"hardly": B_DECR,
     "abbastanza": B_INCR, #"just enough": B_DECR,
     "alquanto": B_DECR, #"kind of": B_DECR,
     "tipo": B_INCR, #"kinda": B_DECR,
     #"kindof": B_DECR,
     #"kind-of": B_DECR,
     "meno": B_INCR, #"less": B_DECR,
     "piccolo": B_INCR, #"little": B_DECR,
     "esiguo": B_DECR, "esigua": B_DECR, "futile": B_DECR, "insignificante": B_DECR, "marginale": B_DECR,  #"marginally": B_DECR,
     "occasionale": B_DECR, "saltuario": B_DECR, "saltuaria": B_DECR, #"occasionally": B_DECR,
     "parziale": B_DECR, #"partly": B_DECR,
     "scarso": B_DECR, "scarsa": B_DECR, "rado": B_DECR, "rada": B_DECR, "scarsamente": B_DECR, "magro": B_DECR, "magra": B_DECR, #"scarcely": B_DECR,
     "poco": B_DECR, "briciolo": B_DECR, "pizzico": B_DECR, #"slightly": B_DECR,
     "piuttosto": B_DECR #"somewhat": B_DECR,
     #"sort of": B_DECR,
     #"sorta": B_DECR,
     #"sortof": B_DECR,
     #"sort-of": B_DECR
     }

17647 exactly duplicated entries have been removed.
675 duplicated lemmas with differing POS-tags have been removed.


In [21]:
# Set up the Sentiment Classifier class
class NaiveSentimentClassifierIT:
    def __init__(self, senti_lex_dict, negations, booster_dic):
        # Note:
        ## senti_lex_dict has to be a dictionary with entries of the following form: {token: {'POS': token_POStag, 'polarity': token_polarity_score}}
        ## negations has to be a list of negation words
        ## booster_dic has to be a dictionary with entries of the following form: {token: additive_polarity_impact_on_target_token}
        # Store the inputs within the corresponding attribute of self
        self.senti_lex_dict = senti_lex_dict
        self.negations      = negations
        self.booster_dic    = booster_dic
    
    # Set up a function that evaluates the polarity of the articles
    def evaluate(self, tx, idx):
        # Note: 
        ## tx has to be a list of tokenized articles (i.e. a list, whose elements are itself lists containing the tokenized and precleaned articles)
        ## --> (precleaned means lemmatized and filtered, such that only negations, nouns, verbs, adverbs and adjectives are contained)
        ## idx has to be a list containing the ordered article indexes corresponding to the articles in tx

        # Keep track of the processing time
        t = time.time()

        # Create an empty list to store the resulting document polarity scores
        article_polarity = []
        # Set up a loop to go through all articles
        for article in tx:
            # Apply the above defined functions to get the token's polarity scores, score adjustments (through booster words) and score multipliers (through negations)
            self.get_token_postag(article)
            self.get_token_polarity(article)
            self.get_score_adjustment(article)
            self.get_score_multiplier(article)
            self.get_article_polarity()
            # Apply the above defined function to calculate the final polarity of the article and append it to the variable article_polarity
            article_polarity.append(self.document_polarity)
        # Store the article polarities in slef
        self.article_polarity = article_polarity

        # Print out the processing time
        print("Processing time to evaluate the article sentiments:", str((time.time() - t)/60), "minutes")

        # Create a correctly indexed dataframe containing the article sentiments
        Naive_tx_polarity = pd.DataFrame(article_polarity, index = idx, columns = ['Naive_polarity'])
        # Return the results
        return Naive_tx_polarity

    ## Define all functions needed within the Sentiment Classifier class

    # Define a function to get the POS-tag for each token in an article (given that the token is contained in the sentiment lexicon)
    def get_token_postag(self, article):
        # Create a list to store the results
        token_pos = []
        # Set up a loop to go through all tokens of the article
        for token in article:
            if token in self.senti_lex_dict:
                token_pos.append(self.senti_lex_dict[token]['POS'])
            else:
                token_pos.append('UNKNOWN')
        # Store the resulting list of the token POS-tags in self
        self.token_pos = token_pos

    # Define a function to get the polarity score for each token in an article (given that the token is contained in the sentiment lexicon)
    def get_token_polarity(self, article):
        # Create a list to store the results
        token_polarity = []
        # Set up a loop to go through all tokens of the article
        for token in article:
            if token in self.senti_lex_dict:
                token_polarity.append(self.senti_lex_dict[token]['polarity'])
            else:
                token_polarity.append(0)
        # Store the resulting list of the token polarities in self
        self.token_polarity = token_polarity

    # Define a function to get the score multiplier (caused by negation words) for each token in an article
    def get_score_multiplier(self, article):
        # Define the negation scalar (adapted from VADER)
        neg_scalar = -0.74
        # Define a list of ones of the same length as the number of tokens in the article
        score_multiplier = np.repeat(1, len(article)).tolist()
        # Set up a loop to go through all tokens of the article
        for i in np.arange(1, len(article)-1):
            # Check whether the word is a negation word and assign a neg_scalar to the multiplier at the position of the subsequnet token and a 0.5*neg_scalar at the position of the second token after the negation
            if article[i] in self.negations:
                score_multiplier[i+1] = neg_scalar
                if i < (len(article)-2):
                   score_multiplier[i+2] = 0.5*neg_scalar
        # Store the score_multiplier variable in self
        self.score_multiplier = score_multiplier

    # Define a function to get the score adjustment (caused by intensifier words) for each token in an article
    def get_score_adjustment(self, article):
        # Define a list of zeros of the same length as the number of tokens in the article
        score_adjustment = np.repeat(0, len(article)).tolist()
        # Set up a loop to go through all tokens of the article
        for i in np.arange(1, len(article)-1):
            # If the previous word was a verb and the current word is a booster, then assign an adjustment to the verb
            if article[i] in self.booster_dic and self.token_pos[i-1] == 'VERB':
                score_adjustment[i-1] = score_adjustment[i-1] + self.booster_dic[article[i]]
            # Else, it is assumed that the booster affects the subsequent token
            elif article[i] in self.booster_dic and not self.token_pos[i-1] == 'VERB':
                score_adjustment[i+1] = score_adjustment[i+1] + self.booster_dic[article[i]]
        # Store the score_adjustment variable in self
        self.score_adjustment = score_adjustment

    # Define a function to calculate the final polarity of an article
    def get_article_polarity(self):
        # Create an empty list to store the final token polarity (note: from now on only tokens with a polarity != 0 are kept)
        final_token_polarity = []
        # Set up a loop to calculate each token's final polarity score
        for i in range(len(self.token_polarity)):
            if self.token_polarity[i] > 0:
                final_token_polarity.append((self.token_polarity[i] + self.score_adjustment[i])*self.score_multiplier[i])
            if self.token_polarity[i] < 0:
                final_token_polarity.append((self.token_polarity[i] - self.score_adjustment[i])*self.score_multiplier[i])
        # Calculate the article polarity, which is just the average final token polarity among all tokens that were kept (i.e. that exhibit a non-zero adjusted polarity)
        # If the list final_token_polarity is empty, then just assign a polarity of 0
        if len(final_token_polarity) == 0:
            document_polarity = 0
        else:
            document_polarity = np.mean(final_token_polarity)
        # Ensure that the resulting polarity is still in the range between -1 and 1
        if document_polarity > 1: document_polarity = 1
        if document_polarity < -1: document_polarity = -1
        # Store the resulting document polarity in self
        self.document_polarity = document_polarity

In [22]:
## Define all functions needed to apply the various sentiment algorithms

# Define a function that evaluates the polarity of the articles using the according TextBlob modules
def eval_blob_polarity(tx, idx):
    # Notes: 
    ## tx has to be a list containing the precleaned and NOT tokenized articles
    ## idx has to be a list containing the correctly ordered index

    # Initialize a Blobber class, which uses the language specific PatternAnalyzer we imported above to assess text polarity (sentiment from -1 to 1) and subjectivity
    tb = Blobber(pos_tagger = PatternTagger(), analyzer = PatternAnalyzer())
    # Keep track of the processing time
    t = time.time()
    # Set up a loop to go through all articles and evaluate their polarity with TextBlob
    pol = []
    for article in tx:
        pol.append(tb(article).sentiment[0])
    # Print out the processing time
    print("Processing time to evaluate polarity scores of the articles:", str(round((time.time() - t)/60,2)), "minutes")
    # Create a correctly indexed dataframe
    Blob_tx_polarity = pd.DataFrame(pol, index = idx, columns = ['Blob_polarity'])
    # Return the results
    return Blob_tx_polarity


# Define a function that evaluates the polarity of the articles and using the French Vader module
def eval_vader_polarity(tx, idx, fuzzywuzzy = False):
    # Notes: 
    ## tx has to be a list containing the (precleaned) fulltext articles
    ## idx has to be a list containing the correctly ordered index
    ## fuzzywuzzy indicates whether the sentiment analysis algorithm should also look for similar words whenever a word is not detected properly to overcome issues with typos (computationally more expensive!) 

    # Initialize a Vader class, which uses the language specific SentimentIntensityAnalyzer we imported above to assess text polarity (sentiment from -1 to 1) and subjectivity
    sia = SentimentIntensityAnalyzer()
    # Keep track of the processing time
    t = time.time()
    # Set up a loop to go through all articles and evaluate their polarity with Vader
    pol = []
    for article in tx:
        if fuzzywuzzy:
            pol.append(sia.polarity_scores_max(article)['compound'])
        else:
            pol.append(sia.polarity_scores(article)['compound'])
    # Print out the processing time
    print("Processing time to evaluate polarity scores of the articles:", str(round((time.time() - t)/60,2)), "minutes")
    # Create a correctly indexed dataframe
    Vader_tx_polarity = pd.DataFrame(pol, index = idx, columns = ['Vader_polarity'])
    # Return the results
    return Vader_tx_polarity

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
3.1.1 German: Sentiment assessment with Vader
</h2>
</div>

Since the most convenient way to access GerVADER is from the computer's command line (i.e. cmd if you are using Windows as an operating system), we do so in the following by applying the following steps sequentially:

1. Download the German Vader sentiment module from the according Github repository (link: https://github.com/KarstenAMF/GerVADER). 

2. Unzip the file, rename the retrieved folder from "GerVADER" to "GerVADER_adjusted" and save it to some path (referred to as Path\to\GerVADER_adjusted in the following). For me this path is C:\Users\Hallk\Documents\Programming\Additional_stuff\GerVADER_adjusted.

3. Replace the original file "vaderSentimentGER.py" in the downloaded folder with the adjusted "vaderSentimentGER.py" file in the "GerVADER_adjusted" folder provided by this thesis.

4. Copy the tsv file created above (i.e. "Sentiment/Naive/Performance_evaluation/de_multilingual_articles_vader.tsv") into the aforementioned "GerVADER_adjusted" folder on your machine.

5. Open your machine's command window and navigate to the "GerVADER_adjusted" folder by entering the following command: cd Path\to\GerVADER_adjusted

6. Next, enter the command: python GERvaderModule.py

7. Choose: 2

8. Enter the name of the tsv file mentioned above

9. Choose N or insert the name you want to give to the output folder

If all these steps are executed the articles' sentiments will be evaluated by the algorithm and 4 distinct output tables in tsv format will be generated. One output file contains the text and sentiment assessment for the entirety of articles, while the remaining three solely contain articles classified as negative, neutral or positive, respectively. All resulting output files are then stored within the folder 'results' located within the "GerVADER_adjusted" folder.

To extract the resulting polarity scores, we now have to copy the resulting tsv file named "GERVADER\_\_all_docs.tsv" to an easily accessible spot on our machine, such that we can read it in conveniently (for me it is the folder d:\\Users\\Hallk\\Dropbox\\MA_data\\Sentiment\\Naive\\Performance_evaluation).

In [23]:
# Read in the resulting tsv file
senti_vader_de = pd.read_csv("Sentiment/Naive/Performance_evaluation/GERVADER__all_docs.tsv", sep = '\t', header = None, usecols = [0,2], names = ['tx','Vader_polarity'])
# Extract the list of polarity assignments
senti_vader_de = senti_vader_de.Vader_polarity.values.tolist()

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
3.1.2 German: Sentiment assessment with TextBlob
</h2>
</div>

In [24]:
# Import the according TextBlob modules
from textblob import Blobber
from textblob_de import PatternTagger, PatternAnalyzer

In [25]:
# Apply the above defined function to assess the articles' polarity with TextBlob
senti_blob_de = eval_blob_polarity(articles_blob_de, multilingual_articles.index.tolist())
# Extract the list of polarity assignments
senti_blob_de = senti_blob_de.Blob_polarity.values.tolist()

Processing time to evaluate polarity scores of the articles: 0.03 minutes


In [26]:
# Remove the previously loaded TextBlob modules
del Blobber, PatternTagger, PatternAnalyzer

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
3.2.1 French: Sentiment assessment with Vader
</h2>
</div>

In [27]:
# Import the according Vader module
from vaderSentiment_fr.vaderSentiment import SentimentIntensityAnalyzer

In [28]:
# Apply the above defined function to assess the articles' polarity with Vader
senti_vader_fr = eval_vader_polarity(articles_vader_fr, multilingual_articles.index.tolist(), fuzzywuzzy = False)
# Extract the list of polarity assignments
senti_vader_fr = senti_vader_fr.Vader_polarity.values.tolist()

Processing time to evaluate polarity scores of the articles: 0.01 minutes


<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
3.2.2 French: Sentiment assessment with TextBlob
</h2>
</div>

In [29]:
# TextBlob (for Sentiment Analysis)
from textblob import Blobber
from textblob_fr import PatternTagger, PatternAnalyzer

In [30]:
# Apply the above defined function to assess the articles' polarity with TextBlob
senti_blob_fr = eval_blob_polarity(articles_blob_fr, multilingual_articles.index.tolist())
# Extract the list of polarity assignments
senti_blob_fr = senti_blob_fr.Blob_polarity.values.tolist()

Processing time to evaluate polarity scores of the articles: 0.0 minutes


In [31]:
# Remove the previously loaded TextBlob modules
del Blobber, PatternTagger, PatternAnalyzer

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
3.3 Italian: Sentiment assessment with the naïve Italian sentiment classifier
</h2>
</div>

In [32]:
# Set up a NaiveSentimentClassifierIT object
NSC_it = NaiveSentimentClassifierIT(senti_lex_dict, negations, booster_dic)
# Evaluate the sentiment of the Italian articles
senti_naive_it = NSC_it.evaluate(articles_naive_it, multilingual_articles.index.tolist())
# Extract the list of polarity assignments
senti_naive_it = senti_naive_it.Naive_polarity.values.tolist()

Processing time to evaluate the article sentiments: 0.00023292700449625652 minutes


<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
4. Compare the resulting polarity scores between the German and French versions of the adjusted Vader & TextBlob sentiment algorithm and the self-developed naïve Italian sentiment algorithm
</h2>
</div>

In [33]:
# Create a dataframe containing all polarity scores
pd.set_option('mode.chained_assignment', None)
Polarities = multilingual_articles[['title_de']]
Polarities['Vader_de'] = senti_vader_de
Polarities['Blob_de']  = senti_blob_de
Polarities['Vader_fr'] = senti_vader_fr
Polarities['Blob_fr']  = senti_blob_fr
Polarities['Naive_it'] = senti_naive_it
# Create a column showing the difference in the resulting polarities from the naïve Italian sentiment algorithm to the established German Vader sentiment algorithm
Polarities['Diff_Vader_de_Naive_it'] = Polarities['Vader_de'] - Polarities['Naive_it']

In [34]:
# Take a look at some summary statistics
np.round(Polarities[['Vader_de','Blob_de','Vader_fr','Blob_fr','Naive_it','Diff_Vader_de_Naive_it']].describe(), 3)

Unnamed: 0,Vader_de,Blob_de,Vader_fr,Blob_fr,Naive_it,Diff_Vader_de_Naive_it
count,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.198,0.29,0.136,0.089,0.264,-0.067
std,0.137,0.45,0.121,0.067,0.164,0.175
min,-0.125,-0.744,-0.18,-0.166,-0.163,-0.427
25%,0.126,-0.056,0.067,0.066,0.197,-0.177
50%,0.207,0.419,0.133,0.099,0.274,-0.082
75%,0.282,0.669,0.213,0.124,0.365,0.055
max,0.45,1.0,0.441,0.214,0.562,0.399


In [35]:
# Take a look at the results
Polarities

Unnamed: 0,title_de,Vader_de,Blob_de,Vader_fr,Blob_fr,Naive_it,Diff_Vader_de_Naive_it
0,2021 - Neujahrsansprache von Bundespräsident Guy Parmelin,0.2054,0.307317,0.0552,0.122173,0.276621,-0.071221
1,Bundesratsfoto 2021: Neue Sicht auf Altbekanntes,0.2196,0.433333,0.2338,0.153333,0.322437,-0.102837
2,Medienberichte zu Todesfall nach Covid-19-Impfung in der Schweiz: Kein Zusammenhang mit der Impfung ersichtlich,-0.0562,-0.075,-0.18,0.213636,-0.163238,0.107038
3,Rasche Prüfung der Ausfuhrgesuche der Crypto International AG und TCG Legacy in Liquidation,0.0859,-0.075,0.0665,0.032222,0.374939,-0.289039
4,Coronavirus: Der Bundesrat verschärft die Massnahmen nicht,0.1269,0.258333,-0.0192,0.082703,0.271111,-0.144211
5,Neue Abkommen zwischen der Schweiz und dem Vereinigten Königreich treten in Kraft,0.2716,0.647368,0.215,0.099049,0.126529,0.145071
6,Luftpolizeidienst rund um die Uhr,0.1702,0.0,0.0662,0.035577,0.406549,-0.236349
7,Betrugsfall Stanford: Herausgabe von rund 200 Mio. US-Dollar an die USA,-0.0591,-0.666667,0.1333,0.114545,0.022822,-0.081922
8,"Statistik, Datenwissenschaft und nationale Datenbewirtschaftung: drei zukunftsgerichtete Aufgaben für das BFS",0.4125,0.0,0.2063,0.103333,0.4552,-0.0427
9,Stabiles Kulturverhalten trotz vermehrter Nutzung von digitalen Angeboten im Jahr 2019,0.3024,0.7,0.1234,0.066,0.452152,-0.149752


In [36]:
# Get the cases where the sign of resulting polarites from the German and French Vader algorithms do not coincide
cases_1_idx = []
for i in range(Polarities.shape[0]):
    if (np.sign(Polarities.loc[i,'Vader_de']) != np.sign(Polarities.loc[i,'Vader_fr'])):
        cases_1_idx.append(Polarities.index[i])
cases_1_idx

[4, 7, 10, 13, 16, 17, 20, 29, 47]

In [37]:
# Get cases where the sign of the resulting polarity from the naïve Italian algorithm differs from the signs of the results from the German and French Vader algorithms, 
# given that the signs of the language specific Vader algorithms coincide
cases_2_idx = []
for i in range(Polarities.shape[0]):
    if (np.sign(Polarities.loc[i,'Vader_de']) == np.sign(Polarities.loc[i,'Vader_fr'])) and not (np.sign(Polarities.loc[i,'Vader_de']) == np.sign(Polarities.loc[i,'Naive_it'])):
        cases_2_idx.append(Polarities.index[i])
cases_2_idx

[34]

In [38]:
# Calcualte the share of articles for which the sign of the resulting polarites from the German and French Vader algorithms coincide (i.e. articles with unambiguous sentiment according to the Vader algorithms)
share = 1 - len(cases_1_idx)/len(articles_de)
print('The share of articles with unambiguous sentiment (according to the Vader algorithms) is', str(np.round(100*share,2)), '%')

The share of articles with unambiguous sentiment (according to the Vader algorithms) is 82.0 %


In [39]:
# Calcualte the share among articles with unambiguous sentiment for which the sign of the resulting polarity from the naïve Italian algorithm does not coincide with the ones assigned by the Vader algorithms.
share = len(cases_2_idx) / (len(articles_de) - len(cases_1_idx))
print('The share among articles with unambiguous sentiment for which the sign of the resulting polarity from the naïve Italian algorithm does not coincide with the ones assigned by the Vader algorithms is', str(np.round(100*share,2)), '%')

The share among articles with unambiguous sentiment for which the sign of the resulting polarity from the naïve Italian algorithm does not coincide with the ones assigned by the Vader algorithms is 2.44 %


<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
5. Visual inspction of articles exhibiting ambiguous sentiment ratings across languages
</h2>
</div>

In [41]:
## Take a look at the German version of the articles where the polarities from the German and French Vader algorithms coincide, but the polarity of the naïve Italian algorithm does not
# Naïve polarity is falsly assigned to be negative
articles_de[cases_2_idx[0]]

'Für die Realisierung und den Unterhalt von Bau- und Infrastrukturprojekten werden in der Schweiz jährlich rund 5 Millionen Tonnen Zement nachgefragt. Dieser Verbrauch wurde 2019 zu 86 % durch die sechs schweizerischen Zementwerke und zu 14 % durch Importe gedeckt.  Eine stabile Zementversorgung wird in erster Linie durch einen langfristig gesicherten Zugang zu den Primärrohstoffen Kalk und Mergel gewährleistet. Bei einigen Zementwerken ist dieser Zugang durch den fortschreitenden Abbau, wegen zunehmenden Zielkonflikten mit entgegenstehenden Schutz- und Raumnutzungsinteressen und aufgrund von Widerständen gegen die beantragten Rohstoffabbauerweiterungsprojekte teilweise eingeschränkt.  Der Rohstoffsicherungsbericht stellt eine Prognose für die nationale Zementversorgung mit inländischen Zementrohstoffen bis ins Jahr 2030 dar. Diese weist auf einen möglichen Rückgang der inländischen Zementproduktion ab 2024 hin, wenn die beantragten, in den kantonalen Richtplänen festgesetzten Abbauerw

In [42]:
## Inspect articles where the resulting polarity scores exhibit some discrepancies between the German and French Vader algorithms (scheme of notation: signs of [Vader_de, Vader_fr, Naive_it])
# [+, -, +]
articles_de[cases_1_idx[0]]

'Gemäss Entscheid vom 18. Dezember 2020 wurde der Bundesrat am 30. Dezember 2020 schriftlich über eine allenfalls erforderliche Verschärfung der Massnahmen zur Bekämpfung des Coronavirus informiert. Nach einer detaillierten Analyse der epidemiologischen Situation ist er zum Schluss gekommen, dass die für eine solche Verschärfung festgelegten Kriterien nicht erfüllt sind. Der Bundesrat hat daher beschlossen, die aktuellen Massnahmen beizubehalten. Er verfolgt die Situation weiterhin aufmerksam und wird am 6. Januar 2021 die Lage neu beurteilen.  Der Reproduktionswert des Virus ist aktuell unter 1 (0,86 am 18.12.2020). Dieser Rückgang sowie die geringe Zahl der neu gemeldeten Fälle in den letzten Tagen sind jedoch mit grosser Vorsicht zu betrachten. Sie lassen sich zu einem beträchtlichen Teil durch den Rückgang der durchgeführten Tests während der Feiertage sowie die Verzögerung bei den Meldungen der neuen Fälle, Hospitalisationen und Todesfälle erklären.  Nach dem Auftreten neuer Varia

In [45]:
# [-, +, +]
articles_de[cases_1_idx[1]]

'Mit einem Schneeballsystem hat der amerikanische Finanzier Allen Stanford zwischen 2001 und 2008 tausende Anleger um insgesamt mehr als sieben Milliarden US-Dollar betrogen. Dafür wurde er im Jahr 2012 in den USA zu einer Haftstrafe von 110 Jahren verurteilt. Die deliktischen Vermögenswerte wurden zu Gunsten der Geschädigten eingezogen.  Die Schweiz hat die USA in diesem Strafverfahren unterstützt. Gestützt auf den bilateralen Rechtshilfevertrag mit den USA und das Rechtshilfegesetz hat das Bundesamt für Justiz (BJ) den amerikanischen Behörden relevante Bankunterlagen zu verschieden Konten auf Schweizer Banken ausgehändigt sowie die Beschlagnahme von Vermögenswerten auf schweizerischen Konten verfügt.  Im Jahr 2019 hat das BJ im Nachgang zum rechtskräftigen Einziehungsurteil in den USA die Herausgabe der gesperrten Vermögenswerte angeordnet. Die dagegen erhobenen Beschwerden wurden vom Bundesstrafgericht am 16. Oktober 2020 abgewiesen. Das BJ wird den US-Behörden deshalb bis Ende Deze

In [46]:
# [-, +, +]
articles_de[cases_1_idx[2]]

'Welche Gefahr geht von den betroffenen Produkten aus?  Es kann nicht ausgeschlossen werden, dass bei Regen Wasser in das Akkufach gelangt. In einem solchen Fall resultiert eine Brand- und Unfallgefahr.  Welche Produkte sind betroffen?  Vom Produktrückruf betroffen sind die E-Scooter MOMO und MASERATI.  Was sollen betroffene Konsumentinnen und Konsumenten tun?  Kunden, die einen der betroffenen E-Scooter besitzen, sind aufgefordert, diesen nicht mehr zu verwenden und in eine Jumbo-Filiale zurückzubringen. Er wird kostenlos umgerüstet.  Disclaimer : Die Rückrufe und Sicherheitsinformationen bestehen aus teilweise oder ganz übernommenen Pressemitteilungen der entsprechenden Unternehmen oder Institutionen und werden mit deren Einverständnis publiziert. Adresse für Rückfragen  Bei Fragen können sich Konsumentinnen und Konsumenten an eine Jumbo-Filiale oder ans Quality Management von Jumbo wenden:'

In [47]:
# [-, +, +]
articles_de[cases_1_idx[3]]

'Welche Gefahr geht von den betroffenen Produkten aus?  Das betroffene Hängemattengestell kann während der Anwendung zusammenbrechen, so dass eine Sturzgefahr besteht.  Welche Produkte sind betroffen?  Vom Produktrückruf betroffen ist die Charge 6201920 des Hängemattengestells PARADA. Die Chargennummer ist am Gestell angebracht (vgl. beiliegendes Informationsschreiben von Jumbo).  Die betroffenen Hängemattengestelle wurden von Jumbo zwischen dem 1. Januar 2019 und dem 17. November 2020 verkauft.  Was sollen betroffene Konsumentinnen und Konsumenten tun?  Kunden, die eines der betroffenen Hängemattengestelle besitzen, sind aufgefordert, dieses nicht mehr zu verwenden und in eine Jumbo-Filiale zurückzubringen. Sie erhalten den Kaufpreis rückerstattet.  Disclaimer: Die Rückrufe und Sicherheitsinformationen bestehen aus teilweise oder ganz übernommenen Pressemitteilungen der entsprechenden Unternehmen oder Institutionen und werden mit deren Einverständnis publiziert.'

In [48]:
# [+, -, +]
articles_de[cases_1_idx[4]]

'Gestützt auf eine Analyse des SECO und die Kontrollergebnisse der paritätischen Kommission, welche für den Vollzug des AVE GAV zuständig ist, kam die TPK Bund zum Schluss, dass die erleichterte AVE des GAV erneut verlängert werden soll. Die Kommission hat dem Bundesrat einen entsprechenden Antrag unterbreitet. Der Bundesrat ist dem Antrag der TPK Bund gefolgt. Die erleichterte AVE wird per 1. Januar 2021 in Kraft treten und gilt bis Ende 2021. Es handelt sich dabei um die Einzige erleichterte AVE, die auf Bundesebene existiert.  Seit 2004 existiert eine ordentliche Allgemeinverbindlicherklärung (AVE) des Gesamtarbeitsvertrags (GAV) für die Reinigungsbranche der Deutschschweiz. Dieser AVE GAV gilt allerdings nur für Betriebe mit mindestens 6 Arbeitnehmenden. Da vermehrt missbräuchliche Lohnunterbietungen bei kleineren Unternehmen festgestellt wurden, hat der Bundesrat auf Antrag der tripartiten Kommission des Bundes (TPK Bund) im Jahr 2011 den GAV für alle Betriebe der Branche erleicht

In [49]:
# [-, +, +]
articles_de[cases_1_idx[5]]

'Eine Billion Schweizer Franken betrug 2019 die Bilanzsumme aller Pensionskassen der Schweiz. Die Wertschwankungsreserven nahmen um fast 43 Milliarden Franken zu. Die Vorsorgekapitalien der aktiven Versicherten und der Rentenbeziehenden wuchsen um 68 Milliarden Franken an. Dies geht aus den definitiven Ergebnissen der Pensionskassenstatistik 2019 des Bundesamtes für Statistik (BFS) hervor.'

In [50]:
# [+, -, +]
articles_de[cases_1_idx[6]]

'Gemäss der Analyse von Präsenz Schweiz im Eidgenössischen Departement für auswärtige Angelegenheiten (EDA), wurde die Schweiz im Jahr 2020 in den ausländischen Medien deutlich weniger als in den Vorjahren thematisiert. Der Fokus lag auf der Covid-19-Pandemie. Dabei war das von der Schweiz vermittelte Bild starken Schwankungen unterworfen. Während der ersten Welle beurteilten ausländische Medien den Umgang der Schweiz mit der Pandemie überwiegend positiv. Insbesondere die Massnahmen zur Unterstützung der Schweizer Wirtschaft und die effiziente Umsetzung bei der Vergabe der Liquiditätskredite an Schweizer Unternehmen wurden gelobt. Während der zweiten Welle berichteten vor allem die Nachbarländer über die Schweiz und äusserten sich kritisch über die aus ihrer Sicht als zu moderat eingeschätzten Massnahmen, unter anderem die Öffnung der Skigebiete.  Neben der Covid-19-Pandemie berichteten die ausländische Medien oft kritisch über Ereignisse rund um die Bundesanwaltschaft. Im Zentrum stan

In [51]:
# [+, -, +]
articles_de[cases_1_idx[7]]

'Das Bundesamt für Zivilluftfahrt BAZL hat aufgrund der neuen epidemiologischen Lage in Grossbritannien und Südafrika die Flugverkehrs-Verbindungen zwischen der Schweiz und diesen zwei Ländern per Sonntag Mitternacht bis auf weiteres eingestellt.  Damit reagiert die Schweiz auf das Auftauchen einer neuen Variante des Coronavirus, die nach ersten Erkenntnissen deutlich ansteckender ist als die bekannte Form. Mit dem Flugverbot soll eine weitere Ausbreitung der neuen Virus-Variante verhindert werden. Das BAZL hat am Sonntagabend die betroffenen Flughäfen und Airlines sowie die Geschäftsluftfahrt über die Sofortmassnahme informiert.'

In [52]:
# [-, +, +]
articles_de[cases_1_idx[8]]

'Gemäss heutigem Recht ist die Militärjustiz bei einigen Straftatbeständen sowohl für Militär- wie auch Zivilpersonen zuständig. Dazu zählt unter anderem die Verletzung militärischer Geheimnisse. Der Bundesrat will einige dieser Straftatbestände auch ins zivile Strafgesetzbuch übernehmen, damit Zivilpersonen für diese Straftaten den zivilen Strafverfolgungsbehörden unterstehen. Bei anderen Delikten will der Bundesrat die Zuständigkeit für Zivilpersonen von Fall zu Fall an die zivilen Behörden übertragen können. Die Neuerungen betreffen das Militärstrafgesetz, das Strafgesetzbuch und das Bundesgesetz über den Schutz militärischer Anlagen. Dabei verfolgt der Bundesrat zwei Ansätze. Zum einen umfassen die Neuerungen die Tatbestände der Spionage und der landesverräterischen Verletzung militärischer Geheimnisse, der Verletzung militärischer Geheimnisse sowie des Ungehorsams gegen militärische und behördliche Massnahmen. Diese sollen in Zukunft von den zivilen Gerichtsbehörden verfolgt werde