## MDS Thesis
#### 02 Sentiment Analysis

- dictionary-based approach (SentiWS)
- word embeddings (FastText)
- pre-trained LLM (german-sentiment-bert)

<br>
<hr style="opacity: 0.5">

### Setup

In [11]:
# install if needed
#!pip install fasttext

# load libraries
import pandas as pd
import numpy as np
import fasttext
import fasttext.util
from transformers import pipeline
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import cosine

In [15]:
# load pre-processed data
df = pd.read_csv("../data/out/df_clean.csv")

<hr style="opacity: 0.5">

### Dictionary-based approach (SentiWS)

In [13]:
# define function to load sentiws
def load_sentiws_words(file_path):
    words = set()
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            base_word = line.split("|")[0].strip().lower()  # Extract base word before "|"
            words.add(base_word)
    return words

# load positive and negative words
positive_words = load_sentiws_words("../data/in/sentiws/SentiWS_v2.0_Positive.txt")
negative_words = load_sentiws_words("../data/in/sentiws/SentiWS_v2.0_Negative.txt")

# output first few words
print("Sample positive words:", list(positive_words)[:10])
print("Sample negative words:", list(negative_words)[:10])

Sample positive words: ['spektakel', 'formvollendet', 'löblich', 'ankurbeln', 'fesselnd', 'höhepunkt', 'nachhaltig', 'sehenswert', 'aufmunternd', 'empfehlung']
Sample negative words: ['besteuerung', 'feindlich', 'inkonsistent', 'folgewidrig', 'archaisch', 'jammern', 'unnötig', 'überschuß', 'auseinandersetzung', 'abweichen']


In [18]:
# define function to tokenize and calculate sentiment scores
def compute_dict_sentiment(text):
    words = text.lower().split()
    pos_count = sum(1 for word in words if word in positive_words)
    neg_count = sum(1 for word in words if word in negative_words)
    return pos_count - neg_count

# run function
df["sentiment_score_dict"] = df["text"].apply(compute_dict_sentiment)
scaler = StandardScaler()
df["sentiment_score_dict"] = scaler.fit_transform(df[["sentiment_score_dict"]])

# apply to processed_text, not text

<hr style="opacity: 0.5">

### Contextual word embeddings (FastText)

- Word2Vec uses context to learn word embeddings for semantic relationships
- GloVe uses global word co-occurrence matrices for semantic relationships
- FastText builds on Word2Vec but includes subword embeddings (useful for German)

In [19]:
# download FastText german model
fasttext.util.download_model("de", if_exists="ignore")
ft = fasttext.load_model("cc.de.300.bin")

### HERE

In [None]:
# define function to calculate sentiment using fasttext
def compute_fasttext_sentiment(text):
    words = text.lower().split()
    vectors = [ft.get_word_vector(word) for word in words if word in ft.words]
    if not vectors:
        return 0
    doc_vector = np.mean(vectors, axis=0)
    
    pos_vector = np.mean([ft.get_word_vector(word) for word in positive_words if word in ft.words], axis=0)
    neg_vector = np.mean([ft.get_word_vector(word) for word in negative_words if word in ft.words], axis=0)
    
    pos_sim = 1 - cosine(doc_vector, pos_vector)
    neg_sim = 1 - cosine(doc_vector, neg_vector)
    
    return pos_sim - neg_sim

In [None]:
# run function
df["sentiment_score_fasttext"] = df["text"].apply(compute_fasttext_sentiment)
df["sentiment_score_fasttext"] = scaler.fit_transform(df[["sentiment_score_fasttext"]])

<hr style="opacity: 0.5">

### Pre-trained LLM (german-sentiment-bert)

In [None]:
# pre-trained LLM sentiment analysis (german-sentiment-bert)
sentiment_model = pipeline("text-classification", model="oliverguhr/german-sentiment-bert")

def compute_bert_sentiment(text):
    result = sentiment_model(text[:512])[0]  # Truncate to 512 tokens
    score = result["score"]
    if result["label"] == "positive":
        return score
    elif result["label"] == "negative":
        return -score
    else:
        return 0

df["sentiment_score_bert"] = df["text"].apply(compute_bert_sentiment)
df["sentiment_score_bert"] = scaler.fit_transform(df[["sentiment_score_bert"]])

In [None]:
# save results
df.to_csv("..data/out/sentiment_results.csv", index=False)