<div style="text-align: center; font-size: 16px;">
    <strong>Course:</strong> Machine Learning Operations |
    <strong>Lecturer:</strong> Prof. Dr. Klotz |
    <strong>Date:</strong> 17.05.2025 |
    <strong>Name:</strong> Sofie Pischl
</div>

# <center> Data Exploration </center>

# Inhaltsverzeichnis
1. [Einleitung](#Einleitung)
2. [Datenvorbereitung](#Datenvorbereitung)
3. [Fortgeschrittenes Topic Modelling mit BERTopic](#Fortgeschrittenes-Topic-Modelling-mit-BERTopic)
4. [Verbesserte Sentiment-Analyse](#Verbesserte-Sentiment-Analyse)
5. [Zeitliche Analyse der Topics](#Zeitliche-Analyse-der-Topics)
6. [UMAP-Visualisierung der Topics](#UMAP-Visualisierung-der-Topics)
7. [Vorbereitung auf Klassifikation](#Vorbereitung-auf-Klassifikation)


## Einleitung
Dieses Notebook untersucht kombinierte Social-Media-Daten aus Reddit, YouTube und TikTok. Ziel ist es, Muster in Themen und Stimmungen zu erkennen und die Daten für eine spätere Klassifikation vorzubereiten. Wir verwenden moderne Methoden wie **BERTopic**, **Transformer-basierte Sentimentanalyse** und **visuelle Embedding-Cluster**.

In diesem Notebook bauen wir auf der bisherigen Datenexploration auf und ergänzen folgende moderne Verfahren:

1. **Fortgeschrittenes Topic Modelling mit BERTopic**
2. **Verbesserte Sentiment-Analyse mit VADER und Transformer-Modell**
3. **Detaillierte Exploration von Text-Clustern und Zeitreihen**
4. **Vorbereitung für Klassifikation und Vorhersagemodelle**

Jede Methode wird durch beschreibenden Text begleitet und in das bestehende Datenformat integriert.

## Datenvorbereitung
Wir laden die vorverarbeiteten Daten, konvertieren Zeitstempel und bereiten die Texte für die Analyse vor.

In [65]:
import pandas as pd
pd.set_option('display.max_columns', 1000)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import matplotlib.pyplot as plt
import nltk
from pathlib import Path

In [66]:
BASE_DIR = Path().resolve().parent
RAW_DIR = (BASE_DIR / "./data/processed").resolve()

# Datei laden
df = pd.read_csv(RAW_DIR / "social_media_data.csv")

display(df.head())
df.info()

Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,datetime,date,title_language,title_clean,title_char_count,title_word_count,title_uppercase_count,title_exclamation_count,title_question_count,title_emoji_count,title_mention_count,title_hashtag_count,title_avg_word_length,title_sentiment,title_sentiment_score,text_language,text_clean,text_char_count,text_word_count,text_uppercase_count,text_exclamation_count,text_question_count,text_emoji_count,text_mention_count,text_hashtag_count,text_avg_word_length,text_sentiment,text_sentiment_score,engagement_rate,hour,weekday,year,month,day_period,is_weekend
0,tiktok,7493469801654881542,,#vairalvideo_foryoupage #🇦🇫ازبک_تاجک_پشتون_تر...,afgcap.cut,0.000407,0.000566,9.6e-05,0.001126,2025-04-15 09:30:06+00:00,,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-04-15 09:30:06+00:00,2025-04-15,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,0.116029,,,,,,False
1,tiktok,7489427780397010198,,#imapoliceofficer #tensheet #foryou #viral #fy...,backwheelbandit69,0.110825,0.024776,0.064638,0.129241,2025-04-04 12:04:57+00:00,,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-04-04 12:04:57+00:00,2025-04-04,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,0.273845,,,,,,False
2,tiktok,7492000423641959685,,#CapCut #قوالب_كاب_كات_جاهزه_للتصميم__🌴♥ #كاب_...,noordeen_cap_cat_0_1,0.001505,0.000989,0.000335,0.00401,2025-04-11 10:28:10+00:00,,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-04-11 10:28:10+00:00,2025-04-11,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,0.119509,,,,,,False
3,tiktok,7472584144510373125,,i think it was a bad idea,maligoshik,0.463918,0.052621,0.276596,0.520049,2025-02-18 02:43:04+00:00,,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-02-18 02:43:04+00:00,2025-02-18,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,think bad idea,14.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,negative,-0.7,0.284582,,,,,,False
4,tiktok,7461927005689302280,,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,0.006737,0.028503,0.002027,0.038248,2025-01-20 09:27:52+00:00,,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...,2025-01-20 09:27:52+00:00,2025-01-20,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,welcome thanos world,20.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,positive,0.8,0.057531,,,,,,False


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 47 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   source                   1728 non-null   object 
 1   id                       1728 non-null   object 
 2   title                    646 non-null    object 
 3   text                     1727 non-null   object 
 4   username                 1398 non-null   object 
 5   likes                    1728 non-null   float64
 6   comments                 1728 non-null   float64
 7   shares                   1728 non-null   float64
 8   plays                    1728 non-null   float64
 9   timestamp                1560 non-null   object 
 10  published_at             645 non-null    object 
 11  url                      1304 non-null   object 
 12  datetime                 1726 non-null   object 
 13  date                     1726 non-null   object 
 14  title_language          

In [67]:
# Anzahl der Posts pro Plattform
post_counts = df["source"].value_counts().reset_index()
post_counts.columns = ["source", "num_posts"]

# Anzeige
print(post_counts)

    source  num_posts
0   tiktok       1082
1   reddit        330
2  youtube        316


# Preparations

In [68]:
# Kombinieren von Titel und Text (beide bereinigt)
texts = (df["title_clean"].fillna("") + " " + df["text_clean"].fillna("")).astype(str)

# LDA
LDA ist ein probabilistisches generatives Modell, das davon ausgeht, dass jedes Dokument eine Mischung aus mehreren Themen ist und jedes Thema durch eine Verteilung über Wörter beschrieben werden kann. Es eignet sich gut für längere, formellere Texte, bietet jedoch auch bei kurzen Texten erste Einblicke in thematische Strukturen.

Warum LDA als erster Ansatz?
Einfachheit & Transparenz: LDA ist leicht verständlich und bietet eine gute Grundlage, um Themenmodelle zu interpretieren.

Vergleichsbasis: Es dient als Referenzmodell, gegen das modernere Verfahren wie BERTopic verglichen werden können.

Schnelle Anwendung: Mit scikit-learn lässt sich LDA effizient auf vorbereiteten TF-IDF-Daten trainieren

In [88]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# CountVectorizer ohne Stopwords (da vorher entfernt)
vectorizer = CountVectorizer(
    max_df=0.95,
    min_df=2
)

# Textmatrix erzeugen
X = vectorizer.fit_transform(texts)  # texts = bereits bereinigte Texte

# LDA-Modell definieren
lda_model = LatentDirichletAllocation(
    n_components=9,         # Anzahl der Themen
    max_iter=3,
    learning_method='online',
    random_state=42
)

# LDA-Modell trainieren
lda_model.fit(X)


In [89]:
# Funktion zum Anzeigen der Themen mit Top-Wörtern
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"🟢 Thema {topic_idx + 1}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
        print()

display_topics(lda_model, vectorizer.get_feature_names_out(), 10)


🟢 Thema 1:
official cat que account khan real crazy bien asmr unfreez

🟢 Thema 2:
best would years every ever something kids back work else

🟢 Thema 3:
de brainrot song time tragic italian day prank months tralala

🟢 Thema 4:
tung one go sahur immer brainrot ronaldo mehr yamal speedrun

🟢 Thema 5:
na käse nie met się dc zutaten ofen mal creme

🟢 Thema 6:
viral tiktok baby barça kenza sound games go omg cette

🟢 Thema 7:
never part hair said even video know home chef bro

🟢 Thema 8:
new love got template stop see people one cool away

🟢 Thema 9:
like get follow real ib life party used videos better



## 1. Fortgeschrittenes Topic Modelling mit BERTopic

Das klassische LDA-Verfahren ist oft ungeeignet für kurze, informelle Texte. BERTopic nutzt Embeddings + Clustering + TF-IDF, um deutlich robustere und interpretierbare Themen zu erzeugen. Zudem unterstützt BERTopic die Analyse von Themen im Zeitverlauf.

In [71]:
# Installation (nur beim ersten Mal notwendig)
# !pip install bertopic[visualization] sentence-transformers

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer

# Für deutschsprachige Texte kann ein mehrsprachiges Modell hilfreich sein
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # oder 'paraphrase-multilingual-MiniLM-L12-v2'

# Daten vorbereiten
texts = df['text_clean'].dropna().tolist()

# BERTopic initialisieren und trainieren
topic_model = BERTopic(embedding_model=embedding_model, language='multilingual')
topics, probs = topic_model.fit_transform(texts)

df_clean = df.loc[df['text_clean'].notna()].copy()
df_clean['topic'] = topics

topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,262,-1_template_new_baby_dc,"[template, new, baby, dc, official, account, k...",[einfachste apfelkuchen welt geht total einfac...
1,0,248,0_de_na_käse_immer,"[de, na, käse, immer, nie, się, que, kenza, bi...",[anzeige eigene gewürze flammkuchentoast zutat...
2,1,87,1_wait_get_better_end,"[wait, get, better, end, omg, done, one, math,...","[wait till end, wait end, tried person drinks ..."
3,2,63,2_brainrot_tragic_song_italian,"[brainrot, tragic, song, italian, cat, music, ...",[tralalero tralala x wakawaka brainrot movie t...
4,3,33,3_ronaldo_messi_games_barça,"[ronaldo, messi, games, barça, valverde, pass,...",[red devil buy high quality football jerseys p...
5,4,30,4_aura_morning_vibes_mood,"[aura, morning, vibes, mood, real, like, vibe,...",[possible learn power perfect sunday morning e...
6,5,25,5_على_في_ہم_afghan,"[على, في, ہم, afghan, للحصول, これめっちゃ頭痛くなる, အလ,...","[أنا شوري على الزلم يصبح قرار, مریم نواز سامنے..."
7,6,19,6_hair_makeup_gala_beauty,"[hair, makeup, gala, beauty, cosmetics, concea...",[hair long wash head like hair long got ta get...
8,7,18,7_viral_tiktok_go_video,"[viral, tiktok, go, video, things, videos, тва...","[go viral, go viral tiktok million, go viral t..."
9,8,17,8_fuck_never_time_like,"[fuck, never, time, like, years, would, life, ...",[preparing go uni mom confessed tuition money ...


## 2. Verbesserte Sentiment-Analyse

Die bisher genutzte Methode (`TextBlob`) ist eher für formellere englische Texte geeignet. Hier ergänzen wir zwei modernere Verfahren:

- **VADER**: speziell für Social Media-Texte (auch kurze Aussagen) konzipiert.
- **Transformer-Modell**: z. B. RoBERTa-basiert, feingetunt für Sentiment-Klassifikation auf Twitter-Daten.

In [None]:
# Installation (falls nicht vorhanden)
# !pip install vaderSentiment transformers

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import pipeline

# VADER Setup
vader_analyzer = SentimentIntensityAnalyzer()
df['vader_score'] = df['clean_text'].apply(lambda x: vader_analyzer.polarity_scores(str(x))['compound'])

# Transformer Sentiment Pipeline (Twitter RoBERTa)
sentiment_pipeline = pipeline('sentiment-analysis', model='cardiffnlp/twitter-roberta-base-sentiment')

# Warnung: kann langsam sein bei vielen Texten
df['roberta_sentiment'] = df['clean_text'].astype(str).apply(lambda x: sentiment_pipeline(x[:512])[0]['label'])

## 3. Zeitliche Analyse der Topics

Ein großer Vorteil von BERTopic ist die Möglichkeit, Themen im Zeitverlauf zu analysieren. Dazu müssen wir das Erstellungsdatum korrekt als `datetime` interpretieren und in geeignete Zeitstempel transformieren.

In [76]:
# Sicherstellen, dass timestamp eine gültige datetime-Spalte ist
df_clean['timestamp'] = pd.to_datetime(df_clean['timestamp'], errors='coerce')
df_clean = df_clean.dropna(subset=['timestamp'])

# Alle Listen synchron extrahieren
texts_synced = df_clean['text_clean'].tolist()
topics_synced = df_clean['topic'].tolist()
timestamps_synced = df_clean['timestamp'].tolist()  # echte datetime-Werte

# Kontrollieren
print(len(texts_synced), len(topics_synced), len(timestamps_synced))

# Topics-over-time berechnen (echte Timestamps → funktioniert!)
topics_over_time = topic_model.topics_over_time(
    docs=texts_synced,
    topics=topics_synced,
    timestamps=timestamps_synced,
    nr_bins=20  # oder eine andere sinnvolle Anzahl Zeit-Buckets
)

# Visualisierung
topic_model.visualize_topics_over_time(topics_over_time)


887 887 887


## 4. Visuelle Exploration mit UMAP

Zur besseren Interpretation der durch BERTopic erzeugten Embeddings und Topics projizieren wir die hochdimensionalen Embeddings in 2D mit **UMAP**. So lassen sich thematische Cluster visuell erkennen.

In [74]:
# Optional: UMAP-Plot aktualisieren (automatisch aus BERTopic möglich)
# Visualisierung nach Topic
topic_model.visualize_documents(texts, topics=topics)

## 5. Vorbereitung für Klassifikation

In diesem Abschnitt bereiten wir die Daten für ein Klassifikationsmodell vor. Ziel könnte z. B. die Vorhersage von Plattform (`source`), Topic oder Sentiment sein. Dazu müssen Texte in numerische Features (z. B. Embeddings oder TF-IDF) transformiert und Zielvariablen korrekt encodiert werden.

In [85]:
y = df_clean['source']  # Zielvariable
X = vectorizer.fit_transform(df_clean['text_clean'].astype(str))  # Features

# Split korrekt durchführen
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

In [86]:
df_clean.to_csv(RAW_DIR / "social_media_data_with_topics.csv")


In [87]:
df_clean.head()

Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,datetime,date,title_language,title_clean,title_char_count,title_word_count,title_uppercase_count,title_exclamation_count,title_question_count,title_emoji_count,title_mention_count,title_hashtag_count,title_avg_word_length,title_sentiment,title_sentiment_score,text_language,text_clean,text_char_count,text_word_count,text_uppercase_count,text_exclamation_count,text_question_count,text_emoji_count,text_mention_count,text_hashtag_count,text_avg_word_length,text_sentiment,text_sentiment_score,engagement_rate,hour,weekday,year,month,day_period,is_weekend,time_bin,topic
3,tiktok,7472584144510373125,,i think it was a bad idea,maligoshik,0.463918,0.052621,0.276596,0.520049,2025-02-18 02:43:04+00:00,,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-02-18 02:43:04+00:00,2025-02-18,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,think bad idea,14.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,negative,-0.7,0.284582,,,,,,False,2025-02-18 AM,1
4,tiktok,7461927005689302280,,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,0.006737,0.028503,0.002027,0.038248,2025-01-20 09:27:52+00:00,,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...,2025-01-20 09:27:52+00:00,2025-01-20,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,welcome thanos world,20.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,positive,0.8,0.057531,,,,,,False,2025-01-20 AM,-1
5,tiktok,7461757350492278048,,This is a great goal 😳😭 @Fabio Andrés #golazo ...,risingballers,0.022387,0.012754,0.049234,0.033621,2025-01-19 22:29:34+00:00,,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-01-19 22:29:34+00:00,2025-01-19,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,great goal andrés,17.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,positive,0.8,0.21312,,,,,,False,2025-01-19 PM,3
6,tiktok,7462429961345879318,,glaube er hatte nicht so bock auf mich..💀 #for...,allroundii,0.017598,0.015076,0.035149,0.031462,2025-01-21 17:59:37+00:00,,,2025-01-21 17:59:37+00:00,2025-01-21,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,de,glaube bock,11.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,neutral,0.0,0.179356,,,,,,False,2025-01-21 PM,0
7,tiktok,7483193151004445974,,#CapCut Speedquiz 🤯🧠✅️ #fyp #viral_video #quiz...,brainyy.quiz,0.000719,0.016347,0.004191,0.003393,2025-03-18 16:51:23+00:00,,,2025-03-18 16:51:23+00:00,2025-03-18,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,nl,speedquiz,9.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,neutral,0.0,0.084348,,,,,,False,2025-03-18 PM,2
