<div style="text-align: center; font-size: 16px;">
    <strong>Course:</strong> Machine Learning Operations |
    <strong>Lecturer:</strong> Prof. Dr. Klotz |
    <strong>Date:</strong> 17.05.2025 |
    <strong>Name:</strong> Sofie Pischl
</div>

# <center> Data Exploration </center>

# Inhaltsverzeichnis
1. [Einleitung](#Einleitung)
2. [Datenvorbereitung](#Datenvorbereitung)
3. [Fortgeschrittenes Topic Modelling mit BERTopic](#Fortgeschrittenes-Topic-Modelling-mit-BERTopic)
4. [Verbesserte Sentiment-Analyse](#Verbesserte-Sentiment-Analyse)
5. [Zeitliche Analyse der Topics](#Zeitliche-Analyse-der-Topics)
6. [UMAP-Visualisierung der Topics](#UMAP-Visualisierung-der-Topics)
7. [Vorbereitung auf Klassifikation](#Vorbereitung-auf-Klassifikation)


## Einleitung
Dieses Notebook untersucht kombinierte Social-Media-Daten aus Reddit, YouTube und TikTok. Ziel ist es, Muster in Themen und Stimmungen zu erkennen und die Daten für eine spätere Klassifikation vorzubereiten. Wir verwenden moderne Methoden wie **BERTopic**, **Transformer-basierte Sentimentanalyse** und **visuelle Embedding-Cluster**.

In diesem Notebook bauen wir auf der bisherigen Datenexploration auf und ergänzen folgende moderne Verfahren:

1. **Fortgeschrittenes Topic Modelling mit BERTopic**
2. **Verbesserte Sentiment-Analyse mit VADER und Transformer-Modell**
3. **Detaillierte Exploration von Text-Clustern und Zeitreihen**
4. **Vorbereitung für Klassifikation und Vorhersagemodelle**

Jede Methode wird durch beschreibenden Text begleitet und in das bestehende Datenformat integriert.

## Datenvorbereitung
Wir laden die vorverarbeiteten Daten, konvertieren Zeitstempel und bereiten die Texte für die Analyse vor.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 1000)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import matplotlib.pyplot as plt
import nltk
from pathlib import Path

In [2]:
BASE_DIR = Path().resolve().parent
RAW_DIR = (BASE_DIR / "./data/processed").resolve()

# Datei laden
df = pd.read_csv(RAW_DIR / "social_media_data.csv")

display(df.head())
df.info()

Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,datetime,date,title_language,title_clean,title_char_count,title_word_count,title_uppercase_count,title_exclamation_count,title_question_count,title_emoji_count,title_mention_count,title_hashtag_count,title_avg_word_length,title_sentiment,title_sentiment_score,text_language,text_clean,text_char_count,text_word_count,text_uppercase_count,text_exclamation_count,text_question_count,text_emoji_count,text_mention_count,text_hashtag_count,text_avg_word_length,text_sentiment,text_sentiment_score,engagement_rate,hour,weekday,year,month,day_period,is_weekend
0,tiktok,7493469801654881542,,#vairalvideo_foryoupage #🇦🇫ازبک_تاجک_پشتون_تر...,afgcap.cut,0.000405,0.000566,5.2e-05,0.001113,1970-01-01 00:00:01.744709406+00:00,2025-05-12 13:56:20+00:00,https://webapp-va.tiktok.com/88847b04ddf03213d...,2025-05-12 13:56:20+00:00,2025-05-12,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,0.117561,13.0,Monday,2025.0,5.0,afternoon,False
1,tiktok,7489427780397010198,,#imapoliceofficer #tensheet #foryou #viral #fy...,backwheelbandit69,0.112821,0.025872,0.035437,0.131667,1970-01-01 00:00:01.743768297+00:00,2025-05-12 13:56:20+00:00,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-05-12 13:56:20+00:00,2025-05-12,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,0.273045,13.0,Monday,2025.0,5.0,afternoon,False
2,tiktok,7492000423641959685,,#CapCut #قوالب_كاب_كات_جاهزه_للتصميم__🌴♥ #كاب_...,noordeen_cap_cat_0_1,0.001497,0.000989,0.000181,0.003962,1970-01-01 00:00:01.744367290+00:00,2025-05-12 13:56:20+00:00,https://webapp-va.tiktok.com/6508d64d970e751a2...,2025-05-12 13:56:20+00:00,2025-05-12,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,0.121036,13.0,Monday,2025.0,5.0,afternoon,False
3,tiktok,7472584144510373125,,i think it was a bad idea,maligoshik,0.492308,0.058979,0.149425,0.572996,1970-01-01 00:00:01.739846584+00:00,2025-05-12 13:56:29+00:00,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-05-12 13:56:29+00:00,2025-05-12,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,think bad idea,14.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,negative,-0.7,0.273435,13.0,Monday,2025.0,5.0,afternoon,False
4,tiktok,7461927005689302280,,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,0.006703,0.028503,0.001095,0.037793,1970-01-01 00:00:01.737365272+00:00,2025-05-12 13:56:20+00:00,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...,2025-05-12 13:56:20+00:00,2025-05-12,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,welcome thanos world,20.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,positive,0.8,0.059165,13.0,Monday,2025.0,5.0,afternoon,False


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2205 entries, 0 to 2204
Data columns (total 47 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   source                   2205 non-null   object 
 1   id                       2205 non-null   object 
 2   title                    778 non-null    object 
 3   text                     2204 non-null   object 
 4   username                 1821 non-null   object 
 5   likes                    2205 non-null   float64
 6   comments                 2205 non-null   float64
 7   shares                   2205 non-null   float64
 8   plays                    2205 non-null   float64
 9   timestamp                2205 non-null   object 
 10  published_at             2188 non-null   object 
 11  url                      1257 non-null   object 
 12  datetime                 2205 non-null   object 
 13  date                     2205 non-null   object 
 14  title_language          

In [3]:
# Anzahl der Posts pro Plattform
post_counts = df["source"].value_counts().reset_index()
post_counts.columns = ["source", "num_posts"]

# Anzeige
print(post_counts)

    source  num_posts
0   tiktok       1427
1  youtube        394
2   reddit        384


# Preparations

In [4]:
# Kombinieren von Titel und Text (beide bereinigt)
texts = (df["title_clean"].fillna("") + " " + df["text_clean"].fillna("")).astype(str)

# LDA
LDA ist ein probabilistisches generatives Modell, das davon ausgeht, dass jedes Dokument eine Mischung aus mehreren Themen ist und jedes Thema durch eine Verteilung über Wörter beschrieben werden kann. Es eignet sich gut für längere, formellere Texte, bietet jedoch auch bei kurzen Texten erste Einblicke in thematische Strukturen.

Warum LDA als erster Ansatz?
Einfachheit & Transparenz: LDA ist leicht verständlich und bietet eine gute Grundlage, um Themenmodelle zu interpretieren.

Vergleichsbasis: Es dient als Referenzmodell, gegen das modernere Verfahren wie BERTopic verglichen werden können.

Schnelle Anwendung: Mit scikit-learn lässt sich LDA effizient auf vorbereiteten TF-IDF-Daten trainieren

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# CountVectorizer ohne Stopwords (da vorher entfernt)
vectorizer = CountVectorizer(
    max_df=0.95,
    min_df=2
)

# Textmatrix erzeugen
X = vectorizer.fit_transform(texts)  # texts = bereits bereinigte Texte

# LDA-Modell definieren
lda_model = LatentDirichletAllocation(
    n_components=9,         # Anzahl der Themen
    max_iter=3,
    learning_method='online',
    random_state=42
)

# LDA-Modell trainieren
lda_model.fit(X)


In [6]:
# Funktion zum Anzeigen der Themen mit Top-Wörtern
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"🟢 Thema {topic_idx + 1}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
        print()

display_topics(lda_model, vectorizer.get_feature_names_out(), 10)


🟢 Thema 1:
video youtube instagram de kanal geht werbung ze gibt al

🟢 Thema 2:
league love dazn baby vs highlights uefa fans assistant champions

🟢 Thema 3:
people said really mom always pf pts telling dad team

🟢 Thema 4:
instagram tiktok mehr real link folge facebook fan dabei live

🟢 Thema 5:
nie music even na best edit jest fox bo ale

🟢 Thema 6:
would know never make asked year felt think take live

🟢 Thema 7:
game follow kids nba media top team corp knicks prime

🟢 Thema 8:
like said time want one get even could us day

🟢 Thema 9:
go new news ice tiktok song walk future by best



## 1. Fortgeschrittenes Topic Modelling mit BERTopic

Das klassische LDA-Verfahren ist oft ungeeignet für kurze, informelle Texte. BERTopic nutzt Embeddings + Clustering + TF-IDF, um deutlich robustere und interpretierbare Themen zu erzeugen. Zudem unterstützt BERTopic die Analyse von Themen im Zeitverlauf.

In [7]:
# Installation (nur beim ersten Mal notwendig)
# !pip install bertopic[visualization] sentence-transformers

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer

# Für deutschsprachige Texte kann ein mehrsprachiges Modell hilfreich sein
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # oder 'paraphrase-multilingual-MiniLM-L12-v2'

# Daten vorbereiten
texts = df['text_clean'].dropna().tolist()

# BERTopic initialisieren und trainieren
topic_model = BERTopic(embedding_model=embedding_model, language='multilingual')
topics, probs = topic_model.fit_transform(texts)

df_clean = df.loc[df['text_clean'].notna()].copy()
df_clean['topic'] = topics

topic_model.get_topic_info()

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,667,-1_instagram_music_youtube_news,"[instagram, music, youtube, news, love, like, ...",[since flooded thousands posts articles traile...
1,0,314,0_said_like_told_would,"[said, like, told, would, time, even, people, ...",[background f fiancé together years baby back ...
2,1,194,1_instagram_video_tiktok_kanal,"[instagram, video, tiktok, kanal, youtube, wer...",[deutsche memes folge quellen folgt insta dank...
3,2,59,2_mga_daw_naman_sa,"[mga, daw, naman, sa, na, bailalo, nakaka, nam...","[vệ sinh tai viêm khó đó nha, terlalu pagi muk..."
4,3,55,3_than_einfach_ist_naja,"[than, einfach, ist, naja, stronger, immer, po...",[gilts mindestens zwei freunde mitnehmen wahll...
5,4,51,4_wait_omg_end_ok,"[wait, omg, end, ok, aimed, hmmm, finallyyyy, ...","[wait, wait till end, wait end]"
6,5,49,5_ronaldo_shape_football_grwm,"[ronaldo, shape, football, grwm, messi, tuff, ...","[prime ronaldo, redick notes players phenomena..."
7,6,49,6_que_de_em_não,"[que, de, em, não, os, para, ou, uma, todos, q...",[que encontro maravilhoso provando que os quad...
8,7,46,7_pf_pts_nbsp_ast,"[pf, pts, nbsp, ast, blk, stl, min, reb, team,...",[gsw min box scores nba yahoo nbsp game summar...
9,8,44,8_league_dazn_highlights_sport,"[league, dazn, highlights, sport, uefa, sky, c...",[königlichen gast beim tabellenzwöflten getafe...


## 2. Verbesserte Sentiment-Analyse

Die bisher genutzte Methode (`TextBlob`) ist eher für formellere englische Texte geeignet. Hier ergänzen wir zwei modernere Verfahren:

- **VADER**: speziell für Social Media-Texte (auch kurze Aussagen) konzipiert.
- **Transformer-Modell**: z. B. RoBERTa-basiert, feingetunt für Sentiment-Klassifikation auf Twitter-Daten.

In [8]:
from transformers import pipeline

# Sentiment-Pipeline laden (Roberta Twitter-Modell)
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment",
    truncation=True
)

# Einzelne Zeilen durchgehen, um Fehler im Batch zu vermeiden
def safe_sentiment(text):
    try:
        return sentiment_pipeline(text[:512])[0]['label']
    except Exception as e:
        print(f"Fehler bei Text: {text[:100]}... -> {e}")
        return "error"

# Auf eine Teilmenge testen (z. B. 500 Zeilen)
# df_clean = df_clean.sample(n=500, random_state=42).copy()

df_clean['roberta_sentiment'] = df_clean['text_clean'].astype(str).apply(safe_sentiment)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Fehler bei Text: 𝗙𝗢𝗟𝗘 𝗣𝗨𝗕𝗟𝗜𝗦𝗛𝗜𝗡𝗚 ____________________________ 𝗫𝗵𝗲𝗻𝘀𝗶𝗹𝗮 𝘅 𝗟𝗲𝗱𝗿𝗶 𝗩𝘂𝗹𝗮 𝗠𝗮 𝗞𝘁𝗵𝗲 𝗢𝗳𝗳𝗶𝗰𝗶𝗮𝗹 𝗩𝗶𝗱𝗲𝗼 __________... -> The expanded size of the tensor (582) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 582].  Tensor sizes: [1, 514]
Fehler bei Text: اشترك في قناة حسين الجسمي الرسمية حسين الجسمي فستانك الأبيض استمعوا على جميع المنصات كلمات أمير طعيم... -> The expanded size of the tensor (516) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 516].  Tensor sizes: [1, 514]


## 3. Zeitliche Analyse der Topics

Ein großer Vorteil von BERTopic ist die Möglichkeit, Themen im Zeitverlauf zu analysieren. Dazu müssen wir das Erstellungsdatum korrekt als `datetime` interpretieren und in geeignete Zeitstempel transformieren.

In [9]:
# Sicherstellen, dass timestamp eine gültige datetime-Spalte ist
df_clean['timestamp'] = pd.to_datetime(df_clean['timestamp'], errors='coerce')
df_clean = df_clean.dropna(subset=['timestamp'])

# Alle Listen synchron extrahieren
texts_synced = df_clean['text_clean'].tolist()
topics_synced = df_clean['topic'].tolist()
timestamps_synced = df_clean['timestamp'].tolist()  # echte datetime-Werte

# Kontrollieren
print(len(texts_synced), len(topics_synced), len(timestamps_synced))

# Topics-over-time berechnen (echte Timestamps → funktioniert!)
topics_over_time = topic_model.topics_over_time(
    docs=texts_synced,
    topics=topics_synced,
    timestamps=timestamps_synced,
    nr_bins=20  # oder eine andere sinnvolle Anzahl Zeit-Buckets
)

# Visualisierung
topic_model.visualize_topics_over_time(topics_over_time)


1154 1154 1154


## 4. Visuelle Exploration mit UMAP

Zur besseren Interpretation der durch BERTopic erzeugten Embeddings und Topics projizieren wir die hochdimensionalen Embeddings in 2D mit **UMAP**. So lassen sich thematische Cluster visuell erkennen.

In [10]:

# Visualisierung nach Topic
topic_model.visualize_documents(texts, topics=topics)

#  Save

In [11]:
df_clean.to_csv(RAW_DIR / "social_media_data_with_topics.csv")


In [12]:
df_clean.head()

Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,datetime,date,title_language,title_clean,title_char_count,title_word_count,title_uppercase_count,title_exclamation_count,title_question_count,title_emoji_count,title_mention_count,title_hashtag_count,title_avg_word_length,title_sentiment,title_sentiment_score,text_language,text_clean,text_char_count,text_word_count,text_uppercase_count,text_exclamation_count,text_question_count,text_emoji_count,text_mention_count,text_hashtag_count,text_avg_word_length,text_sentiment,text_sentiment_score,engagement_rate,hour,weekday,year,month,day_period,is_weekend,topic,roberta_sentiment
3,tiktok,7472584144510373125,,i think it was a bad idea,maligoshik,0.492308,0.058979,0.149425,0.572996,1970-01-01 00:00:01.739846584+00:00,2025-05-12 13:56:29+00:00,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-05-12 13:56:29+00:00,2025-05-12,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,think bad idea,14.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,negative,-0.7,0.273435,13.0,Monday,2025.0,5.0,afternoon,False,-1,LABEL_0
4,tiktok,7461927005689302280,,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,0.006703,0.028503,0.001095,0.037793,1970-01-01 00:00:01.737365272+00:00,2025-05-12 13:56:20+00:00,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...,2025-05-12 13:56:20+00:00,2025-05-12,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,welcome thanos world,20.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,positive,0.8,0.059165,13.0,Monday,2025.0,5.0,afternoon,False,-1,LABEL_2
5,tiktok,7461757350492278048,,This is a great goal 😳😭 @Fabio Andrés #golazo ...,risingballers,0.022272,0.012754,0.026598,0.033222,1970-01-01 00:00:01.737325774+00:00,2025-05-12 13:56:20+00:00,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-05-12 13:56:20+00:00,2025-05-12,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,great goal andrés,17.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,positive,0.8,0.214484,13.0,Monday,2025.0,5.0,afternoon,False,5,LABEL_2
6,tiktok,7462429961345879318,,glaube er hatte nicht so bock auf mich..💀 #for...,allroundii,0.017508,0.015076,0.018989,0.031088,1970-01-01 00:00:01.737482377+00:00,2025-05-12 13:56:20+00:00,,2025-05-12 13:56:20+00:00,2025-05-12,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,de,glaube bock,11.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,neutral,0.0,0.180779,13.0,Monday,2025.0,5.0,afternoon,False,-1,LABEL_1
7,tiktok,7483193151004445974,,#CapCut Speedquiz 🤯🧠✅️ #fyp #viral_video #quiz...,brainyy.quiz,0.000715,0.016347,0.002264,0.003353,1970-01-01 00:00:01.742316683+00:00,2025-05-12 13:56:20+00:00,,2025-05-12 13:56:20+00:00,2025-05-12,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,nl,speedquiz,9.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,neutral,0.0,0.085936,13.0,Monday,2025.0,5.0,afternoon,False,19,LABEL_1
