<div style="text-align: center; font-size: 16px;">
    <strong>Course:</strong> Machine Learning Operations |
    <strong>Lecturer:</strong> Prof. Dr. Klotz |
    <strong>Date:</strong> 17.05.2025 |
    <strong>Name:</strong> Sofie Pischl
</div>

# <center>Preprocessing</center>

# Struktur des Notebooks

## 1. Setup und Bibliotheken

In diesem Abschnitt werden alle benötigten Bibliotheken importiert und notwendige NLP-Ressourcen geladen (z. B. NLTK-Modelle).

In [5]:
import os
import pandas as pd
#pd.set_option('display.max_colwidth', None)
pd.reset_option('display.max_colwidth')
pd.set_option('display.max_columns', None)
import numpy as np
import re
from pathlib import Path
from datetime import datetime
from sklearn.preprocessing import MinMaxScaler
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from textblob import TextBlob
import logging
import sqlite3

# Logging Setup
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# NLTK-Ressourcen laden
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SofiePischl\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SofiePischl\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SofiePischl\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\SofiePischl\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## 2. Daten laden

Hier werden die Rohdaten aus TikTok, YouTube und Reddit eingelesen.


In [6]:
BASE_DIR = Path().resolve().parent
RAW_DIR = (BASE_DIR / "./data").resolve()

# DB-Pfad
DB_PATH = Path(BASE_DIR / "data/social_media.db")

# Verbindung öffnen
conn = sqlite3.connect(DB_PATH)

# Lese jede Tabelle einzeln
df_reddit = pd.read_sql_query("SELECT * FROM reddit_data", conn)
df_tiktok = pd.read_sql_query("SELECT * FROM tiktok_data", conn)
df_youtube = pd.read_sql_query("SELECT * FROM youtube_data", conn)

conn.close()

dfs = {
    "Reddit": df_reddit,
    "TikTok": df_tiktok,
    "YouTube": df_youtube
}

for name, df in dfs.items():
    print(f"\n📊 === {name} Data ===")
    print("🔹 Kopf der Tabelle:")
    print(display(df.head()), "\n")
    
    print("🔹 Info:")
    df.info()
    print("\n" + "-"*50)


📊 === Reddit Data ===
🔹 Kopf der Tabelle:


Unnamed: 0,id,title,text,author,score,created_utc,num_comments,url,subreddit,scraped_at
0,1d05b82bbaf96bc8cbc0a8e3dde9aa30,What's one thing millennials did back in the d...,We used to have to call our friend’s house pho...,,3186,1746206463,2166,https://www.reddit.com/r/Millennials/comments/...,popular,2025-05-03 08:58:16
1,3415c920c6e4622e3b20f81e5f1cdc4d,AITAH for refusing to chip in for a coworker's...,"Alright AITAH ppl, imma need your hot takes on...",,11933,1746193771,1532,https://www.reddit.com/r/AITAH/comments/1kd4be...,popular,2025-05-03 08:58:16
2,05384fb79327ed7f3c6414c1ec3f098c,She erased us from her wedding. So I’m erasing...,"When my brother got married, his bride (now my...",,4547,1746242393,1045,https://www.reddit.com/r/pettyrevenge/comments...,popular,2025-05-03 08:58:16
3,12c55f46157aee9ac731b13a805796b5,Minecraft’s Long-Awaited Visual Upgrade: What ...,"After years of anticipation, Minecraft is fina...",,3,1745506950,2,https://www.reddit.com/r/TrendingReddits/comme...,trendingreddits,2025-05-03 08:58:16
4,ae721d8cf4c11647581cda11d06522cb,"Trending Subreddits for 2021-06-13: /r/place, ...","Over 6 years ago, when reddit was the equivale...",,90,1623635478,222,https://www.reddit.com/r/trendingsubreddits/co...,trendingsubreddits,2025-05-11 13:48:47


None 

🔹 Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 530 entries, 0 to 529
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            530 non-null    object
 1   title         530 non-null    object
 2   text          530 non-null    object
 3   author        530 non-null    object
 4   score         530 non-null    int64 
 5   created_utc   530 non-null    int64 
 6   num_comments  530 non-null    int64 
 7   url           530 non-null    object
 8   subreddit     530 non-null    object
 9   scraped_at    530 non-null    object
dtypes: int64(3), object(7)
memory usage: 41.5+ KB

--------------------------------------------------

📊 === TikTok Data ===
🔹 Kopf der Tabelle:


Unnamed: 0,id,description,author_username,author_id,likes,shares,comments,plays,video_url,created_time,scraped_at
0,7492000423641959685,#CapCut #قوالب_كاب_كات_جاهزه_للتصميم__🌴♥ #كاب_...,noordeen_cap_cat_0_1,7322835376556442629,58400,1576,451,1300000,https://webapp-va.tiktok.com/6508d64d970e751a2...,1744367290,2025-05-12 13:56:20
1,7461927005689302280,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,7455509281098515474,261400,9526,13000,12400000,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...,1737365272,2025-05-12 13:56:20
2,7461757350492278048,This is a great goal 😳😭 @Fabio Andrés #golazo ...,risingballers,6718984144989438982,868600,231400,5817,10900000,https://v16-webapp-prime.tiktok.com/video/tos/...,1737325774,2025-05-12 13:56:20
3,7462429961345879318,glaube er hatte nicht so bock auf mich..💀 #for...,allroundii,7316534594073625633,682800,165200,6876,10200000,,1737482377,2025-05-12 13:56:20
4,7477658166235286791,#mascotas #humormascotas😂😂 #mascotastiktok #vi...,rokopitbull,6847768940501599238,15100000,2200000,38900,101900000,https://v16-webapp-prime.tiktok.com/video/tos/...,1741027969,2025-05-12 13:56:20


None 

🔹 Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1598 entries, 0 to 1597
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               1598 non-null   object
 1   description      1594 non-null   object
 2   author_username  1598 non-null   object
 3   author_id        1598 non-null   object
 4   likes            1598 non-null   int64 
 5   shares           1598 non-null   int64 
 6   comments         1598 non-null   int64 
 7   plays            1598 non-null   int64 
 8   video_url        1587 non-null   object
 9   created_time     1598 non-null   int64 
 10  scraped_at       1598 non-null   object
dtypes: int64(5), object(6)
memory usage: 137.5+ KB

--------------------------------------------------

📊 === YouTube Data ===
🔹 Kopf der Tabelle:


Unnamed: 0,video_id,title,description,channel_title,view_count,like_count,comment_count,published_at,scraped_at
0,"#AborandTynna #OfficialMusicVideo #Baller""",Abor & Tynna,2025-05-08T18:00:06Z,234565,647,0,0,2025-05-12 13:56:29,2025-05-12 13:56:29
1,"#Asche #Kollegah #BisHierLiefAllesNochGut""",Asche,2025-05-08T22:00:07Z,127947,884,0,0,2025-05-12 13:56:29,2025-05-12 13:56:29
2,"#MileyCyrus #MoretoLose #SomethingBeautiful""",MileyCyrusVEVO,2025-05-09T04:00:07Z,1029912,4865,0,0,2025-05-12 13:56:29,2025-05-12 13:56:29
3,"#Oblivion #ArcRaiders #GameTwo #ZDFNeo #rbtv""",Game Two,2025-05-10T10:00:43Z,68615,519,0,0,2025-05-12 13:56:29,2025-05-12 13:56:29
4,#ROSÉ_Messy #F1TheAlbum #F1TheMovie #F1 #ROSÉ ...,ROSÉ,2025-05-08T16:00:07Z,7123605,44993,0,0,2025-05-12 13:56:29,2025-05-12 13:56:29


None 

🔹 Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   video_id       400 non-null    object
 1   title          400 non-null    object
 2   description    400 non-null    object
 3   channel_title  400 non-null    object
 4   view_count     400 non-null    int64 
 5   like_count     400 non-null    int64 
 6   comment_count  400 non-null    int64 
 7   published_at   400 non-null    object
 8   scraped_at     400 non-null    object
dtypes: int64(3), object(6)
memory usage: 28.2+ KB

--------------------------------------------------


## 3. Textbereinigung und Feature-Extraktion

Hier definieren wir Funktionen zur Reinigung, Lemmatization, Stopwortfilterung und Extraktion von Textmerkmalen für Sentimentanalyse und Topic Modeling.


In [7]:
def remove_emojis(text):
    if not isinstance(text, str):
        return ""
    emoji_pattern = re.compile(
        "[" 
        u"\U0001F600-\U0001F64F"
        u"\U0001F300-\U0001F5FF"
        u"\U0001F680-\U0001F6FF"
        u"\U0001F1E0-\U0001F1FF"
        u"\U00002700-\U000027BF"
        u"\U000024C2-\U0001F251"
        "]", flags=re.UNICODE
    )
    return emoji_pattern.sub(r'', text)
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'): return wordnet.ADJ
    elif treebank_tag.startswith('V'): return wordnet.VERB
    elif treebank_tag.startswith('N'): return wordnet.NOUN
    elif treebank_tag.startswith('R'): return wordnet.ADV
    return wordnet.NOUN

def lemmatize_tokens(tokens):
    tagged = pos_tag(tokens)
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w, get_wordnet_pos(t)) for w, t in tagged]

def preprocess_text(text, remove_stopwords=True):
    if not isinstance(text, str):
        return ""

    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = remove_emojis(text)
    text = re.sub(r'#(\w+)', r'\1', text)
    text = re.sub(r'[^\w\s\']', ' ', text)
    text = re.sub(r'\s\'|\'\s', ' ', text)
    text = ' '.join(text.split())

    tokens = word_tokenize(text)

    if remove_stopwords:
        stop_words = set()
        for lang in ['english', 'german']:
            try:
                stop_words.update(stopwords.words(lang))
            except:
                logger.warning(f"Stopwords for {lang} not available")
        important_words = {"n't", "'s", "'m", "'re", "'ve", "'ll", "no", "not"}
        stop_words -= important_words
        tokens = [token for token in tokens if token.lower() not in stop_words]

    return ' '.join(lemmatize_tokens(tokens))

def extract_text_features(text):
    if not isinstance(text, str) or not text.strip():
        return {
            'word_count': 0,
            'char_count': 0,
            'avg_word_length': 0,
            'sentiment_polarity': 0,
            'sentiment_subjectivity': 0
        }
    words = text.split()
    blob = TextBlob(text)
    return {
        'word_count': len(words),
        'char_count': len(text),
        'avg_word_length': len(text) / len(words),
        'sentiment_polarity': blob.sentiment.polarity,
        'sentiment_subjectivity': blob.sentiment.subjectivity
    }

def apply_text_processing(df, col):
    df = df.copy()
    df[f"{col}_processed"] = df[col].astype(str).apply(preprocess_text)
    features = df[f"{col}_processed"].apply(extract_text_features)
    return pd.concat([df, pd.DataFrame(features.tolist())], axis=1)


## 4. Plattformdaten bereinigen

Die Inhalte der Plattformen unterscheiden sich strukturell, daher erfolgt die Bereinigung pro Quelle individuell.


## Reddit

In [8]:
# Reinigungsfunktion für Reddit
def clean_reddit_data(df: pd.DataFrame) -> pd.DataFrame:
    # NaNs entfernen (essentielle Felder)
    df = df.dropna(subset=['text'])

    # Duplikate entfernen
    df = df.drop_duplicates(subset=['text'])

    # Unix-Timestamps in lesbare Datumswerte umwandeln
    if 'created_utc' in df.columns:
        df['created_utc'] = pd.to_datetime(df['created_utc'], unit='s', errors='coerce')


    # scraped_at auffüllen, falls leer
    if 'scraped_at' in df.columns and 'created' in df.columns:
        df['scraped_at'] = df['scraped_at'].fillna(df['created'])

    return df

# Bereinigung anwenden
df_reddit = clean_reddit_data(df_reddit)

print("\n🧹 Erste 5 Zeilen der bereinigten Reddit-Daten:")
print(display(df_reddit.head()))

print("\nℹ️ Infos zu bereinigten Reddit-Daten:")
df_reddit.info()


🧹 Erste 5 Zeilen der bereinigten Reddit-Daten:


Unnamed: 0,id,title,text,author,score,created_utc,num_comments,url,subreddit,scraped_at
0,1d05b82bbaf96bc8cbc0a8e3dde9aa30,What's one thing millennials did back in the d...,We used to have to call our friend’s house pho...,,3186,2025-05-02 17:21:03,2166,https://www.reddit.com/r/Millennials/comments/...,popular,2025-05-03 08:58:16
1,3415c920c6e4622e3b20f81e5f1cdc4d,AITAH for refusing to chip in for a coworker's...,"Alright AITAH ppl, imma need your hot takes on...",,11933,2025-05-02 13:49:31,1532,https://www.reddit.com/r/AITAH/comments/1kd4be...,popular,2025-05-03 08:58:16
2,05384fb79327ed7f3c6414c1ec3f098c,She erased us from her wedding. So I’m erasing...,"When my brother got married, his bride (now my...",,4547,2025-05-03 03:19:53,1045,https://www.reddit.com/r/pettyrevenge/comments...,popular,2025-05-03 08:58:16
3,12c55f46157aee9ac731b13a805796b5,Minecraft’s Long-Awaited Visual Upgrade: What ...,"After years of anticipation, Minecraft is fina...",,3,2025-04-24 15:02:30,2,https://www.reddit.com/r/TrendingReddits/comme...,trendingreddits,2025-05-03 08:58:16
4,ae721d8cf4c11647581cda11d06522cb,"Trending Subreddits for 2021-06-13: /r/place, ...","Over 6 years ago, when reddit was the equivale...",,90,2021-06-14 01:51:18,222,https://www.reddit.com/r/trendingsubreddits/co...,trendingsubreddits,2025-05-11 13:48:47


None

ℹ️ Infos zu bereinigten Reddit-Daten:
<class 'pandas.core.frame.DataFrame'>
Index: 482 entries, 0 to 529
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            482 non-null    object        
 1   title         482 non-null    object        
 2   text          482 non-null    object        
 3   author        482 non-null    object        
 4   score         482 non-null    int64         
 5   created_utc   482 non-null    datetime64[ns]
 6   num_comments  482 non-null    int64         
 7   url           482 non-null    object        
 8   subreddit     482 non-null    object        
 9   scraped_at    482 non-null    object        
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 41.4+ KB


In [9]:
# Wenn die Spalte 'created' als datetime formatiert ist:
df_reddit["created_utc"] = pd.to_datetime(df_reddit["created_utc"], errors="coerce")

# Nur das Datum extrahieren
df_reddit["date"] = df_reddit["created_utc"].dt.date

# Gruppieren und zählen
date_counts = df_reddit["date"].value_counts().sort_index()

# Ausgabe
for d, count in date_counts.items():
    print(f"{d}: {count} Einträge")

2021-03-07: 1 Einträge
2021-03-08: 1 Einträge
2021-03-09: 1 Einträge
2021-03-11: 1 Einträge
2021-03-12: 1 Einträge
2021-03-13: 1 Einträge
2021-03-15: 1 Einträge
2021-03-16: 1 Einträge
2021-03-17: 1 Einträge
2021-03-18: 1 Einträge
2021-03-19: 1 Einträge
2021-03-20: 1 Einträge
2021-03-21: 1 Einträge
2021-03-22: 1 Einträge
2021-03-23: 1 Einträge
2021-03-24: 1 Einträge
2021-03-25: 1 Einträge
2021-03-26: 1 Einträge
2021-03-27: 1 Einträge
2021-03-28: 1 Einträge
2021-03-29: 1 Einträge
2021-03-30: 1 Einträge
2021-03-31: 1 Einträge
2021-04-01: 1 Einträge
2021-04-02: 1 Einträge
2021-04-03: 1 Einträge
2021-04-04: 1 Einträge
2021-04-05: 2 Einträge
2021-04-06: 1 Einträge
2021-04-07: 1 Einträge
2021-04-08: 1 Einträge
2021-04-09: 1 Einträge
2021-04-10: 1 Einträge
2021-04-11: 1 Einträge
2021-04-12: 1 Einträge
2021-04-13: 1 Einträge
2021-04-14: 1 Einträge
2021-04-15: 1 Einträge
2021-04-16: 1 Einträge
2021-04-17: 1 Einträge
2021-04-18: 1 Einträge
2021-04-19: 1 Einträge
2021-04-20: 1 Einträge
2021-04-21:

In [10]:
# Filter: Nur Zeilen behalten, deren Jahr ungleich 2021 ist
df_reddit = df_reddit[df_reddit["created_utc"].dt.year != 2021]

# Nur das Datum extrahieren
df_reddit["date"] = df_reddit["created_utc"].dt.date

# Gruppieren und zählen
date_counts = df_reddit["date"].value_counts().sort_index()

# Sortiere nach dem kombinierten datetime-Wert (älteste zuerst)
df_sorted = df_reddit.sort_values("date", ascending=True).reset_index(drop=True)
df_sorted

Unnamed: 0,id,title,text,author,score,created_utc,num_comments,url,subreddit,scraped_at,date
0,1j68vz6,My New Website,ShopSphere is my new website I have just creat...,,0,2025-03-08 04:10:26,0,https://www.reddit.com/r/TrendingReddits/comme...,trendingreddits,2025-04-14 20:31:29,2025-03-08
1,1ja2kyz,Calling Out: Tabloid Writers,I am doing a story around Torontonians' perspe...,,2,2025-03-13 03:08:46,0,https://www.reddit.com/r/TrendingReddits/comme...,trendingreddits,2025-04-14 20:31:29,2025-03-13
2,1jbipra,Suicidal 1,"Dear Friend,\n\nMy name is Denzo, and I write ...",,3,2025-03-15 00:24:47,4,https://www.reddit.com/r/TrendingReddits/comme...,trendingreddits,2025-04-14 20:31:29,2025-03-15
3,1jhw2oa,Year by year,2017:swag\n\n2018:thug life\n\n2019:savage\n\n...,,3,2025-03-23 10:12:54,0,https://www.reddit.com/r/TrendingReddits/comme...,trendingreddits,2025-04-14 20:31:29,2025-03-23
4,1jikcwk,Binance,Únete a la competencia por ROI de Spot en Bina...,,1,2025-03-24 06:26:14,0,https://www.reddit.com/r/TrendingReddits/comme...,trendingreddits,2025-04-14 20:31:29,2025-03-24
...,...,...,...,...,...,...,...,...,...,...,...
379,1kjrsvy,I HAVE BEEN TAUGHT TO NEVER EVER GO TO A 2ND L...,"When I was in 1st grade, I was going to go on ...",,9140,2025-05-11 01:37:23,314,https://www.reddit.com/r/TwoXChromosomes/comme...,all,2025-05-11 14:38:10,2025-05-11
380,1kjrjau,[Post Game Thread] The Minnesota Timberwolves ...,||\n|:-:|\n|[](/MIN) **102 - 97** [](/GSW)|\n...,,4823,2025-05-11 01:21:53,1842,https://www.reddit.com/r/nba/comments/1kjrjau/...,popular,2025-05-11 17:05:53,2025-05-11
381,1kk29so,For Opportunity Hunters,"Hii,\n\nI am Velan, a BSc Computer Sscience + ...",,0,2025-05-11 12:27:28,0,https://www.reddit.com/r/TrendingReddits/comme...,trendingreddits,2025-05-11 20:41:46,2025-05-11
382,1kjvf32,AITA for refusing to wear a wig in my brother’...,I (23F) have decided not to wear a wig in my b...,,3489,2025-05-11 05:31:39,587,https://www.reddit.com/r/AmItheAsshole/comment...,popular,2025-05-11 18:29:31,2025-05-11


## TikTok

In [11]:
def clean_tiktok_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Bereinigt TikTok-Daten:
    - wandelt numerische Spalten in Integer
    - konvertiert Zeitstempel zu datetime
    - entfernt leere und doppelte Beschreibungen
    """

    # Spalten umbenennen
    df = df.rename(columns={
        'author_username': 'username',
        'author_id': 'user_id',
    })

    # Numerische Spalten konvertieren: NaN → 0 → int
    numeric_cols = ['likes', 'shares', 'comments', 'plays']
    for col in numeric_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0).astype(int)

    # Zeitstempel konvertieren
    if 'created_time' in df.columns:
        df['created'] = pd.to_datetime(df['created_time'], unit='s', errors='coerce')

    # Beschreibung bereinigen
    if 'description' in df.columns:
        df['description'] = df['description'].fillna("").astype(str)
        df = df[df['description'].str.strip() != ""]
        df = df.drop_duplicates(subset=['description'])

    return df



df_tiktok = clean_tiktok_data(df_tiktok)

print("\nErste 5 Zeilen:")
display(df_tiktok.head())


print("Dataframe Info:")
print(df_tiktok.info())


Erste 5 Zeilen:


Unnamed: 0,id,description,username,user_id,likes,shares,comments,plays,video_url,created_time,scraped_at,created
0,7492000423641959685,#CapCut #قوالب_كاب_كات_جاهزه_للتصميم__🌴♥ #كاب_...,noordeen_cap_cat_0_1,7322835376556442629,58400,1576,451,1300000,https://webapp-va.tiktok.com/6508d64d970e751a2...,1744367290,2025-05-12 13:56:20,2025-04-11 10:28:10
1,7461927005689302280,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,7455509281098515474,261400,9526,13000,12400000,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...,1737365272,2025-05-12 13:56:20,2025-01-20 09:27:52
2,7461757350492278048,This is a great goal 😳😭 @Fabio Andrés #golazo ...,risingballers,6718984144989438982,868600,231400,5817,10900000,https://v16-webapp-prime.tiktok.com/video/tos/...,1737325774,2025-05-12 13:56:20,2025-01-19 22:29:34
3,7462429961345879318,glaube er hatte nicht so bock auf mich..💀 #for...,allroundii,7316534594073625633,682800,165200,6876,10200000,,1737482377,2025-05-12 13:56:20,2025-01-21 17:59:37
4,7477658166235286791,#mascotas #humormascotas😂😂 #mascotastiktok #vi...,rokopitbull,6847768940501599238,15100000,2200000,38900,101900000,https://v16-webapp-prime.tiktok.com/video/tos/...,1741027969,2025-05-12 13:56:20,2025-03-03 18:52:49


Dataframe Info:
<class 'pandas.core.frame.DataFrame'>
Index: 1427 entries, 0 to 1597
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            1427 non-null   object        
 1   description   1427 non-null   object        
 2   username      1427 non-null   object        
 3   user_id       1427 non-null   object        
 4   likes         1427 non-null   int32         
 5   shares        1427 non-null   int32         
 6   comments      1427 non-null   int32         
 7   plays         1427 non-null   int32         
 8   video_url     1419 non-null   object        
 9   created_time  1427 non-null   int64         
 10  scraped_at    1427 non-null   object        
 11  created       1427 non-null   datetime64[ns]
dtypes: datetime64[ns](1), int32(4), int64(1), object(6)
memory usage: 122.6+ KB
None


In [12]:

df_tiktok["date"] = df_tiktok["created"].dt.date

# Gruppieren und zählen
date_counts = df_tiktok["date"].value_counts().sort_index()

# Ausgabe
for d, count in date_counts.items():
    print(f"{d}: {count} Einträge")

1970-01-01: 1 Einträge
2024-08-06: 1 Einträge
2024-08-11: 1 Einträge
2024-08-27: 2 Einträge
2024-09-18: 1 Einträge
2024-09-25: 1 Einträge
2024-10-05: 1 Einträge
2024-10-22: 1 Einträge
2025-01-07: 1 Einträge
2025-01-19: 1 Einträge
2025-01-20: 1 Einträge
2025-01-21: 3 Einträge
2025-01-27: 2 Einträge
2025-01-31: 1 Einträge
2025-02-03: 1 Einträge
2025-02-04: 2 Einträge
2025-02-05: 4 Einträge
2025-02-06: 3 Einträge
2025-02-08: 8 Einträge
2025-02-09: 5 Einträge
2025-02-10: 6 Einträge
2025-02-11: 8 Einträge
2025-02-12: 6 Einträge
2025-02-13: 10 Einträge
2025-02-14: 13 Einträge
2025-02-15: 9 Einträge
2025-02-16: 7 Einträge
2025-02-17: 8 Einträge
2025-02-18: 13 Einträge
2025-02-19: 6 Einträge
2025-02-20: 9 Einträge
2025-02-21: 10 Einträge
2025-02-22: 14 Einträge
2025-02-23: 8 Einträge
2025-02-24: 7 Einträge
2025-02-25: 11 Einträge
2025-02-26: 14 Einträge
2025-02-27: 15 Einträge
2025-02-28: 11 Einträge
2025-03-01: 11 Einträge
2025-03-02: 9 Einträge
2025-03-03: 16 Einträge
2025-03-04: 9 Einträge


# YouTube

In [13]:
def clean_youtube_data(df):
    df = df.copy()

    # Textfelder bereinigen
    for col in ['title', 'description']:
        df[col] = df[col].fillna('')

    # Timestamps umwandeln
    for col in ['scraped_at', 'published_at']:
        df[col] = pd.to_datetime(df[col], errors='coerce')

    # Counts in ganze Zahlen konvertieren (mit NaN als -1 oder 0, je nach Bedarf)
    for col in ['view_count', 'like_count', 'comment_count']:
        df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0).astype(int)

    # Leere oder fehlende Beschreibungen entfernen
    df['description'] = df['description'].astype(str).str.strip()
    df = df[df['description'] != ""]

    # Doppelte Beschreibungen entfernen
    df = df.drop_duplicates(subset=['description'])

    return df

# Anwenden
df_youtube = clean_youtube_data(df_youtube)

# Anzeige
display(df_youtube.head())
df_youtube.info()


Unnamed: 0,video_id,title,description,channel_title,view_count,like_count,comment_count,published_at,scraped_at
0,"#AborandTynna #OfficialMusicVideo #Baller""",Abor & Tynna,2025-05-08T18:00:06Z,234565,647,0,0,2025-05-12 13:56:29,2025-05-12 13:56:29
1,"#Asche #Kollegah #BisHierLiefAllesNochGut""",Asche,2025-05-08T22:00:07Z,127947,884,0,0,2025-05-12 13:56:29,2025-05-12 13:56:29
2,"#MileyCyrus #MoretoLose #SomethingBeautiful""",MileyCyrusVEVO,2025-05-09T04:00:07Z,1029912,4865,0,0,2025-05-12 13:56:29,2025-05-12 13:56:29
3,"#Oblivion #ArcRaiders #GameTwo #ZDFNeo #rbtv""",Game Two,2025-05-10T10:00:43Z,68615,519,0,0,2025-05-12 13:56:29,2025-05-12 13:56:29
4,#ROSÉ_Messy #F1TheAlbum #F1TheMovie #F1 #ROSÉ ...,ROSÉ,2025-05-08T16:00:07Z,7123605,44993,0,0,2025-05-12 13:56:29,2025-05-12 13:56:29


<class 'pandas.core.frame.DataFrame'>
Index: 394 entries, 0 to 399
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   video_id       394 non-null    object        
 1   title          394 non-null    object        
 2   description    394 non-null    object        
 3   channel_title  394 non-null    object        
 4   view_count     394 non-null    int32         
 5   like_count     394 non-null    int32         
 6   comment_count  394 non-null    int32         
 7   published_at   394 non-null    datetime64[ns]
 8   scraped_at     394 non-null    datetime64[ns]
dtypes: datetime64[ns](2), int32(3), object(4)
memory usage: 26.2+ KB


In [14]:
# Wenn die Spalte 'created' als datetime formatiert ist:
df_youtube["created"] = pd.to_datetime(df_youtube["published_at"], errors="coerce")

# Nur das Datum extrahieren
df_youtube["date"] = df_youtube["created"].dt.date

# Gruppieren und zählen
date_counts = df_youtube["date"].value_counts().sort_index()

# Ausgabe
for d, count in date_counts.items():
    print(f"{d}: {count} Einträge")

2025-04-16: 1 Einträge
2025-04-17: 4 Einträge
2025-04-18: 7 Einträge
2025-04-19: 8 Einträge
2025-04-20: 10 Einträge
2025-04-21: 16 Einträge
2025-04-22: 26 Einträge
2025-04-23: 14 Einträge
2025-04-24: 14 Einträge
2025-04-25: 2 Einträge
2025-04-27: 2 Einträge
2025-04-28: 6 Einträge
2025-04-29: 16 Einträge
2025-04-30: 15 Einträge
2025-05-01: 27 Einträge
2025-05-02: 19 Einträge
2025-05-03: 23 Einträge
2025-05-04: 18 Einträge
2025-05-05: 11 Einträge
2025-05-06: 23 Einträge
2025-05-07: 13 Einträge
2025-05-08: 33 Einträge
2025-05-09: 18 Einträge
2025-05-10: 22 Einträge
2025-05-11: 11 Einträge
2025-05-12: 35 Einträge


# 5. Zusammenführen der Daten

In [15]:
# Extrahiert die ID aus einer URL (der Teil nach dem letzten Slash)
def extract_id_from_url(url):
    if isinstance(url, str):
        return url.rstrip('/').split('/')[-1]
    return None

In [16]:
df_reddit.columns

Index(['id', 'title', 'text', 'author', 'score', 'created_utc', 'num_comments',
       'url', 'subreddit', 'scraped_at', 'date'],
      dtype='object')

In [17]:
df_tiktok.columns

Index(['id', 'description', 'username', 'user_id', 'likes', 'shares',
       'comments', 'plays', 'video_url', 'created_time', 'scraped_at',
       'created', 'date'],
      dtype='object')

In [18]:
# Diese Funktion vereinheitlicht die Struktur der drei Plattformen in ein einheitliches Format
def unify_dataframes(df_tiktok, df_youtube, df_reddit):
    # TikTok
    df_tiktok_clean = pd.DataFrame({
        'source': 'tiktok',
        'id': df_tiktok['id'],
        'title': None,
        'text': df_tiktok['description'],
        'username': df_tiktok['username'],
        'likes': df_tiktok['likes'],
        'comments': df_tiktok['comments'],
        'shares': df_tiktok['shares'],
        'plays': df_tiktok['plays'],
        'timestamp': df_tiktok['created_time'],
        'published_at': df_tiktok['scraped_at'],
        'url': df_tiktok['video_url']
    })

    # YouTube
    df_youtube_clean = pd.DataFrame({
        'source': 'youtube',
        'id': df_youtube['video_id'],
        'title': df_youtube['title'],
        'text': df_youtube['description'],
        'username': df_youtube['channel_title'],
        'likes': df_youtube['like_count'],
        'comments': df_youtube['comment_count'],
        'shares': None,
        'plays': df_youtube['view_count'],
        'timestamp': df_youtube['scraped_at'],
        'published_at': df_youtube['created'],
        'url': None
    })
 

    # Reddit
    df_reddit_clean = pd.DataFrame({
        'source': 'reddit',
        'id': df_reddit['id'],
        'title': df_reddit['title'],
        'text': df_reddit['text'],
        'username': df_reddit['author'],
        'likes': df_reddit['score'],
        'comments': df_reddit['num_comments'],
        'shares': None,
        'plays': None,
        'timestamp': df_reddit['scraped_at'],
        'published_at': df_reddit['created_utc'],
        'url': df_reddit['url']
    })

    # Kombinieren aller Plattformen in einem DataFrame
    return pd.concat([df_tiktok_clean, df_youtube_clean, df_reddit_clean], ignore_index=True)

# Zusammenführen der Plattform-Daten
df_combined = unify_dataframes(df_tiktok, df_youtube, df_reddit)
df_combined.head()


Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url
0,tiktok,7492000423641959685,,#CapCut #قوالب_كاب_كات_جاهزه_للتصميم__🌴♥ #كاب_...,noordeen_cap_cat_0_1,58400,451,1576,1300000,1744367290,2025-05-12 13:56:20,https://webapp-va.tiktok.com/6508d64d970e751a2...
1,tiktok,7461927005689302280,,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,261400,13000,9526,12400000,1737365272,2025-05-12 13:56:20,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...
2,tiktok,7461757350492278048,,This is a great goal 😳😭 @Fabio Andrés #golazo ...,risingballers,868600,5817,231400,10900000,1737325774,2025-05-12 13:56:20,https://v16-webapp-prime.tiktok.com/video/tos/...
3,tiktok,7462429961345879318,,glaube er hatte nicht so bock auf mich..💀 #for...,allroundii,682800,6876,165200,10200000,1737482377,2025-05-12 13:56:20,
4,tiktok,7477658166235286791,,#mascotas #humormascotas😂😂 #mascotastiktok #vi...,rokopitbull,15100000,38900,2200000,101900000,1741027969,2025-05-12 13:56:20,https://v16-webapp-prime.tiktok.com/video/tos/...


In [19]:
# Anzahl der Posts pro Plattform
post_counts = df_combined["source"].value_counts().reset_index()
post_counts.columns = ["source", "num_posts"]

# Anzeige
print(post_counts)

    source  num_posts
0   tiktok       1427
1  youtube        394
2   reddit        384


## Detect language

In [20]:
from langdetect import detect, DetectorFactory, LangDetectException
import langid

DetectorFactory.seed = 0  # für konsistente langdetect-Ergebnisse

# 🧼 Text bereinigen vor Spracherkennung
def clean_for_langdetect(text: str) -> str:
    return re.sub(r"http\S+|@\S+|#\S+|[^a-zA-ZäöüÄÖÜß0-9\s]", " ", text).strip()

# 🔍 robuste Spracherkennung
def detect_language_robust(text: str) -> str:
    text = clean_for_langdetect(text)
    if not text or len(text.split()) == 0:
        return "unknown"
    try:
        return detect(text)
    except LangDetectException:
        pass
    lang_fallback, _ = langid.classify(text)
    return lang_fallback or "unknown"

# 📄 Kopiere das Original-DataFrame
df_langs = df_combined.copy()

# 🧪 Sprache erkennen für Spalte 'text'
df_langs["text"] = df_langs["text"].fillna("").astype(str)
df_langs["text_language"] = df_langs["text"].apply(detect_language_robust)

# 📊 Anzahl Texte pro Sprache und Plattform
language_summary = pd.crosstab(df_langs["source"], df_langs["text_language"])
display(language_summary)


INFO:langid.langid:initializing identifier


text_language,af,ca,cs,cy,da,de,en,es,et,fi,fr,hr,hu,id,it,lt,lv,nl,no,pl,pt,ro,sk,sl,so,sq,sv,sw,tl,tr,unknown,vi
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1
reddit,0,0,0,0,0,1,362,2,1,0,0,0,0,0,0,0,0,2,0,1,1,0,0,0,1,0,1,0,2,1,9,0
tiktok,13,17,4,12,5,153,536,26,16,10,50,4,3,21,46,1,1,13,17,32,25,19,2,9,12,3,12,12,27,5,315,6
youtube,0,0,0,0,0,245,141,1,0,0,2,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0


In [21]:
# 🔎 Zeige die ersten 5 Texte mit nicht erkennbarer Sprache
unknown_samples = df_langs[df_langs["text_language"] == "unknown"].head(5)
display(unknown_samples)

Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,text_language
0,tiktok,7492000423641959685,,#CapCut #قوالب_كاب_كات_جاهزه_للتصميم__🌴♥ #كاب_...,noordeen_cap_cat_0_1,58400,451,1576,1300000,1744367290,2025-05-12 13:56:20,https://webapp-va.tiktok.com/6508d64d970e751a2...,unknown
4,tiktok,7477658166235286791,,#mascotas #humormascotas😂😂 #mascotastiktok #vi...,rokopitbull,15100000,38900,2200000,101900000,1741027969,2025-05-12 13:56:20,https://v16-webapp-prime.tiktok.com/video/tos/...,unknown
10,tiktok,7489932324423978262,,#fyp #videoviral #relatable #ukcomedy,amzszinotv,5400000,19600,988900,46600000,1743885767,2025-05-12 13:56:26,https://v16-webapp-prime.tiktok.com/video/tos/...,unknown
12,tiktok,7477542845943975190,,#tiktokfood #asmr,zachchoicook6,3800000,11700,61900,68400000,1741001120,2025-05-12 13:56:20,https://v16-webapp-prime.tiktok.com/video/tos/...,unknown
19,tiktok,7489550724301475090,,راحه نفسيه ☺️🙂 #CapCut#اشهد_ان_لا_اله_الا_الله...,.__holy.quran,65600,621,3581,1200000,1743796924,2025-05-12 13:56:21,https://v16-webapp-prime.tiktok.com/video/tos/...,unknown


# 6. Feature Engineering

In [22]:
from textblob import TextBlob

In [23]:
from nltk.corpus import stopwords

SUPPORTED_LANGS = {
    "en": "english",
    "de": "german",
    "fr": "french",
    "es": "spanish",
    "it": "italian",
    # du kannst beliebig erweitern
}

STOPWORDS_MAP = {
    lang: set(stopwords.words(nltk_lang))
    for lang, nltk_lang in SUPPORTED_LANGS.items()
}
DEFAULT_STOPWORDS = STOPWORDS_MAP["en"]

In [24]:
def deep_clean_text_with_lang(text: str, lang: str) -> str:
    text = str(text).lower()
    text = re.sub(r"http\S+|@\S+|#\S+|[^\w\s]", " ", text)
    text = re.sub(r"\d+", " ", text)

    tokens = word_tokenize(text)
    stop_words = STOPWORDS_MAP.get(lang, DEFAULT_STOPWORDS)
    tokens = [w for w in tokens if w not in stop_words]

    return " ".join(tokens).strip()

In [25]:
# 📊 Textstatistiken
def extract_text_features(text: str) -> dict:
    try:
        text = str(text).strip()
        words = text.split()
        return {
            'char_count': len(text),
            'word_count': len(words),
            'uppercase_count': sum(1 for c in text if c.isupper()),
            'exclamation_count': text.count('!'),
            'question_count': text.count('?'),
            'emoji_count': len(re.findall(r'[^\w\s,]', text)),
            'mention_count': text.count('@'),
            'hashtag_count': text.count('#'),
            'avg_word_length': (sum(len(w) for w in words) / len(words)) if words else 0,
        }
    except Exception as e:
        print(f"⚠️ Fehler in extract_text_features: {e}")
        return {k: 0 for k in [
            'char_count', 'word_count', 'uppercase_count',
            'exclamation_count', 'question_count', 'emoji_count',
            'mention_count', 'hashtag_count', 'avg_word_length']}

# 📈 Sentimentanalyse
def analyze_sentiment(text: str) -> tuple[str, float]:
    try:
        text = str(text).strip()
        if not text:
            return ("neutral", 0.0)
        blob = TextBlob(text)
        polarity = blob.sentiment.polarity
        if polarity > 0.1: return ("positive", polarity)
        elif polarity < -0.1: return ("negative", polarity)
        else: return ("neutral", polarity)
    except Exception as e:
        print(f"⚠️ Fehler in analyze_sentiment: {e}")
        return ("neutral", 0.0)

# 📦 Feature-Anreicherung für Textspalten
def add_text_features(df: pd.DataFrame, text_cols: list[str]) -> pd.DataFrame:
    df = df.copy().reset_index(drop=True)

    for col in text_cols:
        df[col] = df[col].fillna("").astype(str)
        df[f"{col}_language"] = df[col].apply(detect_language_robust)
        df[f"{col}_clean"] = df.apply(lambda row: deep_clean_text_with_lang(row[col], row[f"{col}_language"]), axis=1)

        # Textstatistiken
        feature_df = df[f"{col}_clean"].apply(extract_text_features).apply(pd.Series)
        feature_df.columns = [f"{col}_{c}" for c in feature_df.columns]
        df = pd.concat([df, feature_df], axis=1)

        # Sentiment
        sentiment_df = df[f"{col}_clean"].apply(lambda x: pd.Series(analyze_sentiment(x)))
        sentiment_df.columns = [f"{col}_sentiment", f"{col}_sentiment_score"]
        df = pd.concat([df, sentiment_df], axis=1)

    return df

text_cols = ['title', 'text']
df_featured = add_text_features(df_combined, text_cols)

df_featured.head()

Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,title_language,title_clean,title_char_count,title_word_count,title_uppercase_count,title_exclamation_count,title_question_count,title_emoji_count,title_mention_count,title_hashtag_count,title_avg_word_length,title_sentiment,title_sentiment_score,text_language,text_clean,text_char_count,text_word_count,text_uppercase_count,text_exclamation_count,text_question_count,text_emoji_count,text_mention_count,text_hashtag_count,text_avg_word_length,text_sentiment,text_sentiment_score
0,tiktok,7492000423641959685,,#CapCut #قوالب_كاب_كات_جاهزه_للتصميم__🌴♥ #كاب_...,noordeen_cap_cat_0_1,58400,451,1576,1300000,1744367290,2025-05-12 13:56:20,https://webapp-va.tiktok.com/6508d64d970e751a2...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0
1,tiktok,7461927005689302280,,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,261400,13000,9526,12400000,1737365272,2025-05-12 13:56:20,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,welcome thanos world,20.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,positive,0.8
2,tiktok,7461757350492278048,,This is a great goal 😳😭 @Fabio Andrés #golazo ...,risingballers,868600,5817,231400,10900000,1737325774,2025-05-12 13:56:20,https://v16-webapp-prime.tiktok.com/video/tos/...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,great goal andrés,17.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,positive,0.8
3,tiktok,7462429961345879318,,glaube er hatte nicht so bock auf mich..💀 #for...,allroundii,682800,6876,165200,10200000,1737482377,2025-05-12 13:56:20,,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,de,glaube bock,11.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,neutral,0.0
4,tiktok,7477658166235286791,,#mascotas #humormascotas😂😂 #mascotastiktok #vi...,rokopitbull,15100000,38900,2200000,101900000,1741027969,2025-05-12 13:56:20,https://v16-webapp-prime.tiktok.com/video/tos/...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0


## Berechnen der Engagement rate

In [26]:
# Textverarbeitung & Berechnung der Engagement-Rate
def enrich_data(df, engagement_numerator_cols=None, engagement_denominator_col=None):
    df = df.copy()

    # Berechnung der Engagement Rate (wenn nicht vorhanden)
    if "engagement_rate" not in df.columns and engagement_numerator_cols and engagement_denominator_col in df.columns:
        try:
            numerator = df[engagement_numerator_cols].sum(axis=1)
            denominator = df[engagement_denominator_col].replace(0, np.nan)
            df['engagement_rate'] = (numerator / denominator).replace([np.inf, -np.inf], np.nan)
        except Exception as e:
            print(f"⚠️ Engagement-Rate konnte nicht berechnet werden: {e}")
    
    return df

# Anreicherung mit Textverarbeitung und Engagement Rate
df_enriched = enrich_data(
    df_featured,
    engagement_numerator_cols=['likes', 'comments'],
    engagement_denominator_col='plays'
)

df_enriched.head()

  denominator = df[engagement_denominator_col].replace(0, np.nan)


Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,title_language,title_clean,title_char_count,title_word_count,title_uppercase_count,title_exclamation_count,title_question_count,title_emoji_count,title_mention_count,title_hashtag_count,title_avg_word_length,title_sentiment,title_sentiment_score,text_language,text_clean,text_char_count,text_word_count,text_uppercase_count,text_exclamation_count,text_question_count,text_emoji_count,text_mention_count,text_hashtag_count,text_avg_word_length,text_sentiment,text_sentiment_score,engagement_rate
0,tiktok,7492000423641959685,,#CapCut #قوالب_كاب_كات_جاهزه_للتصميم__🌴♥ #كاب_...,noordeen_cap_cat_0_1,58400,451,1576,1300000,1744367290,2025-05-12 13:56:20,https://webapp-va.tiktok.com/6508d64d970e751a2...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,0.04527
1,tiktok,7461927005689302280,,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,261400,13000,9526,12400000,1737365272,2025-05-12 13:56:20,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,welcome thanos world,20.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,positive,0.8,0.022129
2,tiktok,7461757350492278048,,This is a great goal 😳😭 @Fabio Andrés #golazo ...,risingballers,868600,5817,231400,10900000,1737325774,2025-05-12 13:56:20,https://v16-webapp-prime.tiktok.com/video/tos/...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,great goal andrés,17.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,positive,0.8,0.080222
3,tiktok,7462429961345879318,,glaube er hatte nicht so bock auf mich..💀 #for...,allroundii,682800,6876,165200,10200000,1737482377,2025-05-12 13:56:20,,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,de,glaube bock,11.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,neutral,0.0,0.067615
4,tiktok,7477658166235286791,,#mascotas #humormascotas😂😂 #mascotastiktok #vi...,rokopitbull,15100000,38900,2200000,101900000,1741027969,2025-05-12 13:56:20,https://v16-webapp-prime.tiktok.com/video/tos/...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,0.148566


## Normalisierung

In [27]:
# Normalisiert ausgewählte numerische Spalten zwischen 0 und 1
def normalize_metrics(df, columns):
    df = df.copy()
    valid_cols = []

    for col in columns:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')
            df[col] = df[col].replace([np.inf, -np.inf], np.nan).fillna(df[col].mean())
            valid_cols.append(col)
        else:
            print(f"⚠️ Spalte '{col}' nicht gefunden – wird übersprungen.")

    if not valid_cols:
        print("❌ Keine gültigen Spalten zum Normalisieren.")
        return df

    scaler = MinMaxScaler()
    df[valid_cols] = scaler.fit_transform(df[valid_cols])
    return df

# Normalisierung von Metriken
df_normalized = normalize_metrics(df_enriched, ['likes', 'comments', 'shares', 'plays', 'engagement_rate'])
df_normalized.head()

Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,title_language,title_clean,title_char_count,title_word_count,title_uppercase_count,title_exclamation_count,title_question_count,title_emoji_count,title_mention_count,title_hashtag_count,title_avg_word_length,title_sentiment,title_sentiment_score,text_language,text_clean,text_char_count,text_word_count,text_uppercase_count,text_exclamation_count,text_question_count,text_emoji_count,text_mention_count,text_hashtag_count,text_avg_word_length,text_sentiment,text_sentiment_score,engagement_rate
0,tiktok,7492000423641959685,,#CapCut #قوالب_كاب_كات_جاهزه_للتصميم__🌴♥ #كاب_...,noordeen_cap_cat_0_1,0.001497,0.000989,0.000181,0.003962,1744367290,2025-05-12 13:56:20,https://webapp-va.tiktok.com/6508d64d970e751a2...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,0.121036
1,tiktok,7461927005689302280,,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,0.006703,0.028503,0.001095,0.037793,1737365272,2025-05-12 13:56:20,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,welcome thanos world,20.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,positive,0.8,0.059165
2,tiktok,7461757350492278048,,This is a great goal 😳😭 @Fabio Andrés #golazo ...,risingballers,0.022272,0.012754,0.026598,0.033222,1737325774,2025-05-12 13:56:20,https://v16-webapp-prime.tiktok.com/video/tos/...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,great goal andrés,17.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,positive,0.8,0.214484
3,tiktok,7462429961345879318,,glaube er hatte nicht so bock auf mich..💀 #for...,allroundii,0.017508,0.015076,0.018989,0.031088,1737482377,2025-05-12 13:56:20,,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,de,glaube bock,11.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,neutral,0.0,0.180779
4,tiktok,7477658166235286791,,#mascotas #humormascotas😂😂 #mascotastiktok #vi...,rokopitbull,0.387179,0.085289,0.252874,0.310576,1741027969,2025-05-12 13:56:20,https://v16-webapp-prime.tiktok.com/video/tos/...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,0.397213


In [28]:
df_normalized.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2205 entries, 0 to 2204
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   source                   2205 non-null   object 
 1   id                       2205 non-null   object 
 2   title                    2205 non-null   object 
 3   text                     2205 non-null   object 
 4   username                 2205 non-null   object 
 5   likes                    2205 non-null   float64
 6   comments                 2205 non-null   float64
 7   shares                   2205 non-null   float64
 8   plays                    2205 non-null   float64
 9   timestamp                2205 non-null   object 
 10  published_at             2205 non-null   object 
 11  url                      1803 non-null   object 
 12  title_language           2205 non-null   object 
 13  title_clean              2205 non-null   object 
 14  title_char_count        

## Weitere numerische features

In [29]:
def add_simple_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['published_at'] = pd.to_datetime(df['published_at'], errors='coerce')

    def get_day_period(hour):
        if pd.isna(hour): return None
        if 5 <= hour < 12: return 'morning'
        elif 12 <= hour < 17: return 'afternoon'
        elif 17 <= hour < 21: return 'evening'
        else: return 'night'

    df['hour'] = df['published_at'].apply(lambda x: x.hour if pd.notna(x) else None)
    df['weekday'] = df['published_at'].apply(lambda x: x.day_name() if pd.notna(x) else None)
    df['year'] = df['published_at'].apply(lambda x: x.year if pd.notna(x) else None)
    df['month'] = df['published_at'].apply(lambda x: x.month if pd.notna(x) else None)
    df['day_period'] = df['hour'].apply(get_day_period)
    df['is_weekend'] = df['weekday'].isin(['Saturday', 'Sunday'])

    return df


df_final = add_simple_features(df_normalized)
df_final.head()

Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,title_language,title_clean,title_char_count,title_word_count,title_uppercase_count,title_exclamation_count,title_question_count,title_emoji_count,title_mention_count,title_hashtag_count,title_avg_word_length,title_sentiment,title_sentiment_score,text_language,text_clean,text_char_count,text_word_count,text_uppercase_count,text_exclamation_count,text_question_count,text_emoji_count,text_mention_count,text_hashtag_count,text_avg_word_length,text_sentiment,text_sentiment_score,engagement_rate,hour,weekday,year,month,day_period,is_weekend
0,tiktok,7492000423641959685,,#CapCut #قوالب_كاب_كات_جاهزه_للتصميم__🌴♥ #كاب_...,noordeen_cap_cat_0_1,0.001497,0.000989,0.000181,0.003962,1744367290,2025-05-12 13:56:20,https://webapp-va.tiktok.com/6508d64d970e751a2...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,0.121036,13.0,Monday,2025.0,5.0,afternoon,False
1,tiktok,7461927005689302280,,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,0.006703,0.028503,0.001095,0.037793,1737365272,2025-05-12 13:56:20,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,welcome thanos world,20.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,positive,0.8,0.059165,13.0,Monday,2025.0,5.0,afternoon,False
2,tiktok,7461757350492278048,,This is a great goal 😳😭 @Fabio Andrés #golazo ...,risingballers,0.022272,0.012754,0.026598,0.033222,1737325774,2025-05-12 13:56:20,https://v16-webapp-prime.tiktok.com/video/tos/...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,en,great goal andrés,17.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,positive,0.8,0.214484,13.0,Monday,2025.0,5.0,afternoon,False
3,tiktok,7462429961345879318,,glaube er hatte nicht so bock auf mich..💀 #for...,allroundii,0.017508,0.015076,0.018989,0.031088,1737482377,2025-05-12 13:56:20,,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,de,glaube bock,11.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,neutral,0.0,0.180779,13.0,Monday,2025.0,5.0,afternoon,False
4,tiktok,7477658166235286791,,#mascotas #humormascotas😂😂 #mascotastiktok #vi...,rokopitbull,0.387179,0.085289,0.252874,0.310576,1741027969,2025-05-12 13:56:20,https://v16-webapp-prime.tiktok.com/video/tos/...,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,unknown,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neutral,0.0,0.397213,13.0,Monday,2025.0,5.0,afternoon,False


# 8. Speichern der bereinigten Daten

In [30]:
PROCESSED_DIR = BASE_DIR / "./data/processed"
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

df_final.to_csv(PROCESSED_DIR / "social_media_data.csv", index=False)
print("✅ Daten wurden erfolgreich gespeichert.")

✅ Daten wurden erfolgreich gespeichert.
