<div style="text-align: center; font-size: 16px;">
    <strong>Course:</strong> Machine Learning Operations |
    <strong>Lecturer:</strong> Prof. Dr. Klotz |
    <strong>Date:</strong> 17.05.2025 |
    <strong>Name:</strong> Sofie Pischl
</div>

# <center>Preprocessing</center>

# Struktur des Notebooks

## 1. Setup und Bibliotheken

In diesem Abschnitt werden alle benötigten Bibliotheken importiert und notwendige NLP-Ressourcen geladen (z. B. NLTK-Modelle).

In [131]:
import os
import pandas as pd
import numpy as np
import re
from pathlib import Path
from datetime import datetime
from sklearn.preprocessing import MinMaxScaler
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from textblob import TextBlob
import logging

# Logging Setup
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# NLTK-Ressourcen laden
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SofiePischl\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SofiePischl\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SofiePischl\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\SofiePischl\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## 2. Daten laden

Hier werden die Rohdaten aus TikTok, YouTube und Reddit eingelesen.


In [132]:


BASE_DIR = Path().resolve().parent
RAW_DIR = (BASE_DIR / "./data/raw").resolve()

data_paths = {
    "tiktok": RAW_DIR / "tiktok_data.csv",
    "youtube": RAW_DIR / "youtube_data.csv",
    "reddit": RAW_DIR / "reddit_data.csv"
}

data = {key: pd.read_csv(path) for key, path in data_paths.items()}

# Anzeigen der ersten Zeilen für Überblick
for key, df in data.items():
    print(f"📄 {key.upper()} - Vorschau:")
    display(df.head())
    df.info()
    print("\n" + "="*240 + "\n")


📄 TIKTOK - Vorschau:


Unnamed: 0,id,description,author_username,author_id,likes,shares,comments,plays,video_url,created_time,"{""detail"":""Datei nicht gefunden""}7499178369544621334",Unnamed: 1,zah1de_kyc,6836358130437211142,747600,8890,6626,6600000,Unnamed: 8,1746038530
0,7493469801654881542,#vairalvideo_foryoupage #🇦🇫ازبک_تاجک_پشتون_تر...,afgcap.cut,7461541069958153234,15800,451,258,365200,https://v16-webapp-prime.tiktok.com/video/tos/...,1744709406,,,,,,,,,,
1,7489427780397010198,#imapoliceofficer #tensheet #foryou #viral #fy...,backwheelbandit69,7416366442453632032,4300000,303800,11300,41900000,https://v16-webapp-prime.tiktok.com/video/tos/...,1743768297,,,,,,,,,,
2,7492000423641959685,#CapCut #قوالب_كاب_كات_جاهزه_للتصميم__🌴♥ #كاب_...,noordeen_cap_cat_0_1,7322835376556442629,58400,1576,451,1300000,https://v16-webapp-prime.tiktok.com/video/tos/...,1744367290,,,,,,,,,,
3,7472584144510373125,i think it was a bad idea,maligoshik,7014608336423617542,18000000,1300000,24000,168600000,https://v16-webapp-prime.tiktok.com/video/tos/...,1739846584,,,,,,,,,,
4,7461927005689302280,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,7455509281098515474,261400,9526,13000,12400000,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...,1737365272,,,,,,,,,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5793 entries, 0 to 5792
Data columns (total 20 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   id                                                    3294 non-null   object 
 1   description                                           2996 non-null   object 
 2   author_username                                       3294 non-null   object 
 3   author_id                                             3294 non-null   object 
 4   likes                                                 3294 non-null   object 
 5   shares                                                3294 non-null   object 
 6   comments                                              3294 non-null   object 
 7   plays                                                 3294 non-null   object 
 8   video_url                                             1898

Unnamed: 0,video_id,title,description,channel_title,published_at,view_count,like_count,comment_count,url,scraped_at,trending_date
0,-F33ACcPbhU,Monster Hunter Wilds – Festival of Accord: Blo...,Bask in the springtime aura and enjoy cherry b...,Monster Hunter,2025-04-22T01:00:25Z,195940,9298,619.0,https://www.youtube.com/watch?v=-F33ACcPbhU,2025-04-22 22:06:12.302112,
1,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,2025-05-04T21:00:09Z,1684539,31221,878.0,https://www.youtube.com/watch?v=-H8tvnWaYs4,2025-05-07T12:30:17.866760,2025-05-07
2,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,2025-05-04T21:00:09Z,1582148,30426,867.0,https://www.youtube.com/watch?v=-H8tvnWaYs4,2025-05-06T13:32:11.312387,2025-05-06
3,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,2025-05-04T21:00:09Z,1333458,28389,780.0,https://www.youtube.com/watch?v=-H8tvnWaYs4,2025-05-05T18:10:08.695398,2025-05-05
4,-JFW5V4U6bo,Picks 1-10: Jaguars TRADE UP For Travis Hunter...,"Watch live local and primetime games, NFL RedZ...",NFL,2025-04-25T01:55:00Z,443795,8052,1112.0,https://www.youtube.com/watch?v=-JFW5V4U6bo,2025-04-25 14:45:45.902741,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428 entries, 0 to 427
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   video_id       428 non-null    object 
 1   title          428 non-null    object 
 2   description    428 non-null    object 
 3   channel_title  428 non-null    object 
 4   published_at   428 non-null    object 
 5   view_count     428 non-null    int64  
 6   like_count     428 non-null    int64  
 7   comment_count  427 non-null    float64
 8   url            428 non-null    object 
 9   scraped_at     428 non-null    object 
 10  trending_date  231 non-null    object 
dtypes: float64(1), int64(2), object(8)
memory usage: 36.9+ KB


📄 REDDIT - Vorschau:


Unnamed: 0,"{""detail"":""Datei nicht gefunden""}all",Do Republicans Realize It’s Not Just Democrats - The Whole World Looks at Them with Disgust,"Republicans keep acting like this is just a culture war, as if it's about DEI, immigrants, or whatever grievance of the week gets them riled up. But what they’re enabling under Trump isn’t a debate. It’s a global threat, and the rest of the world sees it clearly.\r\n\r\nTrump has insulted allies, threatened to abandon NATO, and pulled the U.S. out of the Paris Climate Accord again. He has imposed tariffs on Canada and Germany out of spite, joked about annexing Canada, and treated diplomacy like a reality show. These are not policies. They are provocations, and they are shaking the global order.\r\n\r\nNow he has pulled back support for Ukraine, unraveling years of unity and leaving Europe to question whether the U.S. can still be trusted.\r\n\r\nRepublicans have already made clear they don't care how this affects people here. But they seem equally indifferent to the fact that it's dragging the rest of the world down with them. Their loyalty to Trump is wrecking alliances, stalling climate action, emboldening dictators, and unraveling decades of shared progress.\r\n\r\nTo much of the world, it looks like insanity - a country sabotaging the very systems it built, while millions cheer it on like a sport. This isn’t just short-sighted. It's a betrayal of everything we once stood for, both at home and abroad. The world is not confused. They're disgusted. And they’re right to be.\r\n\r\n\r\nEdit:\r\n\r\nI just realized every so-called right-wing reply in this sub comes from a negative karma troll account. Seriously check accounts - negative 60, negative 100, every time. Are you guys bots, trolls, or just Republicans who can’t post from a real profile? You need a burner just to spread MAGA filth? This is crazy.",7264,1612,2025-05-02 06:11:52,https://www.reddit.com/r/AskUS/comments/1kcu1gh/do_republicans_realize_its_not_just_democrats_the/,2025-05-02 14:01:13.027831,subreddit,title,text,score,comments,created,url,scraped_at
0,all,AITAH for refusing to pay my friend for a cust...,My (28F) friend (30F) is a self-taught baker w...,9027.0,1661.0,2025-05-02 05:23:14,https://www.reddit.com/r/AITAH/comments/1kctb2...,2025-05-02 14:01:13.027831,,,,,,,,
1,,,,,,,,,all,She erased us from her wedding. So I’m erasing...,"When my brother got married, his bride (now my...",4545.0,1044.0,2025-05-03 05:19:53,https://www.reddit.com/r/pettyrevenge/comments...,2025-05-03 08:58:16.134905
2,,,,,,,,,all,UPDATE: AITAH for telling my MIL to stop calli...,I just want to give you an update about by sit...,4137.0,247.0,2025-05-03 01:38:25,https://www.reddit.com/r/AITAH/comments/1kdhk8...,2025-05-03 08:58:16.134905
3,,,,,,,,,all,"Conservatives, if you cared about Hunter Biden...",Republicans claim that foreign businesses and ...,3600.0,777.0,2025-05-03 01:34:29,https://www.reddit.com/r/AskUS/comments/1kdhhk...,2025-05-03 08:58:16.134905
4,,,,,,,,,all,What’s a subtle sign that someone has been thr...,,3355.0,1469.0,2025-05-03 01:57:50,https://www.reddit.com/r/AskReddit/comments/1k...,2025-05-03 08:58:16.134905


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 724 entries, 0 to 723
Data columns (total 16 columns):
 #   Column                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

## 3. Textbereinigung und Feature-Extraktion

Hier definieren wir Funktionen zur Reinigung, Lemmatization, Stopwortfilterung und Extraktion von Textmerkmalen für Sentimentanalyse und Topic Modeling.


In [133]:
def remove_emojis(text):
    if not isinstance(text, str):
        return ""
    emoji_pattern = re.compile(
        "[" 
        u"\U0001F600-\U0001F64F"
        u"\U0001F300-\U0001F5FF"
        u"\U0001F680-\U0001F6FF"
        u"\U0001F1E0-\U0001F1FF"
        u"\U00002700-\U000027BF"
        u"\U000024C2-\U0001F251"
        "]", flags=re.UNICODE
    )
    return emoji_pattern.sub(r'', text)
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'): return wordnet.ADJ
    elif treebank_tag.startswith('V'): return wordnet.VERB
    elif treebank_tag.startswith('N'): return wordnet.NOUN
    elif treebank_tag.startswith('R'): return wordnet.ADV
    return wordnet.NOUN

def lemmatize_tokens(tokens):
    tagged = pos_tag(tokens)
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w, get_wordnet_pos(t)) for w, t in tagged]

def preprocess_text(text, remove_stopwords=True):
    if not isinstance(text, str):
        return ""

    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = remove_emojis(text)
    text = re.sub(r'#(\w+)', r'\1', text)
    text = re.sub(r'[^\w\s\']', ' ', text)
    text = re.sub(r'\s\'|\'\s', ' ', text)
    text = ' '.join(text.split())

    tokens = word_tokenize(text)

    if remove_stopwords:
        stop_words = set()
        for lang in ['english', 'german']:
            try:
                stop_words.update(stopwords.words(lang))
            except:
                logger.warning(f"Stopwords for {lang} not available")
        important_words = {"n't", "'s", "'m", "'re", "'ve", "'ll", "no", "not"}
        stop_words -= important_words
        tokens = [token for token in tokens if token.lower() not in stop_words]

    return ' '.join(lemmatize_tokens(tokens))

def extract_text_features(text):
    if not isinstance(text, str) or not text.strip():
        return {
            'word_count': 0,
            'char_count': 0,
            'avg_word_length': 0,
            'sentiment_polarity': 0,
            'sentiment_subjectivity': 0
        }
    words = text.split()
    blob = TextBlob(text)
    return {
        'word_count': len(words),
        'char_count': len(text),
        'avg_word_length': len(text) / len(words),
        'sentiment_polarity': blob.sentiment.polarity,
        'sentiment_subjectivity': blob.sentiment.subjectivity
    }

def apply_text_processing(df, col):
    df = df.copy()
    df[f"{col}_processed"] = df[col].astype(str).apply(preprocess_text)
    features = df[f"{col}_processed"].apply(extract_text_features)
    return pd.concat([df, pd.DataFrame(features.tolist())], axis=1)


## 4. Plattformdaten bereinigen

Die Inhalte der Plattformen unterscheiden sich strukturell, daher erfolgt die Bereinigung pro Quelle individuell.


## Reddit

In [134]:
def clean_reddit_data(path: Path) -> pd.DataFrame:
    """
    Lädt und bereinigt Reddit-Daten aus einer CSV-Datei.
    """
    try:
        df = pd.read_csv(
            path,
            encoding='utf-8',
            parse_dates=['created', 'scraped_at'],
            on_bad_lines='skip'  # Nutze dies bei pandas >= 1.3
        )
    except Exception as e:
        logger.error(f"Fehler beim Einlesen der Reddit-Daten: {e}")
        return pd.DataFrame()

    # Sicherstellen, dass Datumsspalten korrekt sind
    for date_col in ['created', 'scraped_at']:
        if date_col in df.columns:
            df[date_col] = pd.to_datetime(df[date_col], errors='coerce')

    # Nur relevante Spalten behalten
    relevant_columns = ['subreddit', 'title', 'text', 'score', 'comments', 'created', 'url', 'scraped_at']
    existing_columns = [col for col in relevant_columns if col in df.columns]
    df = df[existing_columns].copy()

    # NaNs entfernen (essentielle Felder)
    df = df.dropna(subset=['subreddit', 'title'])

    # Fehlende scraped_at-Werte durch created ersetzen
    df['scraped_at'] = df.get('scraped_at', pd.NaT).fillna(df.get('created'))

    # Fehlende Texte füllen
    df['text'] = df.get('text', "").fillna("")

    # Numerische Felder bereinigen
    df['score'] = pd.to_numeric(df.get('score', 0), errors='coerce').fillna(0).astype(int)
    df['comments'] = pd.to_numeric(df.get('comments', 0), errors='coerce').fillna(0).astype(int)

    return df

df_reddit = clean_reddit_data(data_paths["reddit"])

print("\nErste 5 Zeilen der bereinigten Reddit-Daten:")
display(df_reddit.head())

print("\nInformationen über den Datensatz:")
print(df_reddit.info())



Erste 5 Zeilen der bereinigten Reddit-Daten:


Unnamed: 0,subreddit,title,text,score,comments,created,url,scraped_at
1,all,She erased us from her wedding. So I’m erasing...,"When my brother got married, his bride (now my...",4545,1044,2025-05-03 05:19:53,https://www.reddit.com/r/pettyrevenge/comments...,2025-05-03 08:58:16.134905
2,all,UPDATE: AITAH for telling my MIL to stop calli...,I just want to give you an update about by sit...,4137,247,2025-05-03 01:38:25,https://www.reddit.com/r/AITAH/comments/1kdhk8...,2025-05-03 08:58:16.134905
3,all,"Conservatives, if you cared about Hunter Biden...",Republicans claim that foreign businesses and ...,3600,777,2025-05-03 01:34:29,https://www.reddit.com/r/AskUS/comments/1kdhhk...,2025-05-03 08:58:16.134905
4,all,What’s a subtle sign that someone has been thr...,,3355,1469,2025-05-03 01:57:50,https://www.reddit.com/r/AskReddit/comments/1k...,2025-05-03 08:58:16.134905
5,all,TIFU by trying to flirt with a guy at the gym ...,So this happened yesterday and I’m still cring...,2138,174,2025-05-03 04:05:08,https://www.reddit.com/r/tifu/comments/1kdk6o7...,2025-05-03 08:58:16.134905



Informationen über den Datensatz:
<class 'pandas.core.frame.DataFrame'>
Index: 723 entries, 1 to 723
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   subreddit   723 non-null    object        
 1   title       723 non-null    object        
 2   text        723 non-null    object        
 3   score       723 non-null    int64         
 4   comments    723 non-null    int64         
 5   created     723 non-null    datetime64[ns]
 6   url         723 non-null    object        
 7   scraped_at  723 non-null    datetime64[ns]
dtypes: datetime64[ns](2), int64(2), object(4)
memory usage: 50.8+ KB
None


## TikTok

In [135]:
def clean_tiktok_data(path: Path) -> pd.DataFrame:
    """
    Lädt und bereinigt TikTok-Daten aus einer CSV-Datei.
    Wandelt numerische Spalten in Integer und Timestamp in datetime.
    """
    try:
        df = pd.read_csv(path)
    except Exception as e:
        logger.error(f"Fehler beim Einlesen der TikTok-Daten: {e}")
        return pd.DataFrame()

    # Relevante Spalten selektieren
    relevant_columns = [
        'id', 'description', 'author_username', 'author_id',
        'likes', 'shares', 'comments', 'plays', 'video_url', 'created_time'
    ]
    existing_columns = [col for col in relevant_columns if col in df.columns]
    df = df[existing_columns].copy()

    # Spalten umbenennen
    df = df.rename(columns={
        'author_username': 'username',
        'author_id': 'user_id',
        'created_time': 'timestamp'
    })

    # Numerische Spalten konvertieren: NaN → 0, dann int
    numeric_cols = ['likes', 'shares', 'comments', 'plays']
    for col in numeric_cols:
        df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0).astype(int)

    # Zeitstempel als datetime konvertieren
    if 'timestamp' in df.columns:
        df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s', errors='coerce')

    return df


df_tiktok = clean_tiktok_data(data_paths["tiktok"])

print("\nErste 5 Zeilen:")
display(df_tiktok.head())


print("Dataframe Info:")
print(df_tiktok.info())


Erste 5 Zeilen:


  df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s', errors='coerce')


Unnamed: 0,id,description,username,user_id,likes,shares,comments,plays,video_url,timestamp
0,7493469801654881542,#vairalvideo_foryoupage #🇦🇫ازبک_تاجک_پشتون_تر...,afgcap.cut,7461541069958153234,15800,451,258,365200,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-04-15 09:30:06
1,7489427780397010198,#imapoliceofficer #tensheet #foryou #viral #fy...,backwheelbandit69,7416366442453632032,4300000,303800,11300,41900000,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-04-04 12:04:57
2,7492000423641959685,#CapCut #قوالب_كاب_كات_جاهزه_للتصميم__🌴♥ #كاب_...,noordeen_cap_cat_0_1,7322835376556442629,58400,1576,451,1300000,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-04-11 10:28:10
3,7472584144510373125,i think it was a bad idea,maligoshik,7014608336423617542,18000000,1300000,24000,168600000,https://v16-webapp-prime.tiktok.com/video/tos/...,2025-02-18 02:43:04
4,7461927005689302280,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,7455509281098515474,261400,9526,13000,12400000,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...,2025-01-20 09:27:52


Dataframe Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5793 entries, 0 to 5792
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   id           3294 non-null   object        
 1   description  2996 non-null   object        
 2   username     3294 non-null   object        
 3   user_id      3294 non-null   object        
 4   likes        5793 non-null   int64         
 5   shares       5793 non-null   int64         
 6   comments     5793 non-null   int64         
 7   plays        5793 non-null   int64         
 8   video_url    1898 non-null   object        
 9   timestamp    3385 non-null   datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(5)
memory usage: 452.7+ KB
None


# YouTube

In [136]:
def clean_youtube_data(df):
    df = df.copy()
    for col in ['title', 'description']:
        df[col] = df[col].fillna('')
        
    for col in ['view_count', 'like_count', 'comment_count']:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    return df

df_youtube = clean_youtube_data(data['youtube'])
display(df_youtube.head())
df_youtube.info()

Unnamed: 0,video_id,title,description,channel_title,published_at,view_count,like_count,comment_count,url,scraped_at,trending_date
0,-F33ACcPbhU,Monster Hunter Wilds – Festival of Accord: Blo...,Bask in the springtime aura and enjoy cherry b...,Monster Hunter,2025-04-22T01:00:25Z,195940,9298,619.0,https://www.youtube.com/watch?v=-F33ACcPbhU,2025-04-22 22:06:12.302112,
1,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,2025-05-04T21:00:09Z,1684539,31221,878.0,https://www.youtube.com/watch?v=-H8tvnWaYs4,2025-05-07T12:30:17.866760,2025-05-07
2,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,2025-05-04T21:00:09Z,1582148,30426,867.0,https://www.youtube.com/watch?v=-H8tvnWaYs4,2025-05-06T13:32:11.312387,2025-05-06
3,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,2025-05-04T21:00:09Z,1333458,28389,780.0,https://www.youtube.com/watch?v=-H8tvnWaYs4,2025-05-05T18:10:08.695398,2025-05-05
4,-JFW5V4U6bo,Picks 1-10: Jaguars TRADE UP For Travis Hunter...,"Watch live local and primetime games, NFL RedZ...",NFL,2025-04-25T01:55:00Z,443795,8052,1112.0,https://www.youtube.com/watch?v=-JFW5V4U6bo,2025-04-25 14:45:45.902741,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428 entries, 0 to 427
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   video_id       428 non-null    object 
 1   title          428 non-null    object 
 2   description    428 non-null    object 
 3   channel_title  428 non-null    object 
 4   published_at   428 non-null    object 
 5   view_count     428 non-null    int64  
 6   like_count     428 non-null    int64  
 7   comment_count  427 non-null    float64
 8   url            428 non-null    object 
 9   scraped_at     428 non-null    object 
 10  trending_date  231 non-null    object 
dtypes: float64(1), int64(2), object(8)
memory usage: 36.9+ KB


# 5. Zusammenführen der Daten

In [137]:
# Extrahiert die ID aus einer URL (der Teil nach dem letzten Slash)
def extract_id_from_url(url):
    if isinstance(url, str):
        return url.rstrip('/').split('/')[-1]
    return None

In [138]:
# Diese Funktion vereinheitlicht die Struktur der drei Plattformen in ein einheitliches Format
def unify_dataframes(df_tiktok, df_youtube, df_reddit):
    # TikTok
    df_tiktok_clean = pd.DataFrame({
        'source': 'tiktok',
        'id': df_tiktok['id'],
        'title': None,
        'text': df_tiktok['description'],
        'username': df_tiktok['username'],
        'likes': df_tiktok['likes'],
        'comments': df_tiktok['comments'],
        'shares': df_tiktok['shares'],
        'plays': df_tiktok['plays'],
        'timestamp': df_tiktok['timestamp'],
        'published_at': None,
        'url': df_tiktok['video_url']
    })

    # YouTube
    df_youtube_clean = pd.DataFrame({
        'source': 'youtube',
        'id': df_youtube['video_id'],
        'title': df_youtube['title'],
        'text': df_youtube['description'],
        'username': df_youtube['channel_title'],
        'likes': df_youtube['like_count'],
        'comments': df_youtube['comment_count'],
        'shares': None,
        'plays': df_youtube['view_count'],
        'timestamp': df_youtube['scraped_at'],
        'published_at': df_youtube['published_at'],
        'url': df_youtube['url']
    })

    # Reddit
    df_reddit_clean = pd.DataFrame({
        'source': 'reddit',
        'id': df_reddit['url'].apply(extract_id_from_url),
        'title': df_reddit['title'],
        'text': df_reddit['text'],
        'username': None,
        'likes': df_reddit['score'],
        'comments': df_reddit['comments'],
        'shares': None,
        'plays': None,
        'timestamp': df_reddit['scraped_at'],
        'published_at': df_reddit['created'],
        'url': df_reddit['url']
    })

    # Kombinieren aller Plattformen in einem DataFrame
    return pd.concat([df_tiktok_clean, df_youtube_clean, df_reddit_clean], ignore_index=True)

# Zusammenführen der Plattform-Daten
df_combined = unify_dataframes(df_tiktok, df_youtube, df_reddit)
df_combined.head()


Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url
0,tiktok,7493469801654881542,,#vairalvideo_foryoupage #🇦🇫ازبک_تاجک_پشتون_تر...,afgcap.cut,15800,258.0,451,365200,2025-04-15 09:30:06,,https://v16-webapp-prime.tiktok.com/video/tos/...
1,tiktok,7489427780397010198,,#imapoliceofficer #tensheet #foryou #viral #fy...,backwheelbandit69,4300000,11300.0,303800,41900000,2025-04-04 12:04:57,,https://v16-webapp-prime.tiktok.com/video/tos/...
2,tiktok,7492000423641959685,,#CapCut #قوالب_كاب_كات_جاهزه_للتصميم__🌴♥ #كاب_...,noordeen_cap_cat_0_1,58400,451.0,1576,1300000,2025-04-11 10:28:10,,https://v16-webapp-prime.tiktok.com/video/tos/...
3,tiktok,7472584144510373125,,i think it was a bad idea,maligoshik,18000000,24000.0,1300000,168600000,2025-02-18 02:43:04,,https://v16-webapp-prime.tiktok.com/video/tos/...
4,tiktok,7461927005689302280,,welcome to the thanos world!! #squidgame #squi...,team_thanos_player230,261400,13000.0,9526,12400000,2025-01-20 09:27:52,,https://webapp-sg.tiktok.com/bf63b8aa40b9ff0ca...


## Detect language

In [139]:
from langdetect import detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException

# Fixiert Zufälligkeit (für konsistente Ergebnisse)
DetectorFactory.seed = 0

def detect_language(text: str) -> str:
    try:
        return detect(text)
    except LangDetectException:
        return "unknown"

def preprocess_text_fields(df: pd.DataFrame, text_cols: list[str]) -> pd.DataFrame:
    df = df.copy()

    for col in text_cols:
        # 1. Fehlende Werte auffüllen und in Strings umwandeln
        df[col] = df[col].fillna("").astype(str).str.strip()

        # 2. Sprache erkennen
        df[f"{col}_language"] = df[col].apply(detect_language)

        # 3. Nur deutsche oder englische Einträge behalten
        df = df[df[f"{col}_language"].isin(["en", "de"])]

        # 4. Text bereinigen (z. B. Kleinschreibung, Stoppwörter entfernen, Sonderzeichen etc.)
        df[f"{col}_clean"] = df[col].apply(lambda x: preprocess_text(x, remove_stopwords=True))

    return df

text_cols = ['title', 'text']
df_preprocessed = preprocess_text_fields(df_combined, text_cols)
df_preprocessed.head()


Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,title_language,title_clean,text_language,text_clean
5793,youtube,-F33ACcPbhU,Monster Hunter Wilds – Festival of Accord: Blo...,Bask in the springtime aura and enjoy cherry b...,Monster Hunter,9298,619.0,,195940,2025-04-22 22:06:12.302112,2025-04-22T01:00:25Z,https://www.youtube.com/watch?v=-F33ACcPbhU,en,monster hunter wild festival accord blossomdan...,en,bask springtime aura enjoy cherry blossom seas...
5794,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,31221,878.0,,1684539,2025-05-07T12:30:17.866760,2025-05-04T21:00:09Z,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...
5795,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,30426,867.0,,1582148,2025-05-06T13:32:11.312387,2025-05-04T21:00:09Z,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...
5796,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,28389,780.0,,1333458,2025-05-05T18:10:08.695398,2025-05-04T21:00:09Z,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...
5797,youtube,-JFW5V4U6bo,Picks 1-10: Jaguars TRADE UP For Travis Hunter...,"Watch live local and primetime games, NFL RedZ...",NFL,8052,1112.0,,443795,2025-04-25 14:45:45.902741,2025-04-25T01:55:00Z,https://www.youtube.com/watch?v=-JFW5V4U6bo,en,pick 1 10 jaguar trade travis hunter 2025 nfl ...,en,watch live local primetime game nfl redzone nf...


# 6. Feature Engineering

In [None]:
from textblob import TextBlob
c

In [141]:
def extract_text_features(text: str) -> dict:
    try:
        text = str(text).strip()
        words = text.split()

        return {
            'char_count': len(text),
            'word_count': len(words),
            'uppercase_count': sum(1 for c in text if c.isupper()),
            'exclamation_count': text.count('!'),
            'question_count': text.count('?'),
            'emoji_count': len(re.findall(r'[^\w\s,]', text)),
            'mention_count': text.count('@'),
            'hashtag_count': text.count('#'),
            'avg_word_length': (sum(len(w) for w in words) / len(words)) if words else 0,
        }


    except Exception as e:
        print(f"⚠️ Fehler in extract_text_features: {e}")
        return {
            'char_count': 0,
            'word_count': 0,
            'uppercase_count': 0,
            'exclamation_count': 0,
            'question_count': 0,
            'emoji_count': 0,
            'mention_count': 0,
            'hashtag_count': 0,
            'avg_word_length': 0,
        }

def analyze_sentiment(text: str) -> tuple[str, float]:
    try:
        text = str(text).strip()
        if not text:
            return ("neutral", 0.0)

        blob = TextBlob(text)
        polarity = blob.sentiment.polarity

        if polarity > 0.1:
            return ("positive", polarity)
        elif polarity < -0.1:
            return ("negative", polarity)
        else:
            return ("neutral", polarity)

    except Exception as e:
        print(f"⚠️ Fehler in analyze_sentiment: {e}")
        return ("neutral", 0.0)

def add_text_features(df: pd.DataFrame, text_cols: list[str]) -> pd.DataFrame:
    df = df.copy().reset_index(drop=True)  # Index neu setzen

    for col in text_cols:
        clean_col = f"{col}_clean"

        if clean_col not in df.columns:
            print(f"⚠️ Spalte '{clean_col}' fehlt – überspringe Feature-Generierung für '{col}'")
            continue

        df[clean_col] = df[clean_col].fillna("").astype(str)

        # 1. Textstatistiken extrahieren
        features = df[clean_col].apply(extract_text_features)
        feature_df = pd.DataFrame(features.tolist())
        feature_df.columns = [f"{col}_{c}" for c in feature_df.columns]
        feature_df.index = df.index  # Index ausrichten
        df = pd.concat([df, feature_df], axis=1)

        # 2. Sentimentanalyse
        sentiment_df = df[clean_col].apply(lambda x: pd.Series(analyze_sentiment(x)))
        sentiment_df.columns = [f"{col}_sentiment", f"{col}_sentiment_score"]
        df = pd.concat([df, sentiment_df], axis=1)

        # Optional: NaN-Warnung
        if feature_df.isna().sum().sum() > 0:
            print(f"⚠️ Warnung: Es gibt NaN-Werte in den Features für '{col}'")

    return df

text_cols = ['title', 'text']
df_feature_engineered_1 = add_text_features(df_preprocessed, text_cols)
df_feature_engineered_1.head()

Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,title_language,title_clean,text_language,text_clean,title_char_count,title_word_count,title_uppercase_count,title_exclamation_count,title_question_count,title_emoji_count,title_mention_count,title_hashtag_count,title_avg_word_length,title_sentiment,title_sentiment_score,text_char_count,text_word_count,text_uppercase_count,text_exclamation_count,text_question_count,text_emoji_count,text_mention_count,text_hashtag_count,text_avg_word_length,text_sentiment,text_sentiment_score
0,youtube,-F33ACcPbhU,Monster Hunter Wilds – Festival of Accord: Blo...,Bask in the springtime aura and enjoy cherry b...,Monster Hunter,9298,619.0,,195940,2025-04-22 22:06:12.302112,2025-04-22T01:00:25Z,https://www.youtube.com/watch?v=-F33ACcPbhU,en,monster hunter wild festival accord blossomdan...,en,bask springtime aura enjoy cherry blossom seas...,70,9,0,0,0,0,0,0,6.888889,neutral,0.1,835,124,0,0,0,0,0,0,5.741935,positive,0.133117
1,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,31221,878.0,,1684539,2025-05-07T12:30:17.866760,2025-05-04T21:00:09Z,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...,54,9,0,0,0,0,0,0,5.111111,neutral,0.0,1551,237,0,0,0,3,0,0,5.548523,positive,0.218157
2,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,30426,867.0,,1582148,2025-05-06T13:32:11.312387,2025-05-04T21:00:09Z,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...,54,9,0,0,0,0,0,0,5.111111,neutral,0.0,1551,237,0,0,0,3,0,0,5.548523,positive,0.218157
3,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,28389,780.0,,1333458,2025-05-05T18:10:08.695398,2025-05-04T21:00:09Z,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...,54,9,0,0,0,0,0,0,5.111111,neutral,0.0,1551,237,0,0,0,3,0,0,5.548523,positive,0.218157
4,youtube,-JFW5V4U6bo,Picks 1-10: Jaguars TRADE UP For Travis Hunter...,"Watch live local and primetime games, NFL RedZ...",NFL,8052,1112.0,,443795,2025-04-25 14:45:45.902741,2025-04-25T01:55:00Z,https://www.youtube.com/watch?v=-JFW5V4U6bo,en,pick 1 10 jaguar trade travis hunter 2025 nfl ...,en,watch live local primetime game nfl redzone nf...,51,10,0,0,0,0,0,0,4.2,neutral,0.0,214,35,0,0,0,0,0,0,5.142857,neutral,-0.087879


## Berechnen der Engagement rate

In [142]:
# Textverarbeitung & Berechnung der Engagement-Rate
def enrich_data(df, engagement_numerator_cols=None, engagement_denominator_col=None):
    df = df.copy()

    # Berechnung der Engagement Rate (wenn nicht vorhanden)
    if "engagement_rate" not in df.columns and engagement_numerator_cols and engagement_denominator_col in df.columns:
        try:
            numerator = df[engagement_numerator_cols].sum(axis=1)
            denominator = df[engagement_denominator_col].replace(0, np.nan)
            df['engagement_rate'] = (numerator / denominator).replace([np.inf, -np.inf], np.nan)
        except Exception as e:
            print(f"⚠️ Engagement-Rate konnte nicht berechnet werden: {e}")
    
    return df

# Anreicherung mit Textverarbeitung und Engagement Rate
df_enriched = enrich_data(
    df_feature_engineered_1,
    engagement_numerator_cols=['likes', 'comments'],
    engagement_denominator_col='plays'
)

df_enriched.head()

  df['engagement_rate'] = (numerator / denominator).replace([np.inf, -np.inf], np.nan)


Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,title_language,title_clean,text_language,text_clean,title_char_count,title_word_count,title_uppercase_count,title_exclamation_count,title_question_count,title_emoji_count,title_mention_count,title_hashtag_count,title_avg_word_length,title_sentiment,title_sentiment_score,text_char_count,text_word_count,text_uppercase_count,text_exclamation_count,text_question_count,text_emoji_count,text_mention_count,text_hashtag_count,text_avg_word_length,text_sentiment,text_sentiment_score,engagement_rate
0,youtube,-F33ACcPbhU,Monster Hunter Wilds – Festival of Accord: Blo...,Bask in the springtime aura and enjoy cherry b...,Monster Hunter,9298,619.0,,195940,2025-04-22 22:06:12.302112,2025-04-22T01:00:25Z,https://www.youtube.com/watch?v=-F33ACcPbhU,en,monster hunter wild festival accord blossomdan...,en,bask springtime aura enjoy cherry blossom seas...,70,9,0,0,0,0,0,0,6.888889,neutral,0.1,835,124,0,0,0,0,0,0,5.741935,positive,0.133117,0.050612
1,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,31221,878.0,,1684539,2025-05-07T12:30:17.866760,2025-05-04T21:00:09Z,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...,54,9,0,0,0,0,0,0,5.111111,neutral,0.0,1551,237,0,0,0,3,0,0,5.548523,positive,0.218157,0.019055
2,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,30426,867.0,,1582148,2025-05-06T13:32:11.312387,2025-05-04T21:00:09Z,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...,54,9,0,0,0,0,0,0,5.111111,neutral,0.0,1551,237,0,0,0,3,0,0,5.548523,positive,0.218157,0.019779
3,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,28389,780.0,,1333458,2025-05-05T18:10:08.695398,2025-05-04T21:00:09Z,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...,54,9,0,0,0,0,0,0,5.111111,neutral,0.0,1551,237,0,0,0,3,0,0,5.548523,positive,0.218157,0.021875
4,youtube,-JFW5V4U6bo,Picks 1-10: Jaguars TRADE UP For Travis Hunter...,"Watch live local and primetime games, NFL RedZ...",NFL,8052,1112.0,,443795,2025-04-25 14:45:45.902741,2025-04-25T01:55:00Z,https://www.youtube.com/watch?v=-JFW5V4U6bo,en,pick 1 10 jaguar trade travis hunter 2025 nfl ...,en,watch live local primetime game nfl redzone nf...,51,10,0,0,0,0,0,0,4.2,neutral,0.0,214,35,0,0,0,0,0,0,5.142857,neutral,-0.087879,0.020649


## Normalisierung

In [143]:
# Normalisiert ausgewählte numerische Spalten zwischen 0 und 1
def normalize_metrics(df, columns):
    df = df.copy()
    valid_cols = []

    for col in columns:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')
            df[col] = df[col].replace([np.inf, -np.inf], np.nan).fillna(df[col].mean())
            valid_cols.append(col)
        else:
            print(f"⚠️ Spalte '{col}' nicht gefunden – wird übersprungen.")

    if not valid_cols:
        print("❌ Keine gültigen Spalten zum Normalisieren.")
        return df

    scaler = MinMaxScaler()
    df[valid_cols] = scaler.fit_transform(df[valid_cols])
    return df

# Normalisierung von Metriken
df_normalized = normalize_metrics(df_enriched, ['likes', 'comments', 'shares', 'plays', 'engagement_rate'])
df_normalized.head()

  return xp.asarray(numpy.nanmin(X, axis=axis))
  return xp.asarray(numpy.nanmax(X, axis=axis))


Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,title_language,title_clean,text_language,text_clean,title_char_count,title_word_count,title_uppercase_count,title_exclamation_count,title_question_count,title_emoji_count,title_mention_count,title_hashtag_count,title_avg_word_length,title_sentiment,title_sentiment_score,text_char_count,text_word_count,text_uppercase_count,text_exclamation_count,text_question_count,text_emoji_count,text_mention_count,text_hashtag_count,text_avg_word_length,text_sentiment,text_sentiment_score,engagement_rate
0,youtube,-F33ACcPbhU,Monster Hunter Wilds – Festival of Accord: Blo...,Bask in the springtime aura and enjoy cherry b...,Monster Hunter,0.001926,0.001461,,0.002288,2025-04-22 22:06:12.302112,2025-04-22T01:00:25Z,https://www.youtube.com/watch?v=-F33ACcPbhU,en,monster hunter wild festival accord blossomdan...,en,bask springtime aura enjoy cherry blossom seas...,70,9,0,0,0,0,0,0,6.888889,neutral,0.1,835,124,0,0,0,0,0,0,5.741935,positive,0.133117,0.150794
1,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,0.006466,0.002073,,0.021392,2025-05-07T12:30:17.866760,2025-05-04T21:00:09Z,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...,54,9,0,0,0,0,0,0,5.111111,neutral,0.0,1551,237,0,0,0,3,0,0,5.548523,positive,0.218157,0.055552
2,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,0.006302,0.002047,,0.020078,2025-05-06T13:32:11.312387,2025-05-04T21:00:09Z,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...,54,9,0,0,0,0,0,0,5.111111,neutral,0.0,1551,237,0,0,0,3,0,0,5.548523,positive,0.218157,0.057737
3,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,0.00588,0.001841,,0.016887,2025-05-05T18:10:08.695398,2025-05-04T21:00:09Z,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...,54,9,0,0,0,0,0,0,5.111111,neutral,0.0,1551,237,0,0,0,3,0,0,5.548523,positive,0.218157,0.064062
4,youtube,-JFW5V4U6bo,Picks 1-10: Jaguars TRADE UP For Travis Hunter...,"Watch live local and primetime games, NFL RedZ...",NFL,0.001668,0.002625,,0.005469,2025-04-25 14:45:45.902741,2025-04-25T01:55:00Z,https://www.youtube.com/watch?v=-JFW5V4U6bo,en,pick 1 10 jaguar trade travis hunter 2025 nfl ...,en,watch live local primetime game nfl redzone nf...,51,10,0,0,0,0,0,0,4.2,neutral,0.0,214,35,0,0,0,0,0,0,5.142857,neutral,-0.087879,0.060363


In [144]:
df_normalized.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1007 entries, 0 to 1006
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   source                   1007 non-null   object 
 1   id                       1007 non-null   object 
 2   title                    1007 non-null   object 
 3   text                     1007 non-null   object 
 4   username                 372 non-null    object 
 5   likes                    1007 non-null   float64
 6   comments                 1007 non-null   float64
 7   shares                   0 non-null      float64
 8   plays                    1007 non-null   float64
 9   timestamp                1007 non-null   object 
 10  published_at             1007 non-null   object 
 11  url                      1007 non-null   object 
 12  title_language           1007 non-null   object 
 13  title_clean              1007 non-null   object 
 14  text_language           

## Weitere numerische features

In [145]:
def add_simple_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['published_at'] = pd.to_datetime(df['published_at'], errors='coerce')

    def get_day_period(hour):
        if pd.isna(hour): return None
        if 5 <= hour < 12: return 'morning'
        elif 12 <= hour < 17: return 'afternoon'
        elif 17 <= hour < 21: return 'evening'
        else: return 'night'

    df['hour'] = df['published_at'].apply(lambda x: x.hour if pd.notna(x) else None)
    df['weekday'] = df['published_at'].apply(lambda x: x.day_name() if pd.notna(x) else None)
    df['year'] = df['published_at'].apply(lambda x: x.year if pd.notna(x) else None)
    df['month'] = df['published_at'].apply(lambda x: x.month if pd.notna(x) else None)
    df['day_period'] = df['hour'].apply(get_day_period)
    df['is_weekend'] = df['weekday'].isin(['Saturday', 'Sunday'])

    return df


df_final = add_simple_features(df_normalized)
df_final.head()

  df['published_at'] = pd.to_datetime(df['published_at'], errors='coerce')


Unnamed: 0,source,id,title,text,username,likes,comments,shares,plays,timestamp,published_at,url,title_language,title_clean,text_language,text_clean,title_char_count,title_word_count,title_uppercase_count,title_exclamation_count,title_question_count,title_emoji_count,title_mention_count,title_hashtag_count,title_avg_word_length,title_sentiment,title_sentiment_score,text_char_count,text_word_count,text_uppercase_count,text_exclamation_count,text_question_count,text_emoji_count,text_mention_count,text_hashtag_count,text_avg_word_length,text_sentiment,text_sentiment_score,engagement_rate,hour,weekday,year,month,day_period,is_weekend
0,youtube,-F33ACcPbhU,Monster Hunter Wilds – Festival of Accord: Blo...,Bask in the springtime aura and enjoy cherry b...,Monster Hunter,0.001926,0.001461,,0.002288,2025-04-22 22:06:12.302112,2025-04-22 01:00:25+00:00,https://www.youtube.com/watch?v=-F33ACcPbhU,en,monster hunter wild festival accord blossomdan...,en,bask springtime aura enjoy cherry blossom seas...,70,9,0,0,0,0,0,0,6.888889,neutral,0.1,835,124,0,0,0,0,0,0,5.741935,positive,0.133117,0.150794,1,Tuesday,2025,4,night,False
1,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,0.006466,0.002073,,0.021392,2025-05-07T12:30:17.866760,2025-05-04 21:00:09+00:00,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...,54,9,0,0,0,0,0,0,5.111111,neutral,0.0,1551,237,0,0,0,3,0,0,5.548523,positive,0.218157,0.055552,21,Sunday,2025,5,night,True
2,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,0.006302,0.002047,,0.020078,2025-05-06T13:32:11.312387,2025-05-04 21:00:09+00:00,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...,54,9,0,0,0,0,0,0,5.111111,neutral,0.0,1551,237,0,0,0,3,0,0,5.548523,positive,0.218157,0.057737,21,Sunday,2025,5,night,True
3,youtube,-H8tvnWaYs4,Chelsea 3-1 Liverpool | HIGHLIGHTS | Premier L...,PL Matchday 35 - Highlights of Chelsea's 3-1 P...,Chelsea Football Club,0.00588,0.001841,,0.016887,2025-05-05T18:10:08.695398,2025-05-04 21:00:09+00:00,https://www.youtube.com/watch?v=-H8tvnWaYs4,en,chelsea 3 1 liverpool highlight premier league...,en,pl matchday 35 highlight chelsea 's 3 1 premie...,54,9,0,0,0,0,0,0,5.111111,neutral,0.0,1551,237,0,0,0,3,0,0,5.548523,positive,0.218157,0.064062,21,Sunday,2025,5,night,True
4,youtube,-JFW5V4U6bo,Picks 1-10: Jaguars TRADE UP For Travis Hunter...,"Watch live local and primetime games, NFL RedZ...",NFL,0.001668,0.002625,,0.005469,2025-04-25 14:45:45.902741,2025-04-25 01:55:00+00:00,https://www.youtube.com/watch?v=-JFW5V4U6bo,en,pick 1 10 jaguar trade travis hunter 2025 nfl ...,en,watch live local primetime game nfl redzone nf...,51,10,0,0,0,0,0,0,4.2,neutral,0.0,214,35,0,0,0,0,0,0,5.142857,neutral,-0.087879,0.060363,1,Friday,2025,4,night,False


# 8. Speichern der bereinigten Daten

In [146]:
PROCESSED_DIR = BASE_DIR / "./data/processed"
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

df_final.to_csv(PROCESSED_DIR / "social_media_data.csv", index=False)
print("✅ Daten wurden erfolgreich gespeichert.")

✅ Daten wurden erfolgreich gespeichert.
