<div style="text-align: center; font-size: 16px;">
    <strong>Course:</strong> Machine Learning Operations |
    <strong>Lecturer:</strong> Prof. Dr. Klotz |
    <strong>Date:</strong> 17.05.2025 |
    <strong>Name:</strong> Sofie Pischl
</div>

# <center>Data Collection </center>

Konzept & Inhalt:

Daten von den gr√∂√üten Social media Apps sollen abgegriffen werden. besonderer Fokus auf Texten.

1. Setup & Imports
2. Reddit: Hot Posts aus Subreddits
3. Instagram: Top Posts per Scraping/API (light)
4. Twitter: Aktuelle Tweets via snscrape oder Tweepy
5. TikTok: Trending Videos
6. YouTube: Trending Videos (API/Scraping)
7. Fazit & Learnings

----
# 1. Setup & Imports

Zun√§chst werden alle ben√∂tigten Libraries importiert:
- `praw` f√ºr den Reddit-Zugriff
- `pandas` f√ºr Datenverarbeitung
- `datetime` f√ºr Timestamps
- `dotenv` f√ºr Umgebungsvariablen
- `pathlib` f√ºr saubere Pfadangaben
- `logging` f√ºr Fehlerprotokollierung

In [1]:
import os
import praw
import pandas as pd
from datetime import datetime
from dotenv import load_dotenv
from pathlib import Path
import logging

# Load from .env file
load_dotenv()

True

# 2. Reddit

### Funktionen:
- Authentifizierung √ºber OAuth2 via `praw`
- Abruf der `hot`-Beitr√§ge aus ausgew√§hlten Subreddits
- Speicherung als `.csv` unter `/data/raw/reddit_data.csv`
- Fehler-Handling und Logging integriert

**Authentifizierung**

Zur Authentifizierung an der Reddit-API wird ein Reddit-Objekt der Bibliothek praw initialisiert. Die ben√∂tigten Zugangsdaten ‚Äì client_id, client_secret und user_agent ‚Äì werden aus einer .env-Datei geladen, um die Trennung von Code und Konfiguration zu gew√§hrleisten und Sicherheitsrisiken zu minimieren.

Diese Parameter dienen der eindeutigen Identifikation der Anwendung gegen√ºber der API und sind notwendig, um Zugriff auf Reddit-Inhalte zu erhalten. Der user_agent erm√∂glicht zudem die R√ºckverfolgbarkeit von API-Anfragen seitens Reddit. Ohne diese Authentifizierung ist ein reguliertes, automatisiertes Scraping nicht zul√§ssig.

In [2]:
reddit = praw.Reddit(
    client_id=(os.getenv("REDDIT_ID")),
    client_secret=(os.getenv("REDDIT_SECRET")),
    user_agent=(os.getenv("USER_AGENT"))
)

**Logging-Konfiguration**

Bevor das Reddit-Scraping startet, wird ein Logging-System eingerichtet. Dazu wird zun√§chst ein Pfad zur Log-Datei definiert ‚Äì in diesem Fall `logs/reddit.log`. Falls das Verzeichnis `logs/` noch nicht existiert, wird es automatisch erstellt. Anschlie√üend wird das Python-Logging so konfiguriert, dass alle Log-Meldungen in diese Datei geschrieben werden.

Die Konfiguration legt fest, dass nur Meldungen ab dem Schweregrad `INFO` gespeichert werden. Au√üerdem wird das Format der Eintr√§ge so definiert, dass jeder Log-Eintrag einen Zeitstempel, den Log-Level (wie `INFO` oder `ERROR`) sowie die eigentliche Nachricht enth√§lt. So entsteht eine nachvollziehbare Chronik √ºber den Ablauf und m√∂gliche Fehler des Scripts.

Ein typischer Eintrag k√∂nnte zum Beispiel so aussehen:

```
2025-04-19 14:33:07,512 - INFO - Starte Reddit-Scraping...
```

In [4]:
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)

# Logging einrichten
log_path = Path("../logs/reddit.log")
log_path.parent.mkdir(parents=True, exist_ok=True)
logging.basicConfig(
    filename=log_path,
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)
print(f" Logging aktiv unter: {log_path.resolve()}")

 Logging aktiv unter: C:\Users\SofiePischl\Documents\01_HdM\10_ML_OPS\TrendAnalyse Social Media\logs\reddit.log


**Reddit-Datensammlung mittels Python und PRAW**

Die Funktion `scrape_reddit()` dient der systematischen Erhebung textbasierter Inhalte aus der Social-Media-Plattform **Reddit**. Ziel ist es, strukturierte Daten zur Analyse von Trendthemen zu generieren. Zur Umsetzung wird die Bibliothek `praw` (Python Reddit API Wrapper) verwendet, die eine komfortable Schnittstelle zur Reddit-API bereitstellt.

Nach der Initialisierung der Protokollierung via `logging` erfolgt die Authentifizierung an der Reddit-API. Hierf√ºr wird ein `Reddit`-Objekt instanziiert, wobei sensible Zugangsdaten wie `client_id`, `client_secret` und `user_agent` aus einer `.env`-Datei geladen werden. Dieses Vorgehen erm√∂glicht eine sichere Trennung von Konfiguration und Codebasis und sch√ºtzt vor dem unbeabsichtigten Leaken von API-Schl√ºsseln.

Im Anschluss wird eine Liste von Subreddits definiert, die sowohl popul√§re als auch als ‚Äûtrending‚Äú markierte Communities umfasst. Aus diesen Subreddits werden jeweils bis zu 100 Beitr√§ge aus dem Hot-Feed abgerufen. Dieses Verfahren stellt sicher, dass aktuelle, stark diskutierte Inhalte gesammelt werden, die ein hohes Relevanzpotenzial f√ºr Trendanalysen aufweisen.

Die Datenextraktion erfolgt √ºber eine doppelte Schleife: F√ºr jedes Subreddit werden Hot-Beitr√§ge iteriert, wobei ausschlie√ülich ‚Äûself-posts‚Äú ber√ºcksichtigt werden. Diese beinhalten keine externen Links und erm√∂glichen dadurch eine fokussierte Analyse des vom Nutzer selbst verfassten Textinhalts. Pro Beitrag werden zentrale Metriken wie Titel, Text, Anzahl der Kommentare, Upvotes, Erstellungszeitpunkt sowie die URL gespeichert. Zur besseren zeitlichen Einordnung wird au√üerdem ein einheitlicher Zeitstempel f√ºr alle Eintr√§ge vergeben.

Die gesammelten Beitr√§ge werden in einem `pandas.DataFrame` strukturiert und anschlie√üend unter `data/raw/reddit_data.csv` abgespeichert. Dabei wird sichergestellt, dass ben√∂tigte Verzeichnisse automatisch erstellt werden. Falls bereits Daten vorhanden sind, werden die neuen Eintr√§ge angeh√§ngt und anschlie√üend Duplikate basierend auf Titel, Textinhalt und Subreddit entfernt. Die finale Version wird ohne Index in die CSV-Datei geschrieben.

Abschlie√üend wird die Anzahl der gespeicherten Beitr√§ge im Logfile vermerkt. Etwaige Fehler werden w√§hrend der Ausf√ºhrung abgefangen und entsprechend protokolliert. Die Funktion kann sowohl als Modul importiert als auch direkt per Skriptausf√ºhrung genutzt werden.


In [5]:
def scrape_reddit():
    try:
        logging.info("Starte Reddit-Scraping...")

        subreddits = ["all", "popular", "trendingreddits", "trendingsubreddits"]
        post_limit = 100
        all_posts = []
        scrape_time = datetime.now()

        for sub in subreddits:
            subreddit = reddit.subreddit(sub)
            for post in subreddit.hot(limit=post_limit):
                if post.is_self:
                    all_posts.append({
                        "subreddit": sub,
                        "title": post.title,
                        "text": post.selftext,
                        "score": post.score,
                        "comments": post.num_comments,
                        "created": datetime.fromtimestamp(post.created),
                        "url": post.url,
                        "scraped_at": scrape_time
                    })

        df = pd.DataFrame(all_posts)
        csv_path = Path("../app/data/raw/reddit_data.csv")
        csv_path.parent.mkdir(parents=True, exist_ok=True)

        if csv_path.exists():
            df_existing = pd.read_csv(csv_path)
            df = pd.concat([df_existing, df], ignore_index=True)

        df.drop_duplicates(subset=["title", "text", "subreddit"], inplace=True)
        df.to_csv(csv_path, index=False)
        print(f" Gespeichert unter: {csv_path.resolve()}")

        logging.info(f"{len(df)} Eintr√§ge gespeichert unter {csv_path}")

    except Exception as e:
        logging.error(f"Fehler beim Reddit-Scraping: {e}")

# Ausf√ºhrung bei direktem Aufruf
if __name__ == "__main__":
    scrape_reddit()

 Gespeichert unter: C:\Users\SofiePischl\Documents\01_HdM\10_ML_OPS\TrendAnalyse Social Media\app\data\raw\reddit_data.csv


In [None]:
# Zeuge neuste Eintr√§ge
df = pd.read_csv("../app/data/raw/reddit_data.csv")
df.tail()

Unnamed: 0,subreddit,title,text,score,comments,created,url,scraped_at
162,popular,Is anyone else getting irritated with the new ...,"I get it, it‚Äôs something I can put in my prefe...",2474,643,2025-04-19 01:53:11,https://www.reddit.com/r/ChatGPT/comments/1k2j...,2025-04-19 14:20:21.514955
163,popular,What is the first thing you‚Äôd buy if you get f...,,5116,7975,2025-04-18 21:51:33,https://www.reddit.com/r/AskReddit/comments/1k...,2025-04-19 14:23:21.324902
164,popular,"Under current law, the Social Security payroll...",,9256,1062,2025-04-18 18:47:13,https://www.reddit.com/r/SocialSecurity/commen...,2025-04-19 14:23:21.324902
165,trendingreddits,We are open again!,,4,2,2024-09-27 20:14:11,https://www.reddit.com/r/TrendingReddits/comme...,2025-04-19 14:23:21.324902
166,trendingreddits,Hi,,1,4,2025-04-02 21:13:09,https://www.reddit.com/r/TrendingReddits/comme...,2025-04-19 14:23:21.324902


In [8]:
len(df)

167

---

# 2. Instagram

was nicht funktioniert hat: instaloader, playwright

step 1: log in & extract urls from for you page

step 2: Look up urls without beeing signed in and extract captions & content

setp 1: 

In [3]:
import time
import re
import os
import pandas as pd
from selenium import webdriver
from dotenv import load_dotenv
load_dotenv()
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

In [15]:
# log in data
USERNAME =  os.getenv("INSTA_USERNAME")
PASSWORD =  os.getenv("INSTA_PASSWORD")

In [24]:
# Start browser and login
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
wait = WebDriverWait(driver, 15)

driver.get("https://www.instagram.com/accounts/login/")
time.sleep(3)

# Accept cookies (if shown)
try:
    decline_button = wait.until(EC.element_to_be_clickable(
        (By.XPATH, '//button[contains(text(), "Nur essentielle Cookies erlauben") or contains(text(), "Decline optional cookies")]')))
    decline_button.click()
except:
    pass

# Fill in login form
wait.until(EC.presence_of_element_located((By.NAME, 'username')))
driver.find_element(By.NAME, 'username').send_keys(USERNAME)
driver.find_element(By.NAME, 'password').send_keys(PASSWORD)
driver.find_element(By.NAME, 'password').send_keys(Keys.RETURN)

time.sleep(5)

# Go to Explore page
driver.get("https://www.instagram.com/explore/")
time.sleep(5)

# Scroll and collect post URLs
post_urls = set()
scrolls = 0
while len(post_urls) < 20 and scrolls < 10:
    elements = driver.find_elements(By.XPATH, "//a[contains(@href, '/p/')]")
    for elem in elements:
        href = elem.get_attribute("href")
        if href and href.startswith("https://www.instagram.com/p/"):
            post_urls.add(href)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)
    scrolls += 1

# Save URLs to CSV
df = pd.DataFrame({"Post URL": list(post_urls)})
df.to_csv("../data/raw/explore_results_3.csv", index=False)
print(f"‚úÖ Saved {len(post_urls)} URLs to 'explore_results.csv'.")

driver.quit()

‚úÖ Saved 5 URLs to 'explore_results.csv'.


In [5]:
import traceback

In [5]:
# === SETUP ===
csv_path = "../data/raw/explore_results_3.csv"
df = pd.read_csv(csv_path, dtype="string")

# Sicherstellen, dass alle Zielspalten existieren
for col in ["timestamp", "datum", "inhalt", "username", "caption", "likes"]:
    if col not in df.columns:
        df[col] = ""

# Browser starten
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
wait = WebDriverWait(driver, 15)


In [6]:
url = "https://www.instagram.com/p/DICWDtYNq7E/"
driver.get(url)

In [7]:
dismiss_popups()

In [8]:
dismiss_popups()

In [None]:
driver.get(url)
        print("Looked up url")
        dismiss_popups()
        time.sleep(1)
        dismiss_popups()
        print("Dismissed pop ups")

In [37]:
username_elem = driver.find_element(By.XPATH, "//a[contains(@href, '/') and contains(@class, 'notranslate')]//span[1]")
print(username_elem)

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[contains(@href, '/') and contains(@class, 'notranslate')]//span[1]"}
  (Session info: chrome=135.0.7049.85); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
	GetHandleVerifier [0x010280E3+60707]
	GetHandleVerifier [0x01028124+60772]
	(No symbol) [0x00E50683]
	(No symbol) [0x00E98660]
	(No symbol) [0x00E989FB]
	(No symbol) [0x00EE1022]
	(No symbol) [0x00EBD094]
	(No symbol) [0x00EDE824]
	(No symbol) [0x00EBCE46]
	(No symbol) [0x00E8C5D3]
	(No symbol) [0x00E8D424]
	GetHandleVerifier [0x0126BBC3+2435075]
	GetHandleVerifier [0x01267163+2416035]
	GetHandleVerifier [0x0128350C+2531660]
	GetHandleVerifier [0x0103F1B5+155125]
	GetHandleVerifier [0x01045B5D+182173]
	GetHandleVerifier [0x0102F9B8+91640]
	GetHandleVerifier [0x0102FB60+92064]
	GetHandleVerifier [0x0101A620+4704]
	BaseThreadInitThunk [0x761F5D49+25]
	RtlInitializeExceptionChain [0x77D1CE3B+107]
	RtlGetAppContainerNamedObjectPath [0x77D1CDC1+561]


In [11]:
username_elem.text.strip()

'_fida.n.zati_'

In [12]:
caption_elem = driver.find_element(By.XPATH, "(//div[contains(@class, '_a9zs')]/span)[2]")
caption_elem.text.strip()

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"(//div[contains(@class, '_a9zs')]/span)[2]"}
  (Session info: chrome=135.0.7049.85); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
	GetHandleVerifier [0x010280E3+60707]
	GetHandleVerifier [0x01028124+60772]
	(No symbol) [0x00E50683]
	(No symbol) [0x00E98660]
	(No symbol) [0x00E989FB]
	(No symbol) [0x00EE1022]
	(No symbol) [0x00EBD094]
	(No symbol) [0x00EDE824]
	(No symbol) [0x00EBCE46]
	(No symbol) [0x00E8C5D3]
	(No symbol) [0x00E8D424]
	GetHandleVerifier [0x0126BBC3+2435075]
	GetHandleVerifier [0x01267163+2416035]
	GetHandleVerifier [0x0128350C+2531660]
	GetHandleVerifier [0x0103F1B5+155125]
	GetHandleVerifier [0x01045B5D+182173]
	GetHandleVerifier [0x0102F9B8+91640]
	GetHandleVerifier [0x0102FB60+92064]
	GetHandleVerifier [0x0101A620+4704]
	BaseThreadInitThunk [0x761F5D49+25]
	RtlInitializeExceptionChain [0x77D1CE3B+107]
	RtlGetAppContainerNamedObjectPath [0x77D1CDC1+561]


In [15]:
like_spans = driver.find_elements(By.XPATH, "//section//span[contains(text(), 'likes')]")

In [16]:
like_spans

[<selenium.webdriver.remote.webelement.WebElement (session="2794535543ca7a8f5811614e0ca17064", element="f.FA5CB9931F1124A41F7FC0F4B96F38C7.d.9EF78FA623897B6F8821F18CB253A2F5.e.1324")>]

In [17]:
like_spans[-1].text

'251,547 likes'

In [20]:
re.findall(r"\d[\d\.\,]*", like_spans[-1].text)

['251,547']

In [None]:

                    like_text = like_spans[-1].text
                    like_num = re.findall(r"\d[\d\.\,]*", like_text)

In [None]:
ike_spans = driver.find_elements(By.XPATH, "//section//span[contains(text(), 'Gef√§llt')]")
                if like_spans:
                    like_text = like_spans[-1].text
                    like_num = re.findall(r"\d[\d\.\,]*", like_text)

In [22]:
date_elem = driver.find_element(By.XPATH, "//time")

In [23]:
date_elem.get_attribute("datetime")[:10]

'2025-04-04'

In [32]:
# === SETUP ===
csv_path = "../data/raw/explore_results_3.csv"
df = pd.read_csv(csv_path, dtype="string")

# Sicherstellen, dass alle Zielspalten existieren
for col in ["timestamp", "datum", "inhalt", "username", "caption", "likes"]:
    if col not in df.columns:
        df[col] = ""

# Browser starten
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
wait = WebDriverWait(driver, 15)

In [25]:
# === Popups abfangen ===
def dismiss_popups():
    xpaths = [
        '//button[contains(text(), "Nur essentielle Cookies erlauben")]',
        '//button[contains(text(), "Decline optional cookies")]',
        '//div[@role="dialog"]//div[@aria-label="Schlie√üen"]',
        '//div[@role="dialog"]//div[@aria-label="Close"]',
         "//div[@role='dialog']//button",
        '//button[@aria-label="Schlie√üen"]',
    ]
    for xpath in xpaths:
        try:
            btn = WebDriverWait(driver, 2).until(EC.element_to_be_clickable((By.XPATH, xpath)))
            btn.click()
            time.sleep(1)
        except:
            continue

In [39]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Schritt 1: Seite aufrufen
url = 'https://www.instagram.com/p/DICWDtYNq7E/'
driver.get(url)

# Schritt 2: Warten, bis der Beitrag (Artikel) geladen ist
try:
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.TAG_NAME, "article"))
    )
    print("üìÑ Beitrag geladen.")
except:
    print("‚ùå Beitrag nicht gefunden.")

# Schritt 3: Versuche, das Login-Popup zu schlie√üen
try:
    close_btn = WebDriverWait(driver, 5).until(
        EC.element_to_be_clickable((By.XPATH, "//div[@role='dialog']//button"))
    )
    close_btn.click()
    print("‚ùé Login-Popup geschlossen.")
except:
    print("‚úÖ Kein Popup oder konnte nicht geklickt werden.")

# Schritt 4: Jetzt warte, bis Username sichtbar ist
try:
    username_elem = WebDriverWait(driver, 10).until(
        EC.visibility_of_element_located(By.XPATH, "//a[contains(@href, '/') and contains(@class, 'notranslate')]//span[1]")
    )
    username = username_elem.text.strip()
    print("üë§ Username:", username)
except:
    print("‚ùå Username nicht gefunden.")


üìÑ Beitrag geladen.
‚úÖ Kein Popup oder konnte nicht geklickt werden.
‚ùå Username nicht gefunden.


In [41]:
username_elem = driver.find_element(By.XPATH, "//a[contains(@href, '/') and contains(@class, 'notranslate')]//span[1]")
print(username_elem)

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[contains(@href, '/') and contains(@class, 'notranslate')]//span[1]"}
  (Session info: chrome=135.0.7049.85); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
	GetHandleVerifier [0x010280E3+60707]
	GetHandleVerifier [0x01028124+60772]
	(No symbol) [0x00E50683]
	(No symbol) [0x00E98660]
	(No symbol) [0x00E989FB]
	(No symbol) [0x00EE1022]
	(No symbol) [0x00EBD094]
	(No symbol) [0x00EDE824]
	(No symbol) [0x00EBCE46]
	(No symbol) [0x00E8C5D3]
	(No symbol) [0x00E8D424]
	GetHandleVerifier [0x0126BBC3+2435075]
	GetHandleVerifier [0x01267163+2416035]
	GetHandleVerifier [0x0128350C+2531660]
	GetHandleVerifier [0x0103F1B5+155125]
	GetHandleVerifier [0x01045B5D+182173]
	GetHandleVerifier [0x0102F9B8+91640]
	GetHandleVerifier [0x0102FB60+92064]
	GetHandleVerifier [0x0101A620+4704]
	BaseThreadInitThunk [0x761F5D49+25]
	RtlInitializeExceptionChain [0x77D1CE3B+107]
	RtlGetAppContainerNamedObjectPath [0x77D1CDC1+561]


In [27]:
# === Hauptloop ===
for i, row in df.iterrows():
    url = row['Post URL']
    if not isinstance(url, str) or not url.startswith("http"):
        continue

    try:
        print(f"\n Lade Beitrag: {url}")
        driver.get(url)
        print("Looked up url")
        dismiss_popups()
        time.sleep(1)
        dismiss_popups()
        print("Dismissed pop ups")


        # Zeitstempel immer setzen
        df.at[i, "timestamp"] = datetime.now().isoformat()

        # üë§ Username (erster Treffer)
        if not pd.notna(row.get("username")) or row.get("username") == "":
            print("Entering try loop username")
            try:
                print("Looking for username")
                time.sleep(2)
                username_elem = driver.find_element(By.XPATH, "//a[contains(@href, '/') and contains(@class, 'notranslate')]//span[1]")
                print(username_elem)
                #username_elem = driver.find_element(By.XPATH, "(//a[contains(@href, '/') and contains(@class, 'notranslate')]//span)[1]")
                df.at[i, "username"] = username_elem.text.strip()
                print("üë§ Username:", df.at[i, "username"])
                print(username_elem)
            except:
                print("‚ùå Username nicht gefunden.")

    except:
        print("m")


 Lade Beitrag: https://www.instagram.com/p/DICWDtYNq7E/
Looked up url
Dismissed pop ups
Entering try loop username
Looking for username
‚ùå Username nicht gefunden.

 Lade Beitrag: https://www.instagram.com/p/DHLad6VIRES/
Looked up url
Dismissed pop ups
Entering try loop username
Looking for username
‚ùå Username nicht gefunden.

 Lade Beitrag: https://www.instagram.com/p/DIOqCE7TKSu/
m

 Lade Beitrag: https://www.instagram.com/p/DHQFK3CMfRe/
m

 Lade Beitrag: https://www.instagram.com/p/DHWe5czOGta/
m


In [4]:
# === SETUP ===
csv_path = "../data/raw/explore_results_3.csv"
df = pd.read_csv(csv_path, dtype="string")

# Sicherstellen, dass alle Zielspalten existieren
for col in ["timestamp", "datum", "inhalt", "username", "caption", "likes"]:
    if col not in df.columns:
        df[col] = ""

# Browser starten
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
wait = WebDriverWait(driver, 15)

# === Popups abfangen ===
def dismiss_popups():
    xpaths = [
        '//button[contains(text(), "Nur essentielle Cookies erlauben")]',
        '//button[contains(text(), "Decline optional cookies")]',
        '//div[@role="dialog"]//div[@aria-label="Schlie√üen"]',
        '//div[@role="dialog"]//div[@aria-label="Close"]',
         "//div[@role='dialog']//button",
        '//button[@aria-label="Schlie√üen"]',
    ]
    for xpath in xpaths:
        try:
            btn = WebDriverWait(driver, 2).until(EC.element_to_be_clickable((By.XPATH, xpath)))
            btn.click()
            time.sleep(1)
        except:
            continue

# === Hauptloop ===
for i, row in df.iterrows():
    url = row['Post URL']
    if not isinstance(url, str) or not url.startswith("http"):
        continue

    try:
        print(f"\n Lade Beitrag: {url}")
        driver.get(url)
        print("Looked up url")
        dismiss_popups()
        time.sleep(1)
        dismiss_popups()
        print("Dismissed pop ups")


        # Zeitstempel immer setzen
        df.at[i, "timestamp"] = datetime.now().isoformat()

        # üë§ Username (erster Treffer)
        if not pd.notna(row.get("username")) or row.get("username") == "":
            print("Entering try loop username")
            try:
                time.sleep(2)
                print("Looking for username")
                username_elem = driver.find_element(By.XPATH, "//a[contains(@href, '/') and contains(@class, 'notranslate')]//span[1]")
                print(username_elem)
                #username_elem = driver.find_element(By.XPATH, "(//a[contains(@href, '/') and contains(@class, 'notranslate')]//span)[1]")
                df.at[i, "username"] = username_elem.text.strip()
                print("üë§ Username:", df.at[i, "username"])
                print(username_elem)
            except:
                print("‚ùå Username nicht gefunden.")

        # üìù Caption (zweiter Treffer)
        if not pd.notna(row.get("caption")) or row.get("caption") == "":
            try:
                caption_elem = driver.find_element(By.XPATH, "(//div[contains(@class, '_a9zs')]/span)[2]")
                df.at[i, "caption"] = caption_elem.text.strip()
                print("üìù Caption:", df.at[i, "caption"])
            except:
                print("Caption not found")

        # ‚ù§Ô∏è Likes (letzter Treffer)
        if not pd.notna(row.get("likes")) or row.get("likes") == "":
            try:
                like_spans = driver.find_elements(By.XPATH, "//section//span[contains(text(), 'Gef√§llt')]")
                if like_spans:
                    like_text = like_spans[-1].text
                    like_num = re.findall(r"\d[\d\.\,]*", like_text)
                    if like_num:
                        likes = int(like_num[0].replace(".", "").replace(",", ""))
                        df.at[i, "likes"] = str(likes)
                        print("‚ù§Ô∏è Likes:", likes)
            except:
                print("‚ùå Likes nicht gefunden.")

        # üóìÔ∏è Ver√∂ffentlichungsdatum
        if not pd.notna(row.get("datum")) or row.get("datum") == "":
            try:
                date_elem = driver.find_element(By.XPATH, "//time")
                df.at[i, "datum"] = date_elem.get_attribute("datetime")[:10]
                print("üìÖ Datum:", df.at[i, "datum"])
            except:
                print("‚ùå Datum nicht gefunden.")

        # üñºÔ∏è Bildbeschreibung
        if not pd.notna(row.get("inhalt")) or row.get("inhalt") == "":
            try:
                image = driver.find_element(By.XPATH, "//article//img")
                df.at[i, "inhalt"] = image.get_attribute("alt").strip()
                print("üñºÔ∏è Bildbeschreibung:", df.at[i, "inhalt"])
            except:
                print("‚ùå Bildbeschreibung nicht gefunden.")

    except Exception as e:
        print(f"‚ùå Fehler bei {url}: {e}")
        continue


# Speichern
df.to_csv(csv_path, index=False, encoding="utf-8")
driver.quit()
print("\n‚úÖ Alle Daten aktualisiert und gespeichert.")



 Lade Beitrag: https://www.instagram.com/p/DICWDtYNq7E/
Looked up url
Dismissed pop ups
Entering try loop username
Looking for username
‚ùå Username nicht gefunden.
Caption not found
üñºÔ∏è Bildbeschreibung: _fida.n.zati_'s profile picture

 Lade Beitrag: https://www.instagram.com/p/DHLad6VIRES/
Looked up url
Dismissed pop ups
Entering try loop username
Looking for username
‚ùå Username nicht gefunden.
Caption not found

 Lade Beitrag: https://www.instagram.com/p/DIOqCE7TKSu/
Looked up url
Dismissed pop ups
Entering try loop username
Looking for username
‚ùå Username nicht gefunden.
Caption not found
‚ùå Likes nicht gefunden.

 Lade Beitrag: https://www.instagram.com/p/DHQFK3CMfRe/
‚ùå Fehler bei https://www.instagram.com/p/DHQFK3CMfRe/: Message: invalid session id
Stacktrace:
	GetHandleVerifier [0x010280E3+60707]
	GetHandleVerifier [0x01028124+60772]
	(No symbol) [0x00E504FE]
	(No symbol) [0x00E8B898]
	(No symbol) [0x00EBCF06]
	(No symbol) [0x00EB89D5]
	(No symbol) [0x00EB7F66]
	(No

In [None]:
df.head()


Unnamed: 0,Post URL,timestamp,datum,inhalt,username,caption,likes
0,https://www.instagram.com/p/DIKaU11Rwj0/,,,,,,
1,https://www.instagram.com/p/DHW6TrpR9UE/,2025-04-14T12:47:05.156202,2025-03-18,"Photo by BernieGirl on March 18, 2025. Ist m√∂g...",,,
2,https://www.instagram.com/p/DHOLtvLtJKL/,2025-04-14T12:47:26.154501,2025-03-15,Photo by Die Welt hinter der Leinwand on March...,,,
3,https://www.instagram.com/p/DH-iv39t-RD/,2025-04-14T12:47:55.882011,2025-04-03,aiwonderlab.eu's profile picture,,,
4,https://www.instagram.com/p/DIJwgPKpxTN/,2025-04-14T12:48:16.988772,2025-04-07,sonya_styless's profile picture,,,


# 1. TWITTER

In [2]:
import os

In [9]:
!pip install python-dotenv



In [4]:
import requests
import pandas as pd
from datetime import datetime

# Your bearer token from Twitter Developer Portal
BEARER_TOKEN = os.getenv("X_BEARER_TOKEN")

# Twitter API endpoint for recent tweets
search_url = "https://api.twitter.com/2/tweets/search/recent"

# Search parameters ‚Äì open topic
query_params = {
    'query': 'Twitter lang:de -is:retweet',  # No keyword, just German tweets
    'max_results': 50,  # Max per request (10‚Äì100)
    'tweet.fields': 'created_at,public_metrics,text,author_id',
    'expansions': 'author_id',
    'user.fields': 'username,name'
}

# Set headers
headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}

# Send request
response = requests.get(search_url, headers=headers, params=query_params)

# Check response
if response.status_code != 200:
    raise Exception(f"Request failed: {response.status_code}\n{response.text}")

data = response.json()

# Extract data
tweets = data.get("data", [])
users = {u["id"]: u for u in data.get("includes", {}).get("users", [])}

# Prepare data rows
results = []
for tweet in tweets:
    user = users.get(tweet["author_id"], {})
    metrics = tweet.get("public_metrics", {})
    results.append({
        "url": f"https://twitter.com/{user.get('username')}/status/{tweet['id']}",
        "timestamp": datetime.now().isoformat(),
        "datum": tweet.get("created_at", ""),
        "username": user.get("username", ""),
        "name": user.get("name", ""),
        "caption": tweet.get("text", ""),
        "likes": metrics.get("like_count", 0),
        "retweets": metrics.get("retweet_count", 0),
        "replies": metrics.get("reply_count", 0)
    })

# Save to CSV
df = pd.DataFrame(results)
df.sort_values(by="likes", ascending=False, inplace=True)  # Sort by popularity
df.to_csv("../raw/twitter_api_top_tweets.csv", index=False, encoding="utf-8")

print(f"Saved {len(df)} tweets to twitter_api_top_tweets.csv")


Exception: Request failed: 429
{"title":"Too Many Requests","detail":"Too Many Requests","type":"about:blank","status":429}

In [5]:
print(response.headers.get("x-rate-limit-remaining"))
print(response.headers.get("x-rate-limit-reset"))

0
1744658724


In [6]:
pip install ntscraper

Collecting ntscraper
  Downloading ntscraper-0.3.18-py3-none-any.whl.metadata (7.4 kB)
Downloading ntscraper-0.3.18-py3-none-any.whl (12 kB)
Installing collected packages: ntscraper
Successfully installed ntscraper-0.3.18
Note: you may need to restart the kernel to use updated packages.


In [7]:
from ntscraper import Nitter

scraper = Nitter(log_level=1, skip_instance_check=False)

Testing instances: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:01<00:00,  2.86it/s]


In [12]:
pip install snscrape


Note: you may need to restart the kernel to use updated packages.


In [15]:
import snscrape.modules.twitter as sntwitter
import pandas as pd

query = 'Klima lang:de since:2024-01-01'
max_tweets = 50
tweets = []

try:
    for i, tweet in enumerate(sntwitter.TwitterSearchScraper(query).get_items()):
        if i >= max_tweets:
            break
        tweets.append({
            'Datum': tweet.date,
            'User': tweet.user.username,
            'Name': tweet.user.displayname,
            'Text': tweet.content,
            'Likes': tweet.likeCount,
            'Retweets': tweet.retweetCount,
            'Replies': tweet.replyCount,
            'URL': tweet.url
        })
    print(f"{len(tweets)} Tweets erfolgreich gesammelt.")
except Exception as e:
    print(f"Fehler beim Scrapen: {e}")
finally:
    df = pd.DataFrame(tweets)
    output_path = "tweets_scrape_output.csv"
    df.to_csv(output_path, index=False)
    print(f"Tweets gespeichert in: {output_path}")


14-Apr-25 21:22:10 - Retrieving scroll page None
14-Apr-25 21:22:10 - Retrieving https://twitter.com/i/api/graphql/7jT5GT59P8IFjgxwqnEdQw/SearchTimeline?variables=%7B%22rawQuery%22%3A%22Klima%20lang%3Ade%20since%3A2024-01-01%22%2C%22count%22%3A20%2C%22product%22%3A%22Latest%22%2C%22withDownvotePerspective%22%3Afalse%2C%22withReactionsMetadata%22%3Afalse%2C%22withReactionsPerspective%22%3Afalse%7D&features=%7B%22rweb_lists_timeline_redesign_enabled%22%3Afalse%2C%22blue_business_profile_image_shape_enabled%22%3Afalse%2C%22responsive_web_graphql_exclude_directive_enabled%22%3Atrue%2C%22verified_phone_label_enabled%22%3Afalse%2C%22creator_subscriptions_tweet_preview_api_enabled%22%3Afalse%2C%22responsive_web_graphql_timeline_navigation_enabled%22%3Atrue%2C%22responsive_web_graphql_skip_user_profile_image_extensions_enabled%22%3Afalse%2C%22tweetypie_unmention_optimization_enabled%22%3Atrue%2C%22vibe_api_enabled%22%3Atrue%2C%22responsive_web_edit_tweet_api_enabled%22%3Atrue%2C%22graphql_is_t

In [16]:
from playwright.sync_api import sync_playwright
import pandas as pd
import time

def scrape_tweets(keyword="Klima", max_tweets=20):
    tweets_data = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        search_url = f"https://x.com/search?q={keyword}%20lang%3Ade&f=live"
        page.goto(search_url)
        time.sleep(5)

        last_height = 0
        while len(tweets_data) < max_tweets:
            tweet_elements = page.query_selector_all('article')

            for tweet in tweet_elements:
                try:
                    content = tweet.inner_text()
                    lines = content.split('\n')
                    username = lines[0] if lines else ""
                    text = '\n'.join(lines[2:-4]) if len(lines) > 6 else content
                    timestamp = tweet.query_selector('time').get_attribute('datetime') if tweet.query_selector('time') else ''
                    tweet_url = tweet.query_selector('a:has(time)').get_attribute('href') if tweet.query_selector('a:has(time)') else ''
                    full_url = f"https://x.com{tweet_url}" if tweet_url else ''

                    if any(d['url'] == full_url for d in tweets_data):
                        continue  # Already captured

                    tweets_data.append({
                        "username": username,
                        "text": text,
                        "timestamp": timestamp,
                        "url": full_url
                    })

                    if len(tweets_data) >= max_tweets:
                        break
                except Exception as e:
                    continue

            # Scroll down
            page.mouse.wheel(0, 2000)
            time.sleep(2)

        browser.close()

    return tweets_data

# üîÑ Ausf√ºhren & speichern
data = scrape_tweets("Klima", max_tweets=30)
df = pd.DataFrame(data)
df.to_excel("tweets_playwright_scrape.xlsx", index=False)
print(f"{len(df)} Tweets gespeichert in 'tweets_playwright_scrape.xlsx'")


Error: It looks like you are using Playwright Sync API inside the asyncio loop.
Please use the Async API instead.

# 4. TikTok

In [6]:
from TikTokApi import TikTokApi

api = TikTokApi()
trending = api.trending(count=10)

for video in trending:
    print(f"Author: {video['author']['uniqueId']}")
    print(f"Desc: {video['desc']}")
    print(f"Video URL: {video['video']['downloadAddr']}")
    print('-' * 30)


TypeError: Trending() takes no arguments