# 01 Datenbeschaffung der Social Media Posts von Donald Trump

Die n√∂tigen Daten zur Analyse von Donald Trumps Social Media Posts m√ºssen leider aus verschiedenen Quellen zusammen gest√ºckelt werden:
- Trump-Twitter-Archive (2011-2021)
- TTA, aber h√§ndischer Download von immer jeweils 2000 Posts (2021-2024)
- Scraping mit Playwright f√ºr die neuesten & aktuellsten Daten (2024-2025)

## Scraping mit Playwright

In [1]:
## zur Installation der verwendeten Pakete:
# !pip install playwright pandas aiohttp aiofiles os ssl certifi
# einfach die Zeile mit !pip auskommentieren und durchf√ºhren; danach Kernel neu laden
# playwright install ##in shell ausf√ºhren

### Wie sollen die Daten am Ende aussehen?
- Id: Nummer des Posts
- author: Donald Trump @realdonaldtrump
- platform: Truthsocial or X (Twitter)
- date: ganzes Datum (ohne Uhrzeit)
- day: Tag des Posts
- month: Monat des Posts
- year: Jahr des Posts
- time: Uhrzeit des Posts
- text: ganzer Text (ohne Datum, Uhrzeit, Autor und Plattform)
- image : image_path -> der Weg zu den Bildern wird lokal gespeichert.

### Wie sieht die Webseite aus:

- Quellcode auf Website anschauen: https://rollcall.com/factbase-twitter/?platform=all&sort=date&sort_order=desc
- Seite l√§dt Inhalte interaktiv mit Java-Script nach (nicht statisch)
- alle Posts sind in jeweils einzelnen Bl√∂cken gespeichert
- im jeweiligen Block ist einmal das Bild gespeichert und zudem DAutor, Plattform, Datum, Uhrzeit und Text in einem gemeinsamen Block
- Suchmaske auf der Website implementiert
- t√§gliche Erg√§nzung neuer Posts
- Blick auf die URL: Beim Scrollen √§ndert sich die Seitenzahl in der URL
- zum 1.August waren es 87.640 Posts (X und Truthsocial)
- wahrscheinlich circa 5.000 Seiten

### Wahl des Tools:
- Beautifulsoup: schon √§lter, nur f√ºr statische Websites geeignet, braucht l√§nger
- Selectolax: modern und deutlich schneller als Beautiful, wird au√üerdem seltener blockiert, allerdings ebenfalls nur f√ºr statische Seiten
- Selenium: dynamische Alternative, gute M√∂glichkeit
- Playwright: relativ modern, sehr schnell und effizient, f√ºr dynamische Seiten geeignet

### Wie viele Posts gibt es √ºberhaupt auf der Webseite?
- Auf der Webseite steht, dass es 87.656 gibt (Stand 6.8.2025)
- Heruntergeladen werden nur die neueren Posts (November 2024-August 2025), als Erg√§nzung zum TTA

#### Probleme und L√∂sungen:
- die Seite l√§dt dynamisch nach: Playwright verwenden f√ºr dynamische Webseiten
- dynamisches Nachladen: wie komme ich zum Ende der Posts?
- tausend doppelte und dreifache Posts: durch Seiten durchiterieren bringt leider nur doppelte Posts
- doppelte Posts: Key erstellen, mit dem abgeglichen werden kann
- dynamisches Nachladen der Seiten: statt durch Seiten iterieren lieber Scrollen!
- Seiten laden langsam nach: sleep einbauen
- durch das Herunterladen der Bilder: Programm wird sehr langsam :(, daher parallele Worker etablieren & asynchrone Methoden, statt synchron
- Programm st√ºrzt ab, bzw findet keine nicht immer neue Posts, weil die Seiten langsam nachladen: l√§nger warten (sleep(3)) und Posts direkt in csv speichern und nach einem Neustart bereits Gespeichertes aus dem File laden

#### Code f√ºr das Scraping mit Playwright

In [1]:
### Test f√ºr die einzelnen Posts ###
import re
from datetime import datetime

def extract_metadata_text(text: str):
    lines = [line.strip() for line in text.strip().splitlines() if line.strip()]
    author_name = ""
    handle = ""
    platform = ""
    date_str = ""
    time_str = ""
    content_lines = []

    if len(lines) >= 2:
        author_name = lines[0].strip()
        match = re.search(
            r"(@[\w]+)\s*[‚Ä¢\-]\s*(.*?)\s*[‚Ä¢\-]\s*([A-Za-z]+ \d{1,2}, \d{4})\s*@\s*(\d{1,2}:\d{2} [AP]M)",
            lines[1]
        )
        if match:
            handle = match.group(1).strip()
            platform = match.group(2).strip()
            date_str = match.group(3).strip()
            time_str = match.group(4).strip()

        start_idx = 2
        if len(lines) > 2 and lines[2].startswith("View"):  # z.B. "View on ..."
            start_idx = 3
        content_lines = lines[start_idx:]

    content_text = " ".join(content_lines).strip()

    content_text = re.sub(r"\s{2,}", " ", content_text)

    try:
        dt = datetime.strptime(f"{date_str} {time_str}", "%B %d, %Y %I:%M %p")
        return {
            "author": f"{author_name} {handle}".strip(),
            "platform": platform,
            "date": dt.strftime("%Y-%m-%d"),
            "time": dt.strftime("%H:%M"),
            "year": int(dt.year),
            "month": dt.strftime("%B"),
            "day": int(dt.day),
            "text": content_text
        }
    except Exception:
        return {
            "author": f"{author_name} {handle}".strip(),
            "platform": platform,
            "date": date_str,
            "time": time_str,
            "year": "",
            "month": "",
            "day": "",
            "text": content_text
        }

# --- Testfunktion ---
def test_extract_metadata():
    sample_post = """
    Donald Trump
    @realDonaldTrump ‚Ä¢ Twitter ‚Ä¢ January 6, 2021 @ 3:45 PM
    View on Twitter
    This is a test post
    with multiple lines
    and even more text.
    
    As obviously, this is also part of the text.
    
    And what about this? @you
    Look here!
    """

    result = extract_metadata_text(sample_post)
    print("Autor:", result["author"])
    print("Plattform:", result["platform"])
    print("Datum:", result["date"])
    print("Zeit:", result["time"])
    print("Text:", result["text"])

# Testlauf
if __name__ == "__main__":
    test_extract_metadata()


Autor: Donald Trump @realDonaldTrump
Plattform: Twitter
Datum: 2021-01-06
Zeit: 15:45
Text: This is a test post with multiple lines and even more text. As obviously, this is also part of the text. And what about this? @you Look here!


Folgender Code l√§dt 7.176 Posts herunter (vom 2.11.2024 bis zum 25.08.2025):

In [2]:
import asyncio
import nest_asyncio
import pandas as pd
import re
from datetime import datetime
from playwright.async_api import async_playwright
import aiohttp
import os
import certifi
import ssl
import hashlib
import aiofiles

nest_asyncio.apply()
os.makedirs("images", exist_ok=True)

sslcontext = ssl.create_default_context(cafile=certifi.where())
sslcontext.check_hostname = False
sslcontext.verify_mode = ssl.CERT_NONE

CSV_FILE = "factbase_posts_clean.csv"

# Anzahl paralleler Download-Worker
num_workers = 10

async def download_worker(queue, session):
    """Paralleles Herunterladen der Daten (vorallem der Bilder),
        damit das Programm schneller l√§uft"""
    os.makedirs("images", exist_ok=True)
    while True:
        item = await queue.get()
        if item is None: 
            queue.task_done()
            break

        image_url, post, filename = item
        try:
            async with session.get(image_url, ssl=sslcontext) as resp:
                if resp.status == 200:
                    fpath = os.path.join("images", filename)
                    async with aiofiles.open(fpath, "wb") as f:
                        await f.write(await resp.read())
                    post["image_path"] = fpath
        except Exception as e:
            print(f"Fehler beim Download {image_url}: {e}")
        finally:
            queue.task_done()
            

             
def extract_metadata_text(text: str):
    """Extrahieren der Metadaten, die auf der Webseite in einem Block angezeigt werden"""
    lines = [line.strip() for line in text.strip().splitlines() if line.strip()]
    author_name = ""
    handle = ""
    platform = ""
    date_str = ""
    time_str = ""
    content_lines = []

    if len(lines) >= 2:
        author_name = lines[0].strip()
        match = re.search(
            r"(@[\w]+)\s*[‚Ä¢\-]\s*(.*?)\s*[‚Ä¢\-]\s*([A-Za-z]+ \d{1,2}, \d{4})\s*@\s*(\d{1,2}:\d{2} [AP]M)",
            lines[1]
        )
        if match:
            handle = match.group(1).strip()
            platform = match.group(2).strip()
            date_str = match.group(3).strip()
            time_str = match.group(4).strip()

        start_idx = 2
        if len(lines) > 2 and lines[2].startswith("View"):  # z.B. "View on ..."
            start_idx = 3
        content_lines = lines[start_idx:]

    # Absatzmarker durch Leerzeichen ersetzen
    content_text = " ".join(content_lines).strip()

    # Doppelte Leerzeichen normalisieren
    content_text = re.sub(r"\s{2,}", " ", content_text)

    try:
        dt = datetime.strptime(f"{date_str} {time_str}", "%B %d, %Y %I:%M %p")
        return {
            "author": f"{author_name} {handle}".strip(),
            "platform": platform,
            "date": dt.strftime("%Y-%m-%d"),
            "time": dt.strftime("%H:%M"),
            "year": int(dt.year),
            "month": dt.strftime("%B"),
            "day": int(dt.day),
            "text": content_text
        }
    except Exception:
        return {
            "author": f"{author_name} {handle}".strip(),
            "platform": platform,
            "date": date_str,
            "time": time_str,
            "year": "",
            "month": "",
            "day": "",
            "text": content_text
        }
    
def make_post_key(data, include_image=False, include_text=True):
    """
    Generiert einen eindeutigen Schl√ºssel f√ºr jeden Post, um Duplikate zu vermeiden.
    Text und Bild sind optional; nur Posts ohne jegliche Metadaten UND ohne Text UND ohne Bild werden verworfen
    """
    # Metadaten pr√ºfen
    #author = str(data.get("author", "")).strip() 
    #Autor ist eigentlich immer identisch, daher f√ºr Key sinnlos
    platform = str(data.get("platform", "")).strip()
    date = str(data.get("date", "")).strip()
    time = str(data.get("time", "")).strip()
    #text = str(data.get("text", "")).strip()
    img = data.get("image_path")
    
     # Wenn gar nichts da ist: kein valider Key
    if not (platform or date or time or img): #Text kann auch leer sein, daher weglassen
        return None
    
    parts = []
    parts.extend([platform, date, time]) #Basisdaten

    # Text (optional)
    if include_text:
        text_val = str(data.get("text", "")).strip().lower()
        text_norm = re.sub(r"\s+", " ", text_val)
        parts.append(text_norm)
    
    # Bild (optional)
    if include_image and isinstance(img, str) and img.strip():
        parts.append(os.path.basename(img.strip()))

    raw_key = "|".join(parts).strip()
    if not raw_key:
        return None

    return hashlib.md5(raw_key.encode("utf-8")).hexdigest()


async def scrape_all_dynamic(max_posts=90000, max_no_new=5):
    """Scraping der Daten inclusive Checker, ob schon ein File mit Daten vorhanden ist, 
        um an der Stelle weiter arbeiten zu k√∂nnen, wenn das Programm abbricht.
        Bilddownload, Scrollfunktion und Worker beenden."""
    posts_data = []
    seen_posts = set()

    # Fortsetzung, damit mit bereits vorhandenen Daten weiter gearbeitet werden kann
    if os.path.exists(CSV_FILE):
        print(f"Vorhandene Datei gefunden: {CSV_FILE} ‚Äì Lade gespeicherte Posts...")
        df_existing = pd.read_csv(CSV_FILE)
        posts_data = df_existing.to_dict("records")
        seen_posts = {make_post_key(row) for row in posts_data}
        print(f"{len(posts_data)} Posts bereits geladen ‚Äì setze fort...")

    no_new_rounds = 0

    async with async_playwright() as p, aiohttp.ClientSession() as session:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        page.set_default_timeout(60000)

        await page.goto("https://rollcall.com/factbase-twitter/?platform=all&sort=date&sort_order=desc")
        await asyncio.sleep(2)

        queue = asyncio.Queue()
        workers = [asyncio.create_task(download_worker(queue, session)) for _ in range(num_workers)]

        while True:
            try:
                await page.wait_for_selector("div.block", timeout=30000)
                blocks = await page.query_selector_all("div.block")
            except:
                print("Keine weiteren Posts, breche ab.")
                break

            print(f"Aktuell {len(posts_data)} Posts gespeichert ‚Äì {len(blocks)} Bl√∂cke auf der Seite sichtbar")

            new_count = 0
            for block in blocks:
                if len(posts_data) >= max_posts:
                    break

                try:
                    full_text = await block.inner_text()
                    data = extract_metadata_text(full_text)

                    # Basis-Post
                    post = {
                        "author": data["author"],
                        "platform": data["platform"],
                        "date": data["date"],
                        "time": data["time"],
                        "day": data["day"],
                        "month": data["month"],
                        "year": data["year"],
                        "text": data["text"],
                        "image_path": None,
                        "image_url": None
                    }

                    # Bild-Download vorbereiten
                    img_src = None
                    try:
                        img_el = await block.query_selector("img")
                        if img_el:
                            src = await img_el.get_attribute("src")
                            if src and re.search(r"\.jpe?g", src, re.IGNORECASE):
                                img_src = src
                                filename = f"{hashlib.md5(src.encode()).hexdigest()}.jpg"
                                await queue.put((src, post, filename))
                    except:
                        pass

                    post["image_url"] = img_src
                    
                    # Schl√ºssel generieren
                    key = make_post_key(post, include_image=False, include_text=True)
                    if not key or key in seen_posts:
                        continue
                    seen_posts.add(key)
                    
                    posts_data.append(post)
                    new_count += 1

                except Exception as e:
                    print(f"Fehler bei Post: {e}")

            print(f"Neu hinzugekommen: {new_count} Posts")

            # Scrollen und no_new_rounds pr√ºfen
            if new_count == 0:
                no_new_rounds += 1
                print(f"Keine neuen Posts ({no_new_rounds}/{max_no_new})")
                if no_new_rounds >= max_no_new:
                    break
            else:
                no_new_rounds = 0

            # Scrollen nur, wenn max_posts noch nicht erreicht wurde
            if len(posts_data) < max_posts:
                last_height = await page.evaluate("document.body.scrollHeight")
                await page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
                await asyncio.sleep(3)
                new_height = await page.evaluate("document.body.scrollHeight")
                if new_height == last_height:
                    print(f"Scrollen brachte nichts Neues ({no_new_rounds}/{max_no_new})")
                    break
            else:
                print(f"Maximale Anzahl {max_posts} erreicht.")
                break

        # Queue abwarten
        await queue.join()
        for _ in range(num_workers):
            await queue.put(None)
        await asyncio.gather(*workers)

        await browser.close()

    # IDs vergeben und CSV speichern
    for idx, post in enumerate(posts_data, start=1):
        post["id"] = idx

    fff = pd.DataFrame(posts_data)
    cols = ["id", "author", "platform", "date", "time", "day", "month", "year", "text", "image_path"]
    fff = fff[cols]
    fff.to_csv(CSV_FILE, index=False, encoding="utf-8")
    print(f"Scraping abgeschlossen. Gesamt: {len(fff)} Posts.")


# --- Wie viele Posts? ---
    total = len(fff)
    print("===================================")
    print(f"Gesamt:     {total}")
    print("===================================")

# Starten
await scrape_all_dynamic(max_posts=90000)

Vorhandene Datei gefunden: factbase_posts_clean.csv ‚Äì Lade gespeicherte Posts...
7161 Posts bereits geladen ‚Äì setze fort...
Aktuell 7161 Posts gespeichert ‚Äì 53 Bl√∂cke auf der Seite sichtbar
Neu hinzugekommen: 11 Posts
Aktuell 7172 Posts gespeichert ‚Äì 51 Bl√∂cke auf der Seite sichtbar
Neu hinzugekommen: 0 Posts
Keine neuen Posts (1/5)
Aktuell 7172 Posts gespeichert ‚Äì 101 Bl√∂cke auf der Seite sichtbar
Neu hinzugekommen: 4 Posts
Scrollen brachte nichts Neues (0/5)
Scraping abgeschlossen. Gesamt: 7176 Posts.
Gesamt:     7176


In [1]:
# Wie sehen die Daten aus?
import pandas as pd
ppp = pd.read_csv("factbase_posts_clean.csv")
print(ppp.tail(10).T)

                                                         7166  \
id                                                       7167   
author                          Donald Trump @realDonaldTrump   
platform                                         Truth Social   
date                                               2025-08-25   
time                                                    09:14   
day                                                      25.0   
month                                                  August   
year                                                   2025.0   
text        I PAID ZERO FOR INTEL, IT IS WORTH APPROXIMATE...   
image_path        images/6bfc8ebd831abe9afa959fb2653ae0ef.jpg   

                                                         7167  \
id                                                       7168   
author                          Donald Trump @realDonaldTrump   
platform                                         Truth Social   
date                    

In [2]:
ppp.shape

(7176, 10)

In [3]:
oldest_30 = ppp.sort_values('date', ascending=True).head(30)
print(oldest_30)

        id                         author                 platform  \
5853  5854  Donald Trump @realDonaldTrump                      NaN   
5855  5856  Donald Trump @realDonaldTrump             Truth Social   
5854  5855  Donald Trump @realDonaldTrump                      NaN   
5852  5853  Donald Trump @realDonaldTrump             Truth Social   
5804  5805  Donald Trump @realDonaldTrump             Truth Social   
5801  5802  Donald Trump @realDonaldTrump             Truth Social   
5823  5824  Donald Trump @realDonaldTrump             Truth Social   
5822  5823  Donald Trump @realDonaldTrump             Truth Social   
5821  5822  Donald Trump @realDonaldTrump                      NaN   
5820  5821  Donald Trump @realDonaldTrump                      NaN   
5819  5820  Donald Trump @realDonaldTrump                      NaN   
5818  5819  Donald Trump @realDonaldTrump                      NaN   
5817  5818  Donald Trump @realDonaldTrump                      NaN   
5816  5817  Donald T

In [4]:
newest_30 = ppp.sort_values('date', ascending=False).head(30)
print(newest_30)

        id                         author                 platform  \
7167  7168  Donald Trump @realDonaldTrump             Truth Social   
7166  7167  Donald Trump @realDonaldTrump             Truth Social   
7165  7166  Donald Trump @realDonaldTrump             Truth Social   
7164  7165  Donald Trump @realDonaldTrump             Truth Social   
7163  7164  Donald Trump @realDonaldTrump             Truth Social   
7162  7163  Donald Trump @realDonaldTrump             Truth Social   
7161  7162  Donald Trump @realDonaldTrump             Truth Social   
0        1  Donald Trump @realDonaldTrump             Truth Social   
6381  6382  Donald Trump @realDonaldTrump             Truth Social   
5862  5863  Donald Trump @realDonaldTrump             Truth Social   
1        2  Donald Trump @realDonaldTrump             Truth Social   
5861  5862  Donald Trump @realDonaldTrump             Truth Social   
5860  5861  Donald Trump @realDonaldTrump  Deleted ‚Ä¢  Truth Social   
5859  5860  Donald

In [5]:
ppp.text[9]

'Dan Patrick is a terrific and powerful Lieutenant Governor for the Great State of Texas, a place I truly love. I WON BIG in 2016, 2020, and 2024 (Getting the Highest Number of Votes for any Office in the History of Texas ‚Äî Such an Honor!). As Texas Chair of our Historic Presidential Campaigns in 2016, 2020, and 2024, Dan has been an incredible friend to our Movement, helping me to WIN BIG in all Primaries and General Elections! Dan‚Äôs leadership was pivotal in the passage of the new, fair, and much improved, Congressional Map, that will give the wonderful people of Texas the tremendous opportunity to elect 5 new MAGA Republicans in the 2026 Midterm Elections ‚Äî A HUGE VICTORY for our America First Agenda. In his next Term, Dan will continue to fight tirelessly alongside of us to Secure our already Secure Border, Stop Migrant Crime and the Flow of Illegal Drugs into our Country, Grow the Economy, Cut Taxes and Regulations, Promote MADE IN THE U.S.A., Restore American Energy DOMINAN

In [6]:
# Wie viele Posts enthalten keine Angabe zur Zeit?
count_without_time = ppp["time"].isna().sum()
print(f"Anzahl der Posts ohne Zeitangabe: {count_without_time}")

Anzahl der Posts ohne Zeitangabe: 1


In [7]:
ppp_single = ppp.drop_duplicates(subset=['time','date', 'image_path'], keep="first")
print(f"Vorher: {len(ppp)} Zeilen, danach: {len(ppp_single)} Zeilen ohne Duplikate.")

Vorher: 7176 Zeilen, danach: 5192 Zeilen ohne Duplikate.


In [10]:
ppp = ppp.drop_duplicates(subset=['time','date', 'image_path'], keep="first")
print(len(ppp))

5192


In [11]:
print(ppp_single.sort_values('date').head(5))

        id                         author      platform        date   time  \
5854  5855  Donald Trump @realDonaldTrump           NaN  2024-11-02  23:29   
5853  5854  Donald Trump @realDonaldTrump           NaN  2024-11-02  23:37   
5855  5856  Donald Trump @realDonaldTrump  Truth Social  2024-11-02  23:13   
5852  5853  Donald Trump @realDonaldTrump  Truth Social  2024-11-02  23:57   
5850  5851  Donald Trump @realDonaldTrump  Truth Social  2024-11-03  01:13   

      day     month    year  \
5854  2.0  November  2024.0   
5853  2.0  November  2024.0   
5855  2.0  November  2024.0   
5852  2.0  November  2024.0   
5850  3.0  November  2024.0   

                                                   text  \
5854  THANK YOU‚ÄîGREENSBORO, NORTH CAROLINA! #MAGA ht...   
5853  Three beautiful MAGA RALLIES today in Gastonia...   
5855       RT @realDonaldTrump11/2/24 | SALEM, VIRGINIA   
5852                                                NaN   
5850  https://www.breitbart.com/clips/2009/10/0

In [12]:
# Reichen die gescrapten Daten weit genug? 
#Bis zum 04.11.2024 sind Daten des Trump-Twitter-Archivs vorhanden.
oldest_single_30 = ppp_single.sort_values('date').head(30)
print(oldest_single_30.T)

                                                         5854  \
id                                                       5855   
author                          Donald Trump @realDonaldTrump   
platform                                                  NaN   
date                                               2024-11-02   
time                                                    23:29   
day                                                       2.0   
month                                                November   
year                                                   2024.0   
text        THANK YOU‚ÄîGREENSBORO, NORTH CAROLINA! #MAGA ht...   
image_path        images/00a8daa2f3c80102ee965af26597442d.jpg   

                                                         5853  \
id                                                       5854   
author                          Donald Trump @realDonaldTrump   
platform                                                  NaN   
date                  

In [13]:
count_without_id = ppp["id"].isna().sum()
print(f"Anzahl der Posts ohne ID: {count_without_id}")

Anzahl der Posts ohne ID: 0


In [14]:
# Bereinigung der Nan-Werte
ppp['text'] = ppp['text'].fillna("").str.strip()

In [15]:
post_1452 = ppp[ppp['time'] == '14:52']
print(post_1452)

        id                         author      platform        date   time  \
1102  1103  Donald Trump @realDonaldTrump  Truth Social  2025-06-26  14:52   
2000  2001  Donald Trump @realDonaldTrump  Truth Social  2025-05-04  14:52   
2828  2829  Donald Trump @realDonaldTrump  Truth Social  2025-03-16  14:52   
3251  3252  Donald Trump @realDonaldTrump           NaN  2025-02-24  14:52   
4365  4366  Donald Trump @realDonaldTrump  Truth Social  2024-12-17  14:52   
5087  5088  Donald Trump @realDonaldTrump           NaN  2025-02-24  14:52   

       day     month    year  \
1102  26.0      June  2025.0   
2000   4.0       May  2025.0   
2828  16.0     March  2025.0   
3251  24.0  February  2025.0   
4365  17.0  December  2024.0   
5087  24.0  February  2025.0   

                                                   text  \
1102  The Democrats are the ones who leaked the info...   
2000  Lee Zeldin: ‚ÄúAt the Trump EPA, the status quo ...   
2828  I just won the Golf Club Championship, prob

###### Daten in json umwandeln:

In [16]:
import pandas as pd
ppp = pd.read_csv("factbase_posts_clean.csv")
ppp.to_json("factbase_posts_clean.json", orient="records", force_ascii=False, indent=2)
print(ppp.head())

   id                         author      platform        date   time   day  \
0   1  Donald Trump @realDonaldTrump  Truth Social  2025-08-24  12:24  24.0   
1   2  Donald Trump @realDonaldTrump  Truth Social  2025-08-24  12:23  24.0   
2   3  Donald Trump @realDonaldTrump  Truth Social  2025-08-24  12:23  24.0   
3   4  Donald Trump @realDonaldTrump  Truth Social  2025-08-24  12:22  24.0   
4   5  Donald Trump @realDonaldTrump  Truth Social  2025-08-24  10:16  24.0   

    month    year                                               text  \
0  August  2025.0  I played Golf yesterday with the Great Roger C...   
1  August  2025.0  https://humanevents.com/2025/08/21/shea-bradle...   
2  August  2025.0  https://www.foxnews.com/opinion/gregg-jarrett-...   
3  August  2025.0                                                NaN   
4  August  2025.0  Did Wes Moore, the Governor of Maryland, lie a...   

  image_path  
0        NaN  
1        NaN  
2        NaN  
3        NaN  
4        NaN  


In [17]:
print(ppp.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7176 entries, 0 to 7175
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          7176 non-null   int64  
 1   author      7176 non-null   object 
 2   platform    6739 non-null   object 
 3   date        7175 non-null   object 
 4   time        7175 non-null   object 
 5   day         7175 non-null   float64
 6   month       7175 non-null   object 
 7   year        7175 non-null   float64
 8   text        4363 non-null   object 
 9   image_path  2725 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 560.8+ KB
None


#### Die restlichen Daten wurden von der Webseite https://www.thetrumparchive.com heruntergeladen.

In [18]:
# zuerst der File mit Daten von 2009-2021
import pandas as pd
fff = pd.read_csv("tweets_01-08-2021.csv")
print(fff.tail(10).T)

                                                       56561  \
id                                       1212166009446162432   
text       RT @heatherjones333: MAGNIFICENT TRUMP- KEEPIN...   
isRetweet                                                  t   
isDeleted                                                  f   
device                                    Twitter for iPhone   
favorites                                                  0   
retweets                                                6452   
date                                     2020-01-01 00:17:52   
isFlagged                                                  f   

                                                       56562  \
id                                       1212165377477750786   
text       RT @heatherjones333: üî•üî•üî•üî•üî•Lindsey Graham: Trum...   
isRetweet                                                  t   
isDeleted                                                  f   
device                  

In [19]:
fff.shape

(56571, 9)

Nun sollen alle restlichen json-files von 2021-2024, die ich einzeln herunter laden musste (mit jeweils 2000 Posts), zusammen geh√§ngt werden:

In [20]:
import glob
files = glob.glob("TTA/*.json")
print(files)

['TTA/januarmarch24.json', 'TTA/marchmai24.json', 'TTA/juliseptember24.json', 'TTA/septembernovember23.json', 'TTA/apriljuli23.json', 'TTA/novemberjanuar24.json', 'TTA/spetemberdezember24.json', 'TTA/juliseptember23.json', 'TTA/oktoberjanuar22.json', 'TTA/januarapril23.json', 'TTA/junioktober22.json', 'TTA/maijuli24.json', 'TTA/tweets_01-08-2021.json', 'TTA/januar21juni22.json']


Die Dateien, die ich herunterladen musste, waren scheinbar keine sauberen json-files.
S√§mtliche Steuerzeichen wie \n, \t, \r, Nullbytes oder ungew√∂hnliche Kontrollcodes werden durch ein Leerzeichen " " ersetzt, da JSON nur bestimmte Zeichen enthalten darf (z.B. ", \n, \t nur in Strings mit Escape). Wenn in den Dateien unescapte Steuerzeichen enthalten sind, bricht json.loads ab.

In [31]:
import json, pandas as pd, glob, re

dfs = []
for file in glob.glob("TTA/*.json"):
    with open(file, "r", encoding="utf-8") as f:
        raw = f.read()
        cleaned = re.sub(r"[\x00-\x1f\x7f]", " ", raw)
    for loader in (lambda x: json.loads(x), lambda x: [json.loads(line) for line in x.splitlines() if line.strip()]):
        try:
            data = loader(cleaned)
            if isinstance(data, dict): data = [data]
            dfs.append(pd.DataFrame(data))
            print(f"Bereinigt geladen: {file}")
            break
        except: 
            continue

if dfs:
    tta_full = pd.concat(dfs, ignore_index=True)
    print("Gesamtdaten:", tta_full.shape)
    print(tta_full.head())
    tta_full.to_csv("tta_full.csv", index=False, encoding="utf-8")
    tta_full.to_json("tta_full.json", orient="records", force_ascii=False, indent=2)
    print("Gespeichert als tta_full.csv und tta_full.json")
else:
    print("Keine g√ºltigen JSON-Dateien geladen.")


Bereinigt geladen: TTA/januarmarch24.json
Bereinigt geladen: TTA/marchmai24.json
Bereinigt geladen: TTA/juliseptember24.json
Bereinigt geladen: TTA/septembernovember23.json
Bereinigt geladen: TTA/apriljuli23.json
Bereinigt geladen: TTA/novemberjanuar24.json
Bereinigt geladen: TTA/spetemberdezember24.json
Bereinigt geladen: TTA/juliseptember23.json
Bereinigt geladen: TTA/oktoberjanuar22.json
Bereinigt geladen: TTA/januarapril23.json
Bereinigt geladen: TTA/junioktober22.json
Bereinigt geladen: TTA/maijuli24.json
Bereinigt geladen: TTA/tweets_01-08-2021.json
Bereinigt geladen: TTA/januar21juni22.json
Gesamtdaten: (80358, 9)
            date favorites                  id isRetweet retweets  \
0  1709611886295     22410  112041124575579316     False     4357   
1  1709606786303     16477  112040790347547040     False     3420   
2  1709606689853     11978  112040784026627289     False     3468   
3  1709600790251     14970  112040397390384141     False     3429   
4  1709599014526     10254

In [32]:
tta_full.text[1]

'<p><span  class="quote-inline"><br/>RT:  https://truthsocial.com/users/realDonaldTrump/statuses/112037150280659739</span>THANK  YOU, NORTH DAKOTA! <a  href="https://links.truthsocial.com/link/110119864581902473"  rel="nofollow noopener noreferrer" target="_blank"><span  class="invisible">https://</span><span  class="">DonaldJTrump.com</span><span  class="invisible"></span></a></p>'

Das Datum soll getrennt werden, damit es besser zu den Daten von "factbase_posts_clean.json" passt.
Au√üerdem ist das Datum noch seltsam codiert:

In [33]:
### scheinbar gibt es eine spezifische Art f√ºr Twitter, das Datum darzustellen:
## als Twitter Snowflake (laut ChatGPT)
import pandas as pd
df = pd.read_json("tta_full.json", orient="records")
print(df['date'].head(5))
print(df['date'].dtype)

0   1970-01-01 00:28:29.611886295
1   1970-01-01 00:28:29.606786303
2   1970-01-01 00:28:29.606689853
3   1970-01-01 00:28:29.600790251
4   1970-01-01 00:28:29.599014526
Name: date, dtype: datetime64[ns]
datetime64[ns]


In [34]:
import pandas as pd
import json

# JSON einlesen
df = pd.read_json("tta_full.json", orient="records")

# ID als int64 speichern (f√ºr Bit-Shift)
df['id'] = df['id'].astype('int64')

# Twitter Snowflake in Timestamp umrechnen
TWITTER_EPOCH = 1288834974657
df['timestamp_ms'] = df['id'].apply(lambda x: (x >> 22) + TWITTER_EPOCH)

# In datetime konvertieren
df['datetime'] = pd.to_datetime(df['timestamp_ms'], unit='ms')
# Datum trennen, damit es einheitlicher wird (im Vergleich zu den gescrapten Daten)
df['date'] = df['datetime'].dt.date.astype(str)
df['time'] = df['datetime'].dt.time.astype(str)
df['day'] = df['datetime'].dt.day
df['month'] = df['datetime'].dt.month
df['year'] = df['datetime'].dt.year
# Spalten entfernen
df.drop(columns=['datetime','timestamp_ms'], inplace=True)
# Reihenfolge festlegen
key_order = ['id','text','date','time','day','month','year'] + [col for col in df.columns if col not in ['id','text','date','time','day','month','year']]

data_ordered = df[key_order].to_dict(orient="records")

with open("tta_full.json", "w", encoding="utf-8") as f:
    json.dump(data_ordered, f, ensure_ascii=False, indent=2)


for record in data_ordered[:3]:
    print(json.dumps(record, ensure_ascii=False, indent=2))

{
  "id": 112041124575579312,
  "text": "<p>A  70 Point win in the Great State of North Dakota tonight. Thank you  Governor Doug, and First Lady Kathryn,  Burgum!!!</p>",
  "date": "2011-09-09",
  "time": "05:54:20.379000",
  "day": 9,
  "month": 9,
  "year": 2011,
  "favorites": 22410,
  "isRetweet": false,
  "retweets": 4357,
  "isDeleted": null,
  "device": null,
  "isFlagged": null
}
{
  "id": 112040790347547040,
  "text": "<p><span  class=\"quote-inline\"><br/>RT:  https://truthsocial.com/users/realDonaldTrump/statuses/112037150280659739</span>THANK  YOU, NORTH DAKOTA! <a  href=\"https://links.truthsocial.com/link/110119864581902473\"  rel=\"nofollow noopener noreferrer\" target=\"_blank\"><span  class=\"invisible\">https://</span><span  class=\"\">DonaldJTrump.com</span><span  class=\"invisible\"></span></a></p>",
  "date": "2011-09-09",
  "time": "05:53:00.693000",
  "day": 9,
  "month": 9,
  "year": 2011,
  "favorites": 16477,
  "isRetweet": false,
  "retweets": 3420,
  "isDele

In [35]:
## Ist in den Spalten "isFlagged", "isDeleted" und "device" Wert immer null?
df = pd.read_json("tta_full.json", orient="records")
cols_to_check = ["isFlagged", "isDeleted", "device"]
for col in cols_to_check:
    if col in df.columns:
        null_count = df[col].isna().sum()
        total = len(df)
        print(f"Spalte '{col}': {null_count}/{total} null Werte")
    else:
        print(f"Spalte '{col}' existiert nicht in der Datei.")

Spalte 'isFlagged': 23787/80358 null Werte
Spalte 'isDeleted': 23787/80358 null Werte
Spalte 'device': 23787/80358 null Werte


Folglich waren diese Angaben in der Datei "08-01-2021.json" alle enthalten (etwa 54.000 Eintr√§ge), aber bei den neueren Daten, die von Hand gescrapt wurden, nicht mehr.

#### Jetzt sollen "factbase_posts_clean.json" und "tta_full.json" zusammengef√ºhrt werden:

In [36]:
import pandas as pd
import json

# factbase_posts_clean.json einlesen, 'image_path', 'author' und 'platform' weglassen
cols_factbase = ["id", "date", "time", "day", "month", "year", "text"]
df_factbase = pd.read_json("factbase_posts_clean.json", orient="records")
df_factbase = df_factbase[cols_factbase]

# tta_full.json einlesen
df_tta = pd.read_json("tta_full.json", orient="records")

# Einheitliche Spaltenreihenfolge definieren
cols_order = ["id", "text", "date", "time", "day", "month", "year",
              "favorites", "retweets", "isRetweet", "isDeleted", "device", "isFlagged"]

# Fehlende Spalten erg√§nzen
for col in cols_order:
    if col not in df_factbase.columns:
        df_factbase[col] = pd.NA
    if col not in df_tta.columns:
        df_tta[col] = pd.NA

# Beide DataFrames zusammenf√ºhren
df_combined = pd.concat([df_factbase, df_tta], ignore_index=True)

# Spalten in der definierten Reihenfolge anordnen
df_combined = df_combined[cols_order]

df_combined['date'] = pd.to_datetime(df_combined['date'], errors='coerce').dt.date.astype(str)
#df_combined['time'] = pd.to_datetime(df_combined['time'], errors='coerce').dt.time.astype(str)
df_combined['time'].astype(str)
#df_combined['time'] = pd.to_datetime(df_combined['time'], format='%H:%M:%S', errors='coerce').dt.time.astype(str)


# Datum als String vorausgesetzt, z.B. "2025-08-24"
df_combined['month'] = pd.to_datetime(df_combined['date'], errors='coerce').dt.month
df_combined['day'] = pd.to_datetime(df_combined['date'], errors='coerce').dt.day
df_combined['year'] = pd.to_datetime(df_combined['date'], errors='coerce').dt.year

df_combined = df_combined.where(pd.notna(df_combined), None)
df_combined.to_json("t_combined_all.json", orient="records", force_ascii=False, indent=2)
df_combined.to_csv("t_combined_all.csv", index=False, encoding="utf-8")
print(df_combined[["id", "text", "date", "time", "month"]].head())

   id                                               text        date   time  \
0   1  I played Golf yesterday with the Great Roger C...  2025-08-24  12:24   
1   2  https://humanevents.com/2025/08/21/shea-bradle...  2025-08-24  12:23   
2   3  https://www.foxnews.com/opinion/gregg-jarrett-...  2025-08-24  12:23   
3   4                                               None  2025-08-24  12:22   
4   5  Did Wes Moore, the Governor of Maryland, lie a...  2025-08-24  10:16   

   month  
0    8.0  
1    8.0  
2    8.0  
3    8.0  
4    8.0  


In [37]:
df_combined.shape

(87534, 13)

In [38]:
df_combined.tail()

Unnamed: 0,id,text,date,time,day,month,year,favorites,retweets,isRetweet,isDeleted,device,isFlagged
87529,108217783188791696,<p>Thank you to all of the GREAT and BEAUTIFU...,2011-08-29,16:41:44.758000,29.0,8.0,2011.0,217929,47820,False,,,
87530,108211822140637680,"<p>I‚ÄôM BACK! <a href=""https://truthsocial.com...",2011-08-29,16:18:03.533000,29.0,8.0,2011.0,411400,123523,False,,,
87531,107797156496908384,<p>Get Ready! Your favorite President will s...,2011-08-28,12:50:19.540000,28.0,8.0,2011.0,264781,49190,False,,,
87532,1347569870578266112,"To all of those who have asked, I will not be...",2021-01-08,15:44:28.440000,8.0,1.0,2021.0,639463,79113,False,,,
87533,1347555316863553536,"The 75,000,000 great American Patriots who vo...",2021-01-08,14:46:38.564000,8.0,1.0,2021.0,535831,89895,False,,,


In [39]:
print(json.dumps(df_combined.iloc[0].to_dict(), ensure_ascii=False, indent=2))

{
  "id": 1,
  "text": "I played Golf yesterday with the Great Roger Clemens and his son, Kacy. Roger Clemens was easily one of the few Greatest Pitchers of All Time, winning 354 Games, the Cy Young Award seven times (A Record, by a lot!), and played in six World Series, winning two! He was second to Nolan Ryan in most strike-outs, and he should be in the Baseball Hall of Fame, NOW! People think he took drugs, but nothing was proven. He never tested positive, and Roger, from the very beginning, totally denies it. He was just as great before those erroneous charges were leveled at him. That rumor has gone on for years, and there has been no evidence whatsoever that he was a ‚Äúdruggie.‚Äù This is going to be like Pete Rose where, after over 4,000 Hits, they wouldn‚Äôt put him in the Hall of Fame until I spoke to the Commissioner, and he promised to do so, but it was essentially a promise not kept because he only ‚Äúopened it up‚Äù when Pete died and, even then, he said that Pete Rose on

#### t_combined_all.json ist das finale Dokument.