# Classify and Export Trump Tweets

> Before you start, make sure you've followed the setup instructions in the [README](README.md).

Performs analysis on a set of Trump tweets that have been manually labeled as `text`, `multimedia`, or `both`. 

Demonstrates how to use an LLM to classify these tweets and export the results to a CSV file, which can then be used for downstream [evaluation](eval_tweet_classification.ipynb).

## Ingest the data

We'll start with some basic imports and then load the CSV file containing Trump's tweets.


In [1]:
import pandas as pd
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import OllamaLLM

In [2]:
df = pd.read_csv("trump_tweets_sample_labeled.csv")

Review the data to see what columns are available. You can use the `head()` method to view the first few rows of the DataFrame.

In [4]:
# displaly all columns
pd.set_option("display.max_columns", None)
df.head()

Unnamed: 0,PostId,PostUrl,PostEngagement,Platform,ChannelID,ChannelName,ChannelUid,ChannelUrl,ChannelEngagement,post_body_text,post_type_true_label,GoogleAudioText,VoskAudioText,EmbeddedContentText,published_at,post_data,post_media_urls,LikesCount,SharesCount,CommentsCount,ViewsCount,post_media_file,embedded_post_text,search_data
0,528452546,https://twitter.com/realDonaldTrump/status/192...,,Twitter,14001198,Donald J. Trump,blank_for_now,blank_for_now,"{""follower_count"":105433744,""following_count"":...",https://t.co/ttc8fsXNVF,multimedia,,no longer populated,,2025-05-31T12:54:08.000Z,post data removed,https://www.junkipedia.org/rails/active_storag...,374413,52340,18280,13598142,,,
1,524629281,https://twitter.com/realDonaldTrump/status/192...,,Twitter,14001198,Donald J. Trump,blank_for_now,blank_for_now,"{""follower_count"":105433386,""following_count"":...",https://t.co/b19JtrkCiS,multimedia,,no longer populated,,2025-05-26T19:50:36.000Z,post data removed,https://www.junkipedia.org/rails/active_storag...,328895,46401,21163,30490858,,,
2,522872445,https://twitter.com/realDonaldTrump/status/192...,,Twitter,14001198,Donald J. Trump,blank_for_now,blank_for_now,"{""follower_count"":105429848,""following_count"":...",https://t.co/ymnea3Wf69,multimedia,,no longer populated,,2025-05-24T14:18:32.000Z,post data removed,,120723,21960,12400,34815923,,,
3,520771899,https://twitter.com/realDonaldTrump/status/192...,,Twitter,14001198,Donald J. Trump,blank_for_now,blank_for_now,"{""follower_count"":105430891,""following_count"":...","“THE ONE, BIG, BEAUTIFUL BILL” has PASSED the ...",text,,no longer populated,,2025-05-22T13:44:04.000Z,post data removed,,360624,57659,39828,52727374,,,
4,519932850,https://twitter.com/realDonaldTrump/status/192...,,Twitter,14001198,Donald J. Trump,blank_for_now,blank_for_now,"{""follower_count"":105421518,""following_count"":...",https://t.co/ZuagVWL4KS,multimedia,,no longer populated,,2025-05-21T14:47:03.000Z,post data removed,https://www.junkipedia.org/rails/active_storag...,564321,71124,43222,81481354,,,


Above, the `post_type_true_label` column contains the manually labeled classification of the tweets. The `post_body_text` column contains the text of the tweet itself.

**Note that only a subset of the tweets are labeled, so we'll filter the DataFrame to only include those tweets that have a label.**

In [5]:
labeled = df.dropna(subset=["post_type_true_label"]).reset_index()

In [13]:
rows, columns =labeled.shape
print(f"Number of labeled examples: {rows}")
print(f"Number of columns: {columns}")

Number of labeled examples: 30
Number of columns: 25


## Classify tweets by type

Now we'll classify Tweets using Ollama as either text-only, multimedia-only, or both.

Prepare a [ChatPromptTemplate](https://python.langchain.com/api_reference/core/prompts/langchain_core.prompts.chat.ChatPromptTemplate.html#chatprompttemplate) to generate prompts for the LLM, which allows you to inject the text of each tweet into the prompt.

- The `system` message below gives the LLM a "persona" -- a common strategy used to optimize the results of an LLM interaction -- and a set of instructions on how to classify the tweets. 
- The `user` message will contain the tweet text that should be classified. It uses curly braces to templatize the text (`{tweet_text}`), in a way that is reminiscent of Python ["f"-strings](https://docs.python.org/3/reference/lexical_analysis.html#f-strings).

```python

In [14]:
prompt = ChatPromptTemplate.from_messages([
    ("system", (
        "You are helpful assistant that classifies tweets. For each tweet, determine if it is: "
        " - Text-only - i.e. there is only text and no images or videos\n"
        " - Multimedia-only - i.e. If there is only a link that starts with 'https://t.co ' and no other text, "
        "although text representations of emojis should be treated as media\n"
        " - Text and Multimedia - i.e. there is both text and at least on image or video\n\n"
        "Return a single word classification for each tweet: 'text', 'multimedia', or 'both'."
        "Make sure your 1-word classification is in lowercase and does not include any punctuation."
    )
    ),
    ("user", "{tweet_text}"),
])

chat = OllamaLLM(model="gemma3")

Classify the tweets, similar to the [classify_trump_tweets.ipynb](classify_trump_tweets.ipynb) notebook.

> Note that this may take 10 or more seconds to run, depending on the number of tweets and the speed of your Ollama server.

In [16]:
# Create a dictionary to store the results of the classification
data = {}
# Loop through the DataFrame and classify each tweet
for index, row in labeled.iterrows():
    post_id = row['PostId']
    post_text = row['post_body_text']
    # Generate the prompt by "injecting" the tweet text into the prompt
    compiled_prompt = prompt.format(tweet_text=post_text)
    # Call the model with the prompt
    # Note: The model may take a few seconds to respond, 
    # depending on the size of the model and the complexity of the prompt
    response = chat.invoke(compiled_prompt)
    # Store the response in the dictionary using the post_id as the key
    data[post_id] = {
        'post_id': post_id,
        'post_body_text': post_text,
        'post_type_true_label': row['post_type_true_label'],
        'post_type_pred': response.strip() # This is the predicted label
    }

## Merge results with the DataFrame
Map the results onto the dataframe, and then review the results again. 

In [17]:
labeled['post_type_pred'] = labeled['PostId'].map(lambda x: data[x]['post_type_pred'])

In [18]:
labeled[['post_body_text', 'post_type_true_label', 'post_type_pred']]

Unnamed: 0,post_body_text,post_type_true_label,post_type_pred
0,https://t.co/ttc8fsXNVF,multimedia,multimedia
1,https://t.co/b19JtrkCiS,multimedia,multimedia
2,https://t.co/ymnea3Wf69,multimedia,multimedia
3,"“THE ONE, BIG, BEAUTIFUL BILL” has PASSED the ...",text,both
4,https://t.co/ZuagVWL4KS,multimedia,multimedia
5,https://t.co/q5CPyHZkLD,multimedia,multimedia
6,🇦🇪🇺🇸 https://t.co/7fRFm4zCL5,both,multimedia
7,🇶🇦🇺🇸 https://t.co/v1NwTQPWLO,both,multimedia
8,🇸🇦🇺🇸 https://t.co/i5cRnVmaFv,both,multimedia
9,THE SUPREME COURT IS BEING PLAYED BY THE RADIC...,text,text


## Export the results
Finally, export the results to a CSV file. This file can be used for downstream evaluation or analysis.

In [19]:
labeled.to_csv("trump_tweets_sample_labeled_with_predictions.csv", index=False)