# Evaluate Classification of Trump Tweets

> Before you start, make sure you've followed the setup instructions in the [README](README.md).

Evaluate the accuracy of LLM classification of a set of Trump tweets. This data was exported from Junkipedia; manually labeled; and then [run through one round of classification](classify_trump_tweets_for_export.ipynb) using Ollama.

The labels are `text`, `multimedia` or `both`, and signify whether the `post_body_text` field contains pure text, only an image or video, or both text and multimedia. The manually labeled data is in the `post_type_true_label` column, while the LLM classification is in `post_type_pred`.

## Ingest the data

We'll start with some basic imports and then load the CSV file containing Trump's tweets.


In [21]:
import pandas as pd

In [22]:
df = pd.read_csv("trump_tweets_sample_labeled_with_predictions.csv")

Review the data to see what columns are available.

In [23]:
# For better readability in Jupyter notebooks
pd.set_option("display.max_columns", None)
df.head()

Unnamed: 0,index,PostId,PostUrl,PostEngagement,Platform,ChannelID,ChannelName,ChannelUid,ChannelUrl,ChannelEngagement,post_body_text,post_type_true_label,GoogleAudioText,VoskAudioText,EmbeddedContentText,published_at,post_data,post_media_urls,LikesCount,SharesCount,CommentsCount,ViewsCount,post_media_file,embedded_post_text,search_data,post_type_pred
0,0,528452546,https://twitter.com/realDonaldTrump/status/192...,,Twitter,14001198,Donald J. Trump,blank_for_now,blank_for_now,"{""follower_count"":105433744,""following_count"":...",https://t.co/ttc8fsXNVF,multimedia,,no longer populated,,2025-05-31T12:54:08.000Z,post data removed,https://www.junkipedia.org/rails/active_storag...,374413,52340,18280,13598142,,,,multimedia
1,1,524629281,https://twitter.com/realDonaldTrump/status/192...,,Twitter,14001198,Donald J. Trump,blank_for_now,blank_for_now,"{""follower_count"":105433386,""following_count"":...",https://t.co/b19JtrkCiS,multimedia,,no longer populated,,2025-05-26T19:50:36.000Z,post data removed,https://www.junkipedia.org/rails/active_storag...,328895,46401,21163,30490858,,,,multimedia
2,2,522872445,https://twitter.com/realDonaldTrump/status/192...,,Twitter,14001198,Donald J. Trump,blank_for_now,blank_for_now,"{""follower_count"":105429848,""following_count"":...",https://t.co/ymnea3Wf69,multimedia,,no longer populated,,2025-05-24T14:18:32.000Z,post data removed,,120723,21960,12400,34815923,,,,multimedia
3,3,520771899,https://twitter.com/realDonaldTrump/status/192...,,Twitter,14001198,Donald J. Trump,blank_for_now,blank_for_now,"{""follower_count"":105430891,""following_count"":...","“THE ONE, BIG, BEAUTIFUL BILL” has PASSED the ...",text,,no longer populated,,2025-05-22T13:44:04.000Z,post data removed,,360624,57659,39828,52727374,,,,both
4,4,519932850,https://twitter.com/realDonaldTrump/status/192...,,Twitter,14001198,Donald J. Trump,blank_for_now,blank_for_now,"{""follower_count"":105421518,""following_count"":...",https://t.co/ZuagVWL4KS,multimedia,,no longer populated,,2025-05-21T14:47:03.000Z,post data removed,https://www.junkipedia.org/rails/active_storag...,564321,71124,43222,81481354,,,,multimedia


Verify the row count (there should be 30 rows).

In [24]:
rows, cols = df.shape
rows

30

And review the subset of columns that are relevant to the evaluation.

> NOTE: You should notice a number of predictions do not match the true labels. This is expected, and will help in providing a more realistic evaluation :) 

In [25]:
df[['post_body_text', 'post_type_true_label', 'post_type_pred']]

Unnamed: 0,post_body_text,post_type_true_label,post_type_pred
0,https://t.co/ttc8fsXNVF,multimedia,multimedia
1,https://t.co/b19JtrkCiS,multimedia,multimedia
2,https://t.co/ymnea3Wf69,multimedia,multimedia
3,"“THE ONE, BIG, BEAUTIFUL BILL” has PASSED the ...",text,both
4,https://t.co/ZuagVWL4KS,multimedia,multimedia
5,https://t.co/q5CPyHZkLD,multimedia,multimedia
6,🇦🇪🇺🇸 https://t.co/7fRFm4zCL5,both,multimedia
7,🇶🇦🇺🇸 https://t.co/v1NwTQPWLO,both,multimedia
8,🇸🇦🇺🇸 https://t.co/i5cRnVmaFv,both,multimedia
9,THE SUPREME COURT IS BEING PLAYED BY THE RADIC...,text,text


## Evaluate the results

Now that we've imported the data, we can evaluate the LLMs predictions against our manually labeled data.

We'll start by importing some evaluation functions from [scikit-learn](https://scikit-learn.org/stable/), which will help us calculate accuracy, precision, recall, and F1 score.

For an overview of these metrics, see:

- [Precision, Recall, and F1 Score Explained](https://rumn.medium.com/precision-recall-and-f1-explained-with-10-ml-use-case-6ef2fbe458e5)
- [Wikipedia article on precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)
- [Wikipedia article on F1 score](https://en.wikipedia.org/wiki/F1_score)

In [26]:
from sklearn.metrics import (
    accuracy_score, 
    classification_report, 
    precision_recall_fscore_support, 
    confusion_matrix
)

## Accuracy

Accuracy is a common metric for evaluating classification models, and is defined as the number of correct predictions divided by the total number of predictions.

In [27]:
accuracy = (df['post_type_true_label'] == df['post_type_pred']).mean()
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.73


This metric is useful when we have a similar number of examples for each class (`text`, `multimedia`, and `both`), but can be misleading if one class is much more common than the others.

For example, if 90% of the examples are `text`, a model that always predicts `text` will have an accuracy of 90%, but it won't be very useful for identifying the other classes.

Let's more closely examine the accuracy of the LLM's predictions using precision, recall, and F1 score.

## Precision, Recall and F1 Score

Precision, recall and the F1 score are additional metrics that are often used to evaluate classification models.

Precision is the number of true positives divided by the number of true positives plus false positives. It measures how many of the predicted positive cases were actually positive. For example, if the model predicts 10 tweets as text-based, but only 5 of them are actually text-based, then the precision would be 0.5 (or 50%).

Recall is the number of true positives divided by the number of true positives plus false negatives. It measures how many of the actual positive cases were predicted as positive. For example, if there are 10 actual text-based tweets, and the model predicts 8 of them as text-based, then the recall would be 0.8 (or 80%).

The F1 score is the harmonic mean of precision and recall. It provides a single score that balances both the precision and recall, making it a useful metric when you need to take both false positives and false negatives into account. The F1 score ranges from 0 to 1, with 1 being the best possible score.

Let's calculate these metrics for our model's predictions.

In [28]:
label_names = ["text", "multimedia", "both"]
precision, recall, f1, _ = precision_recall_fscore_support(
    df['post_type_true_label'], 
    df['post_type_pred'],
    labels=label_names,
    average=None
)

Now let's create a DataFrame to hold the evaluation results. We'll include the precision, recall and F1 score for each label type.

In [29]:
metrics_df = pd.DataFrame({
    'label': label_names,
    'precision': precision,
    'recall': recall,
    'f1-score': f1
})
print(metrics_df)

        label  precision  recall  f1-score
0        text   1.000000     0.6  0.750000
1  multimedia   0.714286     1.0  0.833333
2        both   0.600000     0.6  0.600000


### Interpret the results

The above results show:

- The precision for the `text` label is 1.0, meaning that all of the tweets classified as text-based were actually text-based. However, the recall is only 0.6, meaning that only 60% of the actual text-based tweets were classified as such. This means the model is very precise in its classification of text-based tweets, but it misses some of them, classifying them as `multimedia` or `both` instead.
- The precision for the `multimedia` label is 0.714286, meaning that 71.43% of the tweets classified as multimedia were actually multimedia. The recall is 1.0, meaning that all of the actual multimedia tweets were classified as such. In other words, the model found all of the multimedia tweets, but also made some mistakes by classifying some tweets in the `text` or `both` categories as `multimedia`.
- The precision for the `both` label is 0.600000, meaning that only 60% of the tweets classified as `both` were actually both. The recall is 0.600000, meaning that 60% of the actual tweets in this category were classified as such. This indicates that the model struggles with classifying tweets that contain both text and multimedia, often misclassifying them as either `text` or `multimedia`.

## Classification Report

Deriving Precision, Recall and F1 scores by class (ie our labels of `text`, `multimedia`, and `both`) is such a common task that scikit-learn provides a built-in function to generate a classification report. This report includes precision, recall, F1 score, and support (the number of true instances for each label) in a single table. 

It also provides an overall accuracy score, which we derived earlier.

In [30]:
print("\nDetailed Classification Report:")
print(
    classification_report(
        df['post_type_true_label'],
        df['post_type_pred'],
        labels=label_names
    )
)


Detailed Classification Report:
              precision    recall  f1-score   support

        text       1.00      0.60      0.75        10
  multimedia       0.71      1.00      0.83        10
        both       0.60      0.60      0.60        10

    accuracy                           0.73        30
   macro avg       0.77      0.73      0.73        30
weighted avg       0.77      0.73      0.73        30



## Confusion Matrix

We can also construct a confusion matrix to visualize the results. The confusion matrix will provide a glimpse of how the model's predictions compare to the true labels, and can help identify the nature of the mistakes the model is making.

For example, it might show that the model is classifying some text-based tweets as multimedia, or vice versa.

In [31]:
conf_matrix = pd.crosstab(
    df['post_type_true_label'], 
    df['post_type_pred'], 
    rownames=['Actual'], 
    colnames=['Predicted'], 
    normalize='index'
)
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
Predicted   both  multimedia  text
Actual                            
both         0.6         0.4   0.0
multimedia   0.0         1.0   0.0
text         0.4         0.0   0.6


### Interpret the confusion matrix

Above, we can see that:
- The model correctly classified 60% of the tweets that were labeled as `both` as `both`, and 40% as `multimedia`.
- The model correctly classified 100% of the tweets that were labeled as `multimedia` as `multimedia`.
- The model correctly classified 60% of the tweets that were labeled as `text` as `text`, and 40% as `both`.

## Summary

The evaluation of the LLM's classification of tweets reveals some new insights, while confirming what might be easily observed by simply reviewing the data above. 

Most notably, the model has a high recall for the `multimedia` class, but this seems due to the fact that it is overly aggressive in classifying tweets as `multimedia`. Specifically, it mistakenly assigns that label to `text` and `both` tweets, leading to a lower precision and F1 score for that class.

Meanwhile `text` tweets are classified with high precision, but the model misses some of them, leading to a lower recall score.

The `both` class is also problematic, with low precision and recall, indicating that the model struggles to classify tweets that contain both text and multimedia.

While these results are not ideal, they do provide a useful baseline for further improvements.

For example, we could more carefully examine the tweets that are misclassified and try to identify patterns in the mistakes.

We could then attempt to improve the system prompt to better account for the nuances of each class, perform some preprocessing on the tweets to make them easier to classify, or apply both techniques.

It's worth noting that LLMs sometimes catch subtle nuances that may not be immediately obvious to a human reviewer. It's quite possible that the LLM's classification is more accurate than the manual labeling in some cases, and can provide a useful starting point for further refinement of the system prompt or other improvements to the classification process.

Lastly, it's worth noting that the evaluation metrics we calculated are just a starting point. Depending on the specific use case, other metrics such as [Cohen's Kappa](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html#sklearn.metrics.cohen_kappa_score) or [Matthews correlation coefficient](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html#sklearn.metrics.matthews_corrcoef) could also be useful for evaluating the model's performance.