# Label Trump Tweets

> Before you start, make sure you've followed the setup instructions in the [README](README.md).

Ingest a CSV of Trump's tweets, downloaded from Junkipedia.org, and use Ollama to identify which posts are text-based vs image/media-based.

After completing this basic exercise, you can expand the analysis to perform other classification tasks, such as:

* Assign sentiment (positive, negative, neutral)
* Assign one or more topics (free-form)
* Assign one of a set of pre-set categories (e.g. politics, business, etc.)
* Perform entity extraction

## Ways to experiment

Experiment with different prompts to see how well a local model can classify the tweets.

- One big prompt that tries to do everything
- Multiple smaller prompts that do one thing each

As you perform the above work, store the results in a way that lets you link the LLM results with the original data. You can do this using a simple dictionary (demonstrated below), or by adding new columns (e.g. `text_based`, `sentiment`, `topics`, `category`). 

You may need multiple versions of each column type to capture the different results from different prompts. For example, you might have `sentiment_v1`, `sentiment_v2`, etc.

## Prompt Templates

You should be using chat [Prompt Templates](https://python.langchain.com/docs/concepts/prompt_templates/) as part of your workflow, as that will allow you to easily inject the text of posts into the prompts, for example in the context of a `for` loop.

*See LangChain's [Prompt Templates guide](https://python.langchain.com/docs/how_to/#prompt-templates) for more details.*

## Structured output

As your prompts get more sophisticated and the resulting output more complex, you may also want to use LangChain's [structured output](https://python.langchain.com/docs/how_to/structured_output/) feature to parse the results into a structured format.


## Ingest the data

We'll start with some basic imports and then load the CSV file containing Trump's tweets.


In [None]:
import pandas as pd
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import OllamaLLM

In [None]:
df = pd.read_csv("trump_tweets_sample.csv")

Review the data to see what columns are available. You can use the `head()` method to view the first few rows of the DataFrame.

In [None]:
df.head()

## Classify tweets by type

Now we'll demonstrate how to classify Tweets as text-based or image/media-based using Ollama's LLM.

Prepare a [ChatPromptTemplate](https://python.langchain.com/api_reference/core/prompts/langchain_core.prompts.chat.ChatPromptTemplate.html#chatprompttemplate) for classifying tweets as text-based or media-based. This template will be used to generate prompts for the LLM, and allows you to easily inject the text of each tweet into the prompt.

- The `system` message below gives the LLM a "persona" -- a common strategy used to optimize the results of an LLM interaction -- and a set of instructions on how to classify the tweets. 
- The `user` message will contain the tweet text that should be classified. It uses curly braces to templatize the text (`{tweet_text}`), in a way that is reminiscent of Python ["f"-strings](https://docs.python.org/3/reference/lexical_analysis.html#f-strings).

```python

In [None]:
prompt = ChatPromptTemplate.from_messages([
    ("system", (
        "You are helpful assistant that classifies tweets. For each tweet, determine if it is: "
        " - Text-only - i.e. there is only text and no images or videos\n"
        " - Multimedia-only - i.e. If there is only a link that starts with 'https://t.co\n"
        " - Text and Multimedia - i.e. there is both text and at least on image or video\n\n"
        "Return a single word classification for each tweet: 'text', 'multimedia', or 'both'."
        "Make sure your 1-word classification is in lowercase and does not include any punctuation."
    ),
    ("user", "{tweet_text}"),
])

chat = OllamaLLM(model="gemma3")

Test a small subset of tweets to verify that the prompts are working as expected.

In [None]:
# Create a dictionary to store the results of the classification
data = {}
# We'll use a counter to limit the number of tweets processed, to speed up testing
counter = 0
# Loop through the DataFrame and classify each tweet
for index, row in df.iterrows():
    # Limit the number of tweets processed to 5 for testing. 
    if counter == 5:
        break
    post_id = row['PostId']
    post_text = row['post_body_text']
    # Generate the prompt by "injecting" the tweet text into the prompt
    compiled_prompt = prompt.format(tweet_text=post_text)
    # Call the model with the prompt
    # Note: The model may take a few seconds to respond, 
    # depending on the size of the model and the complexity of the prompt
    response = chat.invoke(compiled_prompt)
    # Store the response in the dictionary using the post_id as the key
    data[post_id] = {
        'post_id': post_id,
        'post_body_text': post_text,
        'classification': response.strip()
    }
    counter += 1

Now print and review the results. As you do this, consider the following questions:

- Did the model classify the tweets correctly?
- Did the "label" column contain the expected values ("text" or "media")? Or did it contain other values (e.g. "Text" or "MEDIA" or "text-and-media")?

In [None]:
from pprint import pprint
pprint(data)

## Keep experimenting

Depending on the results of your initial tests, you may want to try different prompts or prompt templates. 

Specifically, you'll want to experiment with different ways of phrasing the classification task in the `system` message. 

You can also try different models, such as Llama 2 or Mistral, to see if they produce better results. To do this, you'll need to use Ollama to install the models  you want to test, and then update the `model` parameter in the `OllamaLLM` class to use the new model. For example, to use Deepseek, you would change the model name from `gemma3` to `deepseek-llm`.

You can search for available models on the [Ollama website](https://ollama.com/search).
```