# Assignment 2: POTUS

---

## Task 1) President of the United States (Trump vs. Obama)

Surely, you're aware that the 45th President of the United States (@POTUS45) was an active user of Twitter, until (permanently) banned on Jan 8, 2021.
You can still enjoy his greatness at the [Trump Twitter Archive](https://www.thetrumparchive.com/). We will be using original tweets only, so make sure to remove all retweets.
Another fan of Twitter was Barack Obama (@POTUS43 and @POTUS44), who used the platform in a rather professional way.
Please also consider the POTUS Tweets of Joe Biden; we will be using those for testing.

### Data

There are multiple ways to get the data, but the easiest way is to download the files from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group. 
Another way is to directly use the data from [Trump Twitter Archive](https://www.thetrumparchive.com/), [Obama Kaggle](https://www.kaggle.com/jayrav13/obama-white-house), and [Biden Kaggle](https://www.kaggle.com/rohanrao/joe-biden-tweets).
Before you get started, please download the files; you can put them into the data folder.

### N-gram Models

In this assignment, you will be doing some Twitter-related preprocessing and training n-gram models to be able to distinguish between Tweets of Trump, Obama, and Biden.
We will be using [NLTK](https://www.nltk.org), more specifically it's [`lm`](https://www.nltk.org/api/nltk.lm.html) module. 
Install the NLTK package within your working environment.
You can use some of the NLTK functions, but you have to implement the functions for likelihoods and perplexity from scratch.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [250]:
# Dependencies
import nltk
import pandas as pd
import re
from collections import Counter
import random

### Prepare the Data

1.1 Prepare all the Tweets. Since the `lm` modules will work on tokenized data, implement a tokenization method that strips unnecessary tokens but retains special words such as mentions (@...) and hashtags (#...).

1.2 Partition into training and test sets; select about 100 tweets each, which we will be testing on later. As with any Machine Learning task, training and test must not overlap.

In [251]:
# Notice: ignore retweets 
def isRetweet(text: str):
    return text.startswith("RT @")


def load_trump_tweets(filepath):
    """Loads all Trump tweets and returns them as a list."""
    ### YOUR CODE HERE
    tweets = []
    df = pd.read_csv(filepath, sep=",")
    
    for _, data in df.iterrows():
        if data["isRetweet"] == "f":
            tweets.append(data["text"])

    return tweets
    ### END YOUR CODE


def load_obama_tweets(filepath):
    """Loads all Obama tweets and returns them as a list."""
    ### YOUR CODE HERE
    tweets = []
    df = pd.read_csv(filepath, sep=",")
    
    for _, data in df.iterrows():
        if not isRetweet(data["Tweet-text"]):
            tweets.append(data["Tweet-text"])

    return tweets
    ### END YOUR CODE
    

def load_biden_tweets(filepath):
    """Loads all Biden tweets and returns them as a list."""
    ### YOUR CODE HERE
    tweets = []
    df = pd.read_csv(filepath, sep=",")
    for _, data in df.iterrows():
        if not isRetweet(data["tweet"]):
            tweets.append(data["tweet"])

    return tweets
    ### END YOUR CODE

In [252]:
# Notice: think about start and end tokens

NUM_TEST = 100
TOKENS_PATTERN = r"""(
    (?:@[\w_]+)                # @mention
  | (?:\#[\w_]+)               # #hashtag
  | (?:https?://\S+)          # URL
  | (?:[A-Za-z0-9_]+(?:'[A-Za-z0-9_]+)?)  # words with optional apostrophes
  | (?:[.,!?;])               # punctuation
  | (?:\S)                    # catch-all for emojis/symbols/etc.
)"""

def tokenize(text, pattern=TOKENS_PATTERN):
    """Tokenizes a single Tweet."""
    ### YOUR CODE HERE
    return ["<s>"] + re.findall(pattern, text, re.VERBOSE) + ["</s>"]
    ### END YOUR CODE
    

def split_and_tokenize(data, num_test=NUM_TEST):
    """Splits and tokenizes the given list of Twitter tweets."""
    ### YOUR CODE HERE
    tokenized_tweets = []
    num_tweets = num_test if num_test > 0 else len(data)
    for i in range(num_tweets):
        tokenized_tweets.append(tokenize(data[i]))

    return tokenized_tweets
    ### END YOUR CODE

In [253]:
trump_tweets = split_and_tokenize(load_trump_tweets("C:\\Users\\Felix\\PythonProjects\\seqlrn_assignments\\2-markov-chains\\data\\tweets\\tweets_trump.csv"))
obama_tweets = split_and_tokenize(load_obama_tweets("C:\\Users\\Felix\\PythonProjects\\seqlrn_assignments\\2-markov-chains\\data\\tweets\\tweets_obama.csv"))
biden_tweets = split_and_tokenize(load_biden_tweets("C:\\Users\\Felix\\PythonProjects\\seqlrn_assignments\\2-markov-chains\\data\\tweets\\tweets_biden.csv"))

print(trump_tweets[0])
print(obama_tweets[0])
print(biden_tweets[0])

['<s>', 'Republicans', 'and', 'Democrats', 'have', 'both', 'created', 'our', 'economic', 'problems', '.', '</s>']
['<s>', 'From', 'a', 'big', 'NBA', 'fan', 'congrats', 'to', 'future', 'Hall', 'of', 'Famers', 'Dwyane', 'Wade', 'and', 'Dirk', 'Nowitzki', '—', 'not', 'just', 'all', '-', 'time', 'greats', 'but', 'class', 'acts', 'too', '.', '</s>']
['<s>', 'Tune', 'in', '11', ':', '30', 'ET', 'tomorrow', 'for', 'a', 'live', 'webcast', 'of', 'Families', 'USA', 'Presidential', 'Forum', 'on', 'health', 'care', ':', 'http://presidentialforums.health08.org/', '</s>']


### Train N-gram Models

2.1 Train n-gram models with n = [1, ..., 5] for Obama, Trump, and Biden.

2.2 Also train a joint model, that will serve as background model.

In [254]:
def build_n_gram_models(n, data):
    """
    To predict the first few words of the Tweet, we need the smaller n-grams as
    well. This method does calculate all n-grams up to the given n.
    """
    ### YOUR CODE HERE
    n_gram_models = []

    for i in range(1, n+1):
        i_gram = []

        for tweet in data:
            i_gram.extend(nltk.ngrams(sequence=tweet, n=i))
        
        n_gram_models.append(i_gram)

    return n_gram_models
    ### END YOUR CODE

def get_suggestion(prev, n_gram_model):
    """
    Gets the next random word for the given n_grams.
    The size of the previous tokens must be exactly one less than the n-value
    of the n-gram, or it will not be able to make a prediction.
    """
    ### YOUR CODE HERE
    n = len(n_gram_model[0])
             
    if len(prev) > n - 1:
        raise ValueError(f"{n}-gram needs {n} tokens for prediction.")

    ngram_counts = Counter(n_gram_model)
    all_sequences_count = sum(ngram_counts.values())
    candidates = [(sequence, count / all_sequences_count) for sequence, count in ngram_counts.items() 
                  if prev == list(sequence[:-1])]
    
    if len(candidates) < 1:
        random_suggestion = random.choice(list(ngram_counts.keys()))
        print("No candidate found, returning random token!")
        return random_suggestion[-1]

    candidates.sort(key=lambda c: c[1], reverse=True)

    return candidates[0][0][-1]  
    ### END YOUR CODE


def get_random_tweet(n, n_gram_models):
    """Generates a random tweet using the given data set."""
    ### YOUR CODE HERE
    tweet_sequence = ["Make", "america"]

    for i in range(len(tweet_sequence), n+1):
        context = tweet_sequence[-(i-1):]
        suggestion = get_suggestion(context, n_gram_models[n-1])
        tweet_sequence.append(suggestion)
        if i == n:
            while suggestion != "</s>" and len(tweet_sequence) < 30:
                context = tweet_sequence[-(i-1):]
                suggestion = get_suggestion(context, n_gram_models[n-1]) 
                tweet_sequence.append(suggestion)

    return " ".join([str(w) for w in tweet_sequence if w not in ("<s>", "</s>")])    


def get_random_tweet(n, n_gram_models):
    """Generates a random tweet using the given n-gram model."""
    tweet_sequence = ["<s>", "Make", "America"]

    while True:
        # Kontext: Letzte (n-1) Wörter
        context = tweet_sequence[-(n-1):] if n > 1 else []

        suggestion = get_suggestion(context, n_gram_models[n-1])
        tweet_sequence.append(suggestion)

        if suggestion == "</s>" or len(tweet_sequence) >= 30:
            break

    # Rückgabe ohne Start/End-Token
    return " ".join([w for w in tweet_sequence if w not in ("<s>", "</s>")])

### END YOUR CODE

In [255]:
n_gram_models = build_n_gram_models(n=3, data=trump_tweets)
random_tweet_trump = get_random_tweet(n=3, n_gram_models=n_gram_models)
print(random_tweet_trump)

No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
No candidate found, returning random token!
Make America have for No peace turning 2 inaccurate @US_FDA the hotly values @SecretarySonny points the Left from than they ’ ll be back in the Great city of Charlo

### Classify the Tweets

3.1 Use the log-ratio method to classify the Tweets for Trump vs. Biden. Trump should be easy to spot; but what about Obama vs. Biden?

3.2 Analyze: At what context length (n) does the system perform best?

In [51]:
def calculate_single_token_log_ratio(prev, token, n_gram_model1, n_gram_model2):
    """Calculates the log ration of a token for two different n-grams"""
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE


def classify(n, tokens, n_gram_models1, n_gram_models2):
    """
    Checks which of the two given datasets is more likely for the given Tweet.
    If true is returned, the first one is more likely, otherwise the second.
    """
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE


In [52]:
def validate(n, data1, data2, classify_fn):
    """
    Trains the n-gram models on the train data and validates on the test data.
    Uses the implemented classification function to predict the Tweeter.
    """
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE

In [53]:
# context_length = ...
# validate(context_length, trump_tweets, biden_tweets, classify_fn=classify)
# validate(context_length, obama_tweets, biden_tweets, classify_fn=classify)

### Compute Perplexities

4.1 Compute (and plot) the perplexities for each of the test tweets and models. Is picking the Model with minimum perplexity a better classifier than in 3.1?

In [54]:
def classify_with_perplexity(n, tokens, n_gram_models1, n_gram_models2):
    """
    Checks which of the two given datasets is more likely for the given Tweet.
    If true is returned, the first one is more likely, otherwise the second.
    """
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE

In [55]:
# context_length = ...
# validate(context_length, trump_tweets, biden_tweets, classify_fn=classify_with_perplexity)
# validate(context_length, obama_tweets, biden_tweets, classify_fn=classify_with_perplexity)