# Assignment 2: POTUS

---

## Task 1) President of the United States (Trump vs. Obama)

Surely, you're aware that the 45th President of the United States (@POTUS45) was an active user of Twitter, until (permanently) banned on Jan 8, 2021.
You can still enjoy his greatness at the [Trump Twitter Archive](https://www.thetrumparchive.com/). We will be using original tweets only, so make sure to remove all retweets.
Another fan of Twitter was Barack Obama (@POTUS43 and @POTUS44), who used the platform in a rather professional way.
Please also consider the POTUS Tweets of Joe Biden; we will be using those for testing.

### Data

There are multiple ways to get the data, but the easiest way is to download the files from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group. 
Another way is to directly use the data from [Trump Twitter Archive](https://www.thetrumparchive.com/), [Obama Kaggle](https://www.kaggle.com/jayrav13/obama-white-house), and [Biden Kaggle](https://www.kaggle.com/rohanrao/joe-biden-tweets).
Before you get started, please download the files; you can put them into the data folder.

### N-gram Models

In this assignment, you will be doing some Twitter-related preprocessing and training n-gram models to be able to distinguish between Tweets of Trump, Obama, and Biden.
We will be using [NLTK](https://www.nltk.org), more specifically it's [`lm`](https://www.nltk.org/api/nltk.lm.html) module. 
Install the NLTK package within your working environment.
You can use some of the NLTK functions, but you have to implement the functions for likelihoods and perplexity from scratch.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [252]:
# Dependencies
import nltk
import pandas as pd
import numpy as np
import re

### Prepare the Data

1.1 Prepare all the Tweets. Since the `lm` modules will work on tokenized data, implement a tokenization method that strips unnecessary tokens but retains special words such as mentions (@...) and hashtags (#...).

1.2 Partition into training and test sets; select about 100 tweets each, which we will be testing on later. As with any Machine Learning task, training and test must not overlap.

In [253]:
# Notice: ignore retweets 

def strip_empty_tweets(tweets: list) -> list:
    """Removes empty tweets from the list."""
    return [tweet for tweet in tweets if tweet]

def beautify_tweet(tweet: str) -> str:
    """Returns a beautified version of the tweet."""
   
    # Replace &amp; with &
    tweet = tweet.replace('&amp;', '&').replace('&amp', '&')

    # Remove links
    tweet = re.sub(r'http\S+|www\S+|https\S+', '', tweet, flags=re.MULTILINE)
        
    # Remove special characters without @ and #
    #tweet = re.sub(r'[^a-zA-Z0-9\s@#&]', '', tweet)
        
    # Remove extra spaces
    tweet = re.sub(r'\s+', ' ', tweet).strip()
    
    return tweet

def csv_load(filepath: str, tweet_col: str) -> pd.DataFrame:
    """Loads the CSV file and returns a dataframe."""

    df = pd.read_csv(filepath, sep=',', encoding='utf-8')
        
    # Remove retweets
    df = df[~df[tweet_col].str.startswith('RT')]

    # Beautify the tweets
    df[tweet_col] = df[tweet_col].apply(beautify_tweet)

    return df

def json_load(filepath: str, tweet_col: str) -> pd.DataFrame:
    """Loads the JSON file and returns a dataframe."""
    
    # Load the JSON file
    df = pd.read_json(filepath)
    
    # Remove retweets
    df = df[~df['isRetweet'].str.contains('t', na=False)]
    
    # Beautify the tweets
    df[tweet_col] = df[tweet_col].apply(beautify_tweet)

    return df

def load_trump_tweets(filepath):
    """Loads all Trump tweets and returns them as a list."""
    ### YOUR CODE HERE
    df = json_load(filepath, 'text')
    return strip_empty_tweets(df['text'].tolist())
    ### END YOUR CODE


def load_obama_tweets(filepath, col_name = 'Tweet-text'):
    """Loads all Obama tweets and returns them as a list."""
    ### YOUR CODE HERE
    df = csv_load(filepath, col_name)
    return strip_empty_tweets(df[col_name].tolist())
    ### END YOUR CODE
    

def load_biden_tweets(filepath, col_name = 'tweet'):
    """Loads all Biden tweets and returns them as a list."""
    ### YOUR CODE HERE
    df = csv_load(filepath, col_name)
    return strip_empty_tweets(df[col_name].tolist())
    ### END YOUR CODE

In [254]:
import random
random.seed(42)
from nltk.tokenize import RegexpTokenizer

In [255]:
# Notice: think about start and end tokens

NUM_TEST = 100

def tokenize(text):
    """Tokenizes a single Tweet."""
    ### YOUR CODE HERE
    tokenizer = RegexpTokenizer(r'@\w+|#\w+|\w+')
    tokens = tokenizer.tokenize(text)
    return tokens
    ### END YOUR CODE
    

def split_and_tokenize(data, num_test=NUM_TEST):
    """Splits and tokenizes the given list of Twitter tweets."""
    ### YOUR CODE HERE
    train, test = [], []

    # Shuffle the data
    random.shuffle(data)

    # Split the data into train and test sets
    for tweet in data:
        tokens = ['<s>'] + tokenize(tweet) + ['</s>']

        # Add the tokenized tweet to the test set if it is less than num_test
        if len(test) < num_test:
            test.extend(tokens)
        else:
            train.extend(tokens)

    return train, test
    ### END YOUR CODE

In [256]:
trump_tweets = split_and_tokenize(load_trump_tweets('data/tweets_01-08-2021.json'))
obama_tweets = split_and_tokenize(load_obama_tweets('data/Tweets-BarackObama.csv'))
biden_tweets = split_and_tokenize(load_biden_tweets('data/JoeBidenTweets.csv'))

### Train N-gram Models

2.1 Train n-gram models with n = [1, ..., 5] for Obama, Trump, and Biden.

2.2 Also train a joint model, that will serve as background model.

In [257]:
def build_n_gram_models(n, data):
    """
    To predict the first few words of the Tweet, we need the smaller n-grams as
    well. This method does calculate all n-grams up to the given n.
    """
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE


def get_suggestion(prev, n_gram_model):
    """
    Gets the next random word for the given n_grams.
    The size of the previous tokens must be exactly one less than the n-value
    of the n-gram, or it will not be able to make a prediction.
    """
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE


def get_random_tweet(n, n_gram_models):
    """Generates a random tweet using the given data set."""
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE

In [258]:
# n_gram_models = build_n_gram_models(...)
# random_tweet_trump = get_random_tweet(...)
# print(random_tweet_trump)

### Classify the Tweets

3.1 Use the log-ratio method to classify the Tweets for Trump vs. Biden. Trump should be easy to spot; but what about Obama vs. Biden?

3.2 Analyze: At what context length (n) does the system perform best?

In [259]:
def calculate_single_token_log_ratio(prev, token, n_gram_model1, n_gram_model2):
    """Calculates the log ration of a token for two different n-grams"""
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE


def classify(n, tokens, n_gram_models1, n_gram_models2):
    """
    Checks which of the two given datasets is more likely for the given Tweet.
    If true is returned, the first one is more likely, otherwise the second.
    """
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE


In [260]:
def validate(n, data1, data2, classify_fn):
    """
    Trains the n-gram models on the train data and validates on the test data.
    Uses the implemented classification function to predict the Tweeter.
    """
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE

In [261]:
# context_length = ...
# validate(context_length, trump_tweets, biden_tweets, classify_fn=classify)
# validate(context_length, obama_tweets, biden_tweets, classify_fn=classify)

### Compute Perplexities

4.1 Compute (and plot) the perplexities for each of the test tweets and models. Is picking the Model with minimum perplexity a better classifier than in 3.1?

In [262]:
def classify_with_perplexity(n, tokens, n_gram_models1, n_gram_models2):
    """
    Checks which of the two given datasets is more likely for the given Tweet.
    If true is returned, the first one is more likely, otherwise the second.
    """
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE

In [263]:
# context_length = ...
# validate(context_length, trump_tweets, biden_tweets, classify_fn=classify_with_perplexity)
# validate(context_length, obama_tweets, biden_tweets, classify_fn=classify_with_perplexity)