# Style transfer of Donald Trump's tweets
### A project for the AI course Advanced Natural Language Processing
_Rik Dijkstra, Abel de Wit, Max Knappe_

Every piece of text fits in a specific time, place and scenario, conveys specific characteristics of the user of language and has a specific intent. If we denote the piece of text as `x` and the style of this text as `a`. Text Style Transfer (TST) aims to produce text `x` of a desired attribute value `a`, given the existing text `x'`.

**Imports**

In [None]:
import pandas as pd
import numpy as np
import torch
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import re

In [None]:
nltk.download('punkt', download_dir='./nltk_data/')
nltk.download('stopwords', download_dir='./nltk_data/')
nltk.download('wordnet', download_dir='./nltk_data/')
nltk.download('averaged_perceptron_tagger', download_dir='./nltk_data/')

**Reading in the datasets**

In [None]:
df1 = pd.read_csv('./data/realdonaldtrump.csv')
df2 = pd.read_csv('./data/trumptweets.csv')

In [None]:
df1.head()

In [None]:
df2.head()

**Removing duplicates**

As we can already see in the first ten entries of both datasets, there are some duplicate tweets. Let's combine the two datasets and remove the duplicates based on the 'content' column

In [None]:
df = pd.concat([df1, df2])
len_before = len(df)
df = df.drop_duplicates(subset=['content'], ignore_index=True)
len_after = len(df)
print("The two datasets together were {} tweets long, of which {} were duplicates,\nthis leaves us with {} tweets".format(len_before, (len_before - len_after), len_after))

## Preprocessing
Now that we have a set of unique tweets from Trump, we need to pre-process the data such that hyperlinks, named entities and other attributes that are not part of The Donald's style of writing

In [None]:
df.isna().sum()

We can see that the column that we want to work with ('content') has no empty fields, so we don't have to remove any of our entries

Next up is our pre-processing where we remove text that is not useful for our model such as hyperlinks, numbers and dates, and decapitalization of our text. After that, we tokenize the sentences so we have a list of words. 

In [None]:
def clean_text(text):
    text = re.sub(r'http\S+', '', text)              # Remove Hyperlinks
    text = re.sub(r'[^a-zA-Z]', ' ', text)           # Remove non-alphanumeric
    text = str(text).lower()                         # Change all to lowercase
    text = re.sub(r'(donald j?.? trump)', '', text)  # Remove all his name occurrences
    text = word_tokenize(text)                       # Tokenize sentence
    return text


df['clean_content'] = df['content'].apply(clean_text)
df.head(5)[['content', 'clean_content']]

### Stop word removal, and lemmatization
**I am not sure if this should be done, as stopwords might be part of Trump's style**

Stop words are too common in a language and teach us nothing about the meaning or style of a text. Hence we remove them. Some words can have inflectional forms, such as `saw` and `see`. And since we want to learn our model that these words are the same as well, we apply lemmatization which converts each inflectional form to their base. 

In [None]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemma = WordNetLemmatizer()

def clean_stop_lemma(token):
    text = [item for item in token if item not in stop_words]  # Remove stopwords
    text = [stemmer.stem(i) for i in text]                     # Stem words with inflections
    text = [lemma.lemmatize(word=w, pos='v') for w in text]    # Lemmatize words with inflections
    return text

df['clean_content'] = df['clean_content'].apply(clean_stop_lemma)
df.head(5)[['content', 'clean_content']]

### Dictionaries
Now we generate the word embedding dictionaries where we have `word2index`, `index2word`, and `word2count`. This allows us to transform the text to numbers and back, and gives us an overview of the occurrances of words in our corpus

In [None]:
class Vocabulary:
    def __init__(self, name):
        self.name = name
        self.word_to_idx = {'<s>': 0, '</s>': 1}
        self.idx_to_word = {0: '<s>', 1: '</s>'}
        self.word_to_count = {}
        self.all_words = []
        
    def generate_dict(self, corpus):
        for sentence in corpus:
            for word in sentence:
                if word not in self.word_to_idx:
                    self.word_to_idx[word] = len(self.word_to_idx)
                    self.word_to_count[word] = 1
                    self.idx_to_word[len(self.word_to_idx)] = word
                else:
                    self.word_to_count[word] += 1
        self.all_words = list(self.word_to_count.keys())

trump_vocab = Vocabulary('trump_tweets')
trump_vocab.generate_dict(list(df['clean_content']))
print("Our {} consists of {} unique words.".format(trump_vocab.name, len(trump_vocab.word_to_count)))

## 'Normal' tweets
The next database that we're going to use is a collection of tweets of the 20 most popular twitter database. We will apply the same preprocessing to this database, so we can train a classifier to recognize which tweets belong to trump, and which don't. With this classifier as our metric, we can then create a Sequence to Sequence model that will train to deceive our classifier in creating realistic Trump tweets

In [None]:
normal_df = pd.read_csv('./data/tweets.csv')

In [None]:
normal_df.isna().sum()

In [None]:
normal_df['clean_content'] = normal_df['content'].apply(clean_text)
normal_df['clean_content'] = normal_df['clean_content'].apply(clean_stop_lemma)

In [None]:
normal_vocab = Vocabulary('normal_tweets')
normal_vocab.generate_dict(list(normal_df['clean_content']))
print("Our {} consists of {} unique words.".format(normal_vocab.name, len(normal_vocab.word_to_count)))

## Trump tweet classifier

In [None]:
df['trump'] = True
normal_df['trump'] = False

dfs = [df[['clean_content', 'trump']], normal_df[['clean_content', 'trump']]]

cdf = pd.concat(dfs, ignore_index=True)

In [None]:
full_vocab = Vocabulary('all_data')
full_vocab.generate_dict(list(cdf['clean_content']))
print("Our {} consists of {} unique words.".format(full_vocab.name, len(full_vocab.word_to_count)))

In [None]:
indexEmpty = cdf[cdf['clean_content'].map(lambda d: len(d)) == 0].index
print("After cleaning there are {} empty tweets, let's drop those".format(len(indexEmpty)))
cdf.drop(indexEmpty, inplace=True)

print("Now we have {} Trump tweets, and {} normal tweets".format(len(cdf[cdf['trump'] == True]), len(cdf[cdf['trump'] == False])))

## Variational Autoencoder (VAE)

In [None]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import time