# Style transfer of Donald Trump's tweets
### A project for the AI course Advanced Natural Language Processing
_Rik Dijkstra, Abel de Wit, Max Knappe_

Every piece of text fits in a specific time, place and scenario, conveys specific characteristics of the user of language and has a specific intent. If we denote the piece of text as `x` and the style of this text as `a`. Text Style Transfer (TST) aims to produce text `x` of a desired attribute value `a`, given the existing text `x'`.

**Imports**

In [17]:
import pandas as pd
import numpy as np
import torch
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import re

In [20]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/abeldewit/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/abeldewit/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/abeldewit/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/abeldewit/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

**Reading in the datasets**

In [3]:
df1 = pd.read_csv('./data/realdonaldtrump.csv')
df2 = pd.read_csv('./data/trumptweets.csv')

In [4]:
df1.head()

Unnamed: 0,id,link,content,date,retweets,favorites,mentions,hashtags
0,1698308935,https://twitter.com/realDonaldTrump/status/169...,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,,
1,1701461182,https://twitter.com/realDonaldTrump/status/170...,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,,
2,1737479987,https://twitter.com/realDonaldTrump/status/173...,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19,,
3,1741160716,https://twitter.com/realDonaldTrump/status/174...,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26,,
4,1773561338,https://twitter.com/realDonaldTrump/status/177...,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945,,


In [5]:
df2.head()

Unnamed: 0,id,link,content,date,retweets,favorites,mentions,hashtags,geo
0,1698308935,https://twitter.com/realDonaldTrump/status/169...,Be sure to tune in and watch Donald Trump on L...,2009-05-04 20:54:25,500,868,,,
1,1701461182,https://twitter.com/realDonaldTrump/status/170...,Donald Trump will be appearing on The View tom...,2009-05-05 03:00:10,33,273,,,
2,1737479987,https://twitter.com/realDonaldTrump/status/173...,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 15:38:08,12,18,,,
3,1741160716,https://twitter.com/realDonaldTrump/status/174...,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 22:40:15,11,24,,,
4,1773561338,https://twitter.com/realDonaldTrump/status/177...,"""My persona will never be that of a wallflower...",2009-05-12 16:07:28,1399,1965,,,


**Removing duplicates**

As we can already see in the first ten entries of both datasets, there are some duplicate tweets. Let's combine the two datasets and remove the duplicates based on the 'content' column

In [6]:
df = pd.concat([df1, df2])
len_before = len(df)
df = df.drop_duplicates(subset=['content'], ignore_index=True)
len_after = len(df)
print("The two datasets together were {} tweets long, of which {} were duplicates,\nthis leaves us with {} tweets".format(len_before, (len_before - len_after), len_after))

The two datasets together were 84474 tweets long, of which 40947 were duplicates,
this leaves us with 43527 tweets


## Preprocessing
Now that we have a set of unique tweets from Trump, we need to pre-process the data such that hyperlinks, named entities and other attributes that are not part of The Donald's style of writing

In [7]:
df.isna().sum()

id               0
link             0
content          0
date             0
retweets         0
favorites        0
mentions     22842
hashtags     37870
geo          43527
dtype: int64

We can see that the column that we want to work with ('content') has no empty fields, so we don't have to remove any of our entries

Next up is our pre-processing where we remove text that is not useful for our model such as hyperlinks, numbers and dates, and decapitalization of our text

In [8]:
def remove_hyper(text):
    return re.sub(r'http\S+', '', text)

def clean_non_alphanumeric(text):
    return re.sub(r'[^a-zA-Z]', ' ', text)

def clean_lowercase(text):
    return str(text).lower()

df['clean_content'] = df['content'].apply(remove_hyper)
df['clean_content'] = df['clean_content'].apply(clean_non_alphanumeric)
df['clean_content'] = df['clean_content'].apply(clean_lowercase)


After that we tokenize the text so that is becomes a list of all the tokens in a sentence

In [9]:
def clean_tokenization(text):
    return word_tokenize(text)

df['clean_content'] = df['clean_content'].apply(clean_tokenization)

As we can see below, the text has been tokenized and put in a new column in our dataset

In [12]:
df.head(5)[['content', 'clean_content']]

Unnamed: 0,content,clean_content
0,Be sure to tune in and watch Donald Trump on L...,"[be, sure, to, tune, in, and, watch, donald, t..."
1,Donald Trump will be appearing on The View tom...,"[donald, trump, will, be, appearing, on, the, ..."
2,Donald Trump reads Top Ten Financial Tips on L...,"[donald, trump, reads, top, ten, financial, ti..."
3,New Blog Post: Celebrity Apprentice Finale and...,"[new, blog, post, celebrity, apprentice, final..."
4,"""My persona will never be that of a wallflower...","[my, persona, will, never, be, that, of, a, wa..."


### Stop word removal, and lemmatization
**I am not sure if this should be done, as stopwords might be part of Trump's style**

Stop words are too common in a language and lean us nothing about the meaning or style of a text. Hence we remove them. Some words can have inflectional forms, such as `saw` and `see`. And since we want to learn our model that these words are the same as well, we apply lemmatization which converts each inflectional form to their base. 

In [13]:
stop_words = set(stopwords.words('english'))
def clean_stopwords(token):
    return [item for item in token if item not in stop_words]

df['clean_content'] = df['clean_content'].apply(clean_stopwords)

In [18]:
stemmer = PorterStemmer()
def clean_stem(token):
    return [stemmer.stem(i) for i in token]

df['clean_content'] = df['clean_content'].apply(clean_stem)

In [21]:
lemma = WordNetLemmatizer()
def clean_lemma(token):
    return [lemma.lemmatize(word=w, pos='v') for w in token]

df['clean_content'] = df['clean_content'].apply(clean_lemma)

### Dictionaries
Now we generate the word embedding dictionaries where we have `word2index`, `index2word`, and `word2count`. This allo

In [None]:
def generate_dict(corpus):
    word_to_idx = {'<s>': 0, '</s>': 1}
    idx_to_word = {0: '<s>', 1: '</s>'}
    word_tp_count = {}
    
    for sentence in corpus:
        for word in sentence:
            if word not in word_count.keys():
                word_count[word] = 1
            else:
                word_count[word] += 1
            
            if word not in word_to_idx.keys():
                