In [1]:
%env CUDA_VISIBLE_DEVICES=0

env: CUDA_VISIBLE_DEVICES=0


In [7]:
import sys
import re
import numpy as np
import copy
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense,Dropout,Activation,LSTM,TimeDistributed
from keras.callbacks import ModelCheckpoint
import pickle

## Goal

The goal in this notebook is to create a model which is able to generate tweets.  Tweets are small messages with maximum length of 280 character. We will build a character level model to be able to achieve this goal.

Tweets can contain simleys, hashtags (which starts with #), urls and user citation (starting with @). First of all we need to prepocess the tweets. 

A few important prepocessing steps:

1. Transform the tweets to lowercase string

2. Transfrom "#something" to "&lt;hashtag&gt; something"

3. Transform "@USER" to "USER &lt;user&gt;"

4. Transform "DNC" to "dnc &lt;allcaps&gt;"

5. Transform urls: "http://google.com" to "&lt;url&gt;"

6. Add "&lt;end&gt;" to end of every tweet

7. etc.

The following functions will preporcess a single tweet:

In [3]:
FLAGS = re.MULTILINE | re.DOTALL

def hashtag(text):
    text = text.group()
    hashtag_body = text[1:]
    if hashtag_body.isupper():
        result = "<hashtag> {} <allcaps>".format(hashtag_body)
    else:
        result = " ".join(["<hashtag>"] + [hashtag_body.lower()])
    return result

def allcaps(text):
    text = text.group()
    return text.lower() + " <allcaps>"

def user(text):
    text = text.group()
    return text.lower() + "<user>"

def tokenize(text):
    # Different regex parts for smiley faces
    eyes = r"[8:=;]"
    nose = r"['`\-]?"

    # function so code less repetitive
    def re_sub(pattern, repl):
        return re.sub(pattern, repl, text, flags=FLAGS)

    text = re_sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", "<url>")
    text = re_sub(r"/"," / ")
    text = re_sub(r"@\w+", user)
    text = re_sub(r"{}{}[)dD]+|[)dD]+{}{}".format(eyes, nose, nose, eyes), "<smile>")
    text = re_sub(r"{}{}p+".format(eyes, nose), "<lolface>")
    text = re_sub(r"{}{}\(+|\)+{}{}".format(eyes, nose, nose, eyes), "<sadface>")
    text = re_sub(r"{}{}[\/|l*]".format(eyes, nose), "<neutralface>")
    text = re_sub(r"<3","<heart>")
    text = re_sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", "<number>")
    text = re_sub(r"#\S+", hashtag)
    text = re_sub(r"([!?.]){2,}", r"\1 <repeat>")
    text = re_sub(r"\b(\S*?)(.)\2{2,}\b", r"\1\2 <elong>")
    text = re_sub(r"(\s|\uFEFF|\xA0)+"," ")

    text = re_sub(r"([A-Z]){2,}", allcaps)

    return text.lower()

## Data

To learn to generate tweets we have to download a lot of tweets. To make the example more interesting, lets say we want to generate tweets in Dondald Trump style. 

First I downloaded almost all of Trump's tweets (around 33k tweets) from (http://www.trumptwitterarchive.com/about), and I saved to "trump.txt" file. In the file every line is a tweet.

First we need to find out what kind of characters are used. So we go trough the tweets, we tokenize every tweet and then we count the occurence of the characters. The special characters from preprocessing "&lt;somehing&gt;" will be one special character.

In [4]:
f = open("trump.txt","r")

characters = set()
counter = {}
lengths = []
for line in f:
    lengths.append(len(line))
    lineiter = iter(tokenize(line))
    for char in lineiter:
        if char=="<":
            spec = "<"
            while char!=">":
                char = next(lineiter)
                spec += char
            characters.add(spec)
            if spec in counter:
                counter[spec] += 1
            else:
                counter[spec] = 1
        else:
            characters.add(char)
            if char in counter:
                counter[char] +=1
            else:
                counter[char] = 1

The ```counter``` dictionary will contain every character and it's occurrence in Donald Trump's tweets. 

We want to create a vocabularry: containing all the characters in a list, where the position will code the character.  There are a lot of special characters (e.g. Japanese symbols, smileys) we don't really want and they are pretty rare. So we remove all the characters which appeard less then 30 times in the tweets. We also add two new character "&lt;unknown&gt;" for unknown characters and "&lt;end&gt;" to signal the end of a tweet.

In [5]:
char_to_idx = {}
i = 2
for c in counter:
    if counter[c]>30:
        char_to_idx[c] = i
        i += 1
char_to_idx["<unknown>"] = 1
char_to_idx["<end>"] = i

idx_to_char = ['' for i in range(71)]
for c in char_to_idx:
    idx_to_char[char_to_idx[c]] = c

print(len(char_to_idx))

70


Our "vocabulary":

In [6]:
print(char_to_idx)

{'a': 25, 'p': 44, 'q': 26, '<unknown>': 1, '.': 21, 'e': 45, 'y': 28, 'z': 27, '#': 49, '“': 2, '👍': 47, 'c': 58, '—': 3, '*': 4, "'": 29, '…': 60, '+': 5, '<user>': 54, '<end>': 70, '<url>': 63, '’': 61, 'v': 62, 'i': 6, 'f': 31, '️': 56, ')': 32, '‘': 7, '-': 8, '🇸': 48, 'l': 9, 'o': 10, '(': 64, '!': 12, 'u': 46, '&': 33, 'n': 24, 'r': 51, '@': 13, '~': 53, ' ': 34, 'g': 30, ';': 66, 'd': 14, 's': 15, 't': 35, 'w': 36, '➡': 65, 'x': 37, '🇺': 50, '–': 11, '<elong>': 55, '"': 16, '/': 52, '=': 17, '<smile>': 18, '<allcaps>': 38, 'h': 19, '$': 67, '”': 39, 'm': 68, '_': 40, 'j': 69, ':': 57, 'b': 41, '<hashtag>': 59, '?': 42, '<number>': 20, 'k': 22, '<repeat>': 23, '%': 43}


In [8]:
with open("character_map.pkl", "wb") as f:
    pickle.dump([char_to_idx, idx_to_char], f)

With the vocabulary we transform every tokenized tweet to a list of indexes. We create to lists: X, Y (input, output). The output is simply shifted with one character:

<img src="imgs/tweet_shift.PNG" width="60%" />

We also transform the list of indexes to one-hot representation!

In [9]:
f = open("trump.txt","r")

max_len = max(lengths)

tweets = []
X = []
Y = []
for line in f:
    lineiter = iter(tokenize(line))
    tweet = []
    
    for char in lineiter:
        if char=="<":
            spec = "<"
            while char!=">":
                char = next(lineiter)
                spec += char
            if spec in char_to_idx:
                tweet.append(char_to_idx[spec])
            else:
                tweet.append(char_to_idx["<unknown>"])
        else:
            if char in char_to_idx:
                tweet.append(char_to_idx[char])
            else:
                tweet.append(char_to_idx["<unknown>"])
    tweet.append(char_to_idx["<end>"])
    lt = len(tweet)
    tweet.extend([0]*(max_len-lt))
    tweet2 = copy.deepcopy(tweet)
    tweet2.pop(0)
    tweet.pop(-1)
    tweets.append(tweet)
    X.append(to_categorical(np.array(tweet),num_classes=71).reshape((1,-1,71)))
    Y.append(to_categorical(np.array(tweet2),num_classes=71).reshape((1,-1,71)))

tweets = np.array(tweets)
X = np.vstack(X)
Y = np.vstack(Y)

The shape of X:

In [10]:
print(X.shape)

(33402, 314, 71)


## The model

<img src="imgs/tweet_gen.png" width="50%" />

In [11]:
model = Sequential()
model.add(LSTM(300, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(LSTM(300, return_sequences=True))
model.add(TimeDistributed(Dense(X.shape[-1])))
model.add(Activation("softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [12]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 314, 300)          446400    
_________________________________________________________________
lstm_2 (LSTM)                (None, 314, 300)          721200    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 314, 71)           21371     
_________________________________________________________________
activation_1 (Activation)    (None, 314, 71)           0         
Total params: 1,188,971
Trainable params: 1,188,971
Non-trainable params: 0
_________________________________________________________________


In [None]:
checkpoint = ModelCheckpoint(
        "./trump_weights/weights-{epoch:02d}-{loss:.4f}.hdf5",
        monitor='loss',
        save_best_only=True,
)

model.fit(X,Y, batch_size=512, epochs=200, callbacks=[checkpoint])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200

In [29]:
#model.load_weights("tweetgen_weights.hdf5")

### Generating tweets

<img src="imgs/tweet_gn.PNG" width="70%" />

In [31]:
start = to_categorical(char_to_idx["c"],num_classes=71)

In [32]:
seq = np.repeat(start, 314, axis=0).reshape((1,314,71))

for i in range(1, 314):
    y     = model.predict(seq)
    new_c = np.random.choice(71, 1, p=y[0,i-1,:])[0]
    seq[0,i,:] = to_categorical(new_c, num_classes=71)

In [None]:
tweet=""
idxs = np.argmax(seq,axis=2).reshape((314))
for idx in idxs:
    tweet += idx_to_char[idx]

print(tweet)

In [146]:
if gtweets:
    gtweets.append(tweet)
else:
    gtweets = [tweet]

In [None]:
gtweets

In [147]:
gtweets

['my unemployment rate ogothing "crazy cuts (institedey) <url> <end>',
 'my unemployment rate ogothing "crazy cuts (institedey) <url> <end>',
 'ashirt and obamacare will be attemeded to i told the evidence of dangers &amp; offers - atric\' trump: "a visit tomorrow morning… <url>" <end>',
 'a bet <hashtag> votetrump natoo <allcaps>!#statbartinotay <hashtag> iacaucus <hashtag> commancerohebyand<url> <url> <end>',
 "believe illegal immigration raised against obama's facts reso citizenship sunday! will be a loser! <end>",
 'crooked hillary we would be hosta chance to share a trale base with the election. just saw! <hashtag> alsocullardford <end>',
 'collusion is brought with territors memoral to our military were fast. where is our country! she will snore to represent me! <end>',
 'only endorsement clais scottish college poor presidential remember this show. <end>',
 'crooked hillary wants terrible cyber borders and albanna. failed crowd will becoming in not great again! <end>',
 'crooked 

-----------------

The tweet tokenizer is based on <a href="https://gist.github.com/tokestermw/cb87a97113da12acb388">python version</a>  of Ruby script to preprocess tweets for use in <a href="http://nlp.stanford.edu/projects/glove/">GloVe</a> featurization.