## Facebook chat friend emulator

It's important when arriving in a new city to settle in and make new friends. Taking this last point to heart, I decided to stay in and create a Long Short-Term Memory recurrent neural network for producing sentences in the style of a given facebook friend, based on our message history

Apologies in advance to my guinea-pig, Dmitri. I censored the ramblings of your robot self to make sure you didn't say anything too outrageous!

TO DO : create chat-bot with a message-response sequence-to-sequence model

In [1]:
# keras module for building LSTM 
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.models import load_model
import keras.utils as ku 
import json
import glob
from random import sample

# set seeds for reproducability
from tensorflow import set_random_seed
from numpy.random import seed
set_random_seed(2)
seed(1)

import pandas as pd
import numpy as np
import string, os 

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

Using TensorFlow backend.


In [2]:
#Name of friend to emulate and some hyperparameters
friend = 'Dmitri'
num_messages = 10000    #randomly sample n messages from the friend's corpus
max_length = 20         #trim all messages to this many words. Most messages are short, and longer lengthen the training time significantly
lstm_size=200           #Size of the  LSTM layer

In [3]:
def generate_corpus(friend, num_messages, max_length):

    PATH_TO_CONV = glob.glob(f'data/messages/inbox/{friend}*/message.json')[0]  #data has weird names but they start with first name so match
    
    with open(PATH_TO_CONV) as f:
        data = json.load(f)
                                
    data = pd.DataFrame(data['messages'])

    def rename(name):
        if name=='Simon Roberts':
            return 'Me'
        else:
            return friend
        
    def trim_message(message):
        trimmed = str(message).split(' ')[:max_length]
        return ' '.join(trimmed)
    
    data['sender_name'] = data['sender_name'].apply(rename)   #rename senders to 'Me' and 'First Name'
    data['content'] = data[data['content'].apply(type)==str]['content'] #Only use messages which are strings (so just numbers are dropped)
    
    messages = data[data['sender_name']==friend]['content'].apply(trim_message)  #trim messages to N words

    def clean_text(txt):
        txt=str(txt)
        txt = "".join(v for v in txt if v not in string.punctuation).lower()
        txt = txt.encode("utf8").decode("ascii",'ignore')
        return txt 

    return [clean_text(message) for message in sample(list(messages), num_messages)]  #Gets N random messages

In [9]:
#Generate the corpus, and look at a few examples
corpus = generate_corpus(friend, num_messages, max_length)
corpus[:5]

['its the 3rd friday ',
 'all of which will have to be corrected before he can get remotely decent',
 'admittedly he did finish the game with 8 men ',
 'haha',
 'like actually insane']

In [10]:
tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    ## convert data to sequence of tokens 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)

In [11]:
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

In [12]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 16, input_length=input_len))    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(lstm_size))            #Larger vocab probably required larger LSTM layer
    model.add(Dropout(0.1))    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 23, 16)            120672    
_________________________________________________________________
lstm_1 (LSTM)                (None, 200)               173600    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 7542)              1515942   
Total params: 1,810,214
Trainable params: 1,810,214
Non-trainable params: 0
_________________________________________________________________


In [13]:
history = model.fit(predictors, label, epochs=50, verbose=1, batch_size = 128)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [14]:
model.save(f'{friend}_{num_messages}messages_{max_length}words_{lstm_size}lstm_model.h5')

In [15]:
# A function to generate styled sentences based on a seed phrase
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text

In [29]:
#Let's see what he sounds like for some different starting words/phrases!
texts = ['will', 'have you', 'I dont', 'when can', 'obviously', 'I was thinking']

for text in texts:
    print(f'Robo-{friend}: {generate_text(text, 20, model,max_sequence_len)}\n')

Robo-Dmitri: will be home in 30 mins so i havent just do it at all point at the time of the 10th

Robo-Dmitri: have you seen the razer blade pro to get it with a good 1180 key and non margin at offer it or

Robo-Dmitri: I dont know how much i was beginning to figure for the beach you at all p and far p than just

Robo-Dmitri: when can go to bed at a sensible time but i dont know what i saw it just processed a few weeks

Robo-Dmitri: obviously i was just saying i can do it to lose the prize in the way way to fund on a

Robo-Dmitri: I was thinking i have a sample of transactions who will be in a small attack to date him to it in a



This actually sounds a lot like my friend Dmitri!

To improve, smileys, standard texts like 'You sent a photo', etc. should either be removed or displayed in their entirety. As it is, we have their artifacts 'p' and 'D' in the results.