## Multilingual Code-switching Identification via LSTM Recurrent Neural Networks

This notebook implements the model described in the paper <a href="http://www.aclweb.org/anthology/W/W16/W16-58.pdf#page=62"> Multilingual Code-switching Identification via LSTM Recurrent Neural Networks.</a>

In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict
import csv
from keras.preprocessing import sequence
from itertools import chain
from keras.utils import np_utils

Using Theano backend.


### Data Preparation

In [2]:
# read the training data in tsv format
train_df=pd.read_csv('data/train.tsv',sep='\t',quoting=csv.QUOTE_NONE,names=['tweet_id','user_id','start','end','token','label'])

In [3]:
# read the dev data in tsv format
dev_df=pd.read_csv('data/dev.tsv',sep='\t',quoting=csv.QUOTE_NONE,names=['tweet_id','user_id','start','end','token','label'])

In [None]:
train_df.head()

In [None]:
dev_df.head()

In [4]:
#group data by tweet_id

train_grouped=train_df.groupby('tweet_id',as_index=False, sort=False)
train_tweets=train_grouped['token'].apply(list).values
train_labels=train_grouped['label'].apply(list).values

In [None]:
train_tweets 

In [None]:
train_labels

In [5]:
#repeat same for dev data
dev_grouped=dev_df.groupby('tweet_id',as_index=False, sort=False)
dev_tweets=dev_grouped['token'].apply(list).values
dev_labels=dev_grouped['label'].apply(list).values

In [None]:
dev_tweets

In [None]:
dev_labels

### Building vocabulary

Here, we will look at the training data to build the vocabulary, word to index, character to index and tags to index. Also, we create reverse mapping from index to word, index to character and index to tags.

In [6]:
# REf: https://github.com/neubig/yrsnlp-2016/blob/master/bow.ipynb
# w2i : word to index [assign unique ids for words in training data]
# c2i : character to index [assign unique ids for characters  in training data]
# t2i : tag to index [assign unique ids for tags in training data]
# reserve 0 for padding and 1 for unknown words


w2i = defaultdict(lambda: len(w2i))
c2i = defaultdict(lambda: len(c2i))
t2i = defaultdict(lambda: len(t2i))
PAD=w2i['<padding>']
UNK = w2i["<unk>"]
CHAR_PAD=c2i['<padding>']
CHAR_UNK =c2i["<unk>"]

In [7]:
from pprint import pprint
pprint (w2i)

defaultdict(<function <lambda> at 0x11a9a9ae8>, {'<padding>': 0, '<unk>': 1})


In [8]:
#iterate over the tokens in training data to assign unique ids for tokens and characters

for tweet in train_tweets:
    for token in tweet:
        w2i[token]
        for c in token:
            c2i[c]
w2i=defaultdict(lambda : UNK,w2i) #prevents assigning new ids for new words, instead they are assigned the unk token id
c2i=defaultdict(lambda : CHAR_UNK,c2i)
i2w = {v: k for k, v in w2i.items()}
i2c= {v: k for k, v in c2i.items()}
n_words=len(w2i)
n_chars=len(c2i)

vocabulary=frozenset([k for k,v in w2i.items()])

MAX_LEN=max([len(word) for word in vocabulary]) #longest token length

In [9]:
for tweet_label in train_labels:
    for label in tweet_label:
        t2i[label]

i2t={v: k for k, v in t2i.items()}
tags=frozenset([k for k,v in t2i.items()])
n_tags=len(t2i)

In [10]:
print (MAX_LEN)

50


In [11]:
print (len(vocabulary))

17655


In [12]:
print (n_words)

17655


In [13]:
print (tags)

frozenset({'lang1', 'other', 'fw', 'ambiguous', 'unk', 'lang2', 'mixed', 'ne'})


In [14]:
print (len(tags))

8


In [15]:
print (n_tags)

8


In [16]:
t2i

defaultdict(<function __main__.<lambda>>,
            {'ambiguous': 6,
             'fw': 0,
             'lang1': 3,
             'lang2': 4,
             'mixed': 7,
             'ne': 5,
             'other': 1,
             'unk': 2})

### Converting tweets  tokens and labels to their ids

Now, we will transform the tweet tokens and tags to their corresponding ids. 

In [17]:
# replace tokens and labels with their ids

transform_train_tweet= [[w2i[token] for token in tweet] for tweet in train_tweets]
transform_train_labels=[[t2i[label] for label in tweet_label] for tweet_label in train_labels]

In [None]:
#first five tweets
transform_train_tweet[:5]

In [None]:
#last five tweets
transform_train_tweet[-5:]

In [None]:
transform_train_labels[:5]

In [None]:
transform_train_labels[-5:]

In [18]:
#repeat same for dev dataset

transform_dev_tweet= [[w2i[token] for token in tweet] for tweet in dev_tweets]
transform_dev_labels=[[t2i[label] for label in tweet_label] for tweet_label in dev_labels]


In [None]:
transform_dev_tweet[-5:]

In [None]:
transform_dev_labels[-5:]

In [None]:
#last dev tweet
dev_tweets[-1]

In [None]:
#last dev tweets ids
transform_dev_tweet[-1]

In [19]:
print('bicycle' in vocabulary)

False


In [20]:
print('motor' in vocabulary)

False


In [21]:
w2i['motor']

1

In [22]:
w2i['bicycle']

1

Since "motor" and "bicycle" are not in the vocabulary, they are assigned the id for unk token

#### Character

We represent each training instance by sequnece of characters in a token.

In [23]:
tranform_train_tweet_chars=[[[c2i[c]  for c in token ] for token in tweet] for tweet in train_tweets]

In [24]:
tranform_train_tweet_chars[:5]

[[[2, 3, 4, 5, 6, 7, 8], [9, 10]],
 [[11, 12, 13, 8, 7, 14, 15, 6, 16, 17, 6, 3, 7], [18], [19, 19, 19]],
 [[11, 20, 21, 22, 15, 23, 24, 24, 25, 17, 8, 15, 15, 23, 16],
  [4, 13, 4, 4, 4],
  [26, 27, 26]],
 [[28, 4, 4, 4], [27, 27, 27]],
 [[28, 4, 4, 4]]]

In [25]:
tranform_dev_tweet_chars=[[[c2i[c]  for c in token ] for token in tweet] for tweet in dev_tweets]

In [26]:
tranform_dev_tweet_chars[:5]

[[[32, 32, 9, 32, 32],
  [27, 27, 27, 27, 27],
  [4, 3, 36, 23],
  [3],
  [38, 6, 16, 13],
  [27, 27, 27, 27, 27, 27, 27],
  [7, 6, 48, 13, 17],
  [7, 6, 48, 13, 17]],
 [[11, 41, 6, 16, 23, 2, 3, 5, 23, 23],
  [6],
  [38, 6, 16, 13],
  [6],
  [49, 8, 21, 31, 22],
  [5, 23],
  [6, 48, 7, 8, 15, 3, 7, 17]],
 [[2, 8, 21, 17], [17, 8], [48, 23, 17], [4, 24], [60, 3, 17], [61, 65]],
 [[29, 23],
  [23, 7, 49, 3, 7, 17, 3, 15, 6, 3],
  [81, 21, 23],
  [4, 23],
  [30, 6, 23, 15, 3, 16],
  [80],
  [22, 23],
  [31, 3],
  [4, 3, 7, 23, 15, 3],
  [81, 21, 23],
  [24, 8],
  [17, 23],
  [30, 23, 8],
  [64, 61, 7, 23, 10, 3, 24]],
 [[11, 14, 3, 31, 30, 3, 15, 24, 20, 15, 31, 14, 14],
  [10, 8],
  [24, 8, 21],
  [48, 21, 24, 16],
  [13, 3, 30, 23],
  [3],
  [16, 39, 3, 7, 6, 16, 13],
  [22, 23, 39, 3, 15, 17, 4, 23, 7, 17],
  [8, 15],
  [16, 23, 15, 30, 6, 49, 23, 16],
  [40, 40, 40]]]

### Building training and test data

For word model, we will represent each tweet token by its context. The contexts are some tokens to the right and left.

In [27]:
#REF: http://deeplearning.net/tutorial/rnnslu.html#rnnslu
def contextwin(l, win):
    '''
    win :: int corresponding to the size of the window
    given a list of indexes composing a sentence

    l :: array containing the word indexes

    it will return a list of list of indexes corresponding
    to context windows surrounding each word in the sentence
    '''
    assert (win % 2) == 1
    assert win >= 1
    l = list(l)

    lpadded = win // 2 * [PAD] + l + win // 2 * [PAD]
    out = [lpadded[i:(i + win)] for i in range(len(l))]

    assert len(out) == len(l)
    return out

In [28]:
context_window=5 #each training instance will have sequence length of 5. The token will be in the middle surrounded by the  right and left tokens.

#### Example transformed tweet

In [30]:
#0 indicates padding 
#pprint(" ".join([i2w[token] for token in transform_train_tweet[10]]))
pprint(transform_train_tweet[10])

[18, 13, 21, 11, 9]


In [31]:
#represent above tweet  tokens with context window of 5
pprint(contextwin(transform_train_tweet[10],context_window))

[[0, 0, 18, 13, 21],
 [0, 18, 13, 21, 11],
 [18, 13, 21, 11, 9],
 [13, 21, 11, 9, 0],
 [21, 11, 9, 0, 0]]


In [None]:
 [[i2w[token]  for token in tweet_tokens] for tweet_tokens in contextwin(transform_train_tweet[10],context_window)]

In [34]:
X_train= [ contextwin(tweet,context_window) for tweet in transform_train_tweet]

In [35]:
X_train[:5]

[[[0, 0, 2, 3, 0], [0, 2, 3, 0, 0]],
 [[0, 0, 4, 5, 6], [0, 4, 5, 6, 0], [4, 5, 6, 0, 0]],
 [[0, 0, 7, 8, 9], [0, 7, 8, 9, 0], [7, 8, 9, 0, 0]],
 [[0, 0, 10, 11, 0], [0, 10, 11, 0, 0]],
 [[0, 0, 10, 0, 0]]]

In [36]:
#flatten the list
# we do not need boundaries between tweets as each token now is represented by its context.
X_train=list(chain(*X_train))

In [37]:
X_train=np.array(X_train)

In [38]:
X_train[:5]

array([[0, 0, 2, 3, 0],
       [0, 2, 3, 0, 0],
       [0, 0, 4, 5, 6],
       [0, 4, 5, 6, 0],
       [4, 5, 6, 0, 0]])

In [39]:
Y_train=list(chain(*transform_train_labels))

In [40]:
Y_train=np_utils.to_categorical(np.array(Y_train)) # one hot encoding 

In [41]:
Y_train[:5]

array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [42]:
#same for dev data

X_dev=[ contextwin(tweet,context_window) for tweet in transform_dev_tweet]
X_dev=list(chain(*X_dev))

In [43]:
X_dev=np.array(X_dev)

In [44]:
Y_dev=list(chain(*transform_dev_labels))

In [45]:
Y_dev=np_utils.to_categorical(np.array(Y_dev))

In [46]:
#check if train instances and labels are same
assert len(X_train)==len(Y_train)
assert len(X_dev)==len(Y_dev)

###  Building training data for character model

We represent each token by sequence of character in the token. 

In [47]:
X_train_char=list(chain(*tranform_train_tweet_chars))

In [48]:
X_train_char[:5]

[[2, 3, 4, 5, 6, 7, 8],
 [9, 10],
 [11, 12, 13, 8, 7, 14, 15, 6, 16, 17, 6, 3, 7],
 [18],
 [19, 19, 19]]

In [49]:
# creating same length training instances 
# we define sequence length to be the max token length.

X_train_char= sequence.pad_sequences(X_train_char, maxlen=MAX_LEN)

In [50]:
X_train_char=np.array(X_train_char)

In [51]:
X_train_char[:5]

array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  2,  3,  4,  5,  6,  7,  8],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0, 11, 12, 13,  8,  7, 14, 15,  6, 16, 17,  6,  3,  7],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 18],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0, 

In [52]:
X_dev_char=list(chain(*tranform_dev_tweet_chars))

In [53]:
X_dev_char=sequence.pad_sequences(X_dev_char, maxlen=MAX_LEN)

## Building Model

<img src="img/model.png" alt="Drawing" style="width: 300px; height: 400px"/>

In [54]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Merge
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.initializers import RandomUniform
from keras.optimizers import Adamax

In [55]:
batch_size=128
nb_epoch=20

In [56]:
# not using pre-trained embedding (randomly initilaized)
# the paper used pre-trained embedding
word_model=Sequential([
    Embedding(output_dim=300, input_dim=n_words,
                    input_length=context_window,mask_zero=True,embeddings_initializer=RandomUniform(seed=1234)),
    Dropout(0.5),
    LSTM(100, return_sequences=False),
    Dropout(0.5)
])
word_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 5, 300)            5296500   
_________________________________________________________________
dropout_1 (Dropout)          (None, 5, 300)            0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               160400    
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
Total params: 5,456,900
Trainable params: 5,456,900
Non-trainable params: 0
_________________________________________________________________


In [57]:
character_model=Sequential([
    Embedding(n_chars, 50, input_length=MAX_LEN, mask_zero=True,embeddings_initializer=RandomUniform(seed=1234)),
    Dropout(0.5),
    LSTM(100,return_sequences=False),
    Dropout(0.5)
    ])
character_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 50, 50)            16700     
_________________________________________________________________
dropout_3 (Dropout)          (None, 50, 50)            0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               60400     
_________________________________________________________________
dropout_4 (Dropout)          (None, 100)               0         
Total params: 77,100
Trainable params: 77,100
Non-trainable params: 0
_________________________________________________________________


In [58]:
merge_word_character_model = Sequential()
merge_word_character_model.add(Merge([character_model,word_model], mode='concat',concat_axis=1))
merge_word_character_model.add(Dense(output_dim=n_tags, activation='softmax'))
merge_word_character_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   


  from ipykernel import kernelapp as app
  app.launch_new_instance()


merge_1 (Merge)              (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 1608      
Total params: 5,535,608
Trainable params: 5,535,608
Non-trainable params: 0
_________________________________________________________________


In [59]:
adamax = Adamax(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08)

In [None]:
merge_word_character_model.compile(loss='categorical_crossentropy', optimizer=adamax, metrics=['accuracy'])

In [None]:
merge_word_character_model.fit([X_train_char,X_train], Y_train,
                batch_size=batch_size,
                epochs=nb_epoch,
                verbose=1,
                shuffle=True,                  
                validation_data=([X_dev_char,X_dev],Y_dev)
                )

Train on 139539 samples, validate on 33276 samples
Epoch 1/20
Epoch 2/20