#  Tweet2Vec:Character-Based Distributed Representations for Social Media

#### Kumara Prasanna Jayaraju / Rinaldo Sonia Joseph Santhana Raj

####  Emails: kjayaraju@ryerson.ca / rinaldo.joseph@ryerson.ca

# Introduction:

#### Problem Description:

Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts. 

#### Context of the Problem:

This is leading to a prohibitively large vocabulary size for word-level approaches. In any natural language corpus a majority of the vocabulary word types will either be absent or occur in low frequency. Estimating the statistical properties of these rare word types is naturally a difficult task.This is analogous to the curse of dimensionality when we deal with sequences of tokens, most sequences will occur only once in the training data.

#### Limitation About other Approaches:

Traditional Neural Network Language Models (NNLMs) treat words as the basic units of language and assign independent vectors to each word type. This paper is motivated by bi-directional Long Short Term Memory (LSTM )for composing word vectors

#### Solution:

In this project, the authors explore a similar approach to learning distributed representations of social media posts by composing them from their constituent characters, with the goal of generalizing to out-of-vocabulary words as well as sequences at test time. The paper propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences.

# Background

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Bengio et al., 2003 [1] | Using neural networks to learn distributed representations of words dates back leased word2vec, a collection of word vectors trained using a recurrent neural network.| word2Vev, sentences, documents and paragraphs | Require storing an extremely large table of vectors for all word types and cannot be easily generalized to unseen words at test time.
| Ling et al., 2015 [2] | the authors present a compositional character model based on bidirectional LSTMs as a potential solution to these problems.| Large data sets, sentences, paragraphs| generate word embeddings from character-level representations only, less accuracy. 
| Luong et al., 2013 [3] | Dealt with the problem of estimating rare word representations by building them from their constituent morphemes | Large data sets, sentences, paragraphs| approach requires a morpheme parser for preprocessing which may not perform well on noisy text like Twitter.
| Dhingra et al., 2016 [4] | A character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. | Feeds, Twitter| The paper is limited only to english language but can be extended for other languages.


# Methodology

Bi-GRU Encoder: Figure below shows our model for encoding tweets. It uses a similar structure to the C2W model in (Ling et al., 2015), with LSTM units replaced with GRU units.

The input to the network is defined by an alphabet of characters C (this may include the entire unicode character set). The input tweet is broken into a stream of characters c1 , c2...cm each of which is represented by a 1-by-|C| encoding. These one-hot vectors are then projected to a character space by multiplying with the matrix PC ∈ R|C|×dc, where dc is the dimension of the character vector space.

The encoder consists of a forward- GRU and a backward-GRU. Both have the same architecture, except the backward-GRU processes the sequence in reverse order. Each of the GRU units process these vectors sequentially, and start- ing with the initial state h0 compute the sequence h1, h2, ...hm. 

Finally, the tweet embedding is passed through a linear layer whose output is the same size as the number of hashtags L in the data set. We use a softmax layer to compute the posterior hashtag probabilities and the objective function is to optimize the categorical cross-entropy loss between predicted and true hashtags.


<div align="center"> Bi-GRU Encoder </div>

![Tweet2Vec_Encoder](Tweet2vec_encoder.png "Bi-GRU Encoder")

### Data Preparation:

We didnt have any dataset associated with the paper due to confidentiality. So we planned to collect tweets on our own based on the following common life oriented keywords such as '#life', '#motivation', '#happy', '#emotions', '#friends', '#babies', '#dogs' to implement and test out the project. 

We have divided the paper into two parts: 

Part 1: Comparing the correctly predicted hashtags by Word Model baseline with Tweet2Vec

Part 2: Training and testing the dataset to calculate Precision, Recall and Mean Rank.

In this notebook, we are going show case the implemention of second part which is training and testing the dataset to compare the performace between word model baseline and tweet2Vec Encoder.


# Implementation


In [1]:
#importing all the neccessary libraries:
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import json
from textblob import TextBlob, Word, Blobber
from textblob.classifiers import NaiveBayesClassifier
from textblob.taggers import NLTKTagger
from decimal import *
import re
import sys
import io
import preprocessor as p
import numpy as np
import lasagne
import theano
import theano.tensor as T
import sys
import batch_word as batch
import pickle as pkl
import os
from w2v import tweet2vec, load_params
from settings_word import N_BATCH, N_WORD, MAX_CLASSES

#setting up conditions for Theano:
theano.config.gcc.cxxflags = "-Wno-c++11-narrowing"
os.environ['THEANO_FLAGS'] = "device=cpu,floatX=float32"

##### Word baseline classifier trainer

In [2]:
import numpy as np
import lasagne
import theano
import theano.tensor as T
import random
import sys
import batch_word as batch
import time
import pickle as pkl
import io
import shutil
import os

from collections import OrderedDict
from w2v import tweet2vec, init_params, load_params_shared
from settings_word import NUM_EPOCHS, N_BATCH, N_WORD, SCALE, WDIM, MAX_CLASSES, LEARNING_RATE, DISPF, SAVEF, REGULARIZATION, RELOAD_MODEL, MOMENTUM, SCHEDULE
from evaluate_w2v import precision

#setting up conditions for Theano:
theano.config.gcc.cxxflags = "-Wno-c++11-narrowing"
os.environ['THEANO_FLAGS'] = "device=cpu,floatX=float32"

T1 = 0.01
T2 = 0.0001

train_path = "/NLP_Final_Project/Method 2/data/train_DS.txt"
val_path = "/NLP_Final_Project/Method 2/data/Val_DS.txt"
save_path = "/NLP_Final_Project/Method 2/data"

def schedule(lr, mu):
    print("Updating Schedule...")
    lr = max(1e-5,lr/2)
    return lr, mu

def tnorm(tens):
    '''
    Tensor Norm
    '''
    return T.sqrt(T.sum(T.sqr(tens),axis=1))

def classify(tweet, t_mask, params, n_classes, n_tokens):
    # tweet embedding
    emb_layer = tweet2vec(tweet, t_mask, params, n_tokens)

    # Dense layer for classes
    l_dense = lasagne.layers.DenseLayer(emb_layer, n_classes, W=params['W_cl'], b=params['b_cl'], nonlinearity=lasagne.nonlinearities.softmax)

    return lasagne.layers.get_output(l_dense), l_dense, lasagne.layers.get_output(emb_layer)

def main(train_path,val_path,save_path,num_epochs=NUM_EPOCHS):
    global T1

    # save settings
    shutil.copyfile('settings_word.py','%s/settings_word.txt'%save_path)

    print("Preparing Data...")
    # Training data
    Xt = []
    yt = []
    with io.open(train_path,'r') as f:
        for line in f:
            (yc, Xc) = line.rstrip('\n').split('\t')
            Xt.append(Xc)
            yt.append(yc)
    # Validation data
    Xv = []
    yv = []
    with io.open(val_path,'r') as f:
        for line in f:
            (yc, Xc) = line.rstrip('\n').split('\t')
            Xv.append(Xc)
            yv.append(yc.split(','))

    print("Preparing Model...")
    if not RELOAD_MODEL:
        # Build dictionaries from training data
        tokendict, tokencount = batch.build_dictionary(Xt)
        n_token = min(len(tokendict.keys()) + 1, N_WORD)
        batch.save_dictionary(tokendict,tokencount,'%s/dict-nlp-w2v-p2.pkl' % save_path)
        
        # params
        params = init_params(n_chars=n_token)
        
        labeldict, labelcount = batch.build_label_dictionary(yt)
        batch.save_dictionary(labeldict, labelcount, '%s/label_dict-nlp-w2v-p2.pkl' % save_path)

        n_classes = min(len(labeldict.keys()) + 1, MAX_CLASSES)

        # classification params
        params['W_cl'] = theano.shared(np.random.normal(loc=0., scale=SCALE, size=(WDIM,n_classes)).astype('float32'), name='W_cl')
        params['b_cl'] = theano.shared(np.zeros((n_classes)).astype('float32'), name='b_cl')

    else:
        print("Loading model params...")
        params = load_params_shared('%s/best_model-nlp-w2v-p2.npz' % save_path)

        print("Loading dictionaries...")
        with open('%s/dict-nlp-w2v-p2.pkl' % save_path, 'rb') as f:
            tokendict = pkl.load(f)
        with open('%s/label_dict-nlp-w2v-p2.pkl' % save_path, 'rb') as f:
            labeldict = pkl.load(f)
        n_token = min(len(tokendict.keys()) + 1, N_WORD)
        n_classes = min(len(labeldict.keys()) + 1, MAX_CLASSES)

    # iterators
    train_iter = batch.BatchTweets(Xt, yt, labeldict, batch_size=N_BATCH, max_classes=MAX_CLASSES)
    val_iter = batch.BatchTweets(Xv, yv, labeldict, batch_size=N_BATCH, max_classes=MAX_CLASSES, test=True)

    print("Building network...")
    # Tweet variables
    tweet = T.itensor3()
    targets = T.ivector()

    # masks
    t_mask = T.fmatrix()

    # network for prediction
    predictions, net, emb = classify(tweet, t_mask, params, n_classes, n_token)

    # batch loss
    loss = lasagne.objectives.categorical_crossentropy(predictions, targets)
    cost = T.mean(loss) + REGULARIZATION*lasagne.regularization.regularize_network_params(net, lasagne.regularization.l2)
    cost_only = T.mean(loss)
    reg_only = REGULARIZATION*lasagne.regularization.regularize_network_params(net, lasagne.regularization.l2)

    # params and updates
    print("Computing updates...")
    lr = LEARNING_RATE
    mu = MOMENTUM
    updates = lasagne.updates.nesterov_momentum(cost, lasagne.layers.get_all_params(net), lr, momentum=mu)

    # Theano function
    print("Compiling theano functions...")
    inps = [tweet,t_mask,targets]
    predict = theano.function([tweet,t_mask],predictions)
    encode = theano.function([tweet,t_mask],emb)
    cost_val = theano.function(inps,[cost_only,emb])
    train = theano.function(inps,cost,updates=updates)
    reg_val = theano.function([],reg_only)

    # Training
    print("Training...")
    uidx = 0
    maxp = 0.
    start = time.time()
    valcosts = []
    try:
        for epoch in range(num_epochs):
            n_samples = 0
            train_cost = 0.
            print("Epoch {}".format(epoch))

            # learning schedule
            if len(valcosts) > 1 and SCHEDULE:
                change = (valcosts[-1]-valcosts[-2])/abs(valcosts[-2])
                if change < T1:
                    lr, mu = schedule(lr, mu)
                    updates = lasagne.updates.nesterov_momentum(cost, lasagne.layers.get_all_params(net), lr, momentum=mu)
                    train = theano.function(inps,cost,updates=updates)
                    T1 = T1/2

            # stopping criterion
            if len(valcosts) > 6:
                deltas = []
                for i in range(5):
                    deltas.append((valcosts[-i-1]-valcosts[-i-2])/abs(valcosts[-i-2]))
                if sum(deltas)/len(deltas) < T2:
                    break

            ud_start = time.time()
        for xr,y in train_iter:
            n_samples +=len(xr)
            uidx += 1
            x, x_m = batch.prepare_data(xr, tokendict, n_tokens=n_token)
            if x.any()==None:
                print("Minibatch with zero samples under maxlength.")
                uidx -= 1
                continue

            curr_cost = train(x,x_m,y)
            train_cost += curr_cost*len(xr)
            ud = time.time() - ud_start

            if np.isnan(curr_cost) or np.isinf(curr_cost):
                print("Nan detected.")
                return

            if np.mod(uidx, DISPF) == 0:
                print("Epoch {} Update {} Cost {} Time {}".format(epoch,uidx,curr_cost,ud))

            if np.mod(uidx,SAVEF) == 0:
                print("Saving...")
                saveparams = OrderedDict()
                for kk,vv in params.items():
                    saveparams[kk] = vv.get_value()
                    np.savez('%s/model-nlp-w2v-p2.npz' % save_path,**saveparams)
                    print("Done.")

        print("Testing on Validation set...")
        preds = []
        targs = []
        for xr,y in val_iter:
            x, x_m = batch.prepare_data(xr, tokendict, n_tokens=n_token)
            if x.any()==None:
                print("Validation: Minibatch with zero samples under maxlength.")
                continue

            vp = predict(x,x_m)
            ranks = np.argsort(vp)[:,::-1]
            for idx,item in enumerate(xr):
                preds.append(ranks[idx,:])
                targs.append(y[idx])

        validation_cost = precision(np.asarray(preds),targs,1)
        regularization_cost = reg_val()

        if validation_cost > maxp:
            maxp = validation_cost
            saveparams = OrderedDict()
            for kk,vv in params.items():
                saveparams[kk] = vv.get_value()
            np.savez('%s/best_model-nlp-w2v-p2.npz' % (save_path),**saveparams)

        print("Epoch {} Training Cost {} Validation Precision {} Regularization Cost {} Max Precision {}".format(epoch, train_cost/n_samples, validation_cost, regularization_cost, maxp))
        print("Seen {} samples.".format(n_samples))
        valcosts.append(validation_cost)

        print("Saving...")
        saveparams = OrderedDict()
        for kk,vv in params.items():
            saveparams[kk] = vv.get_value()
        np.savez('%s/model-nlp-w2v-p2_%d.npz' % (save_path,epoch),**saveparams)
        print("Done.")

    except KeyboardInterrupt:
        pass
    print("Total training time = {}".format(time.time()-start))
    

if __name__ == '__main__':
    main(train_path,val_path,save_path)

Preparing Data...
Preparing Model...
Building network...
Computing updates...
Compiling theano functions...
Training...
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19
Epoch 20
Epoch 21
Epoch 22
Epoch 23
Epoch 24
Epoch 25
Epoch 26
Epoch 27
Epoch 28
Epoch 29
Testing on Validation set...
Epoch 29 Training Cost 5.025040453130549 Validation Precision 3.6 Regularization Cost 0.5672546625137329 Max Precision 3.6
Seen 110 samples.
Saving...
Done.
Total training time = 0.2481698989868164


##### Word baseline classifier tester

In [3]:
import numpy as np
import lasagne
import theano
import theano.tensor as T
import random
import sys
import batch_word as batch
import time
import pickle as pkl
import io
import os
import evaluate_w2v as evaluate

from collections import OrderedDict
from w2v import tweet2vec, load_params
from settings_word import N_BATCH, N_WORD, MAX_CLASSES


#setting up conditions for Theano:
theano.config.gcc.cxxflags = "-Wno-c++11-narrowing"
os.environ['THEANO_FLAGS'] = "device=cpu,floatX=float32"


data_path = "/NLP_Final_Project/Method 2/data/test_DS.txt"
save_path = "/NLP_Final_Project/Method 2/data"


def classify(tweet, t_mask, params, n_classes, n_chars):
    # tweet embedding
    emb_layer = tweet2vec(tweet, t_mask, params, n_chars)
    # Dense layer for classes
    l_dense = lasagne.layers.DenseLayer(emb_layer, n_classes, W=params['W_cl'], b=params['b_cl'], nonlinearity=lasagne.nonlinearities.softmax)

    return lasagne.layers.get_output(l_dense), lasagne.layers.get_output(emb_layer)

def main(args):

    if len(args)>3:
        m_num = int(args[3])

    print("Preparing Data...")
    # Test data
    Xt = []
    yt = []
    with io.open(data_path,'r',encoding='utf-8') as f:
        for line in f:
            (yc, Xc) = line.rstrip('\n').split('\t')
            Xt.append(Xc)
            yt.append(yc.split(','))

    # Model
    print("Loading model params...")
    if len(args)>3:
        #print('Loading %s/model_%d.npz' % (save_path,m_num))
        params = load_params('%s/model-nlp-w2v-p2_%d.npz' % (save_path,m_num))
    else:
        #print('Loading %s/best_model.npz' % save_path)
        params = load_params('%s/best_model-nlp-w2v-p2.npz' % save_path)

    print("Loading dictionaries...")
    with open('%s/dict-nlp-w2v-p2.pkl' % save_path, 'rb') as f:
        chardict = pkl.load(f)
    with open('%s/label_dict-nlp-w2v-p2.pkl' % save_path, 'rb') as f:
        labeldict = pkl.load(f)
    n_char = min(len(chardict.keys()) + 1, N_WORD)
    n_classes = min(len(labeldict.keys()) + 1, MAX_CLASSES)

    # iterators
    test_iter = batch.BatchTweets(Xt, yt, labeldict, batch_size=N_BATCH, max_classes=MAX_CLASSES, test=True)

    print("Building network...")
    # Tweet variables
    tweet = T.itensor3()
    targets = T.imatrix()
    # masks
    t_mask = T.fmatrix()

    # network for prediction
    predictions, embeddings = classify(tweet, t_mask, params, n_classes, n_char)

    # Theano function
    print("Compiling theano functions...")
    predict = theano.function([tweet,t_mask],predictions)
    encode = theano.function([tweet,t_mask],embeddings)

    # Test
    print("Testing...")
    out_data = []
    out_pred = []
    out_emb = []
    out_target = []
    for xr,y in test_iter:
        x, x_m = batch.prepare_data(xr, chardict, n_tokens=n_char)
        p = predict(x,x_m)
        e = encode(x,x_m)
        ranks = np.argsort(p)[:,::-1]

        for idx, item in enumerate(xr):
            out_data.append(item)
            out_pred.append(ranks[idx,:])
            out_emb.append(e[idx,:])
            out_target.append(y[idx])

    # Save
    print("Saving...")
    with open('%s/data-nlp-w2v-p2.pkl'%save_path,'wb') as f:
        pkl.dump(out_data,f)
    with open('%s/predictions-nlp-w2v-p2.npy'%save_path,'wb') as f:
        np.save(f,np.asarray(out_pred))
    with open('%s/embeddings-nlp-w2v-p2.npy'%save_path,'wb') as f:
        np.save(f,np.asarray(out_emb))
    with open('%s/targets-nlp-w2v-p2.pkl'%save_path,'wb') as f:
        pkl.dump(out_target,f)
        
if __name__ == '__main__':
    main(sys.argv[1:])
    evaluate.main(save_path)

Preparing Data...
Loading model params...
Loading dictionaries...
Building network...
Compiling theano functions...
Testing...
Saving...
Precision @ 1 = 3.4347826086956523
Recall @ 10 = 0.42748917748917753
Mean rank = 43


##### Tweet2Vec classifier trainer

In [4]:
import numpy as np
import lasagne
import theano
import theano.tensor as T
import random
import sys
import batch_char as batch
import time
import pickle as pkl
import io
import os
import shutil

from collections import OrderedDict
from t2v import tweet2vec, init_params, load_params_shared
from settings_char import NUM_EPOCHS, N_BATCH, MAX_LENGTH, SCALE, WDIM, MAX_CLASSES, LEARNING_RATE, DISPF, SAVEF, REGULARIZATION, RELOAD_MODEL, MOMENTUM, SCHEDULE
from evaluate_t2v import precision

#setting up conditions for Theano:
theano.config.gcc.cxxflags = "-Wno-c++11-narrowing"
os.environ['THEANO_FLAGS'] = "device=cpu,floatX=float32"


T1 = 0.01
T2 = 0.0001

train_path = "/NLP_Final_Project/Method 2/data/train_DS.txt"
val_path = "/NLP_Final_Project/Method 2/data/Val_DS.txt"
save_path = "/NLP_Final_Project/Method 2/data"

def schedule(lr, mu):
    print("Updating Schedule...")
    lr = max(1e-5,lr/2)
    return lr, mu

def tnorm(tens):
    '''
    Tensor Norm
    '''
    return T.sqrt(T.sum(T.sqr(tens),axis=1))

def classify(tweet, t_mask, params, n_classes, n_chars):
    # tweet embedding
    emb_layer = tweet2vec(tweet, t_mask, params, n_chars)
    # Dense layer for classes
    l_dense = lasagne.layers.DenseLayer(emb_layer, n_classes, W=params['W_cl'], b=params['b_cl'], nonlinearity=lasagne.nonlinearities.softmax)

    return lasagne.layers.get_output(l_dense), l_dense, lasagne.layers.get_output(emb_layer)

def main(train_path,val_path,save_path,num_epochs=NUM_EPOCHS):
    global T1

    # save settings
    shutil.copyfile('settings_char.py','%s/settings_char.txt'%save_path)

    print("Preparing Data...")
    # Training data
    Xt = []
    yt = []
    with io.open(train_path,'r',encoding='utf-8') as f:
        for line in f:
            (yc, Xc) = line.rstrip('\n').split('\t')
            Xt.append(Xc[:MAX_LENGTH])
            yt.append(yc)
    # Validation data
    Xv = []
    yv = []
    with io.open(val_path,'r',encoding='utf-8') as f:
        for line in f:
            (yc, Xc) = line.rstrip('\n').split('\t')
            Xv.append(Xc[:MAX_LENGTH])
            yv.append(yc.split(','))

    print("Building Model...")
    if not RELOAD_MODEL:
        # Build dictionaries from training data
        chardict, charcount = batch.build_dictionary(Xt)
        n_char = len(chardict.keys()) + 1
        batch.save_dictionary(chardict,charcount,'%s/dict-nlp-t2v-p2.pkl' % save_path)
        
        # params
        params = init_params(n_chars=n_char)
        
        labeldict, labelcount = batch.build_label_dictionary(yt)
        batch.save_dictionary(labeldict, labelcount, '%s/label_dict-nlp-t2v-p2.pkl' % save_path)

        n_classes = min(len(labeldict.keys()) + 1, MAX_CLASSES)

        # classification params
        params['W_cl'] = theano.shared(np.random.normal(loc=0., scale=SCALE, size=(WDIM,n_classes)).astype('float32'), name='W_cl')
        params['b_cl'] = theano.shared(np.zeros((n_classes)).astype('float32'), name='b_cl')

    else:
        print("Loading model params...")
        params = load_params_shared('%s/model-nlp-t2v-p2.npz' % save_path)

        print("Loading dictionaries...")
        with open('%s/dict-nlp-t2v-p2.pkl' % save_path, 'rb') as f:
            chardict = pkl.load(f)
        with open('%s/label_dict-nlp-t2v-p2.pkl' % save_path, 'rb') as f:
            labeldict = pkl.load(f)
        n_char = len(chardict.keys()) + 1
        n_classes = min(len(labeldict.keys()) + 1, MAX_CLASSES)

    # iterators
    train_iter = batch.BatchTweets(Xt, yt, labeldict, batch_size=N_BATCH, max_classes=MAX_CLASSES)
    val_iter = batch.BatchTweets(Xv, yv, labeldict, batch_size=N_BATCH, max_classes=MAX_CLASSES, test=True)

    print("Building network...")
    # Tweet variables
    tweet = T.itensor3()
    targets = T.ivector()
    
    # masks
    t_mask = T.fmatrix()

    # network for prediction
    predictions, net, emb = classify(tweet, t_mask, params, n_classes, n_char)

    # batch loss
    loss = lasagne.objectives.categorical_crossentropy(predictions, targets)
    cost = T.mean(loss) + REGULARIZATION*lasagne.regularization.regularize_network_params(net, lasagne.regularization.l2)
    cost_only = T.mean(loss)
    reg_only = REGULARIZATION*lasagne.regularization.regularize_network_params(net, lasagne.regularization.l2)

    # params and updates
    print("Computing updates...")
    lr = LEARNING_RATE
    mu = MOMENTUM
    updates = lasagne.updates.nesterov_momentum(cost, lasagne.layers.get_all_params(net), lr, momentum=mu)

    # Theano function
    print("Compiling theano functions...")
    inps = [tweet,t_mask,targets]
    predict = theano.function([tweet,t_mask],predictions)
    cost_val = theano.function(inps,[cost_only,emb])
    train = theano.function(inps,cost,updates=updates)
    reg_val = theano.function([],reg_only)

    # Training
    print("Training...")
    uidx = 0
    maxp = 0.
    start = time.time()
    valcosts = []
    try:
        for epoch in range(num_epochs):
            n_samples = 0
            train_cost = 0.
            print("Epoch {}".format(epoch))

            # learning schedule
            if len(valcosts) > 1 and SCHEDULE:
                change = (valcosts[-1]-valcosts[-2])/abs(valcosts[-2])
                if change < T1:
                    lr, mu = schedule(lr, mu)
                    updates = lasagne.updates.nesterov_momentum(cost, lasagne.layers.get_all_params(net), lr, momentum=mu)
                    train = theano.function(inps,cost,updates=updates)
                    T1 = T1/2

            # stopping criterion
            if len(valcosts) > 6:
                deltas = []
                for i in range(5):
                    deltas.append((valcosts[-i-1]-valcosts[-i-2])/abs(valcosts[-i-2]))
                if sum(deltas)/len(deltas) < T2:
                    break

            ud_start = time.time()
        for xr,y in train_iter:
            n_samples +=len(xr)
            uidx += 1
            x, x_m = batch.prepare_data(xr, chardict, n_chars=n_char)
            if x is None:
                print("Minibatch with zero samples under maxlength.")
                uidx -= 1
                continue

                curr_cost = train(x,x_m,y)
                train_cost += curr_cost*len(xr)
                ud = time.time() - ud_start

                if np.isnan(curr_cost) or np.isinf(curr_cost):
                    print("Nan detected.")
                    return

                if np.mod(uidx, DISPF) == 0:
                    print("Epoch {} Update {} Cost {} Time {}".format(epoch,uidx,curr_cost,ud))

                if np.mod(uidx,SAVEF) == 0:
                    print("Saving...")
                    saveparams = OrderedDict()
                    for kk,vv in params.items():
                        saveparams[kk] = vv.get_value()
                    np.savez('%s/model-nlp-t2v-p2.npz' % save_path,**saveparams)
                    print("Done.")

        print("Testing on Validation set...")
        preds = []
        targs = []
        for xr,y in val_iter:
            x, x_m = batch.prepare_data(xr, chardict, n_chars=n_char)
            if x is None:
                print("Validation: Minibatch with zero samples under maxlength.")
                continue

            vp = predict(x,x_m)
            ranks = np.argsort(vp)[:,::-1]
            for idx,item in enumerate(xr):
                preds.append(ranks[idx,:])
                targs.append(y[idx])

        validation_cost = precision(np.asarray(preds),targs,1)
        regularization_cost = reg_val()

        if validation_cost > maxp:
            maxp = validation_cost
            saveparams = OrderedDict()
            for kk,vv in params.items():
                saveparams[kk] = vv.get_value()
            np.savez('%s/best_model-nlp-t2v-p2.npz' % (save_path),**saveparams)
            print("Done")

        print("Epoch {} Training Cost {} Validation Precision {} Regularization Cost {} Max Precision {}".format(epoch, train_cost/n_samples, validation_cost, regularization_cost, maxp))
        print("Seen {} samples.".format(n_samples))
        valcosts.append(validation_cost)

        print("Saving...")
        saveparams = OrderedDict()
        for kk,vv in params.items():
            saveparams[kk] = vv.get_value()
        np.savez('%s/model-nlp-t2v-p2_%d.npz' % (save_path,epoch),**saveparams)
        print("Done")

    except KeyboardInterrupt:
        pass
    print("Total training time = {}".format(time.time()-start))
    
if __name__ == '__main__':
    main(train_path,val_path,save_path)

Preparing Data...
Building Model...
OrderedDict()
Building network...
Computing updates...
Compiling theano functions...
Training...
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19
Epoch 20
Epoch 21
Epoch 22
Epoch 23
Epoch 24
Epoch 25
Epoch 26
Epoch 27
Epoch 28
Epoch 29
Testing on Validation set...
Done
Epoch 29 Training Cost 0.0 Validation Precision 4.0 Regularization Cost 2.498483419418335 Max Precision 4.0
Seen 110 samples.
Saving...
Done
Total training time = 0.23064899444580078


##### Tweet2Vec classifier trainer

In [5]:
import numpy as np
import lasagne
import theano
import theano.tensor as T
import random
import sys
import batch_char as batch
import time
import pickle as pkl
import io
import os
import evaluate_t2v as evaluate

from collections import OrderedDict
from t2v import tweet2vec, init_params, load_params
from settings_char import N_BATCH, MAX_LENGTH, MAX_CLASSES

#setting up conditions for Theano:
theano.config.gcc.cxxflags = "-Wno-c++11-narrowing"
os.environ['THEANO_FLAGS'] = "device=cpu,floatX=float32"

data_path = "/NLP_Final_Project/Method 2/data/test_DS.txt"
save_path = "/NLP_Final_Project/Method 2/data"

def classify(tweet, t_mask, params, n_classes, n_chars):
    # tweet embedding
    emb_layer = tweet2vec(tweet, t_mask, params, n_chars)
    # Dense layer for classes
    l_dense = lasagne.layers.DenseLayer(emb_layer, n_classes, W=params['W_cl'], b=params['b_cl'], nonlinearity=lasagne.nonlinearities.softmax)

    return lasagne.layers.get_output(l_dense), lasagne.layers.get_output(emb_layer)

def main(args):

    if len(args)>3:
        m_num = int(args[3])

    print("Preparing Data...")
    # Test data
    Xt = []
    yt = []
    with io.open(data_path,'r',encoding='utf-8') as f:
        for line in f:
            (yc, Xc) = line.rstrip('\n').split('\t')
            Xt.append(Xc[:MAX_LENGTH])
            yt.append(yc.split(','))

    # Model
    print("Loading model params...")
    if len(args)>3:
        print("model")
        params = load_params('%s/model-nlp-t2v-p2_%d.npz' % (save_path,m_num))
    else:
        print("Loading best_model")
        params = load_params('%s/best_model-nlp-t2v-p2.npz' % save_path)

    print("Loading dictionaries...")
    with open('%s/dict-nlp-t2v-p2.pkl' % save_path, 'rb') as f:
        chardict = pkl.load(f)
    with open('%s/label_dict-nlp-t2v-p2.pkl' % save_path, 'rb') as f:
        labeldict = pkl.load(f)
    n_char = len(chardict.keys()) + 1
    n_classes = min(len(labeldict.keys()) + 1, MAX_CLASSES)

    # iterators
    test_iter = batch.BatchTweets(Xt, yt, labeldict, batch_size=N_BATCH, max_classes=MAX_CLASSES, test=True)

    print("Building network...")
    # Tweet variables
    tweet = T.itensor3()
    targets = T.imatrix()

    # masks
    t_mask = T.fmatrix()

    predictions, embeddings = classify(tweet, t_mask, params, n_classes, n_char)

    # Theano function
    print("Compiling theano functions...")
    predict = theano.function([tweet,t_mask],predictions)
    encode = theano.function([tweet,t_mask],embeddings)

    # Test
    print("Testing...")
    out_data = []
    out_pred = []
    out_emb = []
    out_target = []
    for xr,y in test_iter:
        x, x_m = batch.prepare_data(xr, chardict, n_chars=n_char)
        p = predict(x,x_m)
        e = encode(x,x_m)
        ranks = np.argsort(p)[:,::-1]

        for idx, item in enumerate(xr):
            out_data.append(item)
            out_pred.append(ranks[idx,:])
            out_emb.append(e[idx,:])
            out_target.append(y[idx])

    # Save
    print("Saving...")
    with open('%s/data-nlp-t2v-p2.pkl'%save_path,'wb') as f:
        pkl.dump(out_data,f)
    with open('%s/predictions-nlp-t2v-p2.npy'%save_path,'wb') as f:
        np.save(f,np.asarray(out_pred))
    with open('%s/embeddings-nlp-t2v-p2.npy'%save_path,'wb') as f:
        np.save(f,np.asarray(out_emb))
    with open('%s/targets-nlp-t2v-p2.pkl'%save_path,'wb') as f:
        pkl.dump(out_target,f)
        

if __name__ == '__main__':
    main(sys.argv[1:])
    evaluate.main(save_path)

Preparing Data...
Loading model params...
Loading best_model
Loading dictionaries...
Building network...
Compiling theano functions...
Testing...
Saving...
Precision @ 1 = 3.590909090909091
Recall @ 10 = 0.42748917748917753
Mean rank = 26


### Performace output:


| Model |Precision @1 |  Recall @10 |MeanRank
| --- | --- | --- | --- |
| Word Level Baseline | 34.34% | 42.74% | 43 |
| Tweet2Vec | 35.90% | 42.74% | 26 |


# Conclusion and Future Direction

Our learning from this project is that, tweet2vec encoder performs better than word baseline for social media posts trained using supervision from associated hashtags. However, based on our observation there were few tweets were words baseline had better prediction of hastags as well. With respect to performance, without doubt tweet2vec outperforms the word baseline. This paper was limited to English language however the model can be extended to other languages as well. Future direction of the project will focus on how the model can be used for domains specific classification such as news feeds, social media and any content based platforms. 

# References:

[1]: [Dhingra1 et al.2016] Bhuwan Dhingra1, Zhong Zhou2, Dylan Fitzpatrick1,2
Michael Muehl1 and William W. Cohen1, Tweet2Vec: Character-Based Distributed Representations for Social Media, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),2016


[2]: [Bengio et al.2003] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neu- ral probabilistic language model. The Journal of Machine Learning Research, 3:1137–1155.

[3]: [Godin et al.2013] Fréderic Godin, Viktor Slavkovikj, Wesley De Neve, Benjamin Schrauwen, and Rik Van de Walle. 2013. Using topic models for twit- ter hashtag recommendation. In Proceedings of the 22nd international conference on World Wide Web companion, pages 593–596. International World Wide Web Conferences Steering Committee.

[4]: [Zhangetal.2015] XiangZhang,JunboZhao,andYann LeCun. 2015. Character-level convolutional net- works for text classification. In Advances in Neural Information Processing Systems, pages 649–657.