#  Tweet2Vec:Character-Based Distributed Representations for Social Media

#### Kumara Prasanna Jayaraju / Rinaldo Sonia Joseph Santhana Raj

####  Emails: kjayaraju@ryerson.ca / rinaldo.joseph@ryerson.ca

# Introduction:

#### Problem Description:

Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts. 

#### Context of the Problem:

This is leading to a prohibitively large vocabulary size for word-level approaches. In any natural language corpus a majority of the vocabulary word types will either be absent or occur in low frequency. Estimating the statistical properties of these rare word types is naturally a difficult task.This is analogous to the curse of dimensionality when we deal with sequences of tokens, most sequences will occur only once in the training data.

#### Limitation About other Approaches:

Traditional Neural Network Language Models (NNLMs) treat words as the basic units of language and assign independent vectors to each word type. This paper is motivated by bi-directional Long Short Term Memory (LSTM )for composing word vectors

#### Solution:

In this project, the authors explore a similar approach to learning distributed representations of social media posts by composing them from their constituent characters, with the goal of generalizing to out-of-vocabulary words as well as sequences at test time. The paper propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences.

# Background

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Bengio et al., 2003 [1] | Using neural networks to learn distributed representations of words dates back leased word2vec, a collection of word vectors trained using a recurrent neural network.| word2Vev, sentences, documents and paragraphs | Require storing an extremely large table of vectors for all word types and cannot be easily generalized to unseen words at test time.
| Ling et al., 2015 [2] | the authors present a compositional character model based on bidirectional LSTMs as a potential solution to these problems.| Large data sets, sentences, paragraphs| generate word embeddings from character-level representations only, less accuracy. 
| Luong et al., 2013 [3] | Dealt with the problem of estimating rare word representations by building them from their constituent morphemes | Large data sets, sentences, paragraphs| approach requires a morpheme parser for preprocessing which may not perform well on noisy text like Twitter.
| Dhingra et al., 2016 [4] | A character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. | Feeds, Twitter| The paper is limited only to english language but can be extended for other languages.


# Methodology

Bi-GRU Encoder: Figure below shows our model for encoding tweets. It uses a similar structure to the C2W model in (Ling et al., 2015), with LSTM units replaced with GRU units.

The input to the network is defined by an alphabet of characters C (this may include the entire unicode character set). The input tweet is broken into a stream of characters c1 , c2...cm each of which is represented by a 1-by-|C| encoding. These one-hot vectors are then projected to a character space by multiplying with the matrix PC ∈ R|C|×dc, where dc is the dimension of the character vector space.

The encoder consists of a forward- GRU and a backward-GRU. Both have the same architecture, except the backward-GRU processes the sequence in reverse order. Each of the GRU units process these vectors sequentially, and start- ing with the initial state h0 compute the sequence h1, h2, ...hm. 

Finally, the tweet embedding is passed through a linear layer whose output is the same size as the number of hashtags L in the data set. We use a softmax layer to compute the posterior hashtag probabilities and the objective function is to optimize the categorical cross-entropy loss between predicted and true hashtags.


<div align="center"> Bi-GRU Encoder </div>

![Tweet2Vec_Encoder](Tweet2vec_encoder.png "Bi-GRU Encoder")

### Data Preparation:

We didnt have any dataset associated with the paper due to confidentiality. So we planned to collect tweets on our own based on the following common life oriented keywords such as '#life', '#motivation', '#happy', '#emotions', '#friends', '#babies', '#dogs' to implement and test out the project. 

We have divided the paper into two parts: 

Part 1: Comparing the correctly predicted hashtags by Word Model baseline with Tweet2Vec

Part 2: Training and testing the dataset to calculate Precision, Recall and Mean Rank.

In this notebook, we are going show case the implemention of first part which is comparing the predicted hashtags by word model baseline with tweet2Vec Encoder.


# Implementation


In [1]:
#importing all the neccessary libraries:
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import json
from textblob import TextBlob, Word, Blobber
from textblob.classifiers import NaiveBayesClassifier
from textblob.taggers import NLTKTagger
from decimal import *
import re
import sys
import io
import preprocessor as p
import numpy as np
import lasagne
import theano
import theano.tensor as T
import sys
import batch_word as batch
import pickle as pkl
import os
from w2v import tweet2vec, load_params
from settings_word import N_BATCH, N_WORD, MAX_CLASSES

#setting up conditions for Theano:
theano.config.gcc.cxxflags = "-Wno-c++11-narrowing"
os.environ['THEANO_FLAGS'] = "device=cpu,floatX=float32"

#### Collecting Tweets and Preprocessing:

In [None]:
# Collecting Tweets:
'''
please provide your twitter developer API crendentials before running this cell

'''
# please provide your twitter developer API crendentials:
consumer_key = ''
consumer_secret =''
access_token = ''
access_token_secret = ''


class StdOutListener(StreamListener):
    
    def on_data(self, data):
        dict_data = json.loads(data)

        if "text" in dict_data.keys():
            saveFile = open('Tweets2Vec_DA.rtf', 'a')
            saveFile.write(dict_data["text"])
            saveFile.close()
        return True

    def on_error(self, status):
        print(status)
        
        
if __name__ == '__main__':

    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)

    #This line filter Twitter Streams to capture data by below mentioned keywords:
    stream.filter(languages=["en"], track=['#life', '#motivation', '#happy', '#emotions', '#friends', '#babies', '#dogs'])

# Preprocessing: 

# input and output files
infile = "/NLP_Final_Project/Part 1/data/Tweets2Vec_DA.rtf"
outfile = "/NLP_Final_Project/Part 1/data/life_t2v_ds_en_op.txt"

regex_str = [
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)+' # anything else
]

tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)

def tokenize(s):
    return tokens_re.findall(s)

def preprocess(s, lowercase=True):
    tokens = tokenize(s)
    tokens = [token.lower() for token in tokens]

    html_regex = re.compile('<[^>]+>')
    tokens = [token for token in tokens if not html_regex.match(token)]

    mention_regex = re.compile('(?:@[\w_]+)')
    tokens = ['@mention' if mention_regex.match(token) else token for token in tokens]

    url_regex = re.compile('http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+')
    tokens = ['@url' if url_regex.match(token) else token for token in tokens]

    hashtag_regex = re.compile("(?:\#+[\w_]+[\w\'_\-]*[\w_]+)")
    tokens = ['@hash' if hashtag_regex.match(token) else token for token in tokens]    

    f = p.clean(' '.join([t for t in tokens if t]).replace('rt','')
                   .replace(':','').replace('...','')
                   .replace('@mention', '').replace('@url', '').replace('@hash', ''))
    
    return f

with io.open(outfile, 'w') as tweet_processed_text, io.open(infile, 'r') as fin:
    for line in fin:
        tweet_processed_text.write(preprocess(line)+'\n')

#### Encoding_Tweets_W2V and saving predicted tags

In [1]:
# Encoding_Tweets_W2V

import numpy as np
import lasagne
import theano
import theano.tensor as T
import sys
import batch_word as batch
import pickle as pkl
import io
import os

from w2v import tweet2vec, load_params
from settings_word import N_BATCH, N_WORD, MAX_CLASSES

#setting up conditions for Theano:
theano.config.gcc.cxxflags = "-Wno-c++11-narrowing"
os.environ['THEANO_FLAGS'] = "device=cpu,floatX=float32"


def invert(d):
    out = {}
    for k,v in d.items():
        out[v] = k
    return out

def classify(tweet, t_mask, params, n_classes, n_chars):
    # tweet embedding
    emb_layer = tweet2vec(tweet, t_mask, params, n_chars)
    # Dense layer for classes
    l_dense = lasagne.layers.DenseLayer(emb_layer, n_classes, W=params['W_cl'], b=params['b_cl'], nonlinearity=lasagne.nonlinearities.softmax)

    return lasagne.layers.get_output(l_dense), lasagne.layers.get_output(emb_layer)

def main(args):

    data_path = "/NLP_Final_Project/Part 1/data/life_t2v_ds_en_op.txt"
    model_path = "/NLP_Final_Project/Part 1/src"
    save_path = "/NLP_Final_Project/Part 1/data"
    if len(args)>3:
        m_num = int(args[3])

    print("Preparing Data...")
    # Test data
    Xt = []
    with io.open(data_path,'r',encoding='utf-8') as f:
        for line in f:
            Xc = line.rstrip('\n')
            Xt.append(Xc)

    # Model
    print("Loading model params...")
    if len(args)>3:
        params = load_params('%s/model-w2v_%d.npz' % (model_path,m_num))
    else:
        params = load_params('%s/best_model-nlp-w2v.npz' % model_path)

    print("Loading dictionaries...")
    with open('%s/dict-nlp-w2v.pkl' % model_path, 'rb') as f:
        chardict = pkl.load(f)
    with open('%s/label_dict-nlp-w2v.pkl' % model_path, 'rb') as f:
        labeldict = pkl.load(f)
    n_char = min(len(chardict.keys()) + 1, N_WORD)
    n_classes = min(len(labeldict.keys()) + 1, MAX_CLASSES)
    inverse_labeldict = invert(labeldict)

    print("Building network...")
    
    # Tweet variables
    tweet = T.itensor3()
    t_mask = T.fmatrix()

    # network for prediction
    predictions, embeddings = classify(tweet, t_mask, params, n_classes, n_char)
    
    # Theano function
    print("Compiling theano functions...")
    predict = theano.function([tweet,t_mask],predictions)
    encode = theano.function([tweet,t_mask],embeddings)

    # Test
    print("Encoding...")
    out_pred = []
    out_emb = []
    numbatches = int(len(Xt)/N_BATCH + 1)
    for i in range(numbatches):
        xr = Xt[N_BATCH*i:N_BATCH*(i+1)]
        x, x_m = batch.prepare_data(xr, chardict, n_tokens=n_char)
        p = predict(x,x_m)
        e = encode(x,x_m)
        ranks = np.argsort(p)[:,::-1]

        for idx, item in enumerate(xr):
            out_pred.append(' '.join([inverse_labeldict[r] for r in ranks[idx,:5]]))
            out_emb.append(e[idx,:])

    # Save
    print("Saving...")
    with io.open('%s/predicted_tags-nlp-w2v.txt'%save_path,'w') as f:
        for item in out_pred:
            f.write(item + '\n')
    with open('%s/embeddings-nlp-w2v.npy'%save_path,'wb') as f:
        np.save(f,np.asarray(out_emb))

if __name__ == '__main__':
    main(sys.argv[1:])

Preparing Data...
Loading model params...
Loading dictionaries...
Building network...
Compiling theano functions...
Encoding...
Saving...


#### Encoding_Tweets_T2V and saving predicted tags

In [3]:
import numpy as np
import lasagne
import theano
import theano.tensor as T
import sys
import batch_char as batch
import pickle as pkl
import io
import os

from t2v import tweet2vec, init_params, load_params
from settings_char import N_BATCH, MAX_LENGTH, MAX_CLASSES


#setting up conditions for Theano:
theano.config.gcc.cxxflags = "-Wno-c++11-narrowing"
os.environ['THEANO_FLAGS'] = "device=cpu,floatX=float32"


def invert(d):
    out = {}
    for k,v in d.items():
        out[v] = k
    return out

def classify(tweet, t_mask, params, n_classes, n_chars):
    # tweet embedding
    emb_layer = tweet2vec(tweet, t_mask, params, n_chars)
    # Dense layer for classes
    l_dense = lasagne.layers.DenseLayer(emb_layer, n_classes, W=params['W_cl'], b=params['b_cl'], nonlinearity=lasagne.nonlinearities.softmax)

    return lasagne.layers.get_output(l_dense), lasagne.layers.get_output(emb_layer)

def main(args):

    data_path = "/NLP_Final_Project/Part 1/data/life_t2v_ds_en_op.txt"
    model_path = "/NLP_Final_Project/Part 1/src"
    save_path = "/NLP_Final_Project/Part 1/data"
    if len(args)>3:
        m_num = int(args[3])

    print("Preparing Data...")
    # Test data
    Xt = []
    with io.open(data_path,'r',encoding='utf-8') as f:
        for line in f:
            Xc = line.rstrip('\n')
            Xt.append(Xc[:MAX_LENGTH])

    # Model
    print("Loading model params...")
    if len(args)>3:
        params = load_params('%s/model-nlp-t2v_%d.npz' % (model_path,m_num))
    else:
        params = load_params('%s/best_model-nlp-t2v.npz' % model_path)

    print("Loading dictionaries...")
    with open('%s/dict-nlp-t2v.pkl' % model_path, 'rb') as f:
        chardict = pkl.load(f)
    with open('%s/label_dict-nlp-t2v.pkl' % model_path, 'rb') as f:
        labeldict = pkl.load(f)
    n_char = len(chardict.keys()) + 1
    n_classes = min(len(labeldict.keys()) + 1, MAX_CLASSES)
    inverse_labeldict = invert(labeldict)

    print("Building network...")
    # Tweet variables
    tweet = T.itensor3()
    t_mask = T.fmatrix()

    # network for prediction
    predictions, embeddings = classify(tweet, t_mask, params, n_classes, n_char)

    # Theano function
    print("Compiling theano functions...")
    predict = theano.function([tweet,t_mask],predictions)
    encode = theano.function([tweet,t_mask],embeddings)

    # Test
    print("Encoding...")
    out_pred = []
    out_emb = []
    numbatches = int(len(Xt)/N_BATCH + 1)
    for i in range(numbatches):
        xr = Xt[N_BATCH*i:N_BATCH*(i+1)]
        x, x_m = batch.prepare_data(xr, chardict, n_chars=n_char)
        p = predict(x,x_m)
        e = encode(x,x_m)
        ranks = np.argsort(p)[:,::-1]

        for idx, item in enumerate(xr):
            out_pred.append(' '.join([inverse_labeldict[r] if r in inverse_labeldict else 'UNK' for r in ranks[idx,:5]]))
            out_emb.append(e[idx,:])

    # Save
    print("Saving...")
    with io.open('%s/predicted_tags-nlp-t2v.txt'%save_path,'w') as f:
        for item in out_pred:
            f.write(item + '\n')
    with open('%s/embeddings-nlp-t2v.npy'%save_path,'wb') as f:
        np.save(f,np.asarray(out_emb))
        
if __name__ == '__main__':
    main(sys.argv[1:])

Preparing Data...
Loading model params...
Loading dictionaries...
Building network...
Compiling theano functions...
Encoding...
Saving...


The output files are stored under the data files. we have combined top 10 tweets and its precticted hastags for the comparison and provided screenshot below. 


<div align="center"> Examples of top predictions from the models </div>

![Output_Comparison](output.png "Examples of top predictions from the models")

# Conclusion and Future Direction

Our learning from this project is that, tweet2vec encoder performs better than word baseline for social media posts trained using supervision from associated hashtags. However, based on our observation there were few tweets were words baseline had better prediction of hastags as well. With respect to performance, without doubt tweet2vec outperforms the word baseline. This paper was limited to English language however the model can be extended to other languages as well. Future direction of the project will focus on how the model can be used for domains specific classification such as news feeds, social media and any content based platforms. 

# References:

[1]: [Dhingra1 et al.2016] Bhuwan Dhingra1, Zhong Zhou2, Dylan Fitzpatrick1,2
Michael Muehl1 and William W. Cohen1, Tweet2Vec: Character-Based Distributed Representations for Social Media, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),2016


[2]: [Bengio et al.2003] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neu- ral probabilistic language model. The Journal of Machine Learning Research, 3:1137–1155.

[3]: [Godin et al.2013] Fréderic Godin, Viktor Slavkovikj, Wesley De Neve, Benjamin Schrauwen, and Rik Van de Walle. 2013. Using topic models for twit- ter hashtag recommendation. In Proceedings of the 22nd international conference on World Wide Web companion, pages 593–596. International World Wide Web Conferences Steering Committee.

[4]: [Zhangetal.2015] XiangZhang,JunboZhao,andYann LeCun. 2015. Character-level convolutional net- works for text classification. In Advances in Neural Information Processing Systems, pages 649–657.