# Introduction
Most existing algorithms for learning continuous word representation only model the syntactic context of the words but ignored the sentiment of text. This is really problematic in sentiment analysis. In this notebook, I am going to present one of the approach for sentiment specific learning used in the paper [<b>"Learning Sentiment Specific Word Embedding for Twitter Sentiment Classification"</b>](https://www.aclweb.org/anthology/P14-1146.pdf) on the TwitterAirline data set.

In this kernel we will:
* We used the Embedding layer of Keras for word embeddings for training data
* We also used pretrained word embeddings (GLOVE)

# Word Embeddings
In simple terms, Word Embedding is a way of converting texts into numbers for the machine to understand that text. When applying one-hot encoding to the words in the tweets, a sparse vectors of high dimensionality will be obtainedand results in performance issues in case of large datasets. Additionally, one-hot encoding does not take into account the semantics of the words. For example, *tea* and *coffee* are different words but have a similar meaning. 

Basically, word embeddings are dense vectors with a much lower dimensionality and the distance and direction of the vectors in the matrix tells the semantic relationships between words.   

# Analysis

In [None]:
# Basic packages required for analysis
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
#from pathlib import Path
import re
import collections
import tensorflow as tf
import nltk
import itertools
import collections

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Packages required for data preparation
from sklearn.model_selection import train_test_split
## Packages for clearning the text
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import LabelEncoder

# for reproducibility
rand = np.random.seed(78) 

# Packages required for modeling the data 
import keras
from keras import models
from keras import layers
from keras import regularizers

# libraries for visualization
#pd.options.mode.chained_assignment = None 
from gensim.models import word2vec
from sklearn.manifold import TSNE
%matplotlib inline

In [None]:
# Packages required for visualize the sentiment polarity
import seaborn as sns

In [None]:
nb_words = 10000  # number of words in the dictionary as per our choice
batch_size = 512  # size of the batches for gradient descent
max_len = 24  # maximum number of words in a sequence
size_valid = 1000  # size of validation set
epochs = 20  # epochs to start train with
dim_glove = 50  # dimensions of the GLOVE word embeddings

# Some function for pre- processing the text

In [None]:
def remove_stopwords(input_text):
    '''
    Function to remove English stopwords from a Pandas Series.
    
    Parameters:
        input_text : text to clean
    Output:
        cleaned Pandas Series 
    '''
    stopwords_list = stopwords.words('english')
    # Some words which might indicate a certain sentiment are kept via a whitelist
    whitelist = ["n't", "not", "no"," "]
    words = input_text.split() 
    clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
    return " ".join(clean_words) 
    
def remove_mentions(input_text):
    '''
    Function to remove mentions, preceded by @, in a Pandas Series
    
    Parameters:
        input_text : text to clean
    Output:
        cleaned Pandas Series 
    '''
    return re.sub('([^\s\w]|_@?)+', '', input_text)
#r'^@s\w\+|_?'#r'@\w+'

# Data Preparation
### Reading and cleaning data

In [None]:
df = pd.read_csv('../input/twitter-airline-sentiment/Tweets.csv')
df = df.reindex(np.random.permutation(df.index))  

In [None]:
df = df[['text', 'airline_sentiment']]
df.text = df.text.apply(remove_stopwords).apply(remove_mentions)

In [None]:
df.head(5)

In [None]:
fig = plt.figure(figsize=(5,5))
sns.catplot(x="airline_sentiment", data=df, kind="count", height=6, aspect=1.5, palette="husl")
plt.show();

In [None]:
def build_corpus(df):
    "Creates a list of lists containing words from each sentence"
    corpus = []
    for col in ['text']:
        for sentence in df[col].iteritems():
            word_list = sentence[1].split(" ")
            corpus.append(word_list)
            
    return corpus

corpus = build_corpus(df)        
corpus[0:2]

In [None]:
# List of all words across tweets
list_of_corpus = list(itertools.chain(*corpus))

# Create counter
counts_of_words = collections.Counter(list_of_corpus)

counts_of_words.most_common(10)

clean_tweets = pd.DataFrame(counts_of_words.most_common(10),
                             columns=['words', 'count'])

clean_tweets.head()

In [None]:
fig1, ax = plt.subplots(figsize=(8, 8))

# Plot horizontal bar graph
clean_tweets.sort_values(by='count').plot.barh(x='words',
                      y='count',
                      ax=ax,
                      color="purple")

ax.set_title("Common Words Found in Tweets (Including All Words)")

plt.show()

### Train-Test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.text, df.airline_sentiment, test_size=0.1, random_state=rand)
print('# Train data samples:', X_train.shape[0])
print('# Test data samples:', X_test.shape[0])
assert X_train.shape[0] == y_train.shape[0]
assert X_test.shape[0] == y_test.shape[0]
#print(X_train)

### Converting words to numbers

In [None]:
tk = Tokenizer(num_words=nb_words,
               filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
               lower=True,
               split=" ")
tk.fit_on_texts(X_train)

X_train_seq = tk.texts_to_sequences(X_train)
X_test_seq = tk.texts_to_sequences(X_test)
#print(X_train_seq)

### Creating word sequences of equal length
First, look at the length of the (cleaned) tweets as we need sequence of equal length for word embedding. To achieve this,  we either truncate sequences to max_len, or pad them with zeroes. 

In [None]:
# calculating length of each sequence and displaying the five number sumary for length of sequence
seq_lengths = X_train.apply(lambda x: len(x.split(' ')))
seq_lengths.describe()

Since, the maximum length is 24. So we and minimum length is 1. So, we will pad with zeros to avoid loss of information as tweetsare short.

In [None]:
X_train_seq_trunc = pad_sequences(X_train_seq, maxlen=max_len)
X_test_seq_trunc = pad_sequences(X_test_seq, maxlen=max_len)

In [None]:
X_train_seq_trunc[10]  # Example of padded sequence

### Converting the target classes to numbers

In [None]:
le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
y_test_le = le.transform(y_test)
y_train_oh = to_categorical(y_train_le)
y_test_oh = to_categorical(y_test_le)

> ### Splitting train and validation data

In [None]:
X_train_emb, X_valid_emb, y_train_emb, y_valid_emb = train_test_split(X_train_seq_trunc, y_train_oh, test_size=0.1, random_state=rand)

assert X_valid_emb.shape[0] == y_valid_emb.shape[0]
assert X_train_emb.shape[0] == y_train_emb.shape[0]

print('Shape of validation set:',X_valid_emb.shape)

# Some custom function to help analysis

In [None]:
from keras import backend as K

# Custom loss function for SSWE_h
def custom_loss_u(y_true,y_pred):
    """Custom loss function for SSWE_h.

    Parameters
    ----------
        y_true : true sentiment classes
        y_pred : predicted sentiment classes

    Returns
        Output:
        loss: loss value
    -------

    """
    loss=(-1)*(K.sum(y_true * K.log(y_pred)))
    return loss

# Custom Activation function Hard hyperbolic tangent
__all__ = ['htanh']

def hard_tanh(x, name='htanh'):
    """Hard tanh activation function.

    A ramp function with low bound of -1 and upper bound of 1,

    Parameters
    ----------
    x : Input Tensor.
    name : str
        The function name (optional).

    Returns
    -------

    """
    return tf.clip_by_value(x, -1, 1, name=name)

# Alias
htanh = hard_tanh

def deep_model(model, X_train, y_train, X_valid, y_valid):
    '''
    Function to train a multi-class model.
    
    Parameters:
        model : model with the chosen architecture
        X_train : training features
        y_train : training target
        X_valid : validation features
        Y_valid : validation target
    Output:
        model training
    '''
    # setting up the optimizer as per the specification
    opt = keras.optimizers.Adagrad(learning_rate=0.1)
    model.compile(optimizer=opt
                  , loss=custom_loss_u
                  , metrics=['accuracy'])
    
    training = model.fit(X_train
                       , y_train
                       , epochs=epochs
                       , batch_size=batch_size
                       , validation_data=(X_valid, y_valid)
                       , verbose=1
                       ,shuffle=False)
    return training


def eval_metric(training, metric_name):
    '''
    Function to evaluate a trained model. 
    Plots are shown as a line chart corresponding 
    to each epoch for training and validation set
    
    Parameters:
        training : model training
        metric_name : loss or accuracy
    Output:
        line chart with epochs with metric on
        y-axis and epochs on x-axis
    '''
    metric = training.history[metric_name]
    val_metric = training.history['val_' + metric_name]

    e = range(1, epochs + 1)

    plt.plot(e, metric, 'bo', label='Train ' + metric_name)
    plt.plot(e, val_metric, 'b', label='Validation ' + metric_name)
    plt.legend()
    plt.show()

def test_model(model, X_train, y_train, X_test, y_test, epoch_stop):
    '''
    Function to test the model on new data
    with the optimal number of epochs.
    
    Parameters:
        model : trained model
        X_train : training features
        y_train : training target
        X_test : test features
        y_test : test target
        epochs : optimal number of epochs
    Output:
        test accuracy and test loss
    '''
    model.fit(X_train
              , y_train
              , epochs=epoch_stop
              , batch_size=batch_size
              , verbose=0
              ,shuffle=False)
    results = model.evaluate(X_test, y_test)
    
    return results

# Modeling

### Training word embeddings
Keras provides an **Embedding layer** which helps us to train specific word embeddings based on our training data converting words to multi-dimensional vectors. 

In [None]:
emb_model = models.Sequential()
emb_model.add(layers.Embedding(nb_words, 50, input_length=max_len))
emb_model.add(layers.Flatten())
emb_model.add(layers.Dense(20, activation=htanh))
emb_model.add(layers.Dense(3, activation='softmax'))
emb_model.summary()

In [None]:
emb_history = deep_model(emb_model, X_train_emb, y_train_emb, X_valid_emb, y_valid_emb)

In [None]:
eval_metric(emb_history, 'accuracy')

In [None]:
eval_metric(emb_history, 'loss')

In [None]:
emb_results = test_model(emb_model, X_train_seq_trunc, y_train_oh, X_test_seq_trunc, y_test_oh, 6)
print('/n')
print('Test accuracy of word embeddings model: {0:.2f}%'.format(emb_results[1]*100))

* This test result is satisfactory. However, the model overfits fast, after 2 epochs

### Using pre-trained word embeddings
Since the training data is not so big, the model might not be able to learn good embeddings for the sentiment analysis. To vercome this, we can load pre-trained word embeddings built on a much larger training data. 

The [GloVe database](https://nlp.stanford.edu/projects/glove/) contains multiple pre-trained word embeddings, and more specific embeddings trained on tweets.

In [None]:
glove_file = 'glove.twitter.27B.' + str(dim_glove) + 'd.txt'

glove_dir = '../input/glove-global-vectors-for-word-representation'
emb_dict = {}
#print(glove_dir+str('/')+glove_file)
glove = open(glove_dir+str('/')+glove_file)
for line in glove:
    values = line.split()
    word = values[0]
    vector = np.asarray(values[1:], dtype='float32')
    emb_dict[word] = vector
glove.close()

The first task is to see that we have some airline related words in the dictionary

In [None]:
airline_words = ['airplane', 'airline', 'flight', 'luggage']
for w in airline_words:
    if w in emb_dict.keys():
        print('Found the word {} in the dictionary'.format(w))

Now we need to build a matrix of shape (nb_words, dim_glove) containing the words in the tweets and their representative word embedding for it to be processed by embedding layer.

In [None]:
emb_matrix = np.zeros((nb_words, dim_glove))

for w, i in tk.word_index.items():
    # The word_index contains a token for all words of the training data so we need to limit that
    if i < nb_words:
        vect = emb_dict.get(w)
        # Check if the word from the training data occurs in the GloVe word embeddings
        # Otherwise the vector is kept with only zeros
        if vect is not None:
            emb_matrix[i] = vect
    else:
        break

In [None]:
glove_model = models.Sequential()
glove_model.add(layers.Embedding(nb_words, dim_glove, input_length=max_len))
glove_model.add(layers.Flatten())
glove_model.add(layers.Dense(20, activation=htanh))
glove_model.add(layers.Dense(3, activation='softmax'))
glove_model.summary()

We load the pre-trained embeddings in the Embedding layer (here layer 0) using *set_weights* method and by putting *trainable* attribute to False to make sure we are using pre-trained embeddings.

In [None]:
glove_model.layers[0].set_weights([emb_matrix])
glove_model.layers[0].trainable = False

In [None]:
glove_history = deep_model(glove_model, X_train_emb, y_train_emb, X_valid_emb, y_valid_emb)

In [None]:
eval_metric(glove_history, 'loss')

In [None]:
eval_metric(glove_history, 'accuracy')

In [None]:
glove_results = test_model(glove_model, X_train_seq_trunc, y_train_oh, X_test_seq_trunc, y_test_oh, 3)
print('/n')
print('Test accuracy of word glove model: {0:.2f}%'.format(glove_results[1]*100))

The model overfits fast, after 3 epochs. Howerver, the validation accuracy is lower as compared to embeddings trained on the training data. 

Now, we will analyse the results for training the embeddings with the same number of dimensions as the GloVe data.

### Training word embeddings with more dimensions

In [None]:
emb_model2 = models.Sequential()
emb_model2.add(layers.Embedding(nb_words, dim_glove, input_length=max_len))
emb_model2.add(layers.Flatten())
emb_model2.add(layers.Dense(20, activation=htanh))
emb_model2.add(layers.Dense(3, activation='softmax'))
emb_model2.summary()

In [None]:
emb_history2 = deep_model(emb_model2, X_train_emb, y_train_emb, X_valid_emb, y_valid_emb)

In [None]:
eval_metric(emb_history2, 'loss')

In [None]:
eval_metric(emb_history2, 'accuracy')

In [None]:
emb_results2 = test_model(emb_model2, X_train_seq_trunc, y_train_oh, X_test_seq_trunc, y_test_oh, 3)
print('/n')
print('Test accuracy of word embedding model 2: {0:.2f}%'.format(emb_results2[1]*100))

This result is very close to the model with 50-dimensional word embeddings. So there is no strong improvement. 

# Conclusion for model
The best result is achieved with 50-dimensional word embeddings that are trained on the available data. This even outperforms the use of word embeddings that were trained on a much larger Twitter corpus.

# Visualizing Word Vectors with t-SNE

# Word 2 Vec

The Word to Vec model produces a vocabulary, with each word being represented by an n-dimensional numpy array (100 values in this example)

In [None]:
model = word2vec.Word2Vec(corpus, size=100, window=20, min_count=200, workers=4)
model.wv['flights']

In [None]:
def tsne_plot(model):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []

    for word in model.wv.vocab:
        tokens.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

In [None]:
import sys
tsne_plot(model)

In [None]:
# A more selective model
model = word2vec.Word2Vec(corpus, size=100, window=20, min_count=500, workers=4)
tsne_plot(model)

In [None]:
# A less selective model
model = word2vec.Word2Vec(corpus, size=100, window=20, min_count=100, workers=4)
tsne_plot(model)

In [None]:
model.most_similar('flights')

In [None]:
model.most_similar('right')

# Conclusion for visualization using t-SNE

It is hard to visualize these words using t-SNE. The better way to look is most similar words