# Sentiment Analysis of tweets about Airlines in US using Glove for text embeddings and LSTM network

Author: Sheik Mohamed Anees S/O S A F

This machine learning algorithm aims to predict the sentiment for tweets about the various US airlines thereby allowing the management of these airlines to make better informed decisions and thereby improving the reputation of the airline industry in US.

After obtaining the tweets with the use of API, this model could perform the classification of the tweets into 3 categories: Positive, Neutral and Negative.  

This dataset was obtained from https://www.kaggle.com/crowdflower/twitter-airline-sentiment

### Importing all necessary libraries

In [8]:
import pandas as pd
import numpy as np

import os
import sys
import numpy as np

import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.utils import to_categorical

from tensorflow.keras.layers import Dense, Input, GlobalMaxPooling1D, LSTM
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.initializers import Constant

### Taking a look at the tweets dataset

In [9]:
airline_sentiments = pd.read_csv('Tweets.csv')
airline_sentiments = airline_sentiments.loc[:,["airline_sentiment", "text"]]
airline_sentiments = airline_sentiments.dropna()
print(airline_sentiments.describe())
print(airline_sentiments.head())

       airline_sentiment            text
count              14640           14640
unique                 3           14427
top             negative  @united thanks
freq                9178               6
  airline_sentiment                                               text
0           neutral                @VirginAmerica What @dhepburn said.
1          positive  @VirginAmerica plus you've added commercials t...
2           neutral  @VirginAmerica I didn't today... Must mean I n...
3          negative  @VirginAmerica it's really aggressive to blast...
4          negative  @VirginAmerica and it's a really big bad thing...


### Adding a new column to the dataset that denotes the airline sentiments in terms of numbers instead whereby 0 = positive, 1 = neutral and 2 = negative

In [10]:
airline_sentiments.loc[airline_sentiments['airline_sentiment']=='positive', 'sentiment_value'] = 0
airline_sentiments.loc[airline_sentiments['airline_sentiment']=='neutral', 'sentiment_value'] = 1
airline_sentiments.loc[airline_sentiments['airline_sentiment']=='negative', 'sentiment_value'] = 2
airline_sentiments

Unnamed: 0,airline_sentiment,text,sentiment_value
0,neutral,@VirginAmerica What @dhepburn said.,1.0
1,positive,@VirginAmerica plus you've added commercials t...,0.0
2,neutral,@VirginAmerica I didn't today... Must mean I n...,1.0
3,negative,@VirginAmerica it's really aggressive to blast...,2.0
4,negative,@VirginAmerica and it's a really big bad thing...,2.0
5,negative,@VirginAmerica seriously would pay $30 a fligh...,2.0
6,positive,"@VirginAmerica yes, nearly every time I fly VX...",0.0
7,neutral,@VirginAmerica Really missed a prime opportuni...,1.0
8,positive,"@virginamerica Well, I didn't…but NOW I DO! :-D",0.0
9,positive,"@VirginAmerica it was amazing, and arrived an ...",0.0


We are concerned with predicting if a tweet related to US Airlines reflects a positive, neutral or negative sentiment.  For this, we will use the Glove6b obtained from http://nlp.stanford.edu/data/glove.6B.zip to get text_embeddings and then adopt transfer learning to use the text_embeddings with the tweets dataset.

### Building a dictionary mapping words in the embedding set to their respective embedding vector with the glove6b dataset

In [11]:
print('Indexing word vectors')

set = open('glove.6B.100d.txt')

embeddings_index = {}

for line in set:
    values=line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
    
print('Found %s word vectors.' % len(embeddings_index))
print('Example of Glove Embedding Vector for the word "hello" is : \n{}'.format(embeddings_index['hello']))

Indexing word vectors
Found 400000 word vectors.
Example of Glove Embedding Vector for the word "hello" is : 
[ 0.26688    0.39632    0.6169    -0.77451   -0.1039     0.26697
  0.2788     0.30992    0.0054685 -0.085256   0.73602   -0.098432
  0.5479    -0.030305   0.33479    0.14094   -0.0070003  0.32569
  0.22902    0.46557   -0.19531    0.37491   -0.7139    -0.51775
  0.77039    1.0881    -0.66011   -0.16234    0.9119     0.21046
  0.047494   1.0019     1.1133     0.70094   -0.08696    0.47571
  0.1636    -0.44469    0.4469    -0.93817    0.013101   0.085964
 -0.67456    0.49662   -0.037827  -0.11038   -0.28612    0.074606
 -0.31527   -0.093774  -0.57069    0.66865    0.45307   -0.34154
 -0.7166    -0.75273    0.075212   0.57903   -0.1191    -0.11379
 -0.10026    0.71341   -1.1574    -0.74026    0.40452    0.18023
  0.21449    0.37638    0.11239   -0.53639   -0.025092   0.31886
 -0.25013   -0.63283   -0.011843   1.377      0.86013    0.20476
 -0.36815   -0.68874    0.53512   -0.46556

### Loading the text examples and their respective labels

In [12]:
labels = airline_sentiments['sentiment_value'].tolist()
texts = airline_sentiments['text'].tolist()

MAX_SEQUENCE_LENGTH = 50
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

### Tokenizing the words into a 2D integer tensor & Padding so each example is the same length

The tokenizing of each training example is based on a dictionary where each word corresponds to a number.  Each training example is padded such that its sequence length(i.e number of words) is 50.

In [13]:
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
#First 5 examples:
data[:5]

Found 15768 unique tokens.


array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,   81,   62, 6686,  226],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,   81,  558,  590,
        1159, 2536,    1,    2,  201, 6687],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,   81,    3,  207,  102,  805,  591,
           3,   76,   

### Converting the examples to One-Hot representation

In [14]:
labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (14640, 50)
Shape of label tensor: (14640, 3)


### Splitting into training and test sets

In [15]:
#Doing random shuffling of dataset:
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

#Splitting into training and test sets:
x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

### Preparing Embedding Matrix

In [16]:
print('Preparing embedding matrix.')

# prepare embedding matrix
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

Preparing embedding matrix.


### Constructing the Model and Compiling

In [20]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = LSTM(128)(embedded_sequences)
x = Dropout(0.3)(x)
preds = Dense(3, activation='softmax')(x)

model = Model(sequence_input, preds)

model.summary()


model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 50)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 50, 100)           1576900   
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               117248    
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 3)                 387       
Total params: 1,694,535
Trainable params: 117,635
Non-trainable params: 1,576,900
_________________________________________________________________


### Training the Model

In [21]:
model.fit(x_train, y_train,
          batch_size=128,
          epochs=10,
          validation_data=(x_val, y_val))

Train on 11712 samples, validate on 2928 samples
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0xb30d66518>

### Evaluating the Model

In [22]:
loss, acc = model.evaluate(x_val, y_val)
print()
print("Test accuracy = ", acc)


Test accuracy =  0.7800546


An accuracy of 78 percent was achieved with this model! In future, the model could also take in other factors such as which airlines is the tweet referring to (e.g. #UnitedAirlines), how confidence level for the labeling for each training example etc to get an even more accurate model.