# **1. Load data**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
DATASET_COLUMNS=['sentiment','ids','date','flag','user','tweet']
df = pd.read_csv('/content/drive/MyDrive/training.1600000.processed.noemoticon.csv', encoding="ISO-8859-1", names=DATASET_COLUMNS)
df.head(8)

Unnamed: 0,sentiment,ids,date,flag,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
5,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
6,0,1467811592,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,mybirch,Need a hug
7,0,1467811594,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,coZZ,@LOLTrish hey long time no see! Yes.. Rains a...


# **2. Drop columns that are not useful for our modeling**

In [3]:
df = df.drop(['ids', 'date', 'flag', 'user'], axis=1)
df.head()

Unnamed: 0,sentiment,tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


# **3. Data Preprocessing**

In [4]:
# map the sentiment values to positive, negative. 0 is mapped to negative, and four is mapped to positive
label_to_sentiment = {0:"Negative", 4:"Positive"}
def label_decoder(label):
     return label_to_sentiment[label]
df.sentiment = df.sentiment.apply(lambda x: label_decoder(x))

In [5]:
# Import nltk package and download the stopwords
import nltk 
nltk.download('stopwords')
# We filter out the english language stopwrds
from nltk.corpus import stopwords
stop_words = stopwords.words('english')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Stemming/lemmatization refers to the process of extracting the root word. For example, can write ‘play’ as ‘playing,’ ‘played,’ ‘plays’ in different tenses. But the actual meaning is the same. We need to convert these into the root word for easier modelling. We can use the Snowball stemmer from the NLTK package to implement this.

In [6]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')

For removing the non-alphabetic characters, we can use regex expressions.

In [7]:
import re
text_cleaning_regex = "@S+|https?:S+|http?:S|[^A-Za-z0-9]+"

Now, let us define a function that will perform regex filtering, stop word removal, and stemming on all the tweets. Note that in NLP,  we describe the processed words as ‘tokens.’ Each tweet will be passed on to the function shown below.

In [8]:
def clean_tweets(text, stem=False):
  # Text passed to the regex equatio
  text = re.sub(text_cleaning_regex, ' ', str(text).lower()).strip()
  # Empty list created to store final tokens
  tokens = []
  for token in text.split():
    # check if the token is a stop word or not
    if token not in stop_words:
      if stem:
        # Paased to the snowball stemmer
        tokens.append(stemmer.stem(token))
      else:
        # A
        tokens.append(token)
  return " ".join(tokens)

In the above funtion, the text is converted into all lower case; white spaces are stripped and passed to the equation. The hyperlinks will remove non-alphanumeric characters. An empty list can be created to store the final tokens. The sentence is split into words, and each word is checked if it belongs to the list of stop words or not. After that, stemming is performed, and the word is stored in the list.  In the end, the tokens in the list are joined and returned.

In [9]:
df.tweet = df.tweet.apply(lambda x: clean_tweets(x))


# **4. Modeling**

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Splitting the data into training and testing sets
train_data, test_data = train_test_split(df, test_size=0.2,random_state=16)
print("Train Data size:", len(train_data))
print("Test Data size", len(test_data))

Train Data size: 1280000
Test Data size 320000


**4.1 Tokenization & Label Encoding**


Tokenization refers to splitting the given sentence into a list of tokens, indexed or vectorized.

In [11]:
# Tokenization
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_data.tweet)
word_index = tokenizer.word_index
keys = list(word_index.keys())
print(keys[:10])
values = list(word_index.values())
print(values[:10])
# This is a dictionary where each word is mapped with a particular index, starting from 1.
vocab_size = len(word_index) + 1
print("Vocabulary Size :", vocab_size)

['good', 'day', 'get', 'like', 'go', 'quot', 'http', 'today', 'work', 'love']
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Vocabulary Size : 565660


We will be applying a sequence model to this data. For this, we need to pass inputs of the same size. To achieve this, we will use the `pad_sequences()` function. This will return us sequences of a constant size, which can be passed as a parameter. 

In [12]:
# Padding
from keras_preprocessing.sequence import pad_sequences
# The tokens are converted into sequences and then passed to the pad_sequences() function
x_train = pad_sequences(tokenizer.texts_to_sequences(train_data.tweet),maxlen = 30)
x_test = pad_sequences(tokenizer.texts_to_sequences(test_data.tweet),maxlen = 30)

Initialize encoder class and fit it upon the training dataset’s labels (sentiment column). After this, we extract the sentiment from train data to make y_test, y_train by encoding and reshaping

In [13]:
# Label encoding
labels = ['Negative', 'Positive']
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit(train_data.sentiment.to_list())
y_train = encoder.transform(train_data.sentiment.to_list())
y_test = encoder.transform(test_data.sentiment.to_list())
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

**4.2 Representing words with vectors**

The ultimate aim is that the talks with similar meanings are closer to each other than the irrelevant words in the vector representation. The distance between the words could be measured by cosine similarity. For example, the words’ travelling’ and ‘vacation’ will be represented by vectors closer to each other.

In [14]:
# The gloVe is a pretrained word embedding model, and we can download it.

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2023-02-03 20:03:19--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-02-03 20:03:19--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-02-03 20:03:19--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [15]:
import numpy as np

# create a dictionary mapping the words with GloVe vector representations.
embeddings_index = {}
# opening the downloaded glove embeddings file
f = open('glove.6B.300d.txt')
for line in f:
    # For each line file, the words are split and stored in a list
    values = line.split()
    word = value = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' %len(embeddings_index))

Found 400000 word vectors.


Recall that in the tokenizing section, we had gotten a dictionary ‘word_index’, where each word is mapped to an index in the vocabulary. Now, we will map those vocab indices with the glove representations.

In [16]:
# creating an matrix with zeroes of shape vocab x embedding dimension
embedding_matrix = np.zeros((vocab_size, 300))
# Iterate through word, index in the dictionary
for word, i in word_index.items():
    # extract the corresponding vector for the vocab indice of same word
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Storing it in a matrix
        embedding_matrix[i] = embedding_vector

**4.3 Intitialize weights**

Now, we have a matrix that can initialize the weights. We will be using the embedding layer of Keras.

In [17]:
import tensorflow as tf
embedding_layer = tf.keras.layers.Embedding(vocab_size,300,weights=[embedding_matrix],
                                          input_length=30,trainable=False)

# **5.0 Model Architecture (LSTM) & Training**

We start with the embedding layer defined previously, and it inputs the sequences and gives word embeddings. These embeddings are then passed on to the convolution layer, which will convert them into small feature vectors. Next, we have the bidirectional LSTM layer. After the LSTM layers, we have a couple of Dense (fully connected layers) for classification purposes. We use a sigmoid activation function before the final output

In [18]:
from tensorflow.keras.layers import Conv1D, Bidirectional, LSTM, Dense, Input, Dropout
from tensorflow.keras.layers import SpatialDropout1D
from tensorflow.keras.callbacks import ModelCheckpoint

# The Input layer 
sequence_input = Input(shape=(30,), dtype='int32')
# Inputs passed to the embedding layer
embedding_sequences = embedding_layer(sequence_input)
# dropout and conv layer 
x = SpatialDropout1D(0.2)(embedding_sequences)
x = Conv1D(64, 5, activation='relu')(x)
# Passed on to the LSTM layer
x = Bidirectional(LSTM(64, dropout=0.2, recurrent_dropout=0.2))(x)
x = Dense(512, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(512, activation='relu')(x)
# Passed on to activation layer to get final output
outputs = Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(sequence_input, outputs)



In [26]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau
import plot_loss

model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy',metrics=['accuracy'])
ReduceLROnPlateau = ReduceLROnPlateau(factor=0.1, min_lr = 0.0001, monitor = 'val_loss',verbose = 1)
model_save_path = "/content/drive/MyDrive/sentimentAnalysis/model"
checkpoint = ModelCheckpoint(model_save_path, monitor="val_loss", verbose=1, save_best_only=True, mode="auto", period=1)
plot = plot_loss.TrainingPlot()

training = model.fit(x_train, y_train, batch_size=1024, epochs=10,
                    validation_data=(x_test, y_test), callbacks=[ReduceLROnPlateau, checkpoint, plot])



Epoch 1/10
Epoch 1: val_loss improved from inf to 0.45736, saving model to /content/drive/MyDrive/sentimentAnalysis/model




Epoch 2/10
Epoch 2: val_loss did not improve from 0.45736
Epoch 3/10
Epoch 3: val_loss did not improve from 0.45736
Epoch 4/10
Epoch 4: val_loss did not improve from 0.45736
Epoch 5/10
Epoch 5: val_loss improved from 0.45736 to 0.45588, saving model to /content/drive/MyDrive/sentimentAnalysis/model




Epoch 6/10
Epoch 6: val_loss did not improve from 0.45588
Epoch 7/10
Epoch 7: val_loss did not improve from 0.45588
Epoch 8/10
Epoch 8: val_loss did not improve from 0.45588
Epoch 9/10
Epoch 9: val_loss did not improve from 0.45588
Epoch 10/10
Epoch 10: val_loss did not improve from 0.45588


In [27]:
def predict_tweet_sentiment(score):
    return "Positive" if score>0.5 else "Negative"
scores = model.predict(x_test[:1], verbose=1, batch_size=10000)
model_predictions = [predict_tweet_sentiment(score) for score in scores]
print(x_test[:1], model_predictions[:1])

[[     0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0 112939   3089     27    285
    1453    201    209  24860  10673 310448   1284   3089   3927    419]] ['Positive']
