# Contextual Model for Real Time Tweet Sentiment Prediction

Sentiment Analysis has an important role in today’s world especially for private companies
which hold lots of data. The massive amount of data generated by Twitter present a unique
opportunity for sentiment analysis. However, it is challenging to build an accurate predictive
model to identify sentiments, which may lack sufficient context due to the length limit. In
addition, sentimental and regular ones can be hard to separate because of word ambiguity. In
this notebook, I will be proposing the phases of text pre-processing, visual analysis and modeling.

***I tried to keep code as simple as possible to remain understandable.***

Proposed **BERT-CNN-BiLSTM-FC** learning pipeline, which consists of **four sequential modules**.<br />
BERT produces competitive results, and can be considered as one of the new electricity of natural
language processing tasks such as sentiment analysis, named entity recognition (NER), and topic
modeling. The combination of CNN and BiLSTM models requires a particular design, since each
model has a specific architecture and its own strengths:<br />
• BERT is utilized to transform word tokens from the raw Tweet messages to contextual word
embeddings.<br />
• CNN is known for its ability to extract as many features as possible from the text.<br />
• BiLSTM keeps the chronological order between words in a document, thus it has the ability
to ignore unnecessary words using the delete gate.<br />
• Fully Connected Layers give robustness to decrease unsteadiness of results in hard cases. 

## > If you find my work useful please don't forget to **Upvote!**  so it can reach more people.


**References:**<br />
1) [A Sentiment-Aware Contextual Model for Real-Time Disaster Prediction Using Twitter Data](https://www.mdpi.com/1999-5903/13/7/163/htm) -> The idea comes from and really worth to check on, however, i am not using the same model.<br />
2) [Automatic identification of eyewitness messages on twitter during disasters](https://reader.elsevier.com/reader/sd/pii/S0306457319303590?token=985D740724AEDB812611486EBAD3B68FA4393520D4DCD96FDADE4A642A9805D728945987C1BBBE0FDAA8EC3684E372C7&originRegion=eu-west-1&originCreation=20210920022341)<br />
3) [Convolutional Neural Networks for Sentence Classification](http://arxiv.org/abs/1408.5882)<br />
4) [BERT: Pre-training of Deep Bidirectional Transformers for Language
               Understanding](http://arxiv.org/abs/1810.04805)<br />
5) [LMAES' Notebook](https://www.kaggle.com/lmasca/disaster-tweets-using-bert-embeddings-and-lstm)<br />
6) [PAOLO RIPAMONTI's Notebook](https://www.kaggle.com/paoloripamonti/twitter-sentiment-analysis)

In [None]:
# few of the imports are just for checking while coding not included in the rest of notebook.

# Most basic stuff for EDA.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Core packages for text processing.
import string
import re

# Libraries for text preprocessing.
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords

# Loading some sklearn packaces for modelling.
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # not actively using
from sklearn.decomposition import LatentDirichletAllocation, NMF # not actively using
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import train_test_split

# Utility
import logging
import itertools

# Core packages for general use throughout the notebook.
import random
import warnings
import time
import datetime

# For customizing our plots.
from matplotlib.ticker import MaxNLocator
import matplotlib.gridspec as gridspec
import matplotlib.patches as mpatches

# for build our model
import tensorflow as tf

from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from transformers import BertTokenizer, TFBertModel

# Setting some options for general use.
import os
stop = set(stopwords.words('english'))
plt.style.use('fivethirtyeight')
sns.set(font_scale=1.5)
pd.options.display.max_columns = 250
pd.options.display.max_rows = 250
warnings.filterwarnings('ignore')

# Read Dataset

[Data source](www.kaggle.com/c/tweet-sentiment-extraction/data)

Train data file format has 4 fields: 
* textID: The id of the tweet
* text: the text of the tweet
* selected_text: support phrases for sentiment labels
* sentiment -> (negative, neutral, positive)

Test data file format has 3 fields: 
* textID: The id of the tweet
* text: the text of the tweet
* sentiment -> (negative, neutral, positive)

In [None]:
# Read the data
df_train = pd.read_csv('../input/tweet-sentiment-extraction/train.csv')
df_test = pd.read_csv('../input/tweet-sentiment-extraction/test.csv')

In [None]:
# Raw data
df_train.head()

In [None]:
print("null text = ", pd.isna(df_train['text']).sum())
print("null sentiment = ", pd.isna(df_train['sentiment']).sum())

In [None]:
df_train = df_train.dropna(subset=['text'], axis=0)
print("null text = ", pd.isna(df_train['text']).sum())

In [None]:
df_train.describe()

# Visualizing the Data

In [None]:
temp = df_train.groupby('sentiment').count()['text'].reset_index().sort_values(by='text',ascending=False)
temp.style.background_gradient(cmap='Purples')

# Cleaning Text

In [None]:
# Some basic helper functions to clean text by removing urls, emojis, html tags and punctuations.

def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)



def remove_html(text):
    html = re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    return re.sub(html, '', text)


def remove_punct(text):
    table = str.maketrans('', '', string.punctuation)
    return text.translate(table)

# Applying helper functions
df_train['text_clean'] = df_train['text'].apply(lambda x: remove_URL(x))
df_train['text_clean'] = df_train['text_clean'].apply(lambda x: remove_html(x))
df_train['text_clean'] = df_train['text_clean'].apply(lambda x: remove_punct(x))

df_test['text_clean'] = df_test['text'].apply(lambda x: remove_URL(x))
df_test['text_clean'] = df_test['text_clean'].apply(lambda x: remove_html(x))
df_test['text_clean'] = df_test['text_clean'].apply(lambda x: remove_punct(x))

In [None]:
df_train.head()

In [None]:
print("TRAIN size:", len(df_train))
print("TEST size:", len(df_test))

________________

# Setup environment to build model

In [None]:
os.environ["WANDB_API_KEY"] = "0" ## to silence warning

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy() # for CPU and single GPU
    print('Number of replicas:', strategy.num_replicas_in_sync)

In [None]:
# hyperparameters
max_length = 128
batch_size = 128

In [None]:
# Bert Tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)

# Splitting the dev data

In [None]:
train, dev = train_test_split(df_train, test_size=0.1, random_state=42)

# Label Encoder

In [None]:
labels = df_train.sentiment.unique().tolist()
labels

In [None]:
encoder = LabelEncoder()
encoder.fit(df_train.sentiment.tolist())

y_train = encoder.transform(df_train.sentiment.tolist())
y_test = encoder.transform(df_test.sentiment.tolist())
y_dev = encoder.transform(dev.sentiment.tolist())

y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)
y_dev = y_dev.reshape(-1,1)

print("y_train",y_train.shape)
print("y_test",y_test.shape)

In [None]:
def bert_encode(data):
    tokens = tokenizer.batch_encode_plus(data, max_length=max_length, padding='max_length', truncation=True)
    
    return tf.constant(tokens['input_ids'])

In [None]:
train_encoded = bert_encode(df_train.text_clean)
dev_encoded = bert_encode(dev.text_clean)


train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((train_encoded, y_train))
    .shuffle(128)
    .batch(batch_size)
)

dev_dataset = (
    tf.data.Dataset
    .from_tensor_slices((dev_encoded, y_dev))
    .shuffle(128)
    .batch(batch_size)
)

# Proposed Model

In [None]:
def bert_tweets_model():

    bert_encoder = TFBertModel.from_pretrained(model_name)
    input_word_ids = tf.keras.Input(shape=(max_length,), dtype=tf.int32, name="input_ids")
    last_hidden_states = bert_encoder(input_word_ids)[0]   
    x = tf.keras.layers.SpatialDropout1D(0.2)(last_hidden_states)
    x = tf.keras.layers.Conv1D(64, 3, activation='relu')(x)
    x = tf.keras.layers.Bidirectional(LSTM(64, dropout=0.2, recurrent_dropout=0.2))(x)
    x = tf.keras.layers.Dense(128, activation='relu')(x)
    x = tf.keras.layers.Dropout(0.2)(x)
    x = tf.keras.layers.Dense(64, activation='relu')(x)
    x = tf.keras.layers.Dropout(0.3)(x)
    outputs = tf.keras.layers.Dense(3, activation='softmax')(x)
    model = tf.keras.Model(input_word_ids, outputs)
   
    
    return model

In [None]:
with strategy.scope():
    model = bert_tweets_model()
    adam_optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
    model.compile(loss='sparse_categorical_crossentropy',optimizer=adam_optimizer,metrics=['accuracy'])

    model.summary()

In [None]:
tf.keras.utils.plot_model(model, show_shapes=True)

In [None]:
# Start train
history = model.fit(
    train_dataset,
    batch_size=batch_size,
    epochs=12,
    validation_data=dev_dataset,
    verbose=1)
    #callbacks=[tf.keras.callbacks.EarlyStopping(
    #            patience=6,
    #            min_delta=0.05,
    #            baseline=0.7,
    #            mode='min',
    #            monitor='val_accuracy',
    #            restore_best_weights=True,
    #            verbose=1)
    #          ])

In [None]:
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

# Predict Manually Before Using Test Data

Decoder to be able to see results as labelled negative, positive or neutral

In [None]:
def decode_sentiment(score):
        if score == 0:
            return "negative"
        elif score == 1:
            return "neutral"
        else:
            return "positive"

In [None]:
def predict(text, include_neutral=True):
    start_at = time.time()
    # Tokenize text
    x_encoded = bert_encode([text])
    # Predict
    score = model.predict([x_encoded])[0]
    # Decode sentiment
    label = decode_sentiment(np.argmax(score))

    return {"label": label, "score": score,
       "elapsed_time": time.time()-start_at}  

In [None]:
predict("I hate the economy")

In [None]:
predict("I would prefer writing a crawler to create this dataset but i couldn't")

In [None]:
predict("I LOVE NLP")

In [None]:
predict("life is really strange isn't it? just the combination of laugh and sadness")

In [None]:
predict("ESL is the world's largest esports company, leading the industry across the most popular video games.\
        We're proud they've chosen us to help them deliver their launchers to gamers all over the world. Read the full review")

In [None]:
predict("Excited to present a tutorial on 'Modular and Parameter-Efficient Fine-Tuning for NLP Models' \
        at #EMNLP2022 with @PfeiffJo & @licwu.")

In [None]:
predict("Had a song stuck in my head. Thirty seconds later I'm listening to it, thanks to the internet,\
        and Apple/YouTube Music. In the bad old days I'd browse record stores for hours in the hope that the title might jog my memory.\
        It really is a wonderful time to be alive!")

In [None]:
predict("i don't say this lightly - hemingway's life ended by suicide. His life was actually a loss")

In [None]:
predict("these r not ur problems dear!!! these r ur x bf's commitng suicide")

In [None]:
predict("i hve no idea about that i love the uni or not")

In [None]:
predict("I found some old Reddit post in which one guy from english-speaking country complains that\
the names in The Witcher books are 'too difficult' and non- intuitive for english speaker.\
Man, let me introduce you to 'The books werent written only/for english speakers.'' #witcher")

In [None]:
predict("I forgot how cringy all the Slavic names sound read it English \
YOU'RE PRONOUNCING IT ALL WRONG MY EARS ARE HURTING AND I DON'T EVEN HAVE HEARING AIDS IN")

In [None]:
predict("fun fact: ai cannot predict everything right")

In [None]:
predict("our brain is just a machine as well")

# Evaluate

In [None]:
test_encoded = bert_encode(df_test.text_clean)

test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(test_encoded)
    .batch(batch_size)
)

predicted_tweets = model.predict(test_dataset, batch_size=batch_size)
predicted_tweets_binary = tf.cast(tf.round(predicted_tweets), tf.int32).numpy().flatten()

In [None]:
y_pred = []
for i in range(predicted_tweets.shape[0]):
    y_pred.append(np.argmax(predicted_tweets[i]))

In [None]:
print(classification_report(y_test, y_pred))

_____
### In next step, we will use our prepared model to predict tweets on twitter using Twint.

# Fetching data from Twitter
To get started,

* Import the twint package as follows.

In [None]:
!pip install twint
import twint
import nest_asyncio
nest_asyncio.apply() 
!pip install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint

In [None]:
c = twint.Config()
c.Search = "GRAMMYs" #keyword for search
c.Limit = 100 #limit of the number of tweets which will be extracted
c.Store_csv = True 
c.Output = '#GRAMMYs_tweet_data.csv'
twint.run.Search(c)

## We stored the related tweets in the .csv or .json file which is really fast and cool

So how we will read from csv/json file to use for our purpose ? 

In [None]:
crawled_data = pd.read_csv("#GRAMMYs_tweet_data.csv")
#crawled_data = pd.read_json("tweet_data.json", lines=True)
pd.options.display.max_columns=36
crawled_data.head()

_____
As you can see above we have lots of features which extracted by twint. However, we only need the "tweet" feature which includes the text data of tweets for our purpose.

In [None]:
# prediction of the first 15 extracted tweets
for i in range(15):
    print(crawled_data["tweet"][i])
    print(predict(crawled_data["tweet"][i]))
    print("\n")

### If you want to know more about twint, you can checkout this Github link:
https://github.com/twintproject/twint