# Sentiment-aware Contextual Model for Tweets

Sentiment Analysis has an important role in today’s world especially for private companies
which hold lots of data. The massive amount of data generated by Twitter present a unique
opportunity for sentiment analysis. However, it is challenging to build an accurate predictive
model to identify sentiments, which may lack sufficient context due to the length limit. In
addition, sentimental and regular ones can be hard to separate because of word ambiguity. In
this notebook, I will be proposing the phases of text pre-processing, visual analysis and modeling.

***I tried to keep code as simple as possible to remain understandable.***

Proposed **BERT-CNN-BiLSTM** learning pipeline, which consists of **three sequential modules**.<br />
BERT produces competitive results, and can be considered as one of the new electricity of natural
language processing tasks such as sentiment analysis, named entity recognition (NER), and topic
modeling. The combination of CNN and BiLSTM models requires a particular design, since each
model has a specific architecture and its own strengths:<br />
• BERT is utilized to transform word tokens from the raw Tweet messages to contextual word
embeddings.<br />
• CNN is known for its ability to extract as many features as possible from the text.<br />
• BiLSTM keeps the chronological order between words in a document, thus it has the ability
to ignore unnecessary words using the delete gate.<br />

___
### Important
This notebook is prepared based on binary classification (negative-positive), and neutral added manually by using threshold. If you need process of multi-class classification which includes neutral from the beggining, feel free to check my other notebook I prepared for [Multi-class Classification](https://www.kaggle.com/toygarr/contextual-model-and-crawling-for-real-tweets).
___
#### _In last update, Error Analysis is added based on test set._

**References:**<br />
1) [A Sentiment-Aware Contextual Model for Real-Time Disaster Prediction Using Twitter Data](https://www.mdpi.com/1999-5903/13/7/163/htm) -> The idea comes from and really worth to check on, however, I made some changes.<br />
2) [Automatic identification of eyewitness messages on twitter during disasters](https://reader.elsevier.com/reader/sd/pii/S0306457319303590?token=985D740724AEDB812611486EBAD3B68FA4393520D4DCD96FDADE4A642A9805D728945987C1BBBE0FDAA8EC3684E372C7&originRegion=eu-west-1&originCreation=20210920022341)<br />
3) [BERT: Pre-training of Deep Bidirectional Transformers for Language
               Understanding](http://arxiv.org/abs/1810.04805)<br />
4) [Convolutional Neural Networks for Sentence Classification](http://arxiv.org/abs/1408.5882)<br />
5) [Long-short Term Memory](https://www.researchgate.net/publication/13853244_Long_Short-term_Memory#fullTextFileContent)<br />
6) [LMAES' Notebook](https://www.kaggle.com/lmasca/disaster-tweets-using-bert-embeddings-and-lstm)<br />
7) [PAOLO RIPAMONTI's Notebook](https://www.kaggle.com/paoloripamonti/twitter-sentiment-analysis)

## > If you find my work useful please don't forget to **Upvote!**  so it can reach more people.

___
# Personal observations that I recommend to read before starting
#### After trying lots of hyperparameters and different models on very different test distribution, adding things like extra dense layers, gives the models an appreciably robustness. Even though current model has no extra dense layer, those are my experiments to inform you to try for further improvement.

#### Finding perfect hyperparameters are an actual issue after done preprocessing properly. We should not do every preprocessing transaction. I did some of them to show how to, however, generally traditional preprocessing affects texts in a really bad way to be learned by BERT or any contextual structures. We need to check and think how embedding models which we are going to use trained and why we need to clean that any specific property.
#### For example: `version 11` has better accuracy and far better as randomly distributed custom real world sentiment detector. In `version 13`, I added stopwords removing function and it decreased validation score 84.49% to 80.76%. I did this experiment to show this important issue. We lose high level linguistic features if we remove everything from the real texts. The model lost most of its capability of finding actual sentiment of random real world sentences that you give random texts with prediction function as different distribution in `version 13`. 

#### Sentiment effect of words is highly changeable depending on the training phase. Even though the accuracy looks close for each experiments, the actual results which we can consider as totally different distributed real world sentences change this a lot.

#### v12 is same with v13, just fixed some typos.
___

In [None]:
# few of the imports are just for checking while coding not included in the rest of notebook.

# Most basic stuff for EDA.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Core packages for text processing.
import string
import re

# Libraries for text preprocessing.
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords


# Loading some sklearn packaces for modelling.
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import LatentDirichletAllocation, NMF # not actively using
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Utility
import logging
import itertools


# Core packages for general use throughout the notebook.
import random
import warnings
import time
import datetime

# For customizing our plots.
from matplotlib.ticker import MaxNLocator
import matplotlib.gridspec as gridspec
import matplotlib.patches as mpatches

# for build our model
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from transformers import BertTokenizer, TFBertModel

# Setting some options for general use.
import os
stop = set(stopwords.words('english'))
plt.style.use('fivethirtyeight')
sns.set(font_scale=1.5)
pd.options.display.max_columns = 250
pd.options.display.max_rows = 250
warnings.filterwarnings('ignore')

In [None]:
# DATASET
DATASET_COLUMNS = ["target", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"

# SENTIMENT
POSITIVE = "POSITIVE"
NEGATIVE = "NEGATIVE"
NEUTRAL = "NEUTRAL"
SENTIMENT_THRESHOLDS = (0.4, 0.7)

# EXPORT
KERAS_MODEL = "model.h5"

# Read Dataset

### Dataset details:
http://help.sentiment140.com the site is pretty old and most of the links are broken, however, check out for more detail.

Latest availability of dataset: https://www.kaggle.com/kazanova/sentiment140

@ONLINE {Sentiment140, <br />
    author = "Go, Alec and Bhayani, Richa and Huang, Lei", <br />
    title  = "Twitter Sentiment Classification using Distant Supervision", <br />
    year   = "2009", <br />
    url    = "http://help.sentiment140.com/home"
}

*According to the creators of the dataset:* \
"Our approach was unique because our training data was automatically created, as opposed to having humans manual annotate tweets. In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. We used the Twitter Search API to collect these tweets by using keyword search".


The data is a CSV with emoticons removed. Data file format has 6 fields:
* target: the polarity of the tweet (0 = negative, 4 = positive)<br/> -> We will insert (2 = neutral) manually using threshold.
* ids: The id of the tweet
* date: the date of the tweet
* flag: The query (lyx). If there is no query, then this value is NO_QUERY.
* user: the user that tweeted
* text: the text of the tweet

In [None]:
# Read the data
df = pd.read_csv('../input/sentiment140/training.1600000.processed.noemoticon.csv', 
                 encoding = DATASET_ENCODING, names=DATASET_COLUMNS)

In [None]:
# Raw data
df.head()

# Map target label to String
* 0 -> NEGATIVE
* 2 -> NEUTRAL
* 4 -> POSITIVE

We prepare "2" for "neutral" label beforehand.

In [None]:
decode_map = {0: "NEGATIVE", 2: "NEUTRAL", 4: "POSITIVE"}
def decode_sentiment(label):
    return decode_map[int(label)]

In [None]:
%%time
df.target = df.target.apply(lambda x: decode_sentiment(x))

# Cleaning Text

So basically what we will do here:

* Remove urls, html tags and punctuations <br/>

Based on this [paper](https://aclanthology.org/2020.pam-1.15.pdf), removing punctuations is important for BERT both in statistical and sentimental way. It significantly affects counting.

On the other hand, twitter data is a real mess so generally may not has obvious positive impact on the results. In our case we ignored the punctuations, but keep that paper in mind.

In [None]:
# Some basic helper functions to clean text by removing urls, emojis, html tags and punctuations.
def remove_stopwords(text):
    tokens = []
    for token in text.split():
        if token not in stop:
            tokens.append(token)
    return " ".join(tokens)


def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)


def remove_html(text):
    html = re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    return re.sub(html, '', text)


def remove_punct(text):
    table = str.maketrans('', '', string.punctuation)
    return text.translate(table)

# Applying helper functions
df['text_clean'] = df['text'].apply(lambda x: remove_stopwords(x))
df['text_clean'] = df['text_clean'].apply(lambda x: remove_URL(x))
df['text_clean'] = df['text_clean'].apply(lambda x: remove_html(x))
df['text_clean'] = df['text_clean'].apply(lambda x: remove_punct(x))

In [None]:
df.head()

# Visualizing the Data

In [None]:
# Displaying target distribution.

fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(12, 4), dpi=70)
sns.countplot(df['target'], ax=axes[0])
axes[1].pie(df['target'].value_counts(),
            labels=[NEGATIVE, POSITIVE],
            autopct='%1.2f%%',
            shadow=True,
            explode=(0.05, 0),
            startangle=60)
fig.suptitle('Distribution of the Tweets', fontsize=24)
plt.show()

________________

As a quick and easy observation, we can say that dataset has no imbalanced label problem. Negative and Positive labels are equal. The situation of equilibrium will let model to learn more accurate. However, should not be forgotten, dataset might has lots of mislabelled text due to way of collection which has only parameter as **":)" : positive** or **":(" : negative**. The problem is that lots of user may send ":)" or ":(" ironically.  

> An example: "u really like that thing damn :))))"

This problem may decrease the accuracy, however, end of the day we're creating a sentiment-aware model depending on the words.

_____

In [None]:
# Creating a new feature for the visualization.

df['Character Count'] = df['text_clean'].apply(lambda x: len(str(x)))


def plot_dist3(df_x, feature, title):
    # Creating a customized chart. and giving in figsize and everything.
    fig = plt.figure(constrained_layout=True, figsize=(18, 8))
    # Creating a grid of 3 cols and 3 rows.
    grid = gridspec.GridSpec(ncols=3, nrows=3, figure=fig)

    # Customizing the histogram grid.
    ax1 = fig.add_subplot(grid[0, :2])
    # Set the title.
    ax1.set_title('Histogram')
    # plot the histogram.
    sns.distplot(df_x.loc[:, feature],
                 hist=True,
                 kde=True,
                 ax=ax1,
                 color='#e74c3c')
    ax1.set(ylabel='Frequency')
    ax1.xaxis.set_major_locator(MaxNLocator(nbins=20))

    # Customizing the ecdf_plot.
    ax2 = fig.add_subplot(grid[1, :2])
    # Set the title.
    ax2.set_title('Empirical CDF')
    # Plotting the ecdf_Plot.
    sns.distplot(df.loc[:, feature],
                 ax=ax2,
                 kde_kws={'cumulative': True},
                 hist_kws={'cumulative': True},
                 color='#e74c3c')
    ax2.xaxis.set_major_locator(MaxNLocator(nbins=20))
    ax2.set(ylabel='Cumulative Probability')

    # Customizing the Box Plot.
    ax3 = fig.add_subplot(grid[:, 2])
    # Set title.
    ax3.set_title('Box Plot')
    # Plotting the box plot.
    sns.boxplot(x=feature, data=df, orient='v', ax=ax3, color='#e74c3c')
    ax3.yaxis.set_major_locator(MaxNLocator(nbins=25))

    plt.suptitle(f'{title}', fontsize=24)

In [None]:
plot_dist3(df[df['target'] == 'NEGATIVE'], 'Character Count',
           'Characters Per "NEGATIVE" Tweet')

In [None]:
plot_dist3(df[df['target'] == "POSITIVE"], 'Character Count',
           'Characters Per "POSITIVE" Tweet')

# Setup environment to build model

In [None]:
os.environ["WANDB_API_KEY"] = "0" ## to silence warning

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy() # for CPU and single GPU
    print('Number of replicas:', strategy.num_replicas_in_sync)

In [None]:
# hyperparameters
max_length = 140 # max length of tweets in the dataset collection time
batch_size = 512 # huge batch_size is used because it affects training time significantly. /we have really big dataset/

____
Again we will use bert-base-uncased because we don't have proper written texts, mostly chaos. <br/>
Do not need to consider cased characters for now. It would be more sensible if we were classifying text of news or papers etc. so we go through uncased bert.
____

In [None]:
# Bert Tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)

# Splitting the data
We have huge dataset which allows us taking only 1% of whole set for splitting to the development and the test set.

To be more clear, in the code below, first we take 1% of whole set as the test set. Afterwards, we take 1% of the rest of data which can be defined as 99% to create development set.

In [None]:
train_df, test = train_test_split(df, test_size=0.01, random_state=42)
x_train, dev = train_test_split(train_df, test_size=0.01, random_state=42)

In [None]:
print(x_train.shape)
print(test.shape)
print(dev.shape)

___

### Data Decrease
I will be taking 500k of the training data because it takes too much time to train with whole 1.5m different texts. However, you can try that to see how datasize affects the results.
___

In [None]:
train = x_train[:500000]

# Label Encoder

In [None]:
labels = train.target.unique().tolist()
labels.append(NEUTRAL)
labels

In [None]:
encoder = LabelEncoder()
encoder.fit(train.target.tolist())

y_train = encoder.transform(train.target.tolist())
y_test = encoder.transform(test.target.tolist())
y_dev = encoder.transform(dev.target.tolist())

y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)
y_dev = y_dev.reshape(-1,1)

print("y_train",y_train.shape)
print("y_test",y_test.shape)

In [None]:
def bert_encode(data):
    tokens = tokenizer.batch_encode_plus(data, max_length=max_length, padding='max_length', truncation=True)
    
    return tf.constant(tokens['input_ids'])

In [None]:
train_encoded = bert_encode(train.text_clean)
dev_encoded = bert_encode(dev.text_clean)


train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((train_encoded, y_train))
    .shuffle(128)
    .batch(batch_size)
)

dev_dataset = (
    tf.data.Dataset
    .from_tensor_slices((dev_encoded, y_dev))
    .shuffle(128)
    .batch(batch_size)
)

# Proposed Model

In [None]:
def bert_model():

    bert_encoder = TFBertModel.from_pretrained(model_name)
    input_word_ids = tf.keras.Input(shape=(max_length,), dtype=tf.int32, name="input_ids")
    last_hidden_states = bert_encoder(input_word_ids)[0]
    x = tf.keras.layers.SpatialDropout1D(0.2)(last_hidden_states)
    x = tf.keras.layers.Conv1D(32, 3, activation='relu')(x)
    x = tf.keras.layers.Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2))(x)
    outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)
    model = tf.keras.Model(input_word_ids, outputs)
    
    return model

In [None]:
with strategy.scope():
    model = bert_model()
    adam_optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4)
    model.compile(loss='binary_crossentropy',optimizer=adam_optimizer,metrics=['accuracy'])

    model.summary()

In [None]:
tf.keras.utils.plot_model(model, show_shapes=True)

In [None]:
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping

callbacks = [ ReduceLROnPlateau(monitor='val_loss', patience=5, cooldown=0),
              EarlyStopping(monitor='val_acc', min_delta=1e-5, patience=5)]

In [None]:
# Start train
history = model.fit(
    train_dataset,
    batch_size=batch_size,
    epochs=3,
    validation_data=dev_dataset,
    verbose=1,
    callbacks = callbacks)

In [None]:
# SAVE MODEL WEIGHTS
model.save_weights(f'sentiment_weights_v1.h5')

In [None]:
# LOAD MODEL WEIGHTS
#model.load_weights('../input/-THE PATH THAT YOU UPLOADED WEIGHTS ON KAGGLE-/sentiment_weights_v1.h5')

#to be able to use weights you need to run same model again without fitting because you need model to get weights:) 

In [None]:
def plot_graphs(history, string):
    plt.plot(history.history[string])
    plt.plot(history.history['val_'+string])
    plt.xlabel("Epochs")
    plt.ylabel(string)
    plt.legend([string, 'val_'+string])
    plt.show()
   
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

# Predict Manually Before Using Test Data

Decoder to be able to see results as labelled negative, positive or neutral <br/>
**Recap:** The threshold was determined as (0.4, 0.7). <br/>
<br/>
If you want to add neutral label, send include_neutral = **True** parameter after the given text.

In [None]:
def decode_sentiment(score, include_neutral=False):
    if include_neutral:        
        label = NEUTRAL
        if score <= SENTIMENT_THRESHOLDS[0]:
            label = NEGATIVE
        elif score >= SENTIMENT_THRESHOLDS[1]:
            label = POSITIVE

        return label
    else:
        return NEGATIVE if score < 0.5 else POSITIVE

In [None]:
def predict(text, include_neutral=False):
    start_at = time.time()
    # Tokenize text
    x_encoded = bert_encode([text])
    # Predict
    score = model.predict([x_encoded])[0]
    # Decode sentiment
    label = decode_sentiment(score, include_neutral=include_neutral)

    return {"label": label, "score": float(score),
       "elapsed_time": time.time()-start_at}  

In [None]:
predict("I hate the economy")

In [None]:
predict("I would prefer writing a crawler to create this dataset but i couldn't", True)

In [None]:
predict("I LOVE NLP")

In [None]:
predict("life is really strange isn't it? just the combination of laugh and cry", True)

In [None]:
predict("ESL is the world's largest esports company, leading the industry across the most popular video games.\
        We're proud they've chosen us to help them deliver their launchers to gamers all over the world. Read the full review")

In [None]:
predict("Excited to present a tutorial on 'Modular and Parameter-Efficient Fine-Tuning for NLP Models' \
        at #EMNLP2022 with @PfeiffJo & @licwu.")

In [None]:
predict("Had a song stuck in my head. Thirty seconds later I'm listening to it, thanks to the internet,\
        and Apple/YouTube Music. In the bad old days I'd browse record stores for hours in the hope that the title might jog my memory.\
        It really is a wonderful time to be alive!")

In [None]:
predict("i don't say this lightly - hemingway's life ended by suicide. His life was actually a loss")

In [None]:
predict("these r not ur problems dear!!! these r ur x bf's commitng suicide")

In [None]:
predict("i hve no idea about i love the uni or not", True)

In [None]:
predict("For the third time in four years, the Warriors are champions once again.\
This time, they wasted no time in the NBA Finals, dispatching LeBron James and the Cavs in four straight games.\
Here’s how they sealed the championship in Game 4. https://twitter.com/i/moments/1005197277663641600")

In [None]:
predict("I found some old Reddit post in which one guy from english-speaking country complains that\
the names in The Witcher books are 'too difficult' and non- intuitive for english speaker.\
Man, let me introduce you to 'The books werent written only/for english speakers.'' #witcher")

In [None]:
predict("I forgot how cringy all the Slavic names sound read it English \
YOU'RE PRONOUNCING IT ALL WRONG MY EARS ARE HURTING AND I DON'T EVEN HAVE HEARING AIDS IN")

In [None]:
predict("fun fact: ai cannot predict everything right")

In [None]:
predict("brain is just machine", True)

# Test Results

In [None]:
test_encoded = bert_encode(test.text_clean)

test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(test_encoded)
    .batch(batch_size)
)

y_pred = []
predicted_tweets = model.predict(test_dataset, batch_size=batch_size)
predicted_tweets_binary = tf.cast(tf.round(predicted_tweets), tf.int32).numpy().flatten()

In [None]:
%%time
scores = model.evaluate(test_encoded, y_test, batch_size=batch_size)
print()
print("ACCURACY:",scores[1])
print("LOSS:",scores[0])

# To decrease our really bad guesses in old predict function

#### Creating function for removing everything from texts we give as random, and try on some wrong predicted ones again by new function

In [None]:
def decode_sentiment(score, include_neutral=False):
    if include_neutral:        
        label = NEUTRAL
        if score <= SENTIMENT_THRESHOLDS[0]:
            label = NEGATIVE
        elif score >= SENTIMENT_THRESHOLDS[1]:
            label = POSITIVE

        return label
    else:
        return NEGATIVE if score < 0.5 else POSITIVE

In [None]:
def improved_prediction(text, include_neutral=False):
    start_at = time.time()
    # Applying helper functions
    text = remove_stopwords(text)
    text = remove_URL(text)
    text = remove_html(text)
    text = remove_punct(text)
    # Tokenize text
    x_encoded = bert_encode([text])
    # Predict
    score = model.predict([x_encoded])[0]
    # Decode sentiment
    label = decode_sentiment(score, include_neutral=include_neutral)

    return {"label": label, "score": float(score),
       "elapsed_time": time.time()-start_at}

In [None]:
improved_prediction("life is really strange isn't it? just the combination of laugh and cry", True)

In [None]:
improved_prediction("For the third time in four years, the Warriors are champions once again.\
This time, they wasted no time in the NBA Finals, dispatching LeBron James and the Cavs in four straight games.\
Here’s how they sealed the championship in Game 4. https://twitter.com/i/moments/1005197277663641600")

In [None]:
improved_prediction("brain is just machine", True)

#### Still problematic but less. Moreover, after these experiments we can say that not removing stopwords and punctuations gives more contextual information and robustness to BERT embeddings. Most of those are proved in papers and we see that by ourselves as well.

In [None]:
y_pred = [decode_sentiment(predicted_tweets) for predicted_tweets in scores]
y_pred

In [None]:
def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """

    cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=20)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90, fontsize=16)
    plt.yticks(tick_marks, classes, fontsize=16)

    fmt = '.2f'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label', fontsize=12)
    plt.xlabel('Predicted label', fontsize=12)

In [None]:
cnf_matrix = confusion_matrix(y_test, predicted_tweets_binary)
plt.figure(figsize=(6,6))
plot_confusion_matrix(cnf_matrix, classes=train.target.unique(), title="Confusion matrix")
plt.show()

# Comprehensive Report

In [None]:
print('Precision: %.4f' % precision_score(y_test, predicted_tweets_binary))
print('Recall: %.4f' % recall_score(y_test, predicted_tweets_binary))
print('Accuracy: %.4f' % accuracy_score(y_test, predicted_tweets_binary))
print('F1 Score: %.4f' % f1_score(y_test, predicted_tweets_binary))
print(classification_report(y_test, predicted_tweets_binary))

# Error Analysis

When you start looking at the inside of the data with predicted labels, you will realise that there are lots of mislabeled texts beforehand. Therefore, it is expected to not get really high accuracy. However, we can say that our model works pretty good against its competitors.

In [None]:
decode_map = {0: "NEGATIVE", 1: "POSITIVE"}
def decode_sentiment(label):
    return decode_map[int(label)]

In [None]:
df = pd.DataFrame(test.text, columns=["text"])
df['ids'] = test.ids
df["actual"] = test.target
df["predicted"] = predicted_tweets_binary
df.predicted = df.predicted.apply(lambda x: decode_sentiment(x))
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)
incorrect = df[df["actual"] != df["predicted"]]
incorrect[10:20]

In [None]:
correct = df[df['actual'] == df['predicted']]
correct.head(10)

# Fetching data from Twitter
To get started,

* Import the twint package as follows.

In [None]:
!pip install twint
!pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint
import twint
import nest_asyncio

In [None]:
c = twint.Config()

c.Search = "elonmusk" #keyword for search
c.Limit = 20 #limit of the number of tweets which will be extracted
c.Store_csv = True 
c.Output = 'elonmusk_tweet_data.csv'

nest_asyncio.apply()
twint.run.Search(c)

## We stored the related tweets in the .csv or .json file which is really fast and cool

So how we will read from csv/json file to use for our purpose ? 

In [None]:
crawled_data = pd.read_csv("elonmusk_tweet_data.csv")
#crawled_data = pd.read_json("tweet_data.json", lines=True)
pd.options.display.max_columns=36
crawled_data.head()

_____
As you can see above we have lots of features which extracted by twint. However, we only need the "tweet" feature which includes the text data of tweets for our purpose.

In [None]:
# prediction of the first 15 extracted tweets
for i in range(15):
    print(crawled_data["tweet"][i])
    print(improved_prediction(crawled_data["tweet"][i]))
    print("\n")

### If you want to know more about twint, you can checkout this Github link:
https://github.com/twintproject/twint