**Competition Description**

Twitter has become an important *communication* channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster

**Dataset**

The dataset contains three sub-datasets `train.csv`, `test.csv`, `sample_submission.csv`

**Columns**

* `id` - a unique identifier for each tweet
* `text` - the text of the tweet
* `location` - the location the tweet was sent from (may be blank)
* `keyword` - a particular keyword from the tweet (may be blank)
* `target` - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

**Evaluation Metrics**

Submissions are evaluated using **F1** between the predicted and expected answers.

## Import Libraries

In [None]:
import tensorflow as tf
print(tf.__version__)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# split training and testing
from sklearn.model_selection import train_test_split

# Import metrics
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


# check GPU 
!nvidia-smi


## Import Dataset

In [None]:
train_df = pd.read_csv("../input/nlp-getting-started/train.csv")
train_df.head()

In [None]:
test_df = pd.read_csv("../input/nlp-getting-started/test.csv")
test_df.head()

In [None]:
# sample_submission.csv
sample_submission =pd.read_csv("../input/nlp-getting-started/sample_submission.csv")
sample_submission.head()

In [None]:
# Check any null values
train_df.isna().sum()

In [None]:
# Value counts 
train_df['target'].value_counts()

## Shuffle the dataset

In [None]:
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled

## Split the dataset into training and validation

In [None]:
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled['text'].to_numpy(),
                                                                           train_df_shuffled['target'].to_numpy(),
                                                                           test_size=0.3,
                                                                           random_state=42)
len(train_sentences), len(val_sentences), len(train_labels), len(val_labels)

In [None]:
# Average tokens
max_vocab_length =round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))
max_vocab_length

## Convert Text Into Numers
In NLP, there are two main concepts for turning text into numbers:

**Tokenization** - A straight mapping from word or character or sub-word to a numerical value. 

There are three main levels of tokenization:

* Using **word-level tokenization** with the sentence "I love TensorFlow" might result in "I" being 0, "love" being 1 and "TensorFlow" being 2. In this case, every word in a sequence considered a single token.
* **Character-level tokenization**, such as converting the letters A-Z to values 1-26. In this case, every character in a sequence considered a single token.
* **Sub-word tokenization** is in between word-level and character-level tokenization. It involves breaking invidual words into smaller parts and then converting those smaller parts into numbers. For example, "my favourite food is pineapple pizza" might become "my, fav, avour, rite, fo, oo, od, is, pin, ine, app, le, piz, za". After doing this, these sub-words would then be mapped to a numerical value. In this case, every word could be considered multiple tokens.

**Embeddings** - An embedding is a representation of natural language which can be learned. Representation comes in the form of a feature vector. For example, the word "dance" could be represented by the 5-dimensional vector [-0.8547, 0.4559, -0.3332, 0.9877, 0.1112]. It's important to note here, the size of the feature vector is tuneable. There are two ways to use embeddings:

* **Create your own embedding** - Once your text has been turned into numbers (required for an embedding), you can put them through an embedding layer (such as tf.keras.layers.Embedding) and an embedding representation will be learned during model training.

* **Reuse a pre-learned embedding** - Many pre-trained embeddings exist online. These pre-trained embeddings have often been learned on large corpuses of text (such as all of Wikipedia) and thus have a good underlying representation of natural language. You can use a pre-trained embedding to initialize your model and fine-tune it to your own specific task.

In [None]:
# Create Tokenization Layers
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

max_tokens = 10000

text_vectorization = TextVectorization(max_tokens=max_tokens,
                                      output_mode='int',
                                      output_sequence_length=max_vocab_length)

# fix the text vectorizer to the training set
text_vectorization.adapt(train_sentences)

In [None]:
# Check with sample_sentences
sample_sentences = "I'm in love with the shape of you We push and pull like a magnet do"
text_vectorization([sample_sentences])

In [None]:
# Check with random train sentences
import random
random_sentences  = random.choice(train_sentences)

print(f"Original Sentence : \n {random_sentences}\
      \n\nText_Vectorization : ")
text_vectorization([random_sentences])


In [None]:
# Create an Embedding Layer
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_tokens,
                             output_dim=128,
                             embeddings_initializer='uniform',
                             input_length=max_vocab_length)

In [None]:
#check random
random_sentences = random.choice(train_sentences)
print(f"Original Sentences : \n{random_sentences}\
     \n\nEmbeddings : ")
embedding(text_vectorization([random_sentences]))

In [None]:
# Early Stopping
from tensorflow.keras.callbacks import EarlyStopping

callbacks = EarlyStopping(monitor='val_binary_crossentropy', 
                             patience=3)

## Build a model

In [None]:
# Using Long-Short Term Memory (LSTM)

# Pass the input layers
inputs = layers.Input(shape=(1,), dtype='string',name='input_shape')

# Pass the inputs to text vectorization layer
x = text_vectorization(inputs)

# Pass the text vectorization layer to embeddings
x = embedding(x)


# Build a model

# return vector for each word in the Tweet (you can stack RNN cells as long as return_sequences=True)
# LSTM
x = layers.LSTM(units=32)(x)

#

# Output Layer
outputs = layers.Dense(1, activation='sigmoid', name='output_layer')(x)

# Pass the inputs and outputs to model
model = tf.keras.Model(inputs, outputs, name='model')

# Compile the model
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
             optimizer=tf.keras.optimizers.Adam(),
             metrics=['accuracy']
             )

# model summary
model.summary()

In [None]:
# fit the model
history = model.fit(train_sentences, 
                   train_labels,
                   epochs=5,
                   validation_data=(val_sentences, val_labels))

In [None]:
# plot loss curves
def plot_loss_curves(history):
    
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    
    epochs = range(len(history.history['loss']))
    
    accuracy = history.history['accuracy']
    val_accuracy = history.history['val_accuracy']
    
    plt.title("Loss")
    plt.plot(epochs, loss, label='loss')
    plt.plot(epochs, val_loss, label='val_loss')
    plt.xlabel("Epochs")
    plt.legend()
    
    plt.figure()
    plt.title("Accuracy")
    plt.plot(epochs, accuracy, label='accuracy')
    plt.plot(epochs, val_accuracy, label='val_accuracy')
    plt.xlabel("Epochs")
    plt.legend()

In [None]:
plot_loss_curves(history=history)

In [None]:
# Evaluate model
model.evaluate(val_sentences, val_labels)

In [None]:
# preds_probs
model_pred_probs = model.predict(val_sentences)
model_pred_probs

In [None]:
# predictions
model_preds = tf.squeeze(tf.round(model_pred_probs))
model_preds

In [None]:
# evaluation metrics
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_preds):
    
    # model_accuracy
    model_accuracy = accuracy_score(y_true, y_preds)* 100
    # calculate model precision, recall and f1 score using "weighted" average
    model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_preds, average="weighted")
    model_results = {"accuracy": model_accuracy,
                    "precision": model_precision,
                    "recall": model_recall,
                    "f1": model_f1}
    return model_results

In [None]:
model_1_results = calculate_results(val_labels, model_preds)
model_1_results

In [None]:
# Dense model
inputs = tf.keras.Input(shape=(1,), dtype="string", name="inputs") # inputs
x = text_vectorization(inputs) # text_vectorization layer to inputs
x = embedding(x) # pass both text_vectorization and inputs to our embeddings layer
x = layers.GlobalAveragePooling1D()(x) # pooling layer
outputs = tf.keras.layers.Dense(1, activation="sigmoid", name="outputs")(x)# outputs
model_2 = tf.keras.Model(inputs, outputs, name="model2") # Build dense model
model_2.compile(loss="binary_crossentropy", # Compile the model
               optimizer="adam",
               metrics=['accuracy'])
model_2.summary()

In [None]:
history = model_2.fit(train_sentences, train_labels,
                     epochs=5,validation_data=(val_sentences, val_labels))

In [None]:
# plot loss curves
plot_loss_curves(history)

In [None]:
# Evaluate model
model_2.evaluate(val_sentences, val_labels)

In [None]:
# Get prediction_probabilities
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs

In [None]:
# Get the predictions
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds

In [None]:
# Calculate metrics
model_2_results = calculate_results(val_labels, model_2_preds)
model_2_results

In [None]:
# Let's use transfer learning
# we can use this encoding layer in place of our text_vectorizer and embedding_layer
# we will be using bert model from tensorflow hub

# import tensorflow hub
import tensorflow_hub as hub

sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                       input_shape=[], # shape of inputs coming to our model
                                       dtype=tf.string,
                                       trainable=False)

In [None]:
# Create model using the Sequential API
model_3 = tf.keras.Sequential([
    sentence_encoder_layer,
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid')
],name='USE')

# Compile model
model_3.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

model_3.summary()

In [None]:
history_3 = model_3.fit(train_sentences,
                       train_labels,
                       epochs=5,
                       validation_data=(val_sentences, val_labels))


In [None]:
# plot_loss_curves
plot_loss_curves(history_3)

In [None]:
# make prediction probabilities
model_3_prediction_probs = model_3.predict(val_sentences)
model_3_prediction_probs[:10]

In [None]:
# prediction 
model_3_preds = tf.squeeze(tf.round(model_3_prediction_probs))
model_3_preds[:10]

In [None]:
# Calculate results
model_3_results = calculate_results(val_labels, model_3_preds)
model_3_results

## Comparing the performance of each of our models

In [None]:
# Combine model results into a DataFrame
model_results = pd.DataFrame({"model_1": model_1_results,
                             "model_2": model_2_results,
                             "model_3":model_3_results})
model_results = model_results.transpose()
model_results

In [None]:
# change the accuracy to same scale as other metrics
model_results['accuracy'] = model_results['accuracy']/100


In [None]:
# plot and compare
model_results.plot(kind='bar', figsize=(10,5)).legend(bbox_to_anchor=(1.0, 1.0))

In [None]:
# Check which model got more accuracy
model_results.sort_values("accuracy", ascending=False)['accuracy'].plot(kind='bar', figsize=(10,5))

In [None]:


# precision
model_results.sort_values('precision', ascending=False)['precision'].plot(kind='bar', figsize=(10,5))

In [None]:
# recall
model_results.sort_values("recall", ascending=False)['recall'].plot(kind='bar', figsize=(10,5))

In [None]:
# f1
model_results.sort_values('f1',ascending=False)['f1'].plot(kind='bar', figsize=(10,5))

we can see `model_3` performed better than other models regarding `accuracy`, `recall`, `precision`, `f1`

## Make Predictions on the Test Dataset

In [None]:
test_sentences = test_df['text'].to_list()
test_sentences[:10]

In [None]:
# Making predictions on the test dataset

# Keep all the sentences in a list
test_sentences = test_df["text"].to_list()
# take random sample from the list upto 10 samples
test_samples = random.sample(test_sentences, 10)
# loop through to test_samples
for test_sample in test_samples:
    pred_prob = tf.squeeze(model_3.predict([test_sample])) # has to be list
    pred = tf.round(pred_prob)
    print(f"Pred: {int(pred)}, Prob: {pred_prob}")
    print(f"Text:\n{test_sample}\n")
    print("----\n")



In [None]:
prediction_probs  = model_3.predict([test_df['text']])
prediction_probs

In [None]:
predictions = tf.squeeze(tf.round(prediction_probs))
predictions

In [None]:
test_df

In [None]:
submission = pd.DataFrame()
submission['id'] = test_df['id']
submission['target'] = predictions

In [None]:
submission

In [None]:
submission.to_csv("submission", index=False)