## Disaster Tweets Classification with TFHub

## Table of Contents
- Overview
- Import Packages and Datasets
- Data Wrangling
- Data Preprocessing
- Model Development
- Model Evaluation
- Submission
- Conclusion

# Overview
In this notebook I will build a Text Classifier to read tweets dataset to predict Tweets Disaster.

## Import Packages and Datasets 

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import seaborn as sns
import time
from sklearn.metrics import confusion_matrix, classification_report
import tensorflow_hub as hub

In [None]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
train.head()

In [None]:
train.shape

In [None]:
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
test.head()

In [None]:
train.location.value_counts()

## Data Wrangling
Let's see null values for each column.

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
train["keyword"].replace(np.NAN, "", inplace=True)
train["location"].replace(np.NAN, "", inplace=True)
test["keyword"].replace(np.NAN, "", inplace=True)
test["location"].replace(np.NAN, "", inplace=True)

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

## Data Preprocessing

In [None]:
contents = []
for data in [train, test]:
    for i in range(data.shape[0]):
        item = data.iloc[i]
        sentence = item["keyword"] + " " + item["text"] + " " + item["location"]
        sentence = sentence.strip().lower()
        contents.append(sentence)

In [None]:
x_train = contents[:len(train)]
x_test = contents[len(train):]
y_train = train["target"]
print(len(x_train), len(x_test), y_train.shape)

## Train Validation Split

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=44)

In [None]:
print(len(x_train), len(y_train), len(x_val), len(y_val))

In [None]:
train["target"].value_counts()

## Model Development

In [None]:
tf.keras.backend.clear_session()
keras_layer = hub.KerasLayer("https://tfhub.dev/google/nnlm-en-dim128/2")
tf.keras.backend.clear_session()
model = tf.keras.Sequential([
    tf.keras.layers.Input((), dtype=tf.string),
    keras_layer,
    tf.keras.layers.Reshape((1, -1)),
    tf.keras.layers.LSTM(64, return_sequences=True),
    tf.keras.layers.LSTM(32, return_sequences=False),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(64, activation="swish"),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(32, activation="swish"),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()

In [None]:
batch_size = 64
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(len(x_train)).batch(batch_size)
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(batch_size)

In [None]:
loss_object = tf.keras.losses.BinaryCrossentropy()
optimizer = tf.keras.optimizers.Adam()
history = {
    "train_loss": [],
    "valid_loss": [],
    "train_accuracy": [],
    "valid_accuracy": []
}
num_epochs = 20
for epoch in range(num_epochs):
    begin_time = time.time()
    train_losses = []
    valid_losses = []
    correct_count = 0
    total_count = 0
    total_train_count = 0
    for (x_batch, y_true) in train_dataset:
        with tf.GradientTape() as tape:
            y_pred = model(x_batch)
            predict_labels = tf.cast(y_pred > 0.5, dtype=y_true.dtype)
            loss_value = loss_object(y_true, y_pred)
        gradients = tape.gradient(loss_value, model.trainable_weights)
        optimizer.apply_gradients(zip(gradients, model.trainable_weights))
        train_losses.append(loss_value)
        correct_count += tf.reduce_sum(tf.cast(y_true == predict_labels, tf.int32))
        total_train_count += y_true.shape[0]
    train_loss = tf.reduce_mean(train_losses)
    train_accuracy = correct_count / total_train_count
    history["train_loss"].append(train_loss)
    history["train_accuracy"].append(train_accuracy)
    correct_count = 0
    total_count = 0
    total_valid_count = 0
    for (x_batch, y_true) in val_dataset:
        y_pred = model(x_batch)
        predict_labels = tf.cast(y_pred > 0.5, dtype=y_true.dtype)
        loss_value = loss_object(y_true, y_pred)
        valid_losses.append(loss_value)
        correct_count += tf.reduce_sum(tf.cast(y_true == predict_labels, tf.int32))
        total_valid_count += y_true.shape[0]
    valid_loss = tf.reduce_mean(valid_losses)
    valid_accuracy = correct_count / total_valid_count
    history["valid_loss"].append(valid_loss)
    history["valid_accuracy"].append(valid_accuracy)
    elapsed_time = time.time() -  begin_time
    print("Epoch: %d / %d"%(epoch + 1, num_epochs))
    print("%.2fs Loss: %.2f Accuracy: %.2f Validation Loss: %.2f Validation Accuracy: %.2f"%(elapsed_time, train_loss, train_accuracy, valid_loss, valid_accuracy))
for key in history:
    history[key] = list(np.array(history[key]))

## Model Evaluation

**Loss and Accuracy over time**

In [None]:
pd.DataFrame(history).plot(kind="line")


In [None]:
y_pred = np.array(model.predict(x_val) >= 0.5, dtype=int).reshape(-1)

**Confusion Matrix**

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_val, y_pred)
sns.heatmap(cm, annot=True)

## Accuracy

In [None]:
from sklearn.metrics import accuracy_score
print("Accuracy Score", accuracy_score(y_val, y_pred))

## Classification Report

In [None]:
from sklearn.metrics import classification_report
print("Classification Report", classification_report(y_val, y_pred))

## Submission

In [None]:
y_test = np.array(model.predict(x_test) > 0.5, dtype=np.int).reshape(-1)

In [None]:
submission = pd.DataFrame({"id": test["id"], "target": y_test})

In [None]:
submission.head()

In [None]:
submission.to_csv("submission.csv", index=False)

## Conclusion
Now the Model can achive 78% accuracy both in validation dataset and test dataset which shown in Kaggle leader board. There's still a lot to improve.
