NLP using BERT for Disaster Tweets

In [None]:
import pandas as pd
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
import os
from google.colab import drive

drive.mount('/content/drive')

os.chdir('/content/drive/My Drive/Colab Notebooks/DeepLearning/nlp')
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

train_df['keyword'].fillna('', inplace=True)
train_df['location'].fillna('', inplace=True)
train_df['text'].fillna('', inplace=True)
test_df['keyword'].fillna('', inplace=True)
test_df['location'].fillna('', inplace=True)
test_df['text'].fillna('', inplace=True)

train_df = train_df[['text', 'target']]
test_df = test_df[['id', 'text']]

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_data(texts, labels=None):
    encodings = tokenizer(texts.tolist(), truncation=True, padding=True, max_length=128)
    if labels is not None:
        return tf.data.Dataset.from_tensor_slices((dict(encodings), labels))
    return tf.data.Dataset.from_tensor_slices(dict(encodings))
train_dataset = tokenize_data(train_df['text'], train_df['target'])
test_dataset = tokenize_data(test_df['text'])

train_dataset = train_dataset.shuffle(len(train_df)).batch(32)
test_dataset = test_dataset.batch(32)

model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5), loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])

history = model.fit(train_dataset, epochs=3)

model.save_pretrained('./saved_model')
model = TFBertForSequenceClassification.from_pretrained('./saved_model')

predictions = model.predict(test_dataset)
predicted_labels = tf.argmax(predictions.logits, axis=-1)
test_df['target'] = predicted_labels.numpy()
print(test_df.head())

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3


Cause: for/else statement not yet supported


Cause: for/else statement not yet supported
Epoch 2/3
Epoch 3/3


Some layers from the model checkpoint at ./saved_model were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at ./saved_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


   id                                               text  target
0   0                 Just happened a terrible car crash       1
1   2  Heard about #earthquake is different cities, s...       1
2   3  there is a forest fire at spot pond, geese are...       1
3   9           Apocalypse lighting. #Spokane #wildfires       1
4  11      Typhoon Soudelor kills 28 in China and Taiwan       1


In [None]:
sample_submission = pd.read_csv('/content/drive/My Drive/Colab Notebooks/DeepLearning/nlp/sample_submission.csv')
submission_df = sample_submission[['id']].merge(test_df[['id', 'target']], on='id')
submission_df.to_csv('submission.csv', index=False)
print(submission_df.head())

   id  target
0   0       1
1   2       1
2   3       1
3   9       1
4  11       1


In [None]:
print(len(submission_df))

3263
