In [1]:
# This project was developed by Samuel Hart (samuel.hart@gatech.edu) 
# as the final project for CS 4641, a course in Machine Learning 
# at the Georgia Institute of Technology.

# Final Project: Natural Language Processing with Disaster Tweets
The following project is designed to train and test models that will be used to predict whether a given tweet is about an occuring disaster or not. The project idea and data comes from a [Kaggle competition](https://www.kaggle.com/c/nlp-getting-started/overview). For example, a tweet containing a picture of a sunset and reading "The sky was ABLAZE" is obviously not about a disaster to a human reader. However, this may be more ambiguous to a machine.

This project is split into three models: a Naive Bayes Classifier for baseline, a Pre-trained BERT model for comparison, and a Convolutional Neural Network (CNN) that uses Google's word2vec embeddings. The CNN is based on a structure in [Convolution Neural Networks for Sentence Classification](https://arxiv.org/pdf/1408.5882.pdf) by Yoon Kim at NYU.

# Downloading the data
Here we download the data from Kaggle that I uploaded to my GitHub repository.

In [2]:
!git clone https://github.com/samueljhart0/d-tweet-classification.git

Cloning into 'd-tweet-classification'...
remote: Enumerating objects: 44, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 44 (delta 4), reused 0 (delta 0), pack-reused 28[K
Unpacking objects: 100% (44/44), done.


In [3]:
import pandas as pd

orig_train = pd.read_csv('/content/d-tweet-classification/data/train.csv', keep_default_na=False)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 25)

print(f"Here is the first 40 tweets in the dataset and their associated features:\n {orig_train[0:40]}")
print("Note that the keyword and location are mostly vacant.")

Here is the first 40 tweets in the dataset and their associated features:
     id keyword                  location                      text  target
0    1                                    Our Deeds are the Rea...       1
1    4                                    Forest fire near La R...       1
2    5                                    All residents asked t...       1
3    6                                    13,000 people receive...       1
4    7                                    Just got sent this ph...       1
5    8                                    #RockyFire Update => ...       1
6   10                                    #flood #disaster Heav...       1
7   13                                    I'm on top of the hil...       1
8   14                                    There's an emergency ...       1
9   15                                    I'm afraid that the t...       1
10  16                                    Three people died fro...       1
11  17                   

# Preprocessing Data for Naive Bayes Classifier
I do just a little data cleaning here getting rid of the keyword and location columns as we won't make use for any of our models. Note that while the model does not make use of the id column these are necessary to input to the model it is used for the submission file to Kaggle. We then do a 80/20 split of our training data into a training and validation set for our Naive Bayes classifier model.

In [4]:
orig_train = pd.read_csv('/content/d-tweet-classification/data/train.csv')

# Drop keyword and location columns
orig_train.drop('keyword', inplace=True, axis=1)
orig_train.drop('location', inplace=True, axis=1)
print(f"Here is the first 40 tweets again without the keyword and location columns:\n {orig_train[0:40]}")

# Randomly select 80% of training data for training set
nb_train = orig_train.sample(frac=0.7,random_state=200)

# Take what is leftover from previous selection (30% of training set)
leftover = orig_train.drop(nb_train.index)

# Take other 1/2 of remaining 30% of training data for dev set
nb_dev = leftover.sample(frac=0.5, random_state=200)

# Take other 15% of training data for test set
nb_test = leftover.drop(nb_dev.index)

print("\n\nMaking sure dimensions add up:")
print(f"Before split: {orig_train.shape}")
print(f"Split dimensions:")
print(f"nb_train shape: {nb_train.shape}")
print(f"nb_dev shape: {nb_dev.shape}")
print(f"nb_test shape: {nb_test.shape}")

# Send to new csv files
nb_train.to_csv('/content/nb_train.csv')
nb_dev.to_csv('/content/nb_dev.csv')
nb_test.to_csv('/content/nb_test.csv')

Here is the first 40 tweets again without the keyword and location columns:
     id                      text  target
0    1  Our Deeds are the Rea...       1
1    4  Forest fire near La R...       1
2    5  All residents asked t...       1
3    6  13,000 people receive...       1
4    7  Just got sent this ph...       1
5    8  #RockyFire Update => ...       1
6   10  #flood #disaster Heav...       1
7   13  I'm on top of the hil...       1
8   14  There's an emergency ...       1
9   15  I'm afraid that the t...       1
10  16  Three people died fro...       1
11  17  Haha South Tampa is g...       1
12  18  #raining #flooding #F...       1
13  19  #Flood in Bago Myanma...       1
14  20  Damage to school bus ...       1
15  23            What's up man?       0
16  24             I love fruits       0
17  25          Summer is lovely       0
18  26         My car is so fast       0
19  28  What a goooooooaaaaaa...       0
20  31    this is ridiculous....       0
21  32         London

# Tweet Tokenization
Now we use the bag-of-words model to build a sparse matrix where each row is a tweet and the entries in that row are counts of how many times a unique word appears in that tweet. To count these unique words we first split each tweet using the Tweet Tokenizer from the Natural Language ToolKit (NLTK) for Python. This tokenizer is especially adept for splitting tweets into sub-units like individual words, URLs, emoticons, etc. Once the tweet is split into 'tokens', what we call the sub-units, then we give each unique token a unique index in our vocabulary set. Now, with a tweet as a list of tokens, we convert each token into it's index in the vocab set. We count the number of times an index appears and input this into the corresponding row for the tweet in our sparse matrix.

In [5]:
import torch, nltk
import numpy as np

from scipy.sparse import csr_matrix
from nltk import TweetTokenizer
from collections import Counter

nltk.download('punkt')

np.random.seed(1)

class Vocab:
    def __init__(self):
        self.locked = False
        self.nextIndex = 0
        self.tokenToIndex = {}
        self.indexToToken = {}

    def GetIndices(self, tweet):
        tt = TweetTokenizer()
        tokens = tt.tokenize(tweet)
        indices = list()
        for token in tokens:
            i = self.GetIndex(token.lower())
            if i >= 0:
                indices.append(i)
        return indices

    def GetIndex(self, token):
        if self.tokenToIndex.get(token, None) == None:
            if self.locked:
                return -1
            else:
                self.tokenToIndex[token] = self.nextIndex
                self.indexToToken[self.tokenToIndex[token]] = token
                self.nextIndex += 1
        return self.tokenToIndex[token]
    
    def HasToken(self, index):
        return self.indexToToken.get(token, None) == None
    
    def HasIndex(self, token):
        return self.indexToToken.get(index, None) == None

    def GetToken(self, index):
        return self.indexToToken[int(index)]

    def GetVocabSize(self):
        return self.nextIndex

    def GetWords(self):
        return self.wordToToken.keys()

    def Lock(self):
        self.locked = True

class TweetData:
    def __init__(self, data, vocab=None):
        data = pd.read_csv(f"/content/{data}")

        if not vocab:
            self.vocab = Vocab()
        else:
            self.vocab = vocab

        X_values = []
        X_row_indices = []
        X_col_indices = []
        XindexList = []

        tweets = data["text"].to_numpy()

        for i in range(len(tweets)):
            tweet = tweets[i]
            indexList = self.vocab.GetIndices(tweet)
            XindexList.append(indexList)
            indexCounts = Counter(indexList)
            for (index, count) in indexCounts.items():
                if index >= 0:
                    X_row_indices.append(i)
                    X_col_indices.append(index)
                    X_values.append(count)
        
        self.vocab.Lock()

        self.X = csr_matrix((X_values, (X_row_indices, X_col_indices)), shape=(max(X_row_indices) + 1, self.vocab.GetVocabSize()))
        self.XindexList = XindexList
        self.XidList = data["id"].to_numpy()
        self.Y = data["target"].to_numpy()

        index = np.arange(self.X.shape[0])
        np.random.shuffle(index)
        self.X = self.X[index,:]
        self.XindexList = [torch.LongTensor(XindexList[i]) for i in index]
        self.XidList[index]
        self.Y = self.Y[index]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [6]:
train = TweetData('nb_train.csv')
train.vocab.Lock()
test = TweetData('nb_test.csv', vocab=train.vocab)

# Data Exploration

This section simply gives an idea of the format of our data.

In [7]:
print(f"train.X has {train.X.shape[0]} rows and {train.X.shape[1]} columns.")
print(f"train.Y has {train.Y.shape[0]} rows.")

train.X has 5329 rows and 18120 columns.
train.Y has 5329 rows.


In [8]:
# Let's count the frequency of every word appearing in the true disaster tweets:
word_counts = np.array(train.X[train.Y == 1,:].sum(axis=0)).flatten()
word_counts

array([0, 0, 4, ..., 0, 0, 0], dtype=int64)

In [9]:
# Now, let's sort the words by frequency:
sorted_words = list(reversed(np.argsort(word_counts)))
sorted_words[-1]

0

In [10]:
# What is the index of the most frequent word?
sorted_words[-1]

0

In [11]:
# Let's see what word that is:
train.vocab.GetToken(sorted_words[-1])

'@danryckert'

In [12]:
# What are the 10 most frequent words?
[train.vocab.GetToken(sorted_words[x]) for x in range(10)]

['.', 'the', ':', 'in', 'a', 'of', 'to', '...', '?', '-']

# Naive Bayes Classifier

This is a rather simple Naive Bayes Classifier making use of Laplace smoothing on the parameters and we take the log of our parameters so that

$$P(Y)\prod_{i=1}^{|X|}P(x_i|Y)^{count(x_i)}$$

becomes

$$\log[P(Y)] + count(x_i)\sum_{i=1}^{|X|}\log[P(x_i|Y)]$$

Where $Y$ is the target or label of a tweet, $X$ is a tweet, $x_i$ is a token in the tweet, and $count(x_i)$ is the number of times the word appears in the tweet $X$. Note that the second term is simply a dot product of the row corresponding to the tweet and a vector containing the log-probabilities of seeing a word in the vocab set given each label. Our prediction then is the larger of the two probabilities.

In [13]:
import math
class Eval:
    def __init__(self, pred, gold):
        self.pred = pred
        self.gold = gold
        tp = (self.gold * self.pred).sum()
        tn = ((1 - self.gold) * (1 - self.pred)).sum()
        fp = ((1 - self.gold) * self.pred).sum()
        fn = (self.gold * (1 - self.pred)).sum()

        precision = tp / (tp + fp)
        recall = tp / (tp + fn)
    
        f1 = 2* (precision*recall) / (precision + recall)
        acc = np.sum(np.equal(self.pred, self.gold)) / float(len(self.gold))

        print(f"\nF1 Accuracy: {f1}")
        print(f"\nAccuracy: {acc}")

class NaiveBayes:
    def __init__(self, X, Y, ALPHA=1.0):
        self.alpha = ALPHA
        self.prior_pos = np.log(float((np.count_nonzero(Y == 1) + self.alpha) / (Y.shape[0] + (len(set(Y)) * self.alpha))))
        self.prior_neg = np.log(1 - self.prior_pos)
        
        self.likeli_pos = np.log((np.array(X[Y == 1,:].sum(axis=0)).flatten() + self.alpha) / (np.array(X[Y == 1,:].sum(axis=0)).flatten().sum() + (X.shape[1] * self.alpha)))
        self.likeli_neg = np.log((np.array(X[Y == 0,:].sum(axis=0)).flatten() + self.alpha) / (np.array(X[Y == 0,:].sum(axis=0)).flatten().sum() + (X.shape[1] * self.alpha)))

    def Predict(self, X, Y):
        prob_pos = self.prior_pos + X.dot(self.likeli_pos)
        prob_neg = self.prior_neg + X.dot(self.likeli_neg)
        
        Y_pred = prob_pos - prob_neg
        Y_pred[Y_pred >= 0] = 1
        Y_pred[Y_pred < 0] = 0
        Eval(Y_pred, Y)

In [14]:
alpha = 0.6
nb = NaiveBayes(train.X, train.Y, alpha)

In [15]:
nb.Predict(test.X, test.Y)


F1 Accuracy: 0.7367231638418079

Accuracy: 0.7959719789842382


# Preparing for the Pre-Trained BERT Model

We first install the BERT related library.

In [16]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 4.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 63.8 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 70.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 591 kB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 54.9 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transforme

In [17]:
# Randomly select 70% of training data for training set
bert_train = orig_train.sample(frac=0.7,random_state=200)

# Take what is leftover from previous selection (30% of training set)
leftover = orig_train.drop(bert_train.index)

# Take other 1/2 of remaining 30% of training data for dev set
bert_dev = leftover.sample(frac=0.5, random_state=200)

# Take other 15% of training data for test set
bert_test = leftover.drop(bert_dev.index)

print("\n\nMaking sure dimensions add up:")
print(f"Before split: {orig_train.shape}")
print(f"Split dimensions:")
print(f"bert_train shape: {bert_train.shape}")
print(f"bert_dev shape: {bert_dev.shape}")
print(f"bert_test shape: {bert_test.shape}")

# Send to new csv files
bert_train.to_csv('/content/bert_train.csv')
bert_dev.to_csv('/content/bert_dev.csv')
bert_test.to_csv('/content/bert_test.csv')



Making sure dimensions add up:
Before split: (7613, 3)
Split dimensions:
bert_train shape: (5329, 3)
bert_dev shape: (1142, 3)
bert_test shape: (1142, 3)


In [18]:
# import tqdm
import tqdm.notebook as tq

def load_data(data):
    data = pd.read_csv(f"/content/{data}")
    id_list = data['id'].tolist()
    input_list = data['text'].tolist()
    target_list = data['target'].tolist()

    return id_list, input_list, target_list

print("Load training data.")
train_id, train_input, train_target =  load_data('bert_train.csv')
print(f"Training data has {len(train_input)} examples.\n")

print("Load dev data.")
dev_id, dev_input, dev_target =  load_data('bert_dev.csv')
print(f"Dev data has {len(dev_input)} examples.\n")

print("Load test data.")
test_id, test_input, test_target =  load_data('bert_test.csv')
print(f"Test data has {len(test_input)} examples.\n")

Load training data.
Training data has 5329 examples.

Load dev data.
Dev data has 1142 examples.

Load test data.
Test data has 1142 examples.



In [19]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW
from keras.preprocessing.sequence import pad_sequences

model_name = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

max_len = 128
batch_size = 8

def preproces(ids, input, targets, tokenizer, max_len, batch_size, data_class="train"):

    encoded_input = tokenizer(input, padding='max_length', max_length = max_len, truncation=True, return_tensors="pt")
    
    input_ids = torch.tensor(ids)
    input_indices = encoded_input['input_ids']
    attention_mask = encoded_input['attention_mask']
    targets = torch.tensor(targets)

    print(input_ids.size(), input_indices.size(), attention_mask.size(), targets.size())

    dataset_tensor = TensorDataset(input_ids.cuda(), input_indices.cuda(), attention_mask.cuda(), targets.cuda())

    if data_class == "train":
        sampler = RandomSampler(dataset_tensor)
    else:
        sampler = SequentialSampler(dataset_tensor)
    dataloader = DataLoader(dataset_tensor, sampler=sampler, batch_size=batch_size)

    return dataloader

train_dataloader = preproces(train_id, train_input, train_target, tokenizer, max_len, batch_size, data_class="train")
dev_dataloader = preproces(dev_id, dev_input, dev_target, tokenizer, max_len, batch_size, data_class="dev")
test_dataloader = preproces(test_id, test_input, test_target, tokenizer, max_len, batch_size, data_class="test")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

torch.Size([5329]) torch.Size([5329, 128]) torch.Size([5329, 128]) torch.Size([5329])
torch.Size([1142]) torch.Size([1142, 128]) torch.Size([1142, 128]) torch.Size([1142])
torch.Size([1142]) torch.Size([1142, 128]) torch.Size([1142, 128]) torch.Size([1142])


In [20]:
def Eval(bert_model, dataloader):
    model.eval()

    torch.cuda.empty_cache()

    tp = 0
    tn = 0
    fp = 0
    fn = 0
    num_correct = 0
    num_examples = 0

    for step, batch in enumerate(tq.tqdm(dataloader)):
        batch_id = batch[0]
        batch_input = batch[1]
        batch_atten = batch[2]
        batch_label = batch[3]

        bert_output = bert_model.forward(input_ids=batch_input, attention_mask=batch_atten, labels=batch_label)
        pred_label = torch.argmax(bert_output[1], dim=1)

        batch_tp = (batch_label * pred_label).sum().to(torch.float32)
        batch_tn = ((1 - batch_label) * (1 - pred_label)).sum().to(torch.float32)
        batch_fp = ((1 - batch_label) * pred_label).sum().to(torch.float32)
        batch_fn = (batch_label * (1 - pred_label)).sum().to(torch.float32)

        tp += batch_tp
        tn += batch_tn
        fp += batch_fp
        fn += batch_fn

        num_correct += batch_tp + batch_tn
        num_examples += len(batch_label)

    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    
    f1 = 2* (precision*recall) / (precision + recall)
        
    print(f"\nF1 Accuracy: {f1.item()}")
    print(f"\nAccuracy: {(float(num_correct) / float(num_examples))}")
    


def Train(bert_model, train_data, lr, n_epoch, dev_data):
    print("Start Training!")
    optimizer = AdamW(bert_model.parameters(), lr=lr)
    torch.cuda.empty_cache()

    for epoch in range(n_epoch):

        print(f"\nEpoch {epoch}")
      
        bert_model.train()
        tr_loss = 0
        nb_tr_examples, nb_tr_steps = 0, 0

        for step, batch in enumerate(tq.tqdm(train_data)):

            batch_id = batch[0]
            batch_input = batch[1]
            batch_atten = batch[2]
            batch_label = batch[3]

            bert_model.zero_grad()

            bert_output = bert_model.forward(input_ids=batch_input, attention_mask=batch_atten, labels=batch_label)
            batch_loss = bert_output[0]
            tr_loss += float(batch_loss)
            nb_tr_steps += 1

            batch_loss.backward()
            optimizer.step()

        print("Train loss on epoch {}: {}\n".format(epoch, tr_loss / nb_tr_steps))

        print("Evaluate on the dev set:")
        Eval(bert_model, dev_data)
        


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()

learning_rate = 2e-5
num_epoch = 3

model = AutoModelForSequenceClassification.from_pretrained(model_name)
if n_gpu > 1:
    model.to(device)
    model = torch.nn.DataParallel(model)
else:
    model.cuda()
Train(model, train_dataloader, learning_rate, num_epoch, dev_dataloader)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Start Training!

Epoch 0


  0%|          | 0/667 [00:00<?, ?it/s]

Train loss on epoch 0: 0.45738713298065553

Evaluate on the dev set:


  0%|          | 0/143 [00:00<?, ?it/s]


F1 Accuracy: 0.8116592168807983

Accuracy: 0.8528896672504378

Epoch 1


  0%|          | 0/667 [00:00<?, ?it/s]

Train loss on epoch 1: 0.3131595963816057

Evaluate on the dev set:


  0%|          | 0/143 [00:00<?, ?it/s]


F1 Accuracy: 0.8207847476005554

Accuracy: 0.852014010507881

Epoch 2


  0%|          | 0/667 [00:00<?, ?it/s]

Train loss on epoch 2: 0.19061835372200545

Evaluate on the dev set:


  0%|          | 0/143 [00:00<?, ?it/s]


F1 Accuracy: 0.8041666746139526

Accuracy: 0.8353765323992994


In [21]:
print("Evaluate on the test set:")
Eval(model, test_dataloader)

Evaluate on the test set:


  0%|          | 0/143 [00:00<?, ?it/s]


F1 Accuracy: 0.7897436022758484

Accuracy: 0.8204903677758318


# Download Google's Word2Vec Embeddings

In [22]:
# Randomly select 70% of training data for training set
cnn_train = orig_train.sample(frac=0.7,random_state=200)

# Take what is leftover from previous selection (30% of training set)
leftover = orig_train.drop(cnn_train.index)

# Take 1/2 of remaining 30% of training data for dev set
cnn_dev = leftover.sample(frac=0.5, random_state=200)

# Take other 15% of training data for test set
cnn_test = leftover.drop(cnn_dev.index)

print("\n\nMaking sure dimensions add up:")
print(f"Before split: {orig_train.shape}")
print(f"Split dimensions:")
print(f"cnn_train shape: {cnn_train.shape}")
print(f"cnn_dev shape: {cnn_dev.shape}")
print(f"cnn_test shape: {cnn_test.shape}")

# Send to new csv files
cnn_train.to_csv('/content/cnn_train.csv')
cnn_dev.to_csv('/content/cnn_dev.csv')
cnn_test.to_csv('/content/cnn_test.csv')



Making sure dimensions add up:
Before split: (7613, 3)
Split dimensions:
cnn_train shape: (5329, 3)
cnn_dev shape: (1142, 3)
cnn_test shape: (1142, 3)


# Tokenizing Data

In [23]:
!wget -P /content/d-tweet-classification/tweet2vec_model/ https://github.com/eyaler/word2vec-slim/raw/master/GoogleNews-vectors-negative300-SLIM.bin.gz
!wget -P /content/d-tweet-classification/tweet2vec_model/ https://github.com/uclnlp/emoji2vec/raw/master/pre-trained/emoji2vec.bin

--2022-01-10 22:44:55--  https://github.com/eyaler/word2vec-slim/raw/master/GoogleNews-vectors-negative300-SLIM.bin.gz
Resolving github.com (github.com)... 13.114.40.48
Connecting to github.com (github.com)|13.114.40.48|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/eyaler/word2vec-slim/master/GoogleNews-vectors-negative300-SLIM.bin.gz [following]
--2022-01-10 22:44:55--  https://media.githubusercontent.com/media/eyaler/word2vec-slim/master/GoogleNews-vectors-negative300-SLIM.bin.gz
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 276467217 (264M) [application/octet-stream]
Saving to: ‘/content/d-tweet-classification/tweet2vec_model/GoogleNews-vectors-negative300-SLIM.bin.gz’




In [24]:
!gunzip /content/d-tweet-classification/tweet2vec_model/GoogleNews-vectors-negative300-SLIM.bin.gz

In [25]:
from gensim.models import KeyedVectors

w2v = KeyedVectors.load_word2vec_format('/content/d-tweet-classification/tweet2vec_model/GoogleNews-vectors-negative300-SLIM.bin', binary=True)
e2v = KeyedVectors.load_word2vec_format('/content/d-tweet-classification/tweet2vec_model/emoji2vec.bin', binary=True)

for key in e2v.vocab.keys():
    w2v.add(key, e2v[key])
w2v.add('<url>', torch.randn(300))
w2v.add('<tag>', torch.randn(300))
w2v.add('<trend>', torch.randn(300))
w2v.add('<unk>', torch.randn(300))

In [26]:
from gensim.models import KeyedVectors
import torch, nltk
import numpy as np
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from scipy.sparse import csr_matrix
from nltk import TweetTokenizer
from collections import Counter

nltk.download('punkt')

np.random.seed(1)

max_len = 128
batch_size = 8

def loadData(data, max_len, batch_size, data_class="train"):
    input_ids = torch.tensor(data['id'].tolist()).cuda()
    input_tweets = data['text'].tolist()
    input_targets = torch.tensor(data['target'].tolist()).cuda()

    tt = TweetTokenizer()
    tweetTokens = [tt.tokenize(tweet) for tweet in input_tweets]

    invocab = 0
    outvocab = 0
    tokenized_tweets = []
    for tokens in tweetTokens:
        indices = []
        for token in tokens:
            punc = '''()-[]{}'"\<>/$%^&*_~'''
            if token in punc:
                continue
            try:
                index = w2v.vocab[token].index
                invocab += 1
            except:
                if token.startswith('http'):
                    index = w2v.vocab['<url>'].index
                    invocab += 1
                elif token.startswith('@'):
                    index = w2v.vocab['<tag>'].index
                    invocab += 1
                elif token.startswith('#'):
                    try:
                        index = w2v.vocab[token[1:]].index
                        invocab += 1
                    except:
                        index = w2v.vocab['<trend>'].index
                        invocab += 1
                else:
                    index = index = w2v.vocab['<unk>'].index
                    outvocab += 1
            indices.append(index)
        tokenized_tweets.append(indices)
    print(invocab, outvocab)

    padded_tweets = np.zeros((len(tokenized_tweets), max_len), dtype=int)
    for i, row in enumerate(tokenized_tweets):
        padded_tweets[i, -len(row):] = np.array(row)[:max_len]

    padded_tweets = torch.tensor(padded_tweets).cuda()

    dataset_tensor = TensorDataset(input_ids, padded_tweets, input_targets)

    if data_class == "train":
        sampler = RandomSampler(dataset_tensor)
    else:
        sampler = SequentialSampler(dataset_tensor)
    dataloader = DataLoader(dataset_tensor, sampler=sampler, batch_size=batch_size)

    return dataloader

train_dataloader = loadData(cnn_train, max_len, batch_size, data_class="train")
dev_dataloader = loadData(cnn_dev, max_len, batch_size, data_class="dev")
test_dataloader = loadData(cnn_test, max_len, batch_size, data_class="test")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
69296 17369
14987 3833
14976 3647


In [27]:
import tqdm
import torch
import gc
import torch.nn as nn
from torch import optim
import random
import numpy as np
import tqdm.notebook as tq

class CNN(nn.Module):
    def __init__(self, embed_model, vocab_size, output_size, embedding_dim,
                 num_filters=100, kernel_sizes=[3, 4, 5], drop_prob=0.5):
        super(CNN, self).__init__()
        self.num_filters = num_filters
        self.embedding_dim = embedding_dim

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.embedding.weight = nn.Parameter(torch.from_numpy(embed_model.vectors))
        # self.embedding.requires_grad = False

        self.convs_1d = nn.ModuleList([
            nn.Conv2d(1, num_filters, (k, embedding_dim), padding=(k-2,0)) 
            for k in kernel_sizes])
        
        self.fc = nn.Linear(len(kernel_sizes) * num_filters, output_size) 
        self.dropout = nn.Dropout(drop_prob)
        self.logSoftmax = nn.LogSoftmax()
        
    
    def conv_and_pool(self, x, conv):
        x = nn.functional.relu(conv(x)).squeeze(3)
        x_max = nn.functional.max_pool1d(x, x.size(2)).squeeze(2)
        return x_max

    def forward(self, x):
        
        embeds = self.embedding(x)
        embeds = embeds.unsqueeze(1)
        conv_results = [self.conv_and_pool(embeds, conv) for conv in self.convs_1d]
        
        x = torch.cat(conv_results, 1)
        x = self.dropout(x)

        logit = self.fc(x) 

        return self.logSoftmax(logit)

def Eval(net, dataloader):
    net.eval()

    torch.cuda.empty_cache()

    tp = 0
    tn = 0
    fp = 0
    fn = 0
    num_correct = 0
    num_examples = 0

    for step, batch in enumerate(tq.tqdm(dataloader)):
        batch_id = batch[0]
        batch_input = batch[1]
        batch_label = batch[2]

        output = net.forward(batch_input).squeeze()

        pred_label = torch.argmax(output, dim=1)

        batch_tp = (batch_label * pred_label).sum().to(torch.float32)
        batch_tn = ((1 - batch_label) * (1 - pred_label)).sum().to(torch.float32)
        batch_fp = ((1 - batch_label) * pred_label).sum().to(torch.float32)
        batch_fn = (batch_label * (1 - pred_label)).sum().to(torch.float32)

        tp += batch_tp
        tn += batch_tn
        fp += batch_fp
        fn += batch_fn

        num_correct += batch_tp + batch_tn
        num_examples += len(batch_label)

    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    
    f1 = 2* (precision*recall) / (precision + recall)
        
    print(f"\nF1 Accuracy: {f1.item()}")
    print(f"\nAccuracy: {(float(num_correct) / float(num_examples))}")
    


def Train(net, train_data, lr, n_epoch, dev_data):
    print("Start Training!")
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    torch.cuda.empty_cache()

    for epoch in range(n_epoch):
        torch.cuda.empty_cache()
        print(f"\nEpoch {epoch}")
      
        net.train()
        tr_loss = 0
        nb_tr_examples, nb_tr_steps = 0, 0

        for step, batch in enumerate(tq.tqdm(train_data)):
            batch_id = batch[0]
            batch_input = batch[1]
            batch_label = batch[2]

            net.zero_grad()

            output = net.forward(batch_input)
            batch_label = torch.nn.functional.one_hot(batch_label, num_classes=2)
            batch_loss = torch.tensordot(output, torch.neg(batch_label.float()))
            tr_loss += batch_loss
            nb_tr_steps += 1

            batch_loss.backward()
            optimizer.step()

        print("Train loss on epoch {}: {}\n".format(epoch, tr_loss / nb_tr_steps))

        print("Evaluate on the dev set:")
        Eval(net, dev_data)

learning_rate = 1e-4
num_epoch = 7
torch.cuda.empty_cache()
cnn = CNN(w2v, len(w2v.vocab), 2, 300).cuda()
Train(cnn, train_dataloader, learning_rate, num_epoch, dev_dataloader)

Start Training!

Epoch 0


  0%|          | 0/667 [00:00<?, ?it/s]



Train loss on epoch 0: 5.465912818908691

Evaluate on the dev set:


  0%|          | 0/143 [00:00<?, ?it/s]


F1 Accuracy: 0.5687500238418579

Accuracy: 0.637478108581436

Epoch 1


  0%|          | 0/667 [00:00<?, ?it/s]

Train loss on epoch 1: 5.223815441131592

Evaluate on the dev set:


  0%|          | 0/143 [00:00<?, ?it/s]


F1 Accuracy: 0.6090373396873474

Accuracy: 0.6514886164623468

Epoch 2


  0%|          | 0/667 [00:00<?, ?it/s]

Train loss on epoch 2: 4.9882493019104

Evaluate on the dev set:


  0%|          | 0/143 [00:00<?, ?it/s]


F1 Accuracy: 0.6107382774353027

Accuracy: 0.6952714535901926

Epoch 3


  0%|          | 0/667 [00:00<?, ?it/s]

Train loss on epoch 3: 4.473320484161377

Evaluate on the dev set:


  0%|          | 0/143 [00:00<?, ?it/s]


F1 Accuracy: 0.7144508361816406

Accuracy: 0.7837127845884413

Epoch 4


  0%|          | 0/667 [00:00<?, ?it/s]

Train loss on epoch 4: 3.642183303833008

Evaluate on the dev set:


  0%|          | 0/143 [00:00<?, ?it/s]


F1 Accuracy: 0.7706611752510071

Accuracy: 0.8056042031523643

Epoch 5


  0%|          | 0/667 [00:00<?, ?it/s]

Train loss on epoch 5: 2.960808515548706

Evaluate on the dev set:


  0%|          | 0/143 [00:00<?, ?it/s]


F1 Accuracy: 0.7721660137176514

Accuracy: 0.8222416812609457

Epoch 6


  0%|          | 0/667 [00:00<?, ?it/s]

Train loss on epoch 6: 2.5085809230804443

Evaluate on the dev set:


  0%|          | 0/143 [00:00<?, ?it/s]


F1 Accuracy: 0.7779005765914917

Accuracy: 0.8239929947460596


In [28]:
print("Evaluate on the test set:")
Eval(cnn, test_dataloader)

Evaluate on the test set:


  0%|          | 0/143 [00:00<?, ?it/s]




F1 Accuracy: 0.7516629695892334

Accuracy: 0.8038528896672504
