# Using DistilBert for SA on the Sentiment140 Twitter Dataset

Import Statements - also, setting random seed for reproducibility and some plot settings for seaborn

In [1]:
#general
import numpy as np
import pandas as pd
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt

from tqdm import tqdm

#DistilBert + Tokenizer
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification

#train/test/dev split and metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Random seed + styling

SEED=0

sns.set_style("whitegrid")
sns.despine()


plt.style.use("classic")
plt.rc("figure", autolayout=True)
plt.rc("axes", labelweight="bold", labelsize="large", titleweight="bold", titlepad=10)

#tqdm progress bar for pandas methods
tqdm.pandas()

<Figure size 640x480 with 0 Axes>

In [3]:
#optional
pd.set_option('max_colwidth', 800)

In [4]:
#check if using GPU
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  1


### Final Preprocessing

As we will be training this model (as opposed to using an out-of-the-box solution as seen in sent_flair.ipynb, sent_nltk.ipynb, etc.), there will be a few extra steps in regards to preprocessing:

Loading dataset, splitting into train, val and test, tokenizing with DistilBert tokenizer

In [5]:
df = pd.read_csv('data_clean.csv', sep='\t', usecols=['sent', 'text', 'data_len', 'token_lens'])

Splitting the dataframe into test, val and test sets.  Test is 0.7, Val and Test are both 0.15.

In [6]:
df_train, temp = train_test_split(df, test_size=0.3, random_state=SEED)
df_val, df_test = train_test_split(temp, test_size=0.5, random_state=SEED )

Confirming that we properly split data by looking at the shapes of the new datasets.  Also head of train set.

In [7]:
print(df_train.shape)
print(df_val.shape)
print(df_test.shape)

(1078253, 4)
(231054, 4)
(231055, 4)


Creating y_train, y_val and y_test from the ['sent'] column of their respective dataframes

In [8]:
y_train = df_train['sent']
y_val = df_val['sent']
y_test = df_test['sent']

In [9]:
print(type(y_test))

<class 'pandas.core.series.Series'>


### Model training

Finally, the various preprocessing steps are over.
The code below:
1) Initializes the DistilBert tokenizer and defines the tokenize() function
2) Tokenizes the train, val, and test data in turn
3) Trains the DistilBert model on the train and val data 

In [10]:
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [11]:
MAX_LEN = 128

def tokenize(data,max_len=MAX_LEN) :

    input_ids = []

    attention_masks = []

    for i in tqdm(range(len(data))):

        encoded = tokenizer.encode_plus(
            data[i],
            add_special_tokens=True,
            max_length = max_len,
            padding='max_length',
            return_attention_mask=True
        )

        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])

        
    return np.array(input_ids),np.array(attention_masks)

Creating arrays containing input_ids and attention masks as returned by the tokenizer.

In [12]:
train_input_ids, train_attention_masks = tokenize(df_train.text.values)
val_input_ids, val_attention_masks = tokenize(df_val.text.values)
test_input_ids, test_attention_masks = tokenize(df_test.text.values)

 23%|██▎       | 245859/1078253 [01:07<03:47, 3654.22it/s]


KeyboardInterrupt: 

Model creation - setting the optimizer, loss func, and accuracy metric.  Model comprises of two input layers, one taking the input_ids, the other the corresponding attention mask.  These are fed into the bertlike model - which in this case DistilBert.  Take the hidden state of the cls token as a representation of the sentences sentiment.

In [None]:
LEARNING_RATE = 3e-5 #1e-4, 3e-4, 5e-5, 3e-5
N_EPOCHS = 3
BATCH_SIZE = 64 #8, 16, 32, 64, 128


model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

After the model creation above, we display the models summary.  We can see everything seems to be connected up nicely.

In [None]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


Here is the portion that takes by far the longest - model training.

With 10gb of VRAM on my home machine, each epoch (training data is ~1 million tweets) takes around 90 minutes.  So training anywhere from 2 to 4 epochs, expect training times of around 3-6 hours with a similar machine.

This should be less of a problem if running on better hardware or through a cloud platform.

In [None]:
model.fit([train_input_ids, train_attention_masks], y_train, validation_data=([val_input_ids, val_attention_masks], y_val), batch_size=BATCH_SIZE, epochs=N_EPOCHS)

Epoch 1/3
 1751/16848 [==>...........................] - ETA: 1:21:01 - loss: 0.4185 - accuracy: 0.8084

In [None]:
# Optional code to save model

model.save_pretrained('trained_distilbert_model')

Below is the final evaluation of the model on `y_test`

In [None]:
results = model.evaluate([test_input_ids,test_attention_masks], y_test, batch_size=128)

ValueError: Failed to find data adapter that can handle input: (<class 'list'> containing values of types {'(<class \'list\'> containing values of types {\'(<class \\\'list\\\'> containing values of types {"<class \\\'int\\\'>"})\'})'}), <class 'pandas.core.series.Series'>

In [None]:
print(results)

[0.36584365367889404, 0.8491830825805664]


In [None]:
# Optional code to save df_test

df_test.to_csv('data_test.csv', index=False, sep='\t')