This will link your Google Drive account to Google Colabs. You will be given a link to click and from there you will provide permission to access google drive. Then, a verification code is provided to paste in the code below.   

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install transformers



In [3]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from transformers import TFRobertaForSequenceClassification, TFBertForSequenceClassification
from transformers import RobertaTokenizerFast, BertTokenizerFast
import tensorflow.keras.backend as K
from sklearn.model_selection import train_test_split

sample() is used to generate a sample random row from the function caller train.

frac: Float value, Returns (float value * length of sample values )

random_state is the initial seed for random # generation

reset_index(drop=True): drop parameter is used to avoid the old index being added as a column:

What do you mean by "#shuffle"? How is the random number generator used?

In [4]:
#read data
train = pd.read_json("drive/MyDrive/data/train.jsonl", lines=True)
test =  pd.read_json("drive/MyDrive/data/test.jsonl", lines=True)
train = train.sample(frac=1, random_state = 5).reset_index(drop=True) #shuffle

we are adding in the string "#CONTEXT" to the context tweets.

What is text_lst[i]?

In [5]:
def prepare_context(text_lst): #prepare context for input
  res = ""
  for i in range(len(text_lst)):
    res += " # CONTEXT " + str(i)+ " # " + text_lst[i]
  return res

First line: we are adding in the "#RESPONSE#" string to the fron of the input. and ten adding in the new context labled tweets right after the response.

Second line: is adding in the label_numeric column attributing 0/1 to not sarcasm /sarcasm

Third line: we are mapping the 1/0 lables to the lable_numeric column

Fourth line: print out of initial rows

In [6]:
#prepare input
train["input"] = train["response"].apply(lambda x: " # RESPONSE # "+x) + train["context"].apply(prepare_context) #add response and context together
train["input"] = " ** TWEET ** " + train["input"] #(not really important, used this to differentiate amongst external data but not needed in the end)
train["label_numeric"] = train["label"].map({"SARCASM": 1, "NOT_SARCASM": 0})
train.head()

Unnamed: 0,label,response,context,input,label_numeric
0,SARCASM,"@USER @USER @USER please , do read your own da...",[My latest @USER - Obama's weakness invited th...,** TWEET ** # RESPONSE # @USER @USER @USER p...,1
1,SARCASM,@USER The right seems to have a smug assumptio...,[The Left ’ s All-Out War on Trump <URL> THE L...,** TWEET ** # RESPONSE # @USER The right see...,1
2,NOT_SARCASM,@USER @USER #Confessing & repenting are demand...,"[Oh well , guess I'm going to hell . See y'all...",** TWEET ** # RESPONSE # @USER @USER #Confes...,0
3,NOT_SARCASM,🍗 The word of the Lord came to me : “ I cannot...,"[🍗 If I ’ m not right about The Narrow Door , ...",** TWEET ** # RESPONSE # 🍗 The word of the L...,0
4,SARCASM,@USER what a fascinating conclusion . Definite...,[Does anyone care to explain this to me ? <URL...,** TWEET ** # RESPONSE # @USER what a fascin...,1


This prints out the first full example of response and context tweets for quality check of data prep

In [7]:
train.loc[0]["input"]

" ** TWEET **  # RESPONSE # @USER @USER @USER please , do read your own damn comment . Lmao . # CONTEXT 0 # My latest @USER - Obama's weakness invited the Russians to meddle in US elections ( but Dems will never admit it ) . <URL> # CONTEXT 1 # @USER @USER @USER why would Putin want the strong Trump over the weak Hillary ? Makes zero sense ."

In [8]:
#train-test split
train_texts, val_texts, train_labels, val_labels = train_test_split(list(train["input"].values), list(train["label_numeric"].values), test_size=.2, random_state = 5)

In [9]:
#tokenize data (prepare inputs, attention masks, and special tokens)
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-large')
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=128)
train_dataset = tf.data.Dataset.from_tensor_slices(( #creates a tensorflow dataset object that can be used to train
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))

#train model
K.clear_session() #initializes random parameters
model = TFRobertaForSequenceClassification.from_pretrained('roberta-large')
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy']) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16, validation_data=val_dataset.shuffle(100).batch(16))

Some layers from the model checkpoint at roberta-large were not used when initializing TFRobertaForSequenceClassification: ['lm_head']
- This IS expected if you are initializing TFRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f0b0eb5f3c8>

In [10]:
#predict on validation set
val_pred = []
for text in tqdm(val_texts): #predict each validation row with progress bar
    val_encodings = tokenizer.encode(text,
                             truncation=True,
                             padding=True,
                             max_length=128, #add in a little more context 
                             return_tensors="tf")
    logits = model.predict(val_encodings)[0] #outputs logits
    val_pred.append(tf.nn.softmax(logits, axis=1).numpy()[0]) #converts logits to probabilities
val_pred_labels = np.argmax(val_pred, axis=-1) #outputs higher probable class
f1_score(val_labels, val_pred_labels)

100%|██████████| 1000/1000 [01:51<00:00,  8.95it/s]


0.8489341983317886

In [11]:
#save model
model.save_pretrained("drive/MyDrive/data/roberta_model")

In [11]:
#load model and tokenizer
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-large')
model = TFRobertaForSequenceClassification.from_pretrained("drive/MyDrive/data/roberta_model")

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at drive/MyDrive/data/roberta_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


In [12]:
#prepare test input
test["input"] = test["response"].apply(lambda x: " # RESPONSE # "+x) + test["context"].apply(prepare_context)
test["input"] = " ** TWEET ** " + test["input"]
test_texts = list(test["input"].values)
test_pred = []

#tokenize and predict on test set
for text in tqdm(test_texts):
    test_encodings = tokenizer.encode(text,
                             truncation=True,
                             padding=True,
                             max_length=128, #add in a little more context for predict
                             return_tensors="tf")
    logits = model.predict(test_encodings)[0]
    test_pred.append(tf.nn.softmax(logits, axis=1).numpy()[0])
test_pred_labels = np.argmax(test_pred, axis=-1)

100%|██████████| 1800/1800 [03:14<00:00,  9.23it/s]


In [13]:
#write test output to answer.txt
with open('drive/MyDrive/data/answer128-128v4.txt', 'w') as the_file:
    for i in range(len(test)):
        if test_pred_labels[i] == 1:   
            the_file.write(test.loc[i, "id"]+",SARCASM\n")
        else:
            the_file.write(test.loc[i, "id"]+",NOT_SARCASM\n")