## Data Collection steps

In [19]:
import pandas as pd

# Download and load the train data:
train_data_url = 'https://raw.githubusercontent.com/google-research/google-research/2adf640a14f11025ae5a9d0ec493b78530d276d3/goemotions/data/train.tsv'

# Load the files into dataframes
train_data = pd.read_csv(train_data_url, sep='\t')

# 1. Feature Engineering

In [20]:
# split the data into features and target
# comment will be the only feature
# emotion will be the target
header = ["comment", "emotion", "id"]
train_data.columns = header
train_data.head(2)

Unnamed: 0,comment,emotion,id
0,"Now if he does off himself, everyone will thin...",27,ed00q6i
1,WHY THE FUCK IS BAYLESS ISOING,2,eezlygj


### Loading the BERT tokenizer

In [21]:
# making features out of the comment column
# we tokenize the comments
from transformers import BertTokenizerFast

# Load the BERT tokenizer
# we will use the bert-base-uncased tokenizer
# this tokenizer will tokenize the comments
# and convert them into tokens
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

### Converting Comments into Tokens

In [22]:
# Tokenize the comments
train_data['tokenized_comments'] = train_data['comment'].apply(
    lambda x: tokenizer.encode(x)
)

In [23]:
train_data

Unnamed: 0,comment,emotion,id,tokenized_comments
0,"Now if he does off himself, everyone will thin...",27,ed00q6i,"[101, 2085, 2065, 2002, 2515, 2125, 2370, 1010..."
1,WHY THE FUCK IS BAYLESS ISOING,2,eezlygj,"[101, 2339, 1996, 6616, 2003, 3016, 3238, 1116..."
2,To make her feel threatened,14,ed7ypvh,"[101, 2000, 2191, 2014, 2514, 5561, 102]"
3,Dirty Southern Wankers,3,ed0bdzj,"[101, 6530, 2670, 14071, 11451, 102]"
4,OmG pEyToN iSn'T gOoD eNoUgH tO hElP uS iN tHe...,26,edvnz26,"[101, 18168, 2290, 17931, 3475, 1005, 1056, 22..."
...,...,...,...,...
43404,Added you mate well I’ve just got the bow and ...,18,edsb738,"[101, 2794, 2017, 6775, 2092, 1045, 1521, 2310..."
43405,Always thought that was funny but is it a refe...,6,ee7fdou,"[101, 2467, 2245, 2008, 2001, 6057, 2021, 2003..."
43406,What are you talking about? Anything bad that ...,3,efgbhks,"[101, 2054, 2024, 2017, 3331, 2055, 1029, 2505..."
43407,"More like a baptism, with sexy results!",13,ed1naf8,"[101, 2062, 2066, 1037, 18336, 1010, 2007, 791..."


Here's another way to do it

In [24]:
# Using the tokenizer function

#### PADDING means that we add padding to the tokens
# for example
# if we have the tokens [1, 2, 3, 4, 5]
# and we want to pad them to the length of 10
# we will add 5 padding tokens to the tokens
# so the tokens will look like this: [1, 2, 3, 4, 5, 0, 0, 0, 0, 0]

#### TRUNCATION means that we remove tokens from the tokens
# for example
# if we have the tokens [1, 2, 3, 4, 5]
# and we want to truncate them to the length of 3
# we will remove the last 2 tokens
# so the tokens will look like this: [1, 2, 3]

# Why do we need PADDING and TRUNCATION?
# The BERT model requires that all the input sequences have the same length
# we can achieve this by either padding or truncating the sequences
# we can also use a combination of both
# for example, using both
# we can pad the sequences to a certain length
# and if the sequence is longer than the maximum length
# we can truncate the sequence to the maximum length

#### RETURN_TENSORS means that we want the output to be a PyTorch tensor

tokenized_comments = tokenizer(train_data['comment'].to_list(), padding=True, truncation=True, return_tensors='pt')

In [25]:
#### TOKENIZED_COMMENTS is a dictionary
# it contains the INPUT_IDS, ATTENTION_MASK, and TOKEN_TYPE_IDS
# input_ids are the tokenized comments

#### ATTENTION_MASK is a tensor that has the same length as the input_ids
# it contains 1s where the input_ids are and 0s where the padding tokens are

#### TOKEN_TYPE_IDS is a tensor that has the same length as the input_ids
# it contains 0s where the first sentence is and 1s where the second sentence is
# since we only have one sentence, all the values are 0s

# So it separates the text into sentences?
# Yes, it separates the text into sentences
# but since we only have one sentence, all the values are 0s

tokenized_comments

{'input_ids': tensor([[ 101, 2085, 2065,  ...,    0,    0,    0],
        [ 101, 2339, 1996,  ...,    0,    0,    0],
        [ 101, 2000, 2191,  ...,    0,    0,    0],
        ...,
        [ 101, 2054, 2024,  ...,    0,    0,    0],
        [ 101, 2062, 2066,  ...,    0,    0,    0],
        [ 101, 5959, 1996,  ...,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

# 2. Model Building

We try BertForSequenceClassification now

In [26]:
from transformers import BertForSequenceClassification
# Load the BERT model
# we will use the bert-base-uncased model
# this model will classify the comments into 28 emotions

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=28)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Going back to the research question, the research question could be "Can we predict the sentiment of a text?"

Our first research question is now:

**How can different transformer-based models classify textual emotions effectively?**

We start with the BERT Classifier.