<a href="https://colab.research.google.com/github/umbertoselva/NER-based-Sentiment-Analysis/blob/main/04_Sentiment_Analysis_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 04 SENTIMENT ANALYSIS WITH A FINE-TUNED BERT MODEL

This is Part 04 of my NER-based Sentiment Analysis Project: 
https://github.com/umbertoselva/NER-based-Sentiment-Analysis

Our goal here is to preprocess the Kaggle movie review dataset that we acquired and adapted in Part 03, and turn it into a training set and validation set with which to fine-tune a BERT model for Sentiment Analysis. Finally we want to perform Sentiment Analysis on the movie reviews in the "I Just Watched" subreddit dataset that we worked with in Part 01 and 02 and store the results in a dedicated "sentiment" column.

## TABLE OF CONTENTS

A) Preprocessing

B) Input pipeline

C) Fine-tuning BERT

D) Sentiment Analysis

## A) PREPROCESSING

#### LOADING THE KAGGLE MOVIE REVIEW DATASET

Let's retrieve from Google Drive the dataset that we created earlier in the Part 03

In [1]:
import pandas as pd

In [2]:
url = "https://drive.google.com/file/d/1PJsd2xDNDzRPgJnsgpmf5yAs8SWh5U2l/view?usp=sharing"
file_id = url.split('/')[-2]
dwn_url = 'https://drive.google.com/uc?export=download&id=' + file_id
df = pd.read_csv(dwn_url, sep='|', encoding='utf-8')
df

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,0
1,22,1,good for the goose,1
2,23,1,good,1
3,34,1,"the gander , some of which occasionally amuses...",0
4,47,1,amuses,1
...,...,...,...,...
76473,156048,8544,quietly suggesting the sadness and obsession b...,0
76474,156052,8544,sadness and obsession,0
76475,156053,8544,sadness and,0
76476,156057,8544,forced avuncular chortles,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76478 entries, 0 to 76477
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   PhraseId    76478 non-null  int64 
 1   SentenceId  76478 non-null  int64 
 2   Phrase      76478 non-null  object
 3   Sentiment   76478 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 2.3+ MB


#### TOKENIZATION

In [4]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 7.7 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 11.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 57.9 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 55.2 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstal

We will be fine-tuning the `bert-base-cased` model that we will leverage via the Huggingface Transformers library.

This model expects a max sequence length of 512. Let's set that variable here.

In [5]:
seq_len = 512 # this is the encoding size expected by the BERT model we'll be using
num_samples = len(df) # 76478

num_samples, seq_len

(76478, 512)

In [6]:
from transformers import BertTokenizer

In [7]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Let's turn all our texts from the 'Phrase' column of our dataframe into tokens

In [8]:
# the texts are found in the 'Phrase' col in our df
# N.B. the input arg must be a list (not a pd Series)
tokens = tokenizer(df['Phrase'].to_list(),
                   max_length=seq_len,
                   truncation=True, # truncate if longer than max length
                   padding='max_length', # pad if shorter than max length
                   add_special_tokens=True,
                   return_tensors='np') # returning NumPy arrays

In [9]:
tokens.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

We will be using the "input_ids" and the "attention_mask" arrays

In [10]:
tokens['input_ids']

array([[  101,   138,  1326, ...,     0,     0,     0],
       [  101,  1363,  1111, ...,     0,     0,     0],
       [  101,  1363,   102, ...,     0,     0,     0],
       ...,
       [  101, 12928,  1105, ...,     0,     0,     0],
       [  101,  2257,   170, ...,     0,     0,     0],
       [  101,   170, 25247, ...,     0,     0,     0]])

In [11]:
tokens['attention_mask']

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

#### SAVING THE INPUT ARRAYS

Let's save these arrays as NumPy binary files

In [12]:
import numpy as np

In [13]:
with open('movie-xids.npy', 'wb') as f:
  np.save(f, tokens['input_ids'])

In [14]:
with open('movie-xmask.npy', 'wb') as f:
  np.save(f, tokens['attention_mask'])

In [15]:
!ls

movie-xids.npy	movie-xmask.npy  sample_data


Now we need to extract the labels from the "Sentiment" column of our df and convert those too to a NumPy array and save it as a NumPy binary file

In [16]:
arr = df['Sentiment'].values

In [17]:
arr

array([0, 1, 1, ..., 0, 0, 1])

We need to transform this array into another array whose two dimensions will be
- the size of our dataframe (i.e. num_samples)
- the number of classes (i.e. 2) (arr.max()+1)

So whenever we have a 0 in our array, we want to have [1, 0] in our new array

And whenever we have a 1 in our array, we want to have [0, 1] in our new array.

- 0 = [1, 0]
- 1 = [0, 1]

In [18]:
labels = np.zeros((num_samples, arr.max()+1)) # (76478, 2)

In [19]:
labels.shape

(76478, 2)

In [20]:
labels

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       ...,
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [21]:
labels[np.arange(num_samples), arr] = 1

In [22]:
labels

array([[1., 0.],
       [0., 1.],
       [0., 1.],
       ...,
       [1., 0.],
       [1., 0.],
       [0., 1.]])

In [23]:
with open('movie-labels.npy', 'wb') as f:
  np.save(f, labels)

In [24]:
!ls

movie-labels.npy  movie-xids.npy  movie-xmask.npy  sample_data


## B) INPUT PIPELINE

Our goal here will be to create
- a training set
- a test/validation set

Each one ready to be fed to our custom model in batches of 16 samples of ({xids, xmask}, label) tuples of shape ({512, 512}, 2) corresponding to (inputs, outputs).

That is: ({(16, 512), (16, 512)}, (16, 2))


In [25]:
# import numpy as np

with open('movie-xids.npy', 'rb') as f:
    Xids = np.load(f, allow_pickle=True)
with open('movie-xmask.npy', 'rb') as f:
    Xmask = np.load(f, allow_pickle=True)
with open('movie-labels.npy', 'rb') as f:
    labels = np.load(f, allow_pickle=True)

First we need to create a TensorFlow Dataset object with our three numpy files containing the `input_ids`, the `attention_mask` and the `labels` arrays

In [26]:
import tensorflow as tf

In [27]:
from tensorflow.data import Dataset

The `.from_tensor_slices()` method will map the arrays onto each other.

It expects a tuple as input.

An input such as:

([1, 2], [3, 4], [5, 6])

will be mapped as:

[(1, 3, 5) (2, 4, 6)]

In [28]:
dataset = Dataset.from_tensor_slices((Xids, Xmask, labels))

In [29]:
type(dataset)

tensorflow.python.data.ops.dataset_ops.TensorSliceDataset

In [30]:
dataset

<TensorSliceDataset element_spec=(TensorSpec(shape=(512,), dtype=tf.int64, name=None), TensorSpec(shape=(512,), dtype=tf.int64, name=None), TensorSpec(shape=(2,), dtype=tf.float64, name=None))>

In [31]:
dataset.take(1)

<TakeDataset element_spec=(TensorSpec(shape=(512,), dtype=tf.int64, name=None), TensorSpec(shape=(512,), dtype=tf.int64, name=None), TensorSpec(shape=(2,), dtype=tf.float64, name=None))>

So our dataset shape is (512, 512, 2)

To feed our dataset into our model we need a tuple with two items:

`(inputs, outputs)`

However, here we have two matrices as inputs (Xids, Xmask), as each sample in our dataset is a tuple containing a single Xid, Xmask and label, so our first variable "inputs" will have to be a dict like the following (followed by the outputs, i.e. the labels):

```
(
  {
   input_ids': *input_id_tensor*,
   'attention_mask': *attention_mask_tensor*
  },
 labels
)
```

So let's rearrange our dataset like that and convert a three-item tuple into a two-item tuple. Let's create a custom function for that.

In [32]:
def map_inputs(input_ids, masks, labels):
  return {'input_ids': input_ids, 'attention_mask': masks},  labels

Then we use the `Dataset.map()` method to actually map/apply our `map_inputs()` function to each sample set

In [33]:
dataset = dataset.map(map_inputs)

In [34]:
dataset

<MapDataset element_spec=({'input_ids': TensorSpec(shape=(512,), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(512,), dtype=tf.int64, name=None)}, TensorSpec(shape=(2,), dtype=tf.float64, name=None))>

In [35]:
dataset.take(1)

<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(512,), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(512,), dtype=tf.int64, name=None)}, TensorSpec(shape=(2,), dtype=tf.float64, name=None))>

Now you can see the ({512, 512}, 2) shaped tuple that we wanted.

Now we shall shuffle and batch our data.

It's useful to shuffle before batching, so that the data within each batch will already be more mixed.

In [36]:
batch_size = 16

In [37]:
dataset = dataset.shuffle(10000).batch(batch_size, drop_remainder=True)

# drop_remainder=True will drop remaining items that don't fit into a batch

In [38]:
dataset

<BatchDataset element_spec=({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)}, TensorSpec(shape=(16, 2), dtype=tf.float64, name=None))>

In [39]:
dataset.take(1)

<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)}, TensorSpec(shape=(16, 2), dtype=tf.float64, name=None))>

So now our dataset is structured into input batches of shape ({(16, 512), (16, 512)}, (16, 2))

Finally we want to split our dataset into
- training set
- test/validation set

In [40]:
# our total data size is
size = Xids.shape[0]
size

76478

In [41]:
# so the total number of batches is
size / batch_size

4779.875

In [42]:
# let's take 90% of our batches as training data
size / batch_size * 0.9

4301.8875

In [43]:
# let's approximate
int(size / batch_size * 0.9)

4301

In [44]:
train_size = int(size / batch_size * 0.9)
train_size

4301

In [45]:
train_ds = dataset.take(train_size)
train_ds

<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)}, TensorSpec(shape=(16, 2), dtype=tf.float64, name=None))>

In [46]:
len(train_ds)

4301

In [47]:
test_ds = dataset.skip(train_size)
test_ds

<SkipDataset element_spec=({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)}, TensorSpec(shape=(16, 2), dtype=tf.float64, name=None))>

In [48]:
len(test_ds)

478

So our training set contains 4301 batches

And our test set contains 478 batches

In [49]:
4301*16, 479*16, 4301*16+479*16

(68816, 7664, 76480)

We can save our train and test sets

In [50]:
tf.data.experimental.save(train_ds, 'train_ds')
tf.data.experimental.save(test_ds, 'test_ds')

In [51]:
!ls

movie-labels.npy  movie-xmask.npy  test_ds
movie-xids.npy	  sample_data	   train_ds


Note that in order to load these saved datasets later on, we would need to specify the tensors' `element_spec` info, which is a description of the tensors' shape. That's a requirement of the `.load()` method.

The two datasets' element specs should be equal.

In [52]:
train_ds.element_spec

({'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None),
  'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)},
 TensorSpec(shape=(16, 2), dtype=tf.float64, name=None))

In [53]:
train_ds.element_spec == test_ds.element_spec

True

In [54]:
element_spec = train_ds.element_spec
element_spec

({'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None),
  'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)},
 TensorSpec(shape=(16, 2), dtype=tf.float64, name=None))

In [55]:
# so if you load the datasets you need to do
# train_ds = tf.data.experimental.load('train_ds', element_spec=element_spec)
# test_ds = tf.data.experimental.load('test_ds', element_spec=element_spec)

# but we don't need to re-load them here

## C) FINE-TUNING BERT

#### MODEL STRUCTURE

We will use the Huggingface Transformers library to leverage a BERT model, which we will fine-tune for Sentiment Analysis by
- specifying how the input is fed into the model
- building a custom classifier head

In [56]:
from transformers import TFAutoModel

In [57]:
bert = TFAutoModel.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/502M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [58]:
bert.summary()

Model: "tf_bert_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
Total params: 108,310,272
Trainable params: 108,310,272
Non-trainable params: 0
_________________________________________________________________


Let's define two input layers
- one for the input ids
- one for the attention mask

In [59]:
# import tensorflow
# from tensorflow.keras import layers

In [60]:
# shape= this is the max sequence length expected by BERT, its encoding size
# name= this must match the dict key that we set in the input tuple
# ({'input_ids': ..., 'attention_mask': ...}, labels)
input_ids = tf.keras.layers.Input(shape=(512,),
                                  name='input_ids',
                                  dtype='int32')
mask = tf.keras.layers.Input(shape=(512,),
                             name='attention_mask',
                             dtype='int32')

Now we will set the output of these layers to be the input of our BERT model, which shall return the embeddings.

Note that our BERT model will return either
- non-pooled output / last hidden state (3D) at index [0]
- pooled output (2D) at index [1]

We shall use the 2D pooled output and feed it into a couple of Dense layers which will perform the Sentiment Analysis Classification task.

In [61]:
embeddings = bert.bert(input_ids,
                       attention_mask=mask)[1] # index [1] = pooled output

Finally we define our classification head with two Dense layers, passing the embeddings as input

In [62]:
x = tf.keras.layers.Dense(1024, activation='relu')(embeddings)
y = tf.keras.layers.Dense(2, activation='softmax',
                          name='outputs')(x)

Initialize the model

In [63]:
model = tf.keras.Model(inputs=[input_ids, mask], outputs=y)

In [64]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 512)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 512)]        0           []                               
                                                                                                  
 bert (TFBertMainLayer)         TFBaseModelOutputWi  108310272   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 512,                                           

We don't need to optimize the BERT layer, so we will freeze it and train the rest

In [65]:
# the BERT layer is at index [2]
model.layers[2].trainable = False

In [66]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 512)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 512)]        0           []                               
                                                                                                  
 bert (TFBertMainLayer)         TFBaseModelOutputWi  108310272   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 512,                                           

As you can see the number of "Trainable params" has reduced.

#### TRAINING PARAMETERS

In [67]:
optimizer = tf.keras.optimizers.Adam(lr=5e-5, decay=1e-6) 
# these are recommended values for training BERT models

  super(Adam, self).__init__(name, **kwargs)


In [68]:
loss = tf.keras.losses.CategoricalCrossentropy()

In [69]:
acc = tf.keras.metrics.CategoricalAccuracy('accuracy')

#### COMPILE

In [70]:
model.compile(optimizer=optimizer,
              loss=loss,
              metrics=[acc])

#### TRAIN

In [72]:
history = model.fit(
    train_ds,
    validation_data=test_ds,
    epochs=5
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [73]:
model.save('bert-sentiment-model')



INFO:tensorflow:Assets written to: bert-sentiment-model/assets


INFO:tensorflow:Assets written to: bert-sentiment-model/assets


## D) SENTIMENT ANALYSIS

Our goal in this section will be to analyze the sentiment of all the reviews in the "I Just Watched" subreddit database that we created in Part 01.

For this purpose we need to create a function that we will apply to each row of the `selftext` column of the database (i.e. the column that contains the review texts). This function will use our trained model to predict the sentiment of the review. Let's call this function `get_sentiment()`.

Before doing that, however, we need another function that will preprocess the review text by tokenizing it (with the BERT tokenizer that we initialized above) and return the input tensors in the shape that our model expects. Let's define this `prep_data()` function here:

In [74]:
# remember that we initialized a BertTokenizer earlier

def prep_data(text):

  # get tokens
  tokens = tokenizer.encode_plus(text,
                                 max_length=512,
                                 truncation=True, 
                                 padding='max_length',
                                 add_special_tokens=True, 
                                 return_token_type_ids=False,
                                 return_tensors='tf')
  
  # the tokenizer returns int32 tensors, 
  # we need to return float64, so we use tf.cast()
  return {'input_ids': tf.cast(tokens['input_ids'], tf.float64),
          'attention_mask': tf.cast(tokens['attention_mask'], tf.float64)}

Now let's define the `get_sentiment()` function, which will call the above function and then feed its output into our model to predict the sentiment.

Let's set it so it returns a tuple containing a sentiment lable ('POSITIVE' vs 'NEGATIVE') and the probability predicted by our model.

In [82]:
def get_sentiment(text):

  # tokenize the text and prepare the input for the model
  inputs = prep_data(text)

  # predict the sentiment of the text with the model
  # i.e. predict the probs of the two classes
  probs = model.predict(inputs)[0]

  # take the highest prob
  sent_class = np.argmax(probs) # this returns the class 0 neg or 1 pos
  sent_score = probs[sent_class] # this captures the score

  if sent_class == 0:
    sent_label = 'NEGATIVE'
  else:
    sent_label = 'POSITIVE'

  # return
  return (sent_label, sent_score)

Now let's test it

In [83]:
test_review = "I liked this movie very much"
get_sentiment(test_review)

('POSITIVE', 0.8326956)

In [84]:
test_review = "This one sucked"
get_sentiment(test_review)

('NEGATIVE', 0.7317641)

Now we want to apply this to the "I Just Watched" dataset to find out the sentiment of each review

Let's load the dataset

In [87]:
url = "https://drive.google.com/file/d/1rGO4DABtChIogEC8mn7EHpQiZotbapM1/view?usp=sharing"
file_id = url.split('/')[-2]
dwn_url = 'https://drive.google.com/uc?export=download&id=' + file_id
df_ijw = pd.read_csv(dwn_url, sep='|', encoding='utf-8')
df_ijw

Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,people
0,t3_vzu4cb,1.657906e+09,Ijustwatched,IJW: Ang Babaeng Nawawala sa Sarili (2022),Source: [https://www.reeladvice.net/2022/07/an...,0.86,5.0,0.0,5.0,"['Albina', 'Ayanna Misola', 'Adrian Alandy']"
1,t3_vz90er,1.657840e+09,Ijustwatched,Ijw: Paws of Fury: The Legend of Hank (2022),"For a very little kid’s first parody/farce, it...",0.89,7.0,0.0,7.0,"['Marx', 'Mel Brooks', 'Mel']"
2,t3_vyxfuj,1.657810e+09,Ijustwatched,IJW: Kitty K7 (2022),Source: [https://www.reeladvice.net/2022/07/ki...,1.00,1.0,0.0,1.0,"['Hana', 'Rose van Ginkel', 'Kitty K7', 'Joy A..."
3,t3_vx6v7n,1.657617e+09,Ijustwatched,IJW : Man from Toronto (2022),"Was a pretty dope movie, watched it online ye...",0.74,4.0,0.0,4.0,"[""Kevin Hart's""]"
4,t3_vwmwkm,1.657558e+09,Ijustwatched,IJW: Thor: Love and Thunder (2022),Source: [https://www.reeladvice.net/2022/07/th...,0.74,4.0,0.0,4.0,"['Korg', 'Thor', 'Thors', 'Chris Hemsworth', '..."
...,...,...,...,...,...,...,...,...,...,...
992,t3_oj9jvl,1.626156e+09,Ijustwatched,IJW: Fired Up! [2009],Fired Up! is a dramedy romcom type film about ...,1.00,4.0,0.0,4.0,[]
993,t3_oinxgw,1.626083e+09,Ijustwatched,IJW: The 8th Night (2021),Plot is confusing to say the least. It appears...,1.00,5.0,0.0,5.0,"['Buddha', 'Kim Yoo Jung']"
994,t3_oilr8d,1.626072e+09,Ijustwatched,IJW: Diary of a Chambermaid [1964],Diary of a Chambermaid is a drama mystery roma...,1.00,3.0,0.0,3.0,[]
995,t3_oiisdi,1.626059e+09,Ijustwatched,IJW: Soldier (1998),I remember watching this growing up. Good acti...,1.00,5.0,0.0,5.0,[]


Let's create a dedicated column and populate it

In [88]:
df_ijw['sentiment'] = df_ijw['selftext'].apply(get_sentiment)

In [89]:
df_ijw

Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,people,sentiment
0,t3_vzu4cb,1.657906e+09,Ijustwatched,IJW: Ang Babaeng Nawawala sa Sarili (2022),Source: [https://www.reeladvice.net/2022/07/an...,0.86,5.0,0.0,5.0,"['Albina', 'Ayanna Misola', 'Adrian Alandy']","(POSITIVE, 0.5522495)"
1,t3_vz90er,1.657840e+09,Ijustwatched,Ijw: Paws of Fury: The Legend of Hank (2022),"For a very little kid’s first parody/farce, it...",0.89,7.0,0.0,7.0,"['Marx', 'Mel Brooks', 'Mel']","(POSITIVE, 0.5305168)"
2,t3_vyxfuj,1.657810e+09,Ijustwatched,IJW: Kitty K7 (2022),Source: [https://www.reeladvice.net/2022/07/ki...,1.00,1.0,0.0,1.0,"['Hana', 'Rose van Ginkel', 'Kitty K7', 'Joy A...","(POSITIVE, 0.84092736)"
3,t3_vx6v7n,1.657617e+09,Ijustwatched,IJW : Man from Toronto (2022),"Was a pretty dope movie, watched it online ye...",0.74,4.0,0.0,4.0,"[""Kevin Hart's""]","(NEGATIVE, 0.5498567)"
4,t3_vwmwkm,1.657558e+09,Ijustwatched,IJW: Thor: Love and Thunder (2022),Source: [https://www.reeladvice.net/2022/07/th...,0.74,4.0,0.0,4.0,"['Korg', 'Thor', 'Thors', 'Chris Hemsworth', '...","(NEGATIVE, 0.5038758)"
...,...,...,...,...,...,...,...,...,...,...,...
992,t3_oj9jvl,1.626156e+09,Ijustwatched,IJW: Fired Up! [2009],Fired Up! is a dramedy romcom type film about ...,1.00,4.0,0.0,4.0,[],"(NEGATIVE, 0.5597052)"
993,t3_oinxgw,1.626083e+09,Ijustwatched,IJW: The 8th Night (2021),Plot is confusing to say the least. It appears...,1.00,5.0,0.0,5.0,"['Buddha', 'Kim Yoo Jung']","(POSITIVE, 0.76196575)"
994,t3_oilr8d,1.626072e+09,Ijustwatched,IJW: Diary of a Chambermaid [1964],Diary of a Chambermaid is a drama mystery roma...,1.00,3.0,0.0,3.0,[],"(POSITIVE, 0.5740766)"
995,t3_oiisdi,1.626059e+09,Ijustwatched,IJW: Soldier (1998),I remember watching this growing up. Good acti...,1.00,5.0,0.0,5.0,[],"(NEGATIVE, 0.5029748)"


Let us save the dataset (which now includes the sentiment column) into a CSV file for later use

In [96]:
df_ijw.to_csv('ijw_subreddit_ner_sent_bert.csv', sep='|', encoding='utf-8', index=False)

In [98]:
!ls

bert-sentiment-model		 movie-labels.npy  movie-xmask.npy  test_ds
ijw_subreddit_ner_sent_bert.csv  movie-xids.npy    sample_data	    train_ds
