<a href="https://colab.research.google.com/github/Confirmation-Bias-Analyser/Confirmation-Bias-Model/blob/main/Subjectivity_Detection_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install transformers

In [2]:
from transformers import BertTokenizer, TFBertForSequenceClassification, InputExample, InputFeatures
import tensorflow as tf
import pandas as pd
from google.colab import files, drive
drive.mount('/content/drive')

from sklearn.model_selection import train_test_split

# The shutil module offers a number of high-level 
# operations on files and collections of files.
import os
import shutil

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Obtain BERT Model

In [3]:
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [4]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


## Obtain Dataset
Based on this dataset: https://www.cs.cornell.edu/people/pabo/movie-review-data/

In [5]:
URL = "http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz"

dataset = tf.keras.utils.get_file(fname="rotten_imdb.tar.gz", 
                                  origin=URL,
                                  untar=True,
                                  cache_dir='.',
                                  cache_subdir='')

Downloading data from http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz


## Reorganise Datasets
Removing unlabelled data, obtain the directory of train dataset and get the details of train and test sets

In [6]:
with open('/content/plot.tok.gt9.5000', 'r') as f_o:
  objective_data = f_o.read()
  f_o.close()

objective_data = objective_data.split(' \n')

with open('/content/quote.tok.gt9.5000', 'r', encoding="ISO-8859-1") as f_s:
  subjective_data = f_s.read()
  f_s.close()

subjective_data = subjective_data.split(' \n')

In [7]:
objective_from_dict = {'DATA_COLUMN': objective_data}
objective_df = pd.DataFrame.from_dict(objective_from_dict)
objective_df['LABEL_COLUMN'] = 0

subjective_from_dict = {'DATA_COLUMN': subjective_data}
subjective_df = pd.DataFrame.from_dict(subjective_from_dict)
subjective_df['LABEL_COLUMN'] = 1

df = pd.concat([objective_df, subjective_df])
df.head()

Unnamed: 0,DATA_COLUMN,LABEL_COLUMN
0,the movie begins in the past where a young boy...,0
1,emerging from the human psyche and showing cha...,0
2,spurning her mother's insistence that she get ...,0
3,amitabh can't believe the board of directors a...,0
4,"she , among others excentricities , talks to a...",0


### Train-Test Split

In [8]:
# We create a training dataset and a validation 
# dataset from our "aclImdb/train" directory with a 80/20 split.

train, test, y_train, y_test = train_test_split(df, df['LABEL_COLUMN'], test_size=0.2, random_state=42)

## View Train and Test Datasets
Using the Pandas library to view the datasets

In [9]:
print(len(train))
train.head()

7883


Unnamed: 0,DATA_COLUMN,LABEL_COLUMN
4923,"in 1975 , as the vietnam war was ending , thou...",0
1745,at the last minute a cyber-friend arrives to j...,0
430,it's endearing to hear madame d . refer to her...,1
1452,"70 years later , these appliances were condemn...",0
101,arjun ( akshay kumar ) is a high profile vigil...,0


## Data Processing
Using the Input Examples function of Transformers to process the data

Create two main functions:

1 — `convert_data_to_examples`: This will accept our train and test datasets and convert each row into an InputExample object.

2 — `convert_examples_to_tf_dataset`: This function will tokenize the InputExample objects, then create the required input format with the tokenized objects, finally, create an input dataset that we can feed to the model.

In [10]:
InputExample(guid=None,
             text_a = "Hello, world",
             text_b = None,
             label = 1)

InputExample(guid=None, text_a='Hello, world', text_b=None, label=1)

In [11]:
def convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN): 
  train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)

  validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)
  
  return train_InputExamples, validation_InputExamples
  
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in examples:
        # Documentation is really strong for this method, so please take a look at it
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
            )
        )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )

DATA_COLUMN = 'DATA_COLUMN'
LABEL_COLUMN = 'LABEL_COLUMN'

In [12]:
# train_InputExamples, validation_InputExamples refers to the train and test data respectively
train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN)

train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)

validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)



## Model Training
Training for 3 epochs

In [13]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=3, validation_data=validation_data, verbose=1)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f8969c9ec50>

In [14]:
saved_path = '/content/drive/MyDrive/Final Year Project/Key Notebooks/Confirmation Bias Analyser/'

model.save_pretrained(saved_path + 'saved_subjectivity_model')
tokenizer.save_pretrained(saved_path + 'subjectivity_tokenizer')

('/content/drive/MyDrive/Final Year Project/Key Notebooks/Confirmation Bias Analyser/subjectivity_tokenizer/tokenizer_config.json',
 '/content/drive/MyDrive/Final Year Project/Key Notebooks/Confirmation Bias Analyser/subjectivity_tokenizer/special_tokens_map.json',
 '/content/drive/MyDrive/Final Year Project/Key Notebooks/Confirmation Bias Analyser/subjectivity_tokenizer/vocab.txt',
 '/content/drive/MyDrive/Final Year Project/Key Notebooks/Confirmation Bias Analyser/subjectivity_tokenizer/added_tokens.json')

In [15]:
new_model = TFBertForSequenceClassification.from_pretrained(saved_path + 'saved_subjectivity_model')
new_model.summary()

Some layers from the model checkpoint at /content/drive/MyDrive/Final Year Project/Key Notebooks/Confirmation Bias Analyser/saved_subjectivity_model were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at /content/drive/MyDrive/Final Year Project/Key Notebooks/Confirmation Bias Analyser/saved_subjectivity_model.
If your task is similar to the task the model of the checkpoint was t

Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_75 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________
