<p style="background-color:firebrick;font-family:newtimeroman;font-size:200%;color:white;text-align:center;border-radius:60px 20px;"><b>NLP - Bert Binary Classification</b></p>

**If your computer is not GPU compatible, I recommend you run this notebook on Kaggle.**

# <font color='firebrick'> <b>Loading The IMBD Dataset</b><font color='black'>  

[Hugging Face - IMDB Dataset Link](https://huggingface.co/datasets/stanfordnlp/imdb)

In [1]:
from datasets import load_dataset

imdb = load_dataset("imdb")

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [2]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [3]:
imdb["train"][0]    # let's see the first data

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

# <font color='firebrick'> <b>Data Preprocessing</b><font color='black'>  

* Each model has its own unique tokenization (conversion to numerical representations) method.
* In this study, we will use the DistilBERT model.
* DistilBERT is a smaller, faster, and lighter version of BERT. The number of parameters and layers is approximately halved.
* BERT has 110 million parameters, while DistilBERT has 66 million, achieving 97% of BERT’s accuracy.

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



* We must load the Tokenizer specific to the model we are using. We used AutoTokenizer for automatic downloading.

In [5]:
print("Vocab size: ", tokenizer.vocab_size)              # total number of words in this tokenizer
print("Model max size: ", tokenizer.model_max_length)    # total number of tokens in each input

Vocab size:  30522
Model max size:  512


In [6]:
# An example tokenizer operation

text = "I love this movie. This is an awesome film."
encoded_text = tokenizer(text)
print(encoded_text)

{'input_ids': [101, 1045, 2293, 2023, 3185, 1012, 2023, 2003, 2019, 12476, 2143, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


* 101 represents the start of the text, and 102 represents the end.
* The **attention_mask** indicates which words (tokens) the model should process and which should be ignored.

In [7]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)     # data tokenizer transaction function

* The **truncation** parameter, when used in the tokenizer of Transformer-based models like BERT, ensures that if the input text exceeds a certain maximum length, the excess tokens are truncated. This means that when the text exceeds the model’s maximum token limit, the extra tokens are removed, and only the allowed portion is sent to the model.

In [8]:
preprocess_function(imdb["train"][0])     # sample data output with function

{'input_ids': [101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 383

In [9]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)      # Applying the function to the entire data set

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [10]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, 
                                        return_tensors="tf")

* The above process ensures that each input reaches the maximum token count, which is 512 (we perform this operation using the `DataCollatorWithPadding` function).

# <font color='firebrick'> <b>Setting The Model Metric</b><font color='black'>  

* To evaluate our model, we load the necessary library and set up our metrics.

In [11]:
!pip install -q evaluate 

  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [12]:
import evaluate

accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [13]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

# <font color='firebrick'> <b>Model Training</b><font color='black'>  

* We create the variables below to see the model’s predicted labels.

In [14]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [15]:
from transformers import create_optimizer
import tensorflow as tf

batch_size = 16   # Specifies the number of examples in a minibatch.
num_epochs = 5
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size    # Indicates how many minibatches will be processed in one epoch.
total_train_steps = int(batches_per_epoch * num_epochs)           # Determines the total number of steps the model will be trained in (the product of epoch count and minibatch count).
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

* The hyperparameters above can be adjusted as needed. The important point here is to configure them according to our dataset.

In [16]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

* We adjust the last layer of the model according to our dataset.

In [17]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_imdb["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_imdb["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

* In the process above, we convert our dataset into a TensorFlow dataset object so that TensorFlow can interpret it.

In [18]:
model.compile(optimizer=optimizer)  

In [19]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, 
                                      eval_dataset=tf_validation_set)

In [20]:
model.fit(x=tf_train_set, 
          validation_data=tf_validation_set, 
          epochs=3, 
          callbacks=metric_callback)

Epoch 1/3
Cause: for/else statement not yet supported


I0000 00:00:1728149449.573511      89 service.cc:145] XLA service 0x7aa5a38cb0c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1728149449.573557      89 service.cc:153]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
I0000 00:00:1728149449.740696      89 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x7aa60a1f0250>

# <font color='firebrick'> <b>Prediction</b><font color='black'>  

In [21]:
text = "I was never bored watching the movie. The quality of the actors was very good."

In [22]:
inputs = tokenizer(text, return_tensors="tf")   # Convert the text to numerical representations
inputs

{'input_ids': <tf.Tensor: shape=(1, 19), dtype=int32, numpy=
array([[  101,  1045,  2001,  2196, 11471,  3666,  1996,  3185,  1012,
         1996,  3737,  1997,  1996,  5889,  2001,  2200,  2204,  1012,
          102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 19), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
      dtype=int32)>}

In [23]:
logits = model(**inputs).logits       # Convert the model's prediction to logits
logits

<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-3.0632768,  2.9883244]], dtype=float32)>

In [24]:
predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
model.config.id2label[predicted_class_id]

'POSITIVE'

In [25]:
def predict_class(text):
    
    inputs = tokenizer(text, return_tensors="tf")
    
    logits = model(**inputs).logits
    
    predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
    
    return model.config.id2label[predicted_class_id]

In [26]:
result = predict_class("I didn't get anything out of the movie. The script wasn't good enough.")
print(result)

NEGATIVE


<p style="background-color:firebrick;font-family:newtimeroman;font-size:200%;color:white;text-align:center;border-radius:60px 20px;"><b>THANK YOU!</b></p>