Loading the Dataset


Pre-processing the raw data


Getting BERT Pre-trained model and its tokenizer


Training and evaluation


Prediction Pipeline


# Loading the Dataset

In [40]:
import pandas as pd
df = pd.read_csv("bbc-text.csv")
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [41]:
df.isnull().sum()

category    0
text        0
dtype: int64

# Balacing Classes

In [42]:
# from imblearn.over_sampling import RandomOverSampler
# # Step 2: Initialize RandomOverSampler
# oversampler = RandomOverSampler(random_state=42)

# # Step 3: Fit RandomOverSampler to the data
# X = df['text'].values.reshape(-1, 1)  # Reshape to 2D array
# y = df['category']
# X_resampled, y_resampled = oversampler.fit_resample(X, y)

# # Step 4: Resample the data to balance the classes

# # Step 5: Split the resampled data into features and target columns
# df = pd.DataFrame({'text': X_resampled.flatten(), 'category': y_resampled})

In [43]:
df.shape

(2225, 2)

# Train and Test Set

In [44]:
from sklearn.model_selection import train_test_split


# Split the DataFrame into train and test sets
df_train, df_test = train_test_split(df, test_size=0.2, shuffle=True, random_state=42)

# Display the first few rows of each DataFrame
print("Train Set:")
print(df_train.shape)

print("\nTest Set:")
print(df_test.shape)


Train Set:
(1780, 2)

Test Set:
(445, 2)


In [45]:
df_train['category'].value_counts()

sport            413
business         409
politics         334
tech             319
entertainment    305
Name: category, dtype: int64

# Cleaning Text

In [46]:
# import re
# import string
# from nltk.corpus import stopwords
# from nltk.stem import PorterStemmer
# from nltk.tokenize import word_tokenize

# def preprocess_text(text):
#     # Lowercase the text
#     text = text.lower()

#     # Remove numbers
#     text = re.sub(r'\d+', '', text)

#     # Remove punctuation
#     text = text.translate(str.maketrans('', '', string.punctuation))

#     # Remove URLs
#     text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

#     # Remove emails
#     text = re.sub(r'\S*@\S*\s?', '', text)

#     # Remove emojis
#     emoji_pattern = re.compile("["
#                                u"\U0001F600-\U0001F64F"  # emoticons
#                                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
#                                u"\U0001F680-\U0001F6FF"  # transport & map symbols
#                                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
#                                u"\U00002702-\U000027B0"
#                                u"\U000024C2-\U0001F251"
#                                "]+", flags=re.UNICODE)
#     text = emoji_pattern.sub(r'', text)

#     # Tokenize the text
#     tokens = word_tokenize(text)

#     # Remove stop words
#     stop_words = set(stopwords.words('english'))
#     tokens = [word for word in tokens if word not in stop_words]

#     # Stemming
#     stemmer = PorterStemmer()
#     tokens = [stemmer.stem(word) for word in tokens]

#     # Join tokens back into a single string
#     text = ' '.join(tokens)

#     return text


# df['clean_text'] = df['text'].apply(lambda x: preprocess_text(x))

# Converting our Target column into Categorical data

In [47]:
df_test['category'].value_counts()

business         101
sport             98
politics          83
tech              82
entertainment     81
Name: category, dtype: int64

In [48]:
df_test.shape

(445, 2)

In [49]:
encoded_dict = {"sport":0,"business":1, 'politics':2, "entertainment":3,'tech':4}
df_train['category'] = df_train['category'].map(encoded_dict)

In [50]:
df_test['category'] = df_test['category'].map(encoded_dict)

In [51]:
from keras.utils import to_categorical
# Convert the 'category' column to categorical values
y_train = to_categorical(df_train['category'])

In [52]:
y_test = to_categorical(df_test['category'])

In [53]:
y_train.shape, y_test.shape

((1780, 5), (445, 5))

# Loading Model and Tokenizer from the transformers package

In [54]:
from transformers import AutoTokenizer,TFBertModel
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
bert = TFBertModel.from_pretrained('bert-base-cased')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [55]:
max_len_train = max(len(text) for text in df_train['text'])
max_len_test= max(len(text) for text in df_test['text'])

In [56]:
max_len_train, max_len_test

(19136, 25483)

# Input Data Modeling


Before training, we need to convert the input textual data into BERT’s input data format using a tokenizer.

Since we have loaded bert-base-cased, so tokenizer will also be Bert-base-cased.

In [73]:
 # Tokenize the input
max_len = 70
x_train = tokenizer(
    text=df_train.text.tolist(),
    add_special_tokens=True,
    max_length=max_len,
    truncation=True,
    padding=True,
    return_tensors='tf',
    return_token_type_ids=False,
    return_attention_mask=True,
    verbose=True
)
x_test = tokenizer(
    text=df_test.text.tolist(),
    add_special_tokens=True,
    max_length=max_len,
    truncation=True,
    padding=True,
    return_tensors='tf',
    return_token_type_ids=False,
    return_attention_mask=True,
    verbose=True
)

Tokenizer takes all the necessary parameters and returns tensor in the same format Bert accepts.

return_token_type_ids = False: token_type_ids is not necessary for our training in this case.


return_attention_mask = True we want to include attention_mask in our input.


return_tensors=’tf’: we want our input tensor for the TensorFlow model.


max_length=70:
we want the maximum length of each sentence to be 70; if a sentence is
bigger than this, it will be trimmed if a sentence is smaller than
70 then it will be padded.

add_special_tokens=True, CLS, SEP token will be added in the tokenization.


Hereafter data modelling, the tokenizer will return a dictionary (x_train) containing ‘Input_ids’, ‘attention_mask’ as key for their respective

In [74]:
input_ids = x_train['input_ids']
attention_mask = x_train['attention_mask']

In [75]:
print("input ids:",input_ids.shape)
print("attention_mask:", attention_mask.shape)

input ids: (1780, 70)
attention_mask: (1780, 70)


# Model Building

## Importing necessary libraries.



In [76]:
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Input, Dense

In [77]:
# Define input layers
input_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_ids")
input_mask = Input(shape=(max_len,), dtype=tf.int32, name="attention_mask")

input_ids.shape,input_mask.shape

(TensorShape([None, 70]), TensorShape([None, 70]))

In [78]:
# Define model architecture (assuming 'bert' is already defined)
embeddings = bert(input_ids, attention_mask=input_mask)[0]
out = tf.keras.layers.GlobalMaxPool1D()(embeddings)
out = Dense(128, activation='relu')(out)
out = tf.keras.layers.Dropout(0.1)(out)
out = Dense(32, activation='relu')(out)
y = Dense(5, activation='sigmoid')(out)

# Create the model
model = tf.keras.Model(inputs=[input_ids, input_mask], outputs=y)
model.layers[2].trainable = True

Bert layers accept three input arrays, input_ids, attention_mask, token_type_ids


input_ids means our input words encoding, then attention mask,


token_type_ids is necessary for the question-answering model; in this case, we will not pass token_type_ids.


For the Bert layer, we need two input layers, in this case, input_ids, attention_mask.


Embeddings contain hidden states of the Bert layer.
using


GlobalMaxPooling1D then dense layer to build CNN layers using hidden
states of Bert. These CNN layers will yield our output.
bert[0] is the last hidden state, bert[1] is the
pooler_output, for building CNN layers on top of the BERT layer, we have
used Bert’s hidden forms.

# Model Compilation
Defining learning parameters and compiling the model.

learning_rate = 5e-05 the learning rate for the model will be significantly lower.


Loss = CategoricalCrossentropy since we are passing the categorical data as the target.


Balanced accuracy will take care of our average accuracy for all the classes.

In [79]:
from keras.optimizers import Adam
from keras.optimizers.schedules import ExponentialDecay
from keras.losses import CategoricalCrossentropy
from keras.metrics import CategoricalAccuracy

# Define learning rate schedule
initial_learning_rate = 5e-05
lr_schedule = ExponentialDecay(
    initial_learning_rate,
    decay_steps=10000,  # Adjust this value according to your needs
    decay_rate=0.01,    # Adjust this value according to your needs
    staircase=True)
# Define optimizer, loss, and metrics
optimizer = Adam(
    learning_rate=lr_schedule,
    epsilon=1e-08,
    clipnorm=1.0
)
loss = CategoricalCrossentropy(from_logits=True)
metric = CategoricalAccuracy(name='balanced_accuracy')

# Compile the model
model.compile(
    optimizer=optimizer,
    loss=loss,
    metrics=metric
)

# Model Training

You have the model ready with x_train, y_train. You can now train the model.


Training and fine-tuning of the BERT model takes a bit longer time. so be Patience.

model.fit returns a history object which keeps all the training history.
x_test became a dictionary containing ‘input_ids’, ‘attention_mask‘ after pre-processing. We are passing input_ids and attention_mask for the training.
In the validation data, we are passing the test data.



In [85]:
train_history = model.fit(
    x={'input_ids': x_train['input_ids'], 'attention_mask': x_train['attention_mask']},
    y=y_train,
    validation_data=(
        {'input_ids': x_test['input_ids'], 'attention_mask': x_test['attention_mask']},
        y_test
    ),
    epochs=2,
    batch_size=36
)


Epoch 1/2
Epoch 2/2


# Model Evaluation

Testing our model on the test data.


In [86]:
predicted_raw = model.predict({'input_ids':x_test['input_ids'],'attention_mask':x_test['attention_mask']})
predicted_raw[0]



array([0.1674958 , 0.36198717, 0.9961494 , 0.23011458, 0.02928738],
      dtype=float32)

In [89]:
# Taking the index of value having maximum probability.
import numpy as np
y_predicted = np.argmax(predicted_raw, axis = 1)
y_true = df_test.category
y_true

414     2
420     1
1644    3
416     4
1232    0
       ..
741     1
205     1
1102    1
668     1
479     1
Name: category, Length: 445, dtype: int64

In [90]:
# Classification Report
from sklearn.metrics import classification_report
print(classification_report(y_true, y_predicted))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99        98
           1       0.99      0.95      0.97       101
           2       0.94      0.99      0.96        83
           3       1.00      1.00      1.00        81
           4       1.00      0.99      0.99        82

    accuracy                           0.98       445
   macro avg       0.98      0.99      0.98       445
weighted avg       0.98      0.98      0.98       445



# Prediction Pipeline
Converting indexes back to the Sentiment label:

In [100]:
def predict(text):
    x_val = tokenizer(
        text=texts,
        add_special_tokens=True,
        max_length=70,
        truncation=True,
        padding='max_length',
        return_tensors='tf',
        return_token_type_ids=False,
        return_attention_mask=True,
        verbose=True
    )
    validation = model.predict({'input_ids': x_val['input_ids'], 'attention_mask': x_val['attention_mask']}) * 100

    # Create lists to store the results
    labels = []
    scores = []

    # Iterate over the predicted values and store them
    for key, value in zip(encoded_dict.keys(), validation[0]):
        labels.append(key)
        scores.append(value)

    # Get the predicted label with the highest score
    predicted_label = labels[scores.index(max(scores))]

    # Return the labels, scores, and the predicted label with its score
    return labels, scores, predicted_label, max(scores)


In [106]:
texts = 'input the textbrown and blair face new rift claims for the umpteenth time  tony blair and gordon brown are said to have declared all out war on each other.  this time the alleged rift is over who should take the credit for the government s global aid and debt initiatives  particularly in the wake of the tsunami disaster - an issue many hoped and believed was above such things. it dominated the prime minister s monthly news conference  which saw mr blair start in full irritation mode as he was forced to bat away question after question about his relationship with his neighbour. as he told journalists:  i am not interested in what goes in and out of newspapers. there is a complete unity of purpose.  and he again heaped praise on mr brown saying he was doing a great job  and would continue doing it - although he would not commit to any job for mr brown after the election.  so why did he arrange his press conference at the last moment so it coincided with mr brown s long-arranged keynote speech on aid and debt  he was asked  by now mr blair had moved from irritation mode to his barely disguised fury setting. he snapped back that the hacks knew very well what the operational reasons were for the timing of his press conference. well  not really  as it happens.  and he repeated what a great man gordon was and how united they were  before again sneering that he took absolutely no notice of what went in and out of the newspapers  preferring to get on with the job of doing the best for the country and the world. although in the next breath he declared:  i get increasingly alarmed by what i read in the newspapers  before catching himself on and quickly adding:  in so far as i read them of course.  he probably had good reason to be alarmed because the newspapers had been full of stories about the claimed open warfare between the two men.  as far as the timing of the prime minister s press conference is concerned  there are two options. the first is that it was a calculated attempt to upstage the chancellor and seize back the initiative on the big issue of the moment. if that is the case it suggests that even the fear of seriously negative newspaper headlines is not enough to stop the squabbling. the second option is that it was an unavoidable coincidence  which would suggest the government has lost its once-famed ability to strictly co-ordinate announcements - through the infamous downing street grid - to avert just such allegations.  either way  the effect was the same - to overshadow the big announcements of government policy on a hugely pertinent issue. and there had been previous suggestions that the new year had started with a fresh outbreak of the warfare between the two men. firstly  the prime minister insisted on wednesday that he had been intimately involved in the development of the proposals to get g8 countries to freeze debt repayments from the tsunami-hit countries. it was claimed he had been embarrassed by the fact that gordon brown appeared to have taken the initiative over the government s response to the disaster while mr blair was still on holiday in egypt.  then  as if to pour fuel on the flames  both men separately spoke about working on tsunami or wider aid and development policy with their cabinet colleagues foreign secretary jack straw  aid minister hilary benn and deputy prime minister john prescott - without mentioning the other. all this came amid fresh claims that mr brown was still seething that he had been excluded from a prominent role in general election planning and had  as a result  started to set out his own platform. the fact that he used an article in the guardian newspaper to set out what he believed  should  be in the manifesto  has embarked on a mini tour of britain to set out his aid plans and will next week visit africa on the same mission - often seen as the prime minister s  turf  - has only added to the impression of rival camps operating entirely independently of each other. the prime minister denied all that as well  repeating his insistence that it was inconceivable the economy and the chancellor would not be at the centre of the election campaign. but the big fear with many on the labour benches now is that  unless a lid can be put on the speculation over the rivalry  it may even threaten to undermine the election campaign itself.'

labels, scores, predicted_label, pred_prob = predict(texts)

for name, score in zip(labels,scores):
  print("Labels : ", name, " Scores : ", score)

print("====================================================")
print("predicted_label : ", predicted_label, " Scores : ", pred_prob)

Labels :  sport  Scores :  19.24559
Labels :  business  Scores :  29.926249
Labels :  politics  Scores :  99.592834
Labels :  entertainment  Scores :  24.017403
Labels :  tech  Scores :  2.6347737
predicted_label :  politics  Scores :  99.592834


In [108]:
texts="housewives lift channel 4 ratings the debut of us television hit desperate housewives has helped lift channel 4 s january audience share by 12% compared to last year.  other successes such as celebrity big brother and the simpsons have enabled the broadcaster to surpass bbc two for the first month since last july. bbc two s share of the audience fell from 11.2% to 9.6% last month in comparison with january 2004. celebrity big brother attracted fewer viewers than its 2002 series.  comedy drama desperate housewives managed to pull in five million viewers at one point during its run to date  attracting a quarter of the television audience. the two main television channels  bbc1 and itv1  have both seen their monthly audience share decline in a year on year comparison for january  while five s proportion remained the same at a slender 6.3%. digital multi-channel tv is continuing to be the strongest area of growth  with the bbc reporting freeview box ownership of five million  including one million sales in the last portion of 2004. its share of the audience soared by 20% in january 2005 compared with last year  and currently stands at an average of 28.6%."
labels, scores, predicted_label, pred_prob = predict(texts)

for name, score in zip(labels,scores):
  print("Labels : ", name, " Scores : ", score)

print("====================================================")
print("predicted_label : ", predicted_label, " Scores : ", pred_prob)

Labels :  sport  Scores :  7.9653106
Labels :  business  Scores :  5.940966
Labels :  politics  Scores :  1.3667592
Labels :  entertainment  Scores :  99.69849
Labels :  tech  Scores :  7.325931
predicted_label :  entertainment  Scores :  99.69849
