# **AG-News Multi-class Text Classification  with Deep Learning using BERT:**

In this project, we'll go into great depth on how the BERT base model is applied to text classification. We will see how this cutting-edge Transformer model may achieve incredibly high performance metrics in relation to a sizable corpus of data made up of more than 100k+ labelled training instances. Building, training, and fine-tuning the BERT model with regard to classification on this custom dataset will be done using the hugging face transformer and dataset library, as well as ktrain (a high level python wrapper with tensorflow backend).

**Installing Libraries**

In [None]:
!pip install ktrain
!pip install transformers #developed by Hugging Face and provides state-of-the-art pre-trained models for natural language processing tasks. It includes a wide range of transformer architectures, including BERT, GPT, RoBERTa, and more.
!pip install datasets #The datasets library, also developed by Hugging Face, provides easy access to a vast collection of datasets for NLP.
!pip install tensorflow

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import ktrain
from ktrain import text #the text module offers convenient utilities for preprocessing text data, creating text classification models, and performing various operations on text inputs.
from sklearn.model_selection import train_test_split
from datasets import list_datasets
from datasets import load_dataset
import timeit
import tensorflow as tf

In [None]:
print("Tensorflow version : ", tf.__version__)
print("GPU available : ",bool(tf.test.is_gpu_available))
print("GPU name : ",tf.test.gpu_device_name())

**Hugging Face's available datasets are checked:**

In [None]:
available_datasets = list_datasets()
print("Count of available datasets : ", len(available_datasets))
print()
print("<====== Dataset List ======> :\n")
print('\n  |__ '.join(dataset for dataset in available_datasets))

In [None]:
ag_news_dataset = load_dataset('ag_news') #import ag_news dataset
print("\n", ag_news_dataset)

In [None]:
print("Dataset Items: \n", ag_news_dataset.items())
print("\nDataset type: \n", type(ag_news_dataset))
print("\nShape of dataset: \n", ag_news_dataset.shape)
print("\nNo of rows: \n", ag_news_dataset.num_rows)
print("\nNo of columns: \n", ag_news_dataset.num_columns)

The AG News dataset is a widely used benchmark dataset for text classification tasks. It consists of news articles from the AG's corpus, which is a collection of news articles from the web. The dataset is designed for multi-class classification, where each news article is assigned to one of four classes representing different news categories.

Number of Classes: 4

Class Labels: The dataset has four class labels representing news categories:

Class 1: World news
Class 2: Sports news
Class 3: Business news
Class 4: Science and technology news

In [None]:
print("\nColumn Names: \n", ag_news_dataset.column_names)
print("\n", ag_news_dataset.data)

In [None]:
print(ag_news_dataset['train'][0])
print(ag_news_dataset['train'][1])

In [None]:
print(ag_news_dataset['train']['text'][0])
print(ag_news_dataset['train']['label'][0])
print()
print(ag_news_dataset['train']['text'][35000])
print(ag_news_dataset['train']['label'][35000])
print()
print(ag_news_dataset['train']['text'][60000])
print(ag_news_dataset['train']['label'][60000])
print()
print(ag_news_dataset['train']['text'][100000])
print(ag_news_dataset['train']['label'][100000])

**Loading train and test datasets**

In [None]:
ag_news_train = load_dataset('ag_news', split='train')
ag_news_test = load_dataset('ag_news', split='test')
print("Train Dataset : ", ag_news_train.shape)
print("Test Dataset : ", ag_news_test.shape)

In [None]:
print(ag_news_train[0])
print(ag_news_test[0])

In [None]:
print("\nTrain Dataset Features: \n", ag_news_train.features)
print("\nTest Dataset Features: \n", ag_news_test.features)

In [None]:
pd.set_option('Display.max_columns', None)
ag_news_train_df = pd.DataFrame(data=ag_news_train)
ag_news_train_df.head(10)

In [None]:
ag_news_train_df.tail(10)

In [None]:
ag_news_test_df = pd.DataFrame(data=ag_news_test)
ag_news_test_df.head(10)

In [None]:
ag_news_test_df.tail(10)

**Preprocess Data:**

In [None]:
class_label_names = ['World', 'Sports', 'Business', 'Sci/Tech']

(X_train, y_train): These variables store the preprocessed training data, where X_train represents the input text sequences, and y_train represents the corresponding labels.

(X_test, y_test): These variables store the preprocessed testing data, where X_test represents the input text sequences, and y_test represents the corresponding labels.

preprocessing_var: This variable stores the preprocessing information, such as the tokenizer used for tokenizing the text data and the mapping between tokens and IDs.

In [None]:
(X_train, y_train), (X_test, y_test), preprocessing_var = text.texts_from_df(train_df=ag_news_train_df,
                                                                             text_column='text',
                                                                             label_columns='label',
                                                                             val_df=ag_news_test_df,
                                                                             maxlen=512,
                                                                             preprocess_mode='bert')

**Creating the BERT model:**

In [None]:
transformer_bert_model = text.text_classifier(name='bert',
                                              train_data=(X_train, y_train),
                                              preproc=preprocessing_var)

In [None]:
transformer_bert_model.layers

**Compile and train Bert in a Learner Object:**

In [None]:
bert_learner = ktrain.get_learner(model=transformer_bert_model,
                            train_data=(X_train, y_train),
                            val_data=(X_test, y_test),
                            batch_size=6)

Best Hyper-parameters for BERT:
• Batch size: 16, 32

• Learning rate: 5e-5, 3e-5, 2e-5

• Number of epochs: 2, 3, 4

**Train BERT on AG-News dataset:**

In [None]:
training_start_time = timeit.default_timer()
bert_learner.fit_onecycle(lr=2e-5, epochs=3)
training_stop_time = timeit.default_timer()

In [None]:
print("Total training time in minutes: \n", (training_stop_time - training_start_time)/60)
print("Total training time in hours: \n", (training_stop_time - training_start_time)/3600)

**Metrics for evaluating BERT performance:**

In [None]:
bert_learner.validate()

In [None]:
bert_learner.validate(class_names=class_label_names)

**Saving the model:**

In [None]:
bert_predictor = ktrain.get_predictor(bert_learner.model, preproc=preprocessing_var)
bert_predictor.get_classes()

In [None]:
bert_predictor.save('/content/bert-ag-news-predictor')

In [None]:
!zip -r /content/bert-ag-news-predictor.zip /content/bert-ag-news-predictor

**Re-loading Model:**

In [None]:
bert_predictor_2 = ktrain.load_predictor('/content/bert-ag-news-predictor')
bert_predictor_2.get_classes()