We will use a fit of a pre-trained transformer BERT model for a TOPICS analysis using the Hugging Face transformer PyTorch trainer

# Step 1: Install And Import Python Libraries

In step 1, we will install and import python libraries.

Firstly, let's install `transformers`, `datasets`, and `evaluate`.

In [None]:
# Install libraries
!pip install datasets transformers==4.28.0
!pip install transformers datasets evaluate

In [None]:
# We neeed to import dataset, but this dataset we'll try to use how URL
# Data processing
import pandas as pd
import numpy as np

# Modeling
import tensorflow as tf
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, EarlyStoppingCallback, TextClassificationPipeline

# Hugging Face Dataset
from datasets import Dataset

# Model performance evaluation
import evaluate

# Step 2: Download And Read Data

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Change directory
import os
os.chdir("drive/My Drive/contents/nlp")

# Print out the current directory
!pwd

In [None]:
# Read in data
amz_review = pd.read_csv('political_social_media/amazon_cells_labelled.txt', sep='\t', names=['review', 'label'])

# Take a look at the data
amz_review.head()

In [None]:
# Get the dataset information
amz_review.info()

The label value of 0 represents negative reviews and the label value of 1 represents positive reviews. The dataset has 500 positive reviews and 500 negative reviews. It is well-balanced, so we can use  accuracy as the metric to evaluate the model performance.

In [None]:
# Check the label distribution
amz_review['label'].value_counts()

# Step 3: Train Test Split

In [None]:
# Training dataset
train_data = amz_review.sample(frac=0.8, random_state=42)

# Testing dataset
test_data = amz_review.drop(train_data.index)

# Check the number of records in training and testing dataset.
print(f'The training dataset has {len(train_data)} records.')
print(f'The testing dataset has {len(test_data)} records.')

# Step 4: Convert Pandas Dataframe to Hugging Face Dataset

In [None]:
# Convert pyhton dataframe to Hugging Face arrow dataset
hg_train_data = Dataset.from_pandas(train_data)
hg_test_data = Dataset.from_pandas(test_data)

In [None]:
# Length of the Dataset
print(f'The length of hg_train_data is {len(hg_train_data)}.\n')

# Check one review
hg_train_data[0]

In [None]:
# Validate the record in pandas dataframe
amz_review.iloc[[521]]

# Step 5: Tokenize Text

In [None]:
# Tokenizer from a pretrained model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Take a look at the tokenizer
tokenizer