# Using PyTorch Lightning for Inference

In [1]:
import os
from module import SequenceClassificationModule

## Loading a model checkpoint with PyTorch Lightning

It is possible to load checkpoints directly into LightningModules to either continue training, or for inference. Here, we load the model with the expectation that we will use it to predict on sequences of text.

In [2]:
checkpoints = os.listdir("checkpoints")
print(f"We will use the {checkpoints[0]} checkpoint for inference")

We will use the textattack-bert-base-uncased-yelp-polarity_1xNVIDIA-A10G_LR5e-05_BS16_2024-01-30T19:41:33.978407.ckpt checkpoint for inference


In [3]:
model = SequenceClassificationModule.load_from_checkpoint(f"checkpoints/{checkpoints[0]}")

  return self.fget.__get__(instance, owner)()


Let's check out our model's hyperparameters and other settings by calling the config attribute of the inner module (the BERT model):

In [4]:
model.model.config

BertConfig {
  "_name_or_path": "textattack/bert-base-uncased-yelp-polarity",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "finetuning_task": "yelp_polarity",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.37.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

## Using LightningModule's .predict_step to classify a sequence

We know from our `visualizing_logs_metrics_cost.ipynb` notebook that the models should produce reasonably accurate results, as each model had a final validation accuracy of around 90%.

Below, we read in known positive sequences taken from the test dataset, and then pass that sequence to our LightningModules's `predict_step` several times to observe results:

In [5]:
from datasets import load_dataset

In [6]:
labels = {"negative": 0, "positive": 1}
test_dataset = load_dataset("imdb", cache_dir="data", split="test")

Let's create a list to keep track of results from the output:

In [7]:
results = []

and now, let's create our positive labeled sample:

In [8]:
pos = []

for i in range(len(test_dataset)):
    if test_dataset[i]["label"] == 1:
        pos.append(test_dataset[i])
    if len(pos) == 5:
        break

assert all([i["label"] == labels["positive"] for i in pos])

pos

[{'text': "Previous reviewer Claudio Carvalho gave a much better recap of the film's plot details than I could. What I recall mostly is that it was just so beautiful, in every sense - emotionally, visually, editorially - just gorgeous.<br /><br />If you like movies that are wonderful to look at, and also have emotional content to which that beauty is relevant, I think you will be glad to have seen this extraordinary and unusual work of art.<br /><br />On a scale of 1 to 10, I'd give it about an 8.75. The only reason I shy away from 9 is that it is a mood piece. If you are in the mood for a really artistic, very romantic film, then it's a 10. I definitely think it's a must-see, but none of us can be in that mood all the time, so, overall, 8.75.",
  'label': 1},
 {'text': 'CONTAINS "SPOILER" INFORMATION. Watch this director\'s other film, "Earth", at some point. It\'s a better film, but this one isn\'t bad just different.<br /><br />A rare feminist point of view from an Indian filmmaker.

In [9]:
for sequence in pos:
    pred = labels[model.predict_step(sequence=sequence['text'])]
    truth = sequence['label']
    result = pred == truth
    results.append(result)
    print(f"Our finetuned model classifies this sequence as: {pred}. Actual label is {truth}. The classification is {result}")

Our finetuned model classifies this sequence as: 1. Actual label is 1. The classification is True
Our finetuned model classifies this sequence as: 1. Actual label is 1. The classification is True
Our finetuned model classifies this sequence as: 1. Actual label is 1. The classification is True
Our finetuned model classifies this sequence as: 1. Actual label is 1. The classification is True
Our finetuned model classifies this sequence as: 1. Actual label is 1. The classification is True


and, for known negative sequences:

In [10]:
neg = []

for i in range(len(test_dataset)):
    if test_dataset[i]["label"] == 0:
        neg.append(test_dataset[i])
    if len(neg) == 5:
        break

assert all([i["label"] == labels["negative"] for i in neg])

neg

[{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as

In [11]:
for sequence in neg:
    pred = labels[model.predict_step(sequence=sequence['text'])]
    truth = sequence['label']
    result = pred == truth
    results.append(result)
    print(f"Our finetuned model classifies this sequence as: {pred}. Actual label is {truth}. The classification is {result}")

Our finetuned model classifies this sequence as: 0. Actual label is 0. The classification is True
Our finetuned model classifies this sequence as: 0. Actual label is 0. The classification is True
Our finetuned model classifies this sequence as: 0. Actual label is 0. The classification is True
Our finetuned model classifies this sequence as: 0. Actual label is 0. The classification is True
Our finetuned model classifies this sequence as: 1. Actual label is 0. The classification is False


## Conclusion

In [12]:
false = [i for i in results if i == False]

if false:
    print(f"{len(false)} sequences were incorrectly labeled out of {len(results)} sequences")
else:
    print("no false predictions!")

1 sequences were incorrectly labeled out of 10 sequences


We can see that our output coincidentally falls within the range of our validation accuracy. 

Let's keep in mind that this isn't necessarily indicative of a well performing model, as this is just a small sample of unseen sequences. Additionally, we should consider whether or not 90% is accurate enough for our needs i.e. is mislabeling 10% of negative reviews acceptable? 

In the interim, given the nature of this scenario – classifying customer reviews – so long as a model maintains this level of accuracy in production, this may be an acceptable rate so that there is a solution in place while we continue to attempt to improve on a new version of the model for A/B testing in production.