# Introduction

In this notebook we introduce the methodology of using an open source language model (classifier) to classify strings into a set of arbitrary categories. 

The language model we will use is pretrained on a large dataset (just like the models underpinning OpenAI's recently famous ChatGPT interface). This model in particular has been trained and released by Facebook and has been optimized on the task of classification instead of text generation, but the underlying techniques are similar. It therefore is able to perform the tasks we want directly out-of-the-box without any further tweaking!

Classification is the act of assigning a probability between one and zero to a (set of) labels given some string. The label with the highest probability wins.

In [1]:
# Import the pipeline helper from the transformers package, which we will use to load our model
from transformers import pipeline

We will now instantiate the pretrained model and the `pipeline` object that we will be using to classify the strings. This is the `facebook/bart-large-mnli` model, which is trained on the Multi-Genre Natural Language Inference (MNLI) dataset.

In [2]:
# Download and instantiate the facebook/bart-large-mnli model and pipeline. The first time this cell runs, it downloads a large file containing the model weights (1.5GB of parameters all in all!). 
# Let it finish and from then on it will be cached on disk.

# Stay aware that this model instantiation uses a lot of memory/RAM, as it has to load the full 1.5GB of model parameters. If you load the model as below in several notebooks at the same time, you might overload your box. 
# To avoid this, when you're done with a notebook, close it's "kernel" on the left side of the Jupyterlabs interface (The second icon. circle with a square inside, selects the active kernels). This unloads the model from memory.

c = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')

Downloading (…)lve/main/config.json: 100%|██████████| 1.18k/1.18k [00:00<00:00, 2.02MB/s]
Downloading (…)/main/config.json: 100%|██████████| 3.26k/3.26k [00:00<00:00, 6.50MB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.63G/1.63G [01:05<00:00, 24.7MB/s]
Downloading (…)neration_config.json: 100%|██████████| 190/190 [00:00<00:00, 372kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 3.61MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 2.45MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.26M/1.26M [00:00<00:00, 5.07MB/s]


We will now test the model by giving it a string and asking it to classify the string into a set of labels.

In [3]:
c('The sun is out today.', ['Happy'])

{'sequence': 'The sun is out today.', 'labels': ['Happy'], 'scores': [0.9892874369621277]}
