# Introduction

In this notebook we introduce the methodology of using an open source language model (classifier) to classify strings into a set of arbitrary categories. 

The language model we will use is pretrained on a large dataset (just like the models underpinning OpenAI's recently famous ChatGPT interface). This model in particular has been trained and released by Facebook and has been optimized on the task of classification instead of text generation, but the underlying techniques are similar. It therefore is able to perform the tasks we want directly out-of-the-box without any further tweaking!

Classification is the act of assigning a probability between one and zero to a (set of) label(s) being applicable to a string of text. 

To give an example, the sentence "The sun is out today." seems pretty happy. If a person were asked to give it a score on "happiness", they might assign it a score of 90%. But what about the sentence "It is raining outside.."? That perhaps should score only 10%, if not lower. The language model we use will be able to draw similar conclusions automatically based on it's understanding of the English language.

Let's get started!

# Setup

In [9]:
# Import the pipeline helper from the transformers package, which we will use to load our model
from transformers import pipeline

In [10]:
# Download and instantiate the facebook/bart-large-mlni model and pipeline. The first time this cell runs, it downloads a large file containing the model weights (1.5GB of parameters all in all!). 
# Let it finish and from then on it will be cached on disk.

# Stay aware that this model instantiation uses a lot of memory/RAM, as it has to load the full 1.5GB of model parameters. If you load the model as below in several notebooks at the same time, you might overload your box. 
# To avoid this, when you're done with a notebook, close it's "kernel" on the left side of the Jupyterlabs interface (The second icon. circle with a square inside, selects the active kernels). This unloads the model from memory.

c = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')

# Testing whether a label is applicable

Let's see if we can replicate the ideas above. Is the sentence "The sun is out today!" indeed classifiably happy?

To do so, we use the defined interface of our classifier, which takes (at least) the following arguments: a sentence, a list of labels.

Perhaps it's easier to just look at an example!

In [11]:
c('The sun is out today :)', ['Happy'])

{'sequence': 'The sun is out today :)',
 'labels': ['Happy'],
 'scores': [0.9907214641571045]}

Great result! It looks like our sentence is indeed very happy. With a score of almost 96%, it doesn't get much happier than that.

Challenge: Can you find a way to modify the sentence such that it is even happier?

In [12]:
# Modify the sentence to something even happier
c('The sun is out today.', ['Happy'])

{'sequence': 'The sun is out today.',
 'labels': ['Happy'],
 'scores': [0.958601176738739]}

And how about our "unhappy" sentence, will it indeed agree?

In [13]:
c('It is raining outside..', ['Happy'])

{'sequence': 'It is raining outside..',
 'labels': ['Happy'],
 'scores': [0.00019142446399200708]}

Nobody likes rain..

# Multiple labels

If that's not a happy sentence, then perhaps we can conclude it's a sad one? Does our model agree?

We can ask it to make a choice between several available labels.

In [14]:
c('It is raining outside..', ['Happy', 'Sad'])

{'sequence': 'It is raining outside..',
 'labels': ['Sad', 'Happy'],
 'scores': [0.9559348821640015, 0.04406508430838585]}

The mathematics of the model work out slightly differently to before for multi-labeling, but the conclusion remains the same, it certainly considers the sentence to be much more sad than happy.

What about other labels, like wet, dry, high, low and colorful or gray?

c('It is raining outside..', ['Happy', 'Sad', 'Wet', 'Dry', 'High', 'Low', 'Colorful', 'Gray'])

Hard question, but a clear and agreeable answer, it's more Wet than any other of those labels. But isn't rain both wet and sad?

To answer that question we can ask the model to assign a probability per label, rather than forcing it to make a choice. Note that in the above result, all label probabilities add up to 100%.

In [15]:
c('It is raining outside..', ['Happy', 'Sad', 'Wet', 'Dry', 'High', 'Low', 'Colorful', 'Gray'], multi_label=True)

{'sequence': 'It is raining outside..',
 'labels': ['Wet', 'Sad', 'Low', 'Gray', 'High', 'Colorful', 'Dry', 'Happy'],
 'scores': [0.9990817308425903,
  0.8136840462684631,
  0.7355218529701233,
  0.41334474086761475,
  0.03862270340323448,
  0.001768096350133419,
  0.0003273721958976239,
  0.00019142446399200708]}

This makes sense!

# Edge cases

Negatiation. Surely, if rain is bad, then the opposite should be good?

In [16]:
c('It is not raining outside..', ['Happy', 'Sad'])

{'sequence': 'It is not raining outside..',
 'labels': ['Happy', 'Sad'],
 'scores': [0.8534077405929565, 0.14659228920936584]}

It depends on the context though, and the model can handle that pretty well too.

In [9]:
c('After the drought it is finally raining outside..', ['Happy', 'Sad'])

{'sequence': 'After the drought it is finally raining outside..',
 'labels': ['Happy', 'Sad'],
 'scores': [0.7387931942939758, 0.2612067759037018]}

# Multiple sentences/strings

We can also ask the model to classify multiple senteces in one go. That might be useful when you use this model for your trading strategy later.

In [10]:
c(['The sun is out today.', 'It is raining outside..'], ['Happy', 'Sad'])

[{'sequence': 'The sun is out today.',
  'labels': ['Happy', 'Sad'],
  'scores': [0.9836673140525818, 0.016332656145095825]},
 {'sequence': 'It is raining outside..',
  'labels': ['Sad', 'Happy'],
  'scores': [0.9559348821640015, 0.04406508430838585]}]

# Free-form exploration

Below, try a few different examples, different labels, sentence structures, and so on, to get a feel for that the model can and can't handle very well.

In [None]:
c('Your sentence here', ['Your', 'labels', 'here'])