# Using LLMs as text classifiers with an sklearn interface

TODO:
- filter warnings
- google colab installs

## imports

In [1]:
import datasets
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, log_loss
from sklearn.model_selection import GridSearchCV

## load data

Load sentiment dataset

In [2]:
imdb = datasets.load_dataset('imdb').shuffle(seed=0)

Found cached dataset imdb (/home/vinh/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /home/vinh/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-79aee49c9f40dc82.arrow
Loading cached shuffled indices for dataset at /home/vinh/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-5a09ddfc1bd0fbc8.arrow
Loading cached shuffled indices for dataset at /home/vinh/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-f131e6602007628b.arrow


Limit to 100 samples. Using zero/few shot learning mostly makes sense when there are very few labeled samples.

In [3]:
X = imdb['train'][:100]['text']
y = imdb['train'][:100]['label']

In [4]:
labels = np.array(['negative', 'positive'])[y]

## zero shot classification

In [5]:
from skorch.llm import ZeroShotClassifier

### "train" zero shot classifier

For this notebook, we use a small LLM, `flan-t5-small`.

In [6]:
clf = ZeroShotClassifier('google/flan-t5-small', generate_kwargs={'max_length': 512})

In [7]:
%time clf.fit(X=None, y=['positive', 'negative']);

CPU times: user 2.23 s, sys: 272 ms, total: 2.5 s
Wall time: 2.25 s


In general, fitting is fast because, basically, nothing happens. If the LLM is not cached locally, it will, however, be downloaded from Hugging Face, which may take some time.

### evaluate

In [8]:
%time y_proba = clf.predict_proba(X)

Token indices sequence length is longer than the specified maximum sequence length for this model (843 > 512). Running this sequence through the model will result in indexing errors


CPU times: user 1min 51s, sys: 1.72 s, total: 1min 52s
Wall time: 28.3 s


In [9]:
log_loss(y, y_proba)

0.3767031233176697

In [10]:
y_pred = y_proba.argmax(1)

In [11]:
accuracy_score(y, y_pred)

0.83

In [12]:
clf.predict(["A masterpiece, instant classic, 5 stars out of 5"])

array(['positive'], dtype='<U8')

### grid search the prompt

In [13]:
prompt0 = """You are a text classification assistant.

The text to classify:

```
{text}
```

Choose the label among the following possibilities with the highest probability.
Only return the label, nothing more:

{labels}

Your response:
"""

In [14]:
prompt1 = """Your task is to classify text.

Choose the label among the following possibilities with the highest probability.
Only return the label, nothing more:

{labels}

The text to classify:

```
{text}
```

Your response:
"""

In [15]:
params = {'prompt': [prompt0, prompt1]}

In [16]:
search = GridSearchCV(clf, param_grid=params, cv=2, scoring=['accuracy', 'neg_log_loss'], refit=False)

In [17]:
%time search.fit(X, labels)

Token indices sequence length is longer than the specified maximum sequence length for this model (843 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (884 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (843 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (884 > 512). Running this sequence through the model will result in indexing errors


CPU times: user 6min 59s, sys: 6.13 s, total: 7min 6s
Wall time: 1min 50s


grid search results:

In [18]:
pd.DataFrame(search.cv_results_)[['mean_test_accuracy', 'mean_test_neg_log_loss', 'param_prompt', 'mean_score_time']]

Unnamed: 0,mean_test_accuracy,mean_test_neg_log_loss,param_prompt,mean_score_time
0,0.87,-0.296063,You are a text classification assistant.\n\nTh...,25.581524
1,0.93,-0.246425,Your task is to classify text.\n\nChoose the l...,26.542


**Conclusion**: `prompt1` is performing better. Mean test accuracy of 93% and log loss of 0.25 are pretty good, given that we use zero shot and don't perform any fine-tuning.

## few shot classification

In [19]:
from skorch.llm import FewShotClassifier
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

### train few shot classifier

Instead of passing the model name to initialize the classifier, as in `clf = FewShotClassifier('google/flan-t5-small')`, it is also possible to pass the model and tokenizer explicitly. This is a good option if you need more control over them. In our case, it amounts to the same result.

In [20]:
model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-small').to('cuda:0')
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-small')

Use `max_samples` samples from the training data for few shot prompting.

In [21]:
clf = FewShotClassifier(model=model, tokenizer=tokenizer, max_samples=5, generate_kwargs={'max_length': 512})

In [22]:
%time clf.fit(X[:5], labels[:5]);

CPU times: user 966 µs, sys: 15 µs, total: 981 µs
Wall time: 457 µs


Show how the prompt looks like:

In [23]:
print(clf.get_prompt(X[5]))

You are a text classification assistant.

Choose the label among the following possibilities with the highest probability.
Only return the label, nothing more:

['negative' 'positive']

Here are a few examples:

```
i went to see this movie with a bunch of friends one night. I didn't really hear much about it. So I wasn't expecting anything. But after I saw it, I really liked it. Nicolas Cage and the rest of the cast were very good. But I do have to say Giovanni Ribisi's acting performace did need a little perking up. But such a small flaw, it could be overrided. <br /><br />Gone In 60 Seconds is about a retired car thief who must boost 60 rare and exotic cars in one night to save his brother's life. The movie is in no way predictable. So the ending should be a suprise. Think it's just another, fast car driving movie? Well you are partially right. There is much more to it. Everyone should take a look at this movie.
```

Your response:
positive

```
We always watch American movies with 

### evaluate

In [24]:
%time y_proba = clf.predict_proba(X)

Token indices sequence length is longer than the specified maximum sequence length for this model (1367 > 512). Running this sequence through the model will result in indexing errors


CPU times: user 13.4 s, sys: 132 ms, total: 13.5 s
Wall time: 13.6 s


In [25]:
log_loss(y, y_proba)

0.23116233983792145

In [26]:
y_pred = y_proba.argmax(1)

In [27]:
accuracy_score(y, y_pred)

0.91

In [28]:
clf.predict(["Even if paid $1000, I would not watch this movie again"])

array(['negative'], dtype='<U8')

### grid search best number of few shot samples

Note that grid search will split `X` and `y` for each run. Since the few shot samples are taken from X and y, those will thus be different for each split, which could have a big influence on the performance of the model. If you always want to have the same few shot samples in each split, you should craft your own prompt with those examples and then use it with `ZeroShotClassifier`. Just ensure that those prompts are not part of the validation/test data!

In [29]:
params = {'max_samples': [3, 5, 7]}

In [30]:
search = GridSearchCV(clf, param_grid=params, cv=2, scoring=['accuracy', 'neg_log_loss'], refit=False)

In [31]:
%time search.fit(X, labels)

CPU times: user 1min 37s, sys: 112 ms, total: 1min 37s
Wall time: 1min 37s


In [32]:
pd.DataFrame(search.cv_results_)[['mean_test_accuracy', 'mean_test_neg_log_loss', 'param_max_samples', 'mean_score_time']]

Unnamed: 0,mean_test_accuracy,mean_test_neg_log_loss,param_max_samples,mean_score_time
0,0.92,-0.231521,3,11.007958
1,0.91,-0.232335,5,17.057216
2,0.91,-0.232301,7,20.24657


**Conclusion**: No significant change in accuracy but medium improvement in log loss compared to zero shot. More than 5 samples don't seem to help.

## Testing MNLI

An existing method is to use natural language inference (NLI). Compare the results to https://huggingface.co/facebook/bart-large-mnli, which is the most used zero shot classifier on Hugging Face.

In [33]:
from transformers import pipeline

In [34]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device='cuda:0')

In [35]:
%%time
y_probas = []
for x in X:
    output = classifier(x, ['negative', 'positive'])
    if output['labels'] == ['negative', 'positive']:
        y_probas.append(output['scores'])
    else:
        y_probas.append(output['scores'][::-1])



CPU times: user 13.4 s, sys: 17 µs, total: 13.4 s
Wall time: 13.4 s


In [36]:
y_proba = np.vstack(y_probas)

In [37]:
accuracy_score(y, y_proba.argmax(1))

0.84

In [38]:
log_loss(y, y_proba)

0.3443705626436628

**Conclusion**: This model is slower than the tested zero shot and few shot classifier, it is less flexible (we cannot adjust prompt or other parameters), and it performs worse.

## Testing vanilla ML

Use a standard TFIDF + logistic regression benchmark.

In [39]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

In [40]:
tfidf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression()),
])

In [41]:
params = {'tfidf__max_features': [500, 1000, 2000], 'tfidf__ngram_range': [(1, 1), (1, 2), (2, 2), (1, 3)]}

In [42]:
search = GridSearchCV(
    tfidf, param_grid=params, cv=2, scoring=['accuracy', 'neg_log_loss'], refit=False
)

In [43]:
%time search.fit(X, y)

CPU times: user 1.22 s, sys: 4.02 ms, total: 1.22 s
Wall time: 1.22 s


The table is quite big, let's look at the top 5 best log losses:

In [44]:
cols = ['mean_test_accuracy', 'mean_test_neg_log_loss', 'param_tfidf__max_features', 'param_tfidf__ngram_range']
pd.DataFrame(search.cv_results_)[cols].sort_values('mean_test_neg_log_loss', ascending=False).head()

Unnamed: 0,mean_test_accuracy,mean_test_neg_log_loss,param_tfidf__max_features,param_tfidf__ngram_range
3,0.69,-0.662397,500,"(1, 3)"
7,0.71,-0.663959,1000,"(1, 3)"
1,0.68,-0.664004,500,"(1, 2)"
5,0.7,-0.664215,1000,"(1, 2)"
0,0.65,-0.664609,500,"(1, 1)"


**Conclusion**: This classical model is much faster, even if we include the training time, because it is much smaller than an LLM. However, it's scores are also much worse, given the small dataset. If speed is no concern, using an LLM classifier would thus be a good option for this task.