# Using LLMs as text classifiers with an sklearn interface

## imports

In [1]:
import datasets
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, log_loss
from sklearn.model_selection import GridSearchCV
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoModelForCausalLM

## load data

Load sentiment dataset

In [2]:
imdb = datasets.load_dataset('imdb').shuffle(seed=0)

Found cached dataset imdb (/home/vinh/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /home/vinh/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-79aee49c9f40dc82.arrow
Loading cached shuffled indices for dataset at /home/vinh/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-5a09ddfc1bd0fbc8.arrow
Loading cached shuffled indices for dataset at /home/vinh/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-f131e6602007628b.arrow


Limit to 500 samples. Using zero/few shot learning mostly makes sense when there are very few labeled samples.

In [3]:
X = imdb['train'][:500]['text']
y = imdb['train'][:500]['label']

In [4]:
labels = np.array(['negative', 'positive'])[y]

## zero shot classification

In [5]:
from skorch.llm import ZeroShotClassifier

### "train" zero shot classifier

In [6]:
model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-small').to('cuda:0')
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-small')

In [7]:
clf = ZeroShotClassifier(model=model, tokenizer=tokenizer, generate_kwargs={'max_length': 512})
# this would also work:
# clf = ZeroShotClassifier.from_auto_model_for_seq2seq_lm('google/flan-t5-small', device='cuda')

In [8]:
%time clf.fit(X=None, y=['positive', 'negative']);

CPU times: user 2.54 ms, sys: 383 µs, total: 2.92 ms
Wall time: 1.61 ms


Fitting is fast because, basically, nothing happens.

### evaluate

In [9]:
%time y_proba = clf.predict_proba(X)

Token indices sequence length is longer than the specified maximum sequence length for this model (843 > 512). Running this sequence through the model will result in indexing errors


CPU times: user 29.5 s, sys: 188 ms, total: 29.7 s
Wall time: 29.7 s


In [10]:
accuracy_score(y, y_proba.argmax(1))

0.834

In [11]:
log_loss(y, y_proba)

0.35777184664107725

In [12]:
clf.predict(["A masterpiece, instant classic, 5 stars out of 5"])

array(['negative'], dtype='<U8')

### grid search the prompt

In [13]:
from skorch.llm.classifier import DEFAULT_PROMPT_ZERO_SHOT

In [14]:
prompt0 = """You are a text classification assistant.

The text to classify:

```
{text}
```

Choose the label among the following possibilities with the highest probability.
Only return the label, nothing more:

{labels}

Your response:
"""

In [15]:
prompt1 = """Your task is to classify text.

Choose the label among the following possibilities with the highest probability.
Only return the label, nothing more:

{labels}

The text to classify:

```
{text}
```

Your response:
"""

In [16]:
params = {'prompt': [prompt0, prompt1]}

In [17]:
search = GridSearchCV(clf, param_grid=params, cv=2, scoring=['accuracy', 'neg_log_loss'], refit=False)

In [18]:
%time search.fit(X, labels)

CPU times: user 1min 46s, sys: 120 ms, total: 1min 46s
Wall time: 1min 46s


grid search results:

In [19]:
pd.DataFrame(search.cv_results_)[['mean_test_accuracy', 'mean_test_neg_log_loss', 'param_prompt', 'mean_score_time']]

Unnamed: 0,0,1
mean_fit_time,0.000835,0.000904
std_fit_time,0.000057,0.000036
mean_score_time,26.394518,26.492218
std_score_time,0.37708,0.541442
param_prompt,You are a text classification assistant.\n\nTh...,Your task is to classify text.\n\nChoose the l...
params,{'prompt': 'You are a text classification assi...,{'prompt': 'Your task is to classify text. Ch...
split0_test_accuracy,0.9,0.924
split1_test_accuracy,0.844,0.908
mean_test_accuracy,0.872,0.916
std_test_accuracy,0.028,0.008


**Conclusion**: Mean test accuracy of 91.6% and log loss of 0.25 are pretty good, given this small dataset, for zero shot and without any fine-tuning.

## few shot classification

In [20]:
from skorch.llm import FewShotClassifier

### train few shot classifier

Use `max_samples` samples from the training data for few shot prompting.

In [21]:
clf = FewShotClassifier(model=model, tokenizer=tokenizer, max_samples=5, generate_kwargs={'max_length': 512})

In [22]:
%time clf.fit(X[:5], labels[:5]);

CPU times: user 1.9 ms, sys: 0 ns, total: 1.9 ms
Wall time: 793 µs


Show how the prompt looks like:

In [23]:
print(clf.get_prompt(X[5]))

You are a text classification assistant.

Choose the label among the following possibilities with the highest probability.
Only return the label, nothing more:

['negative' 'positive']

Here are a few examples:

```
Reese Witherspooon's first movie. Loved it. The plot and the acting was top notch. You are emotionally involved with the characters. In my opinion, a must see.<br /><br />After watching this movie you will see why Reese Witherspoon's acting career has been so successful. <br /><br />The other cast members do a great job also. <br /><br />The movie flows extremely well. There is not a boring moment in the whole picture. The Man in the Moon's length is just right. <br /><br />As I said earlier, I think this movie was excellent. I have seen it numerous times, and have enjoyed every one of the viewings.
```

Your response:
positive

```
I watch lots of scary movies (or at least they try to be) and this has to be the worst if not 2nd worst movie I have ever had to make myself tr

### evaluate

In [24]:
%time y_proba = clf.predict_proba(X)

CPU times: user 1min 5s, sys: 60.1 ms, total: 1min 5s
Wall time: 1min 5s


In [25]:
accuracy_score(y, y_proba.argmax(1))

0.9

In [26]:
log_loss(y, y_proba)

0.22777235728358436

In [27]:
clf.predict(["Even if paid $1000, I would not watch this movie again"])

array(['negative'], dtype='<U8')

### grid search best number of few shot samples

Note that grid search will split `X` and `y` for each run. Since the few shot samples are taken from X and y, those will thus be different for each split, which could have a big influence on the performance of the model. If you always want to have the same few shot samples in each split, you should craft your own prompt with those examples and then use it with `ZeroShotClassifier`. Just ensure that those prompts are not part of the validation/test data!

In [28]:
params = {'max_samples': [3, 5, 7]}

In [29]:
search = GridSearchCV(clf, param_grid=params, cv=2, scoring=['accuracy', 'neg_log_loss'], refit=False)

In [30]:
%time search.fit(X, labels)

CPU times: user 12min 22s, sys: 372 ms, total: 12min 22s
Wall time: 12min 22s


In [31]:
pd.DataFrame(search.cv_results_)[['mean_test_accuracy', 'mean_test_neg_log_loss', 'param_max_samples', 'mean_score_time']]

Unnamed: 0,0,1,2
mean_fit_time,0.000938,0.001502,0.001431
std_fit_time,0.000124,0.000133,0.000236
mean_score_time,63.511516,57.153045,250.413086
std_score_time,25.193397,2.309564,19.802263
param_max_samples,3,5,7
params,{'max_samples': 3},{'max_samples': 5},{'max_samples': 7}
split0_test_accuracy,0.932,0.932,0.924
split1_test_accuracy,0.884,0.884,0.884
mean_test_accuracy,0.908,0.908,0.904
std_test_accuracy,0.024,0.024,0.02


**Conclusion**: No significant change in accuracy but medium improvement in log loss compared to zero shot. More samples don't seem to help.

## Testing MNLI

An existing method is to use natural language inference (NLI). Compare the results to https://huggingface.co/facebook/bart-large-mnli, which is the most used zero shot classifier on hub.

In [32]:
from transformers import pipeline

In [33]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device='cuda:0')

In [34]:
%%time
y_probas = []
for x in X:
    output = classifier(x, ['negative', 'positive'])
    if output['labels'] == ['negative', 'positive']:
        y_probas.append(output['scores'])
    else:
        y_probas.append(output['scores'][::-1])



CPU times: user 4min 33s, sys: 12.8 s, total: 4min 46s
Wall time: 1min 18s


In [35]:
y_proba = np.vstack(y_probas)

In [36]:
accuracy_score(y, y_proba.argmax(1))

0.864

In [37]:
log_loss(y, y_proba)

0.3327768453356901

**Conclusion**: This model is bigger (thus slower) than our initial LLM, less flexible (cannot adjust prompt or other parameters), and performs slightly worse.

## Testing vanilla ML

Use a standard TFIDF + logistic regression benchmark.

In [38]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

In [39]:
tfidf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression()),
])

In [40]:
params = {'tfidf__max_features': [500, 1000, 2000], 'tfidf__ngram_range': [(1, 1), (1, 2), (2, 2), (1, 3)]}

In [41]:
search = GridSearchCV(
    tfidf, param_grid=params, cv=2, scoring=['accuracy', 'neg_log_loss'], refit=False
)

In [42]:
%time search.fit(X, y)

CPU times: user 6.62 s, sys: 12 ms, total: 6.63 s
Wall time: 6.63 s


In [43]:
pd.DataFrame(search.cv_results_).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
mean_fit_time,0.054696,0.149789,0.114786,0.302876,0.048288,0.154148,0.113395,0.309491,0.050205,0.154167,0.116742,0.307342
std_fit_time,0.005406,0.009729,0.007894,0.015642,0.001786,0.004715,0.005757,0.009912,0.003285,0.010032,0.004611,0.014247
mean_score_time,0.067094,0.121613,0.093494,0.177113,0.067512,0.126158,0.091746,0.182482,0.07084,0.126879,0.093414,0.181389
std_score_time,0.002076,0.005762,0.009078,0.007628,0.002693,0.004112,0.001664,0.009184,0.004774,0.005889,0.005267,0.007528
param_tfidf__max_features,500,500,500,500,1000,1000,1000,1000,2000,2000,2000,2000
param_tfidf__ngram_range,"(1, 1)","(1, 2)","(2, 2)","(1, 3)","(1, 1)","(1, 2)","(2, 2)","(1, 3)","(1, 1)","(1, 2)","(2, 2)","(1, 3)"
params,"{'tfidf__max_features': 500, 'tfidf__ngram_ran...","{'tfidf__max_features': 500, 'tfidf__ngram_ran...","{'tfidf__max_features': 500, 'tfidf__ngram_ran...","{'tfidf__max_features': 500, 'tfidf__ngram_ran...","{'tfidf__max_features': 1000, 'tfidf__ngram_ra...","{'tfidf__max_features': 1000, 'tfidf__ngram_ra...","{'tfidf__max_features': 1000, 'tfidf__ngram_ra...","{'tfidf__max_features': 1000, 'tfidf__ngram_ra...","{'tfidf__max_features': 2000, 'tfidf__ngram_ra...","{'tfidf__max_features': 2000, 'tfidf__ngram_ra...","{'tfidf__max_features': 2000, 'tfidf__ngram_ra...","{'tfidf__max_features': 2000, 'tfidf__ngram_ra..."
split0_test_accuracy,0.68,0.668,0.616,0.664,0.728,0.7,0.664,0.7,0.732,0.728,0.66,0.736
split1_test_accuracy,0.748,0.744,0.656,0.736,0.768,0.748,0.652,0.76,0.768,0.764,0.652,0.764
mean_test_accuracy,0.714,0.706,0.636,0.7,0.748,0.724,0.658,0.73,0.75,0.746,0.656,0.75


The table is quite big, let's look at the top 5 best log losses:

In [50]:
cols = ['mean_test_accuracy', 'mean_test_neg_log_loss', 'param_tfidf__max_features', 'param_tfidf__ngram_range']
pd.DataFrame(search.cv_results_)[cols].sort_values('mean_test_neg_log_loss', ascending=False).head()

Unnamed: 0,mean_test_accuracy,mean_test_neg_log_loss,param_tfidf__max_features,param_tfidf__ngram_range
4,0.748,-0.61463,1000,"(1, 1)"
0,0.714,-0.617935,500,"(1, 1)"
8,0.75,-0.619155,2000,"(1, 1)"
5,0.724,-0.619321,1000,"(1, 2)"
7,0.73,-0.619762,1000,"(1, 3)"


**Conclusion**: This classical model is much faster, even if we include the training time, because the model is much smaller than an LLM. However, it's scores are also much worse, given the small dataset. If speed is no concern, using an LLM classifier would thus be a good option for this task.