# Sentiment analysis with transformers

In this notebook we implement a classic NLP use-case using Hugging Face's `transformers` library.
We show that this use-case may be implementing directly in the SuperDuperDB `Datalayer` using MongoDB as the
data-backend. 

In [None]:
!pip install datasets

In [1]:
from datasets import load_dataset, load_metric
import numpy
import pymongo
from transformers import AutoTokenizer, AutoModelForSequenceClassification

import superduperdb
from superduperdb.misc.superduper import superduper
from superduperdb.container.document import Document as D
from superduperdb.db.mongodb.query import Collection
from superduperdb.ext.transformers.model import TransformersTrainerConfiguration, Pipeline
from superduperdb.container.dataset import Dataset

INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /var/folders/y9/b74b9yj906s_wtj0rrh2lf7c0000gn/T/tmp47hxtjzn
INFO:torch.distributed.nn.jit.instantiator:Writing /var/folders/y9/b74b9yj906s_wtj0rrh2lf7c0000gn/T/tmp47hxtjzn/_remote_module_non_scriptable.py
INFO:faiss.loader:Loading faiss.
INFO:faiss.loader:Successfully loaded faiss.


SuperDuperDB supports MongoDB as a databackend.
Correspondingly, we'll import the python MongoDB client pymongo and "wrap" our database to convert it 
to a SuperDuper Datalayer:

In [3]:
db = pymongo.MongoClient().documents
db = superduper(db)
collection = Collection('imdb')

We use the IMDB dataset for training the model:

In [4]:
data = load_dataset("imdb")

db.execute(collection.insert_many([
    D({'_fold': 'train', **data['train'][int(i)]}) for i in numpy.random.permutation(len(data['train']))[:4]
]))

db.execute(collection.insert_many([
    D({'_fold': 'valid', **data['test'][int(i)]}) for i in numpy.random.permutation(len(data['test']))[:4]
]))



  0%|          | 0/3 [00:00<?, ?it/s]

INFO:root:found 0 uris
INFO:root:found 0 uris


(<pymongo.results.InsertManyResult at 0x19eca70a0>,
 TaskWorkflow(database=<superduperdb.db.base.datalayer.Datalayer object at 0x197363b10>, G=<networkx.classes.digraph.DiGraph object at 0x19ecdf990>))

Check a sample from the database:

In [5]:
r = db.execute(collection.find_one())
r

Document({'_id': ObjectId('64c8f2969a5a99bcba483995'), '_fold': 'train', 'text': '"My Left Foot" is a pretty impressive film that tells the story of Christy Brown, an artist who was crippled with cerebral palsy and learned to paint with his left foot, the only limb in his body he had control over. Daniel Day-Lewis won his first Oscar as Best Actor for this film, which I\'m not absolutely certain was deserved, but is still noteworthy. Day-Lewis give Brown a realistic and occasionally almost humorous touch. Brenda Fricker, as Brown\'s devoted mother, also won an Oscar for a believable and touching role. My problem with this film is that it is a bit too real at times. When Brown is in desperation and must help someone and do it all with his left foot, the film can be difficult to listen. This gives it an often depressing feel that may turn off some viewers for a time. However, if you look beyond that, you will see a sense of hope and inspiration for those who have handicaps and other diff

Create a tokenizer and use it to provide a data-collator for batching inputs:

In [6]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
model = Pipeline(
    identifier='my-sentiment-analysis',
    task='text-classification',
    preprocess=tokenizer,
    object=model,
    preprocess_kwargs={'truncation': True},
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.we

We'll evaluate the model using a simple accuracy metric. This metric gets logged in the
model's metadata during training:

In [7]:
training_args = TransformersTrainerConfiguration(
    identifier='sentiment-analysis',
    output_dir='sentiment-analysis',
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=2,
    weight_decay=0.01,
    save_strategy="epoch",
    use_mps_device=False,
    evaluation_strategy='epoch',
    do_eval=True,
)

Now we're ready to train the model:

In [8]:
from superduperdb.container.metric import Metric

model.fit(
    X='text',
    y='label',
    db=db,
    select=collection.find(),
    configuration=training_args,
    validation_sets=[
        Dataset(
            identifier='my-eval',
            select=collection.find({'_fold': 'valid'}),
        )
    ],
    data_prefetch=False,
    metrics=[Metric(
        identifier='acc',
        object=lambda x, y: sum([xx == yy for xx, yy in zip(x, y)]) / len(x)
    )]
)                                                                            

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,My-eval/acc
1,No log,0.692925,0.5
2,No log,0.694557,0.5


Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
INFO:root:Saving model...
INFO:root:Saving model...


We can verify that the model gives us reasonable predictions:

In [9]:
model.predict("This movie sucks!", one=True)

1