In [1]:
import warnings
warnings.filterwarnings('ignore')

# 1. Introduction

We will be using https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english for sentiment analysis.

This model is a fine-tune checkpoint of DistilBERT-base-uncased, trained on SST-2 dataset which is a dataset for binary sentiment classification. It is composed of sentences extracted from movie reviews and annotated with a sentiment label. The task is to predict the sentiment of a given sentence. This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7).

In [2]:
from datasets import load_dataset, load_dataset_builder

ds_builder = load_dataset_builder('stanfordnlp/sst2')
ds_builder.info.features

{'idx': Value(dtype='int32', id=None),
 'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None)}

In [3]:
initial_dataset = load_dataset('stanfordnlp/sst2')
initial_dataset

DatasetDict({
    train: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 872
    })
    test: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 1821
    })
})

In [4]:
initial_dataset['train'][0]

{'idx': 0,
 'sentence': 'hide new secretions from the parental units ',
 'label': 0}

In [5]:
sentence = initial_dataset['train'][0]['sentence']
sentence

'hide new secretions from the parental units '

In [6]:
from transformers import DistilBertTokenizer

tokenizer_ckpt = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = DistilBertTokenizer.from_pretrained(tokenizer_ckpt)

In [7]:
tokenizer.model_input_names

['input_ids', 'attention_mask']

In [8]:
tokenized_sentence = tokenizer.tokenize(sentence)
tokenized_sentence

['hide', 'new', 'secret', '##ions', 'from', 'the', 'parental', 'units']

In [9]:
inputs = tokenizer(sentence, return_tensors='pt')
inputs['input_ids']

tensor([[  101,  5342,  2047,  3595,  8496,  2013,  1996, 18643,  3197,   102]])

In [10]:
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
tokens

['[CLS]',
 'hide',
 'new',
 'secret',
 '##ions',
 'from',
 'the',
 'parental',
 'units',
 '[SEP]']

# 2. Using the model

Below is a simple example of how to use the model to predict the sentiment of a sentence.

Eventually, we will use a homomorphically encrypted version of this model to predict the sentiment of a sentence. By then, the input pipeline should look similar to the one below.

- Encode step: tokens -> hash | input_ids

- Decode step: tags | hash -> tags

In [11]:
from transformers import pipeline

model_ckpt = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english'

pipe = pipeline('text-classification', model=model_ckpt, tokenizer=tokenizer)

result = pipe(sentence)
result

Device set to use mps:0


[{'label': 'NEGATIVE', 'score': 0.9979427456855774}]

In [None]:
from transformers import DistilBertForSequenceClassification