<a href="https://colab.research.google.com/github/ravitata/nlp/blob/main/sentiment_analysis_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb#scrollTo=htO7JShhI4sa

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 4.5 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 14.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 3.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 46.6 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.4 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyy

In [2]:
import tensorflow as tf
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification

This notebook is broken up into 5 sections:

1.   Preprocessing the data
2.   Fine-tuning the model
3.   Testing the model
4.   Using the fine-tuned model to predict new samples
5.   Saving and loading the model for future use

Let's start off by taking a look at our dataset. In this example we consider a small corpus of 10 Yelp reviews: 5 positive (class 1) and 5 negative (class 0).

In [3]:
x = [
     'Great customer service! The food was delicious! Definitely a come again.',
     'The VEGAN options are super fire!!! And the plates come in big portions. Very pleased with this spot, I\'ll definitely be ordering again',
     'Come on, this place is family owned and operated, they are super friendly, the tacos are bomb.',
     'This is such a great restaurant. Multiple times during days that we don\'t want to cook, we\'ve done takeout here and it\'s been amazing. It\'s fast and delicious.',
     'Staff is really nice. Food is way better than average. Good cost benefit.',
     'pricing for this, while relatively inexpensive for a Las Vegas attraction, is completely over the top.',
     'At such a *fine* institution, I find the lack of knowledge and respect for the art appalling',
     'If I could give one star I would...I walked out before my food arrived the customer service was horrible!',
     'Wow the slowest drive thru I\'ve ever been at WOWWWW. Horrible I won\'t be coming back here ever again',
     'Service: 1 out of 5 stars. They will mess up your order, not have it ready after 30 mins calling them before. Worst ran family business Ive ever seen.'
]

y = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

## 1. Preprocessing the data

In [34]:
MODEL_NAME = 'distilbert-base-uncased'
MAX_LEN = 20

review = x[0]

tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME, truncation=True, padding = True)
encoded_x = tokenizer(review)

print(review)
print(encoded_x['input_ids'])
print(encoded_x['attention_mask'])
print(tokenizer.decode(encoded_x['input_ids']))

Great customer service! The food was delicious! Definitely a come again.
[101, 2307, 8013, 2326, 999, 1996, 2833, 2001, 12090, 999, 5791, 1037, 2272, 2153, 1012, 102]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[CLS] great customer service! the food was delicious! definitely a come again. [SEP]


In [35]:
def construct_encodings(x, tokenizer, max_len, truncation=True, padding = True):
  return tokenizer(x, truncation=truncation, max_length=max_len, padding = padding)

In [36]:
train_encoding = construct_encodings(x, tokenizer, MAX_LEN)

In [37]:
print(train_encoding)
print(train_encoding['input_ids'][0])

{'input_ids': [[101, 2307, 8013, 2326, 999, 1996, 2833, 2001, 12090, 999, 5791, 1037, 2272, 2153, 1012, 102, 0, 0, 0, 0], [101, 1996, 15942, 2078, 7047, 2024, 3565, 2543, 999, 999, 999, 1998, 1996, 7766, 2272, 1999, 2502, 8810, 1012, 102], [101, 2272, 2006, 1010, 2023, 2173, 2003, 2155, 3079, 1998, 3498, 1010, 2027, 2024, 3565, 5379, 1010, 1996, 11937, 102], [101, 2023, 2003, 2107, 1037, 2307, 4825, 1012, 3674, 2335, 2076, 2420, 2008, 2057, 2123, 1005, 1056, 2215, 2000, 102], [101, 3095, 2003, 2428, 3835, 1012, 2833, 2003, 2126, 2488, 2084, 2779, 1012, 2204, 3465, 5770, 1012, 102, 0, 0], [101, 20874, 2005, 2023, 1010, 2096, 4659, 23766, 2005, 1037, 5869, 7136, 8432, 1010, 2003, 3294, 2058, 1996, 2327, 102], [101, 2012, 2107, 1037, 1008, 2986, 1008, 5145, 1010, 1045, 2424, 1996, 3768, 1997, 3716, 1998, 4847, 2005, 1996, 102], [101, 2065, 1045, 2071, 2507, 2028, 2732, 1045, 2052, 1012, 1012, 1012, 1045, 2939, 2041, 2077, 2026, 2833, 3369, 102], [101, 10166, 1996, 4030, 4355, 3298, 27046,

In [38]:
def construct_tfdataset(encoding, y=None):
  if y:
    return tf.data.Dataset.from_tensor_slices((dict(encoding), y))
  else:
    # this case is used when making predictions on unseen samples after training
    return tf.data.Dataset.from_tensor_slices((dict(encoding)))

In [39]:
tfdataset = construct_tfdataset(train_encoding, y)
print(tfdataset)

<TensorSliceDataset element_spec=({'input_ids': TensorSpec(shape=(20,), dtype=tf.int32, name=None), 'attention_mask': TensorSpec(shape=(20,), dtype=tf.int32, name=None)}, TensorSpec(shape=(), dtype=tf.int32, name=None))>


In [40]:
TEST_SPLIT = 0.2
BATCH_SIZE = 2

train_size = int(len(x) * (1 - TEST_SPLIT))
print(train_size)

tfdataset = tfdataset.shuffle(len(x))
tfdataset_train = tfdataset.take(train_size)
tfdataset_test = tfdataset.skip(train_size)

tfdataset_train = tfdataset_train.batch(BATCH_SIZE)
tfdataset_test = tfdataset_test.batch(BATCH_SIZE)

8


## 2. Fine-tuning the model

In [41]:
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)
optimizer = tf.keras.optimizers.Adam(3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_transform', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'dropout_79', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [42]:
N_EPOCHS = 3

model.fit(x=tfdataset_train, batch_size = BATCH_SIZE, epochs = N_EPOCHS)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f48035ac150>

## 3. Testing the model

In [43]:
benchmarks = model.evaluate(tfdataset_test, return_dict=True, batch_size=BATCH_SIZE)
print(benchmarks)

{'loss': 0.49327486753463745, 'accuracy': 1.0}


## 4. Using the fine-tuned model to predict new samples

In [51]:
def create_predictor(model, model_name, max_len):
  tkzr = DistilBertTokenizer.from_pretrained(model_name)
  def predict_prob(text):
    encoding = construct_encodings(text, tkzr, max_len)
    tfdataset = construct_tfdataset(encoding)
    tfdataset = tfdataset.batch(1)

    pred = model.predict(tfdataset).logits
    pred = tf.keras.activations.sigmoid(tf.convert_to_tensor(pred)).numpy()
    return pred
  return predict_prob

In [55]:
clf = create_predictor(model, MODEL_NAME, MAX_LEN)
print(clf('this restaurant food')[0])

[0.49112502 0.499815  ]
