<a href="https://colab.research.google.com/github/zy4kamu/Coda/blob/master/sentiment_fine_tuning_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [24]:
# Mount Google Drive device to get access to train and validate datasets

from google.colab import drive
drive.mount("/content/drive")

train_file = "/content/drive/My Drive/classification_train"
validate_file = "/content/drive/My Drive/classification_validate"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [25]:
# Install HuggingFace transformers

!pip install git+https://github.com/huggingface/transformers.git

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-umwph6u1
  Running command git clone -q https://github.com/huggingface/transformers.git /tmp/pip-req-build-umwph6u1
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (PEP 517) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-3.3.1-cp36-none-any.whl size=1082350 sha256=27e5636339e7b0f9a83baf53095503d83155cc71205909b221114894817657c9
  Stored in directory: /tmp/pip-ephem-wheel-cache-n7kvv05d/wheels/33/eb/3b/4bf5dd835e865e472d4fc0754f35ac0edb08fe852e8f21655f
Successfully built transformers


In [26]:
# Import required packages

from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertForSequenceClassification

import tensorflow as tf
import json

In [30]:
# Read training and validate datasets

def read_dataset(input_file):
  sentences = []
  labels = []
  with open(input_file) as reader:
    train_json = json.load(reader)
    for item in train_json:
      sentences.append(item['text'])
      labels.append(item['label'])
  return sentences, labels 

training_sentences, training_labels = read_dataset('/content/drive/My Drive/classification_train')
validation_sentences, validation_labels = read_dataset('/content/drive/My Drive/classification_validate')

print('Training sentences:', training_sentences[0:3], '...')
print('Training labels:', training_labels[0:3], '...')
print('Validation sentences:', validation_sentences[0:3], '...')
print('Validation labels:', validation_labels[0:3], '...')

Training sentences: ['play a jack lawrence concerto', 'play some sam moore.', 'add album to princesas indie'] ...
Training labels: [6, 6, 5] ...
Validation sentences: ['where is belgium located', 'let me hear the good songs from james iha', 'play something by duke ellington from the seventies'] ...
Validation labels: [0, 6, 6] ...


In [31]:
# Create training and validation datasets for TensorFlow backend

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(training_sentences,
                            truncation=True,
                            padding=True)
val_encodings = tokenizer(validation_sentences,
                            truncation=True,
                            padding=True)

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    training_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    validation_labels
))

In [33]:
# Train the model

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',
                                                              num_labels=7)
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy'])
model.fit(train_dataset.shuffle(100).batch(16),
          epochs=3,
          batch_size=16,
          validation_data=val_dataset.shuffle(100).batch(16))
model.save_pretrained('/tmp/classification_model')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'dropout_99', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [47]:
# Load model and make prediction on specific sentence 

loaded_model = TFDistilBertForSequenceClassification.from_pretrained('/tmp/classification_model')
test_sentence = 'play a jack lawrence concerto'

predict_input = tokenizer.encode(test_sentence,
                                 truncation=True,
                                 padding=True,
                                 return_tensors='tf')

tf_output = tf.nn.softmax(loaded_model.predict(predict_input)[0]).numpy()[0]

labels = ['GetWeather', 'SearchCreativeWork', 'RateBook', 'SearchScreeningEvent',
          'BookRestaurant', 'AddToPlaylist', 'PlayMusic']
print()
for label, probability in zip(labels, tf_output):
  print('{}: {:0.5f}'.format(label, probability))

Some weights of the model checkpoint at /tmp/classification_model were not used when initializing TFDistilBertForSequenceClassification: ['dropout_99']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at /tmp/classification_model and are newly initialized: ['dropout_359']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



GetWeather: 0.00004
SearchCreativeWork: 0.00052
RateBook: 0.00004
SearchScreeningEvent: 0.00008
BookRestaurant: 0.00008
AddToPlaylist: 0.00015
PlayMusic: 0.99909
