<a href="https://colab.research.google.com/github/valerielimyh/Intent_Recognition_using_BERT/blob/master/02_1_DistilBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split, RandomizedSearchCV,KFold
seed = 123

#settings
pd.set_option('display.max_colwidth', -1)
np.set_printoptions(threshold=np.inf)
pd.options.display.max_columns = None
pd.options.display.max_rows = None
from IPython.core.interactiveshell import InteractiveShell  
InteractiveShell.ast_node_interactivity = "all"

In [4]:
!gdown --id 1OlcvGWReJMuyYQuOZm149vHWwPtlboR6 --output train.csv
!gdown --id 1Oi5cRlTybuIF2Fl5Bfsr-KkqrXrdt77w --output valid.csv
!gdown --id 1ep9H6-HvhB4utJRLVcLzieWNUSG3P_uF --output test.csv

Downloading...
From: https://drive.google.com/uc?id=1OlcvGWReJMuyYQuOZm149vHWwPtlboR6
To: /content/train.csv
100% 799k/799k [00:00<00:00, 50.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Oi5cRlTybuIF2Fl5Bfsr-KkqrXrdt77w
To: /content/valid.csv
100% 43.3k/43.3k [00:00<00:00, 65.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1ep9H6-HvhB4utJRLVcLzieWNUSG3P_uF
To: /content/test.csv
100% 43.1k/43.1k [00:00<00:00, 67.7MB/s]


In [5]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

HBox(children=(IntProgress(value=0, description='Downloading', max=231508, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=442, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=267967963, style=ProgressStyle(description_…




In [0]:
train = pd.read_csv("train.csv")
valid = pd.read_csv("valid.csv")
test = pd.read_csv("test.csv")

In [9]:
df = pd.concat([train,valid,test])
df.shape

(14484, 2)

# Let's preprocess our data so that it matches the data BERT was trained on. 

For this, we'll 

1. Lowercase our text (if we're using a BERT lowercase model)
2. Tokenize it (i.e. "sally says hi" -> ["sally", "says", "hi"])
3. Break words into WordPieces (i.e. "calling" -> ["call", "##ing"])
4. Map our words to indexes using a vocab file that BERT provides
5. Add special "CLS" and "SEP" tokens 

In [0]:
tokenized = df['text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

## Padding 
After tokenization, tokenized is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).

In [11]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
np.array(padded).shape

(14484, 38)

## Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [12]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(14484, 38)

The model() function runs our sentences through BERT. The results of the processing will be returned into last_hidden_states.

In [0]:
#create an input tensor out of the padded token matrix, and send that to DistilBERT
input_ids = torch.tensor(np.array(padded))


with torch.no_grad():
  # last_hidden_states holds the outputs of DistilBERT. 
  # It is a tuple with the shape (number of examples, max number of tokens in the sequence, number of hidden units in the DistilBERT model). 
    last_hidden_states = model(input_ids)

In [0]:
 # Slice the output for the first position for all the sequences, take all hidden unit outputs
features = last_hidden_states[0][:,0,:].numpy()

In [0]:
# Split our datset into a training set and testing set
X_train, X_test, y_train, y_test = train_test_split(features, df['intent'], test_size=0.2, random_state=42)

In [21]:
# search for the best value of the C parameter, which determines regularization strength.
parameters = {'C': np.linspace(0.0001, 100, 20)}

lr_finder = RandomizedSearchCV(estimator = LogisticRegression() , scoring = 'accuracy', 
                               param_distributions = parameters,
                               cv = KFold(n_splits=5, shuffle=True, random_state = seed), 
                               verbose=50, random_state=seed, n_jobs = -1)                         
lr_finder.fit(X_train, y_train)

print('best parameters: ', lr_finder.best_params_)
print('best scores: ', lr_finder.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   10.1s
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   10.1s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:   10.2s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   10.6s
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   17.0s
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:   17.3s
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed:   17.3s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   17.9s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   24.0s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   24.1s
[Parallel(n_jobs=-1)]: Done  11 tasks      | elapsed:   24.4s
[Parallel(n_jobs=-1)]: Done  12 tasks      | elapsed:   25.0s
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed:   31.3s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:   3

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [26]:
lr_cv = lr_finder.best_estimator_.fit(X_train, y_train)
y_lr_pred = lr_cv.predict(X_test)
print(metrics.classification_report(y_test, y_lr_pred))

                      precision    recall  f1-score   support

       AddToPlaylist       0.98      0.99      0.98       392
      BookRestaurant       0.99      0.99      0.99       379
          GetWeather       0.99      1.00      1.00       435
           PlayMusic       0.95      0.97      0.96       441
            RateBook       0.99      0.99      0.99       399
  SearchCreativeWork       0.92      0.91      0.91       409
SearchScreeningEvent       0.96      0.94      0.95       442

            accuracy                           0.97      2897
           macro avg       0.97      0.97      0.97      2897
        weighted avg       0.97      0.97      0.97      2897



In [0]:
# Using DistilBERT to embed text and modeling using Log reg gives us higher accuracy than 
# using count vectoriser to generate a document-term matrix and modeling using Log reg. 

### References: 
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines