# 0. Outline

This notebook is a short demonstration of applying the Bidirectional Encoder Representations from Transformers (BERT) from tensorflow hub for learning purpose. 
We will build an natural language processing (NLP) model to predict if a given tweet is talking about a real emergency situation or not. The data comes from a kaggle competition at "getting started" level, see https://www.kaggle.com/c/nlp-getting-started.

The following sections started by very limited data exploration and light preprocessing, so we can quickly dive into the application of BERT. The first solution is to finetune BERT from pretrained weights in a few epochs of training. The second solution is to attach more dense layers after BERT while freezing BERT's weights. This way of transfer learning complexifies the model's structure for broader missions without heavy retraining. Finally, the third solution explores more models attached after BERT to further improve the predictions.


Download datasets/models and install modules

In [0]:
# pretrained BERT is downloaded from tensorflow hub by Google
!wget https://storage.googleapis.com/tfhub-modules/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1.tar.gz
!mkdir bert_model
!tar -xvf '/content/1.tar.gz'  -C '/content/bert_model'
module_path = "/content/bert_model"
# direct url if one doesn't want to save the model
#module_path = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"

# download dataset provided by kaggle
!wget --quiet https://raw.githubusercontent.com/whitejetyeh/NLP-with-Disaster-Tweets/master/train.csv
!wget --quiet https://raw.githubusercontent.com/whitejetyeh/NLP-with-Disaster-Tweets/master/test.csv
!wget --quiet https://raw.githubusercontent.com/whitejetyeh/NLP-with-Disaster-Tweets/master/sample_submission.csv

# Download a text cleaning function for tweets
#ref https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-full-cleaning
!wget --quiet https://raw.githubusercontent.com/whitejetyeh/NLP-with-Disaster-Tweets/master/CleanTweets.py

# the official tokenization script created by Google
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

# for importing sentencepiece in tokenization.py
!pip install sentencepiece

--2020-02-10 16:34:40--  https://storage.googleapis.com/tfhub-modules/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1.tar.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.143.128, 2a00:1450:4013:c01::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.143.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1244531387 (1.2G) [application/x-tar]
Saving to: ‘1.tar.gz’


2020-02-10 16:35:03 (66.1 MB/s) - ‘1.tar.gz’ saved [1244531387/1244531387]

./
./variables/
./variables/variables.data-00000-of-00001
./variables/variables.index
./assets/
./assets/vocab.txt
./saved_model.pb
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/74/f4/2d5214cbf13d06e7cb2c20d84115ca25b53ea76fa1f0ade0e3c9749de214/sentencepiece-0.1.85-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 9.2MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepi

In [0]:
%tensorflow_version 2.x
import re
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import tokenization

print(tf.__version__)

# check GPU connection with Google
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

TensorFlow 2.x selected.
2.1.0
Found GPU at: /device:GPU:0


# 1. minimum data exploration and preprocessing

Here, we slightly explore the dataset. More fine analysis can be found on kaggle, for example https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert.

## data exploration

note: instead of reading test.csv for predicting unknown targets, we split a portion of train.csv to be df_test for validation.

In [0]:
# read data prepared by kaggle into pandas' data frames
df_train = pd.read_csv("/content/train.csv")
#df_test = pd.read_csv("/content/test.csv")
#submission = pd.read_csv("/content/sample_submission.csv") #no need
df_train, df_test = train_test_split(df_train,
                                     test_size=0.1,
                                     random_state=39)

display(df_train.head())

print('an example of a tweet')
display(df_train.text.iloc[0])

Unnamed: 0,id,keyword,location,text,target
4933,7027,mayhem,"PG County, MD",Tonight It's Going To Be Mayhem @ #4PlayThursd...,0
1673,2416,collide,"Kansas, The Free State! ~ KC",That sounds about right. Our building will hav...,0
5654,8066,rescue,Big NorthEast Litter Box,I'm on 2 blood pressure meds and it's still pr...,0
3532,5049,eyewitness,Pennsylvania,A true #TBT Eyewitness News WBRE WYOU http://...,0
5212,7444,obliterated,,I think I'll get obliterated tonight,0


an example of a tweet


"Tonight It's Going To Be Mayhem @ #4PlayThursdays. Everybody Free w/ Text. 1716 I ST NW (18+) http://t.co/cQ7jJ6Yjfz"




BERT makes predictions based on the text content in the 'text' column, and we will consider columns of 'keyword' and 'location' in the booster solution to improve BERT's predictions.

In [0]:
# basic infomation
print('basic infomation of training data')
display(df_train.info())
print('basic infomation of test data')
display(df_test.info())

# stats of data
print('stats of training data')
display(df_train.describe(include=['object']))
print('stats of test data')
display(df_test.describe(include=['object']))

# missing values
print('missing values in the training data')
display(df_train.isnull().sum())
print('missing values in the test data')
display(df_test.isnull().sum())

basic infomation of training data
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6851 entries, 4933 to 3465
Data columns (total 5 columns):
id          6851 non-null int64
keyword     6797 non-null object
location    4577 non-null object
text        6851 non-null object
target      6851 non-null int64
dtypes: int64(2), object(3)
memory usage: 321.1+ KB


None

basic infomation of test data
<class 'pandas.core.frame.DataFrame'>
Int64Index: 762 entries, 7313 to 3305
Data columns (total 5 columns):
id          762 non-null int64
keyword     755 non-null object
location    503 non-null object
text        762 non-null object
target      762 non-null int64
dtypes: int64(2), object(3)
memory usage: 35.7+ KB


None

stats of training data


Unnamed: 0,keyword,location,text
count,6797,4577,6851
unique,221,3066,6764
top,fatalities,USA,11-Year-Old Boy Charged With Manslaughter of T...
freq,43,94,9


stats of test data


Unnamed: 0,keyword,location,text
count,755,503,762
unique,212,440,760
top,dust%20storm,USA,To fight bioterrorism sir.
freq,12,10,3


missing values in the training data


id             0
keyword       54
location    2274
text           0
target         0
dtype: int64

missing values in the test data


id            0
keyword       7
location    259
text          0
target        0
dtype: int64

## Text cleaning
* The most common type of words in oov have punctuations at the start or end. Those words doesn't have embeddings because of the trailing punctuations. Punctuations #, @, !, ?, (, ),[, ], *, %, ..., ', ., :, ; are separated from words
* Special characters that are attached to words are removed completely
* Contractions are expanded
* Urls are removed
* Character entity references are replaced with their actual symbols
* Typos and slang are corrected, and informal abbreviations are written in their long forms
* Hashtags and usernames are expanded
* Some words are replaced with their acronyms

See https://raw.githubusercontent.com/whitejetyeh/NLP-with-Disaster-Tweets/master/CleanTweets.py for details.

In [0]:
# clean is a handmade text cleaning function for tweets
#Reference: https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-full-cleaning
from CleanTweets import clean
df_train['text_cleaned'] = df_train['text'].apply(lambda s : clean(s))
df_test['text_cleaned'] = df_test['text'].apply(lambda s : clean(s))

print('a tweet before cleaning')
display(df_train.text.iloc[0])
print('a tweet after cleaning')
display(df_train.text_cleaned.iloc[0])

a tweet before cleaning


"Tonight It's Going To Be Mayhem @ #4PlayThursdays. Everybody Free w/ Text. 1716 I ST NW (18+) http://t.co/cQ7jJ6Yjfz"

a tweet after cleaning


'Tonight It is Going To Be Mayhem  @   # Foreplay Thursdays .  Everybody Free with Text .  1716 I ST NW  ( 18 +  )  '

# 2. finetune a pretrained BERT model
Here, we apply the plain BERT model. The first step is to process text data into the BERT format [token, mask, segment] with `bert_encode` and `tokenizer`. Then, the model defined in `build_model` is the original base BERT with the one node output layer determining predictions to be 1 for real disaster or 0 for non-disaster.

This part is forked from another fine kernel on kaggle, 
https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub.

## define model

In [0]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return [np.array(all_tokens), np.array(all_masks), np.array(all_segments)]

def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=2e-6), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

## initialize BERT and encode text data

In [0]:
# initialize BERT with trainable weights
bert_layer = hub.KerasLayer(module_path, trainable=True)

# establish tokenizer with bert_layer
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

# encode cleaned text data for BERT to read
# the maximum text length (max_len) with BERT-base is 512
train_input = bert_encode(df_train.text_cleaned.values, tokenizer, max_len=160)
test_input = bert_encode(df_test.text_cleaned.values, tokenizer, max_len=160)
train_labels = df_train.target.values
test_labels = df_test.target.values

## train model and predict

In [0]:
%%time

train_model = False
if train_model:
    # initialize model
    n_epoch = 3
    model = build_model(bert_layer, max_len=160)
    checkpoint_path = "/content/bert_model.ckpt"
    display(model.summary())

    # start training (about 30 mins)
    # Create a callback that saves the model's weights
    cp_callback = ModelCheckpoint(filepath=checkpoint_path,
                                  save_weights_only=True,
                                  save_best_only=True,
                                  verbose=1)
    model.fit(train_input, train_labels,
              validation_split=0.1,
              epochs=n_epoch,
              batch_size=16,
              callbacks=[cp_callback])

    # predict df_test (validation data from train.csv)
    predictions = model.predict(test_input).round().astype(int)
    print(classification_report(test_labels, predictions, labels=[0, 1]))

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 160)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 160)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 160)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 1024), (None 335141889   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

None

Train on 6165 samples, validate on 686 samples
Epoch 1/3
Epoch 00001: val_loss improved from inf to 0.43643, saving model to /content/bert_model.ckpt
Epoch 2/3
Epoch 00002: val_loss improved from 0.43643 to 0.42187, saving model to /content/bert_model.ckpt
Epoch 3/3
Epoch 00003: val_loss did not improve from 0.42187
              precision    recall  f1-score   support

           0       0.85      0.87      0.86       451
           1       0.81      0.77      0.79       311

    accuracy                           0.83       762
   macro avg       0.83      0.82      0.83       762
weighted avg       0.83      0.83      0.83       762

CPU times: user 16min 53s, sys: 11min 42s, total: 28min 35s
Wall time: 35min 16s


A fine tuned plain BERT gives predictions with accuracy 0.83 and classification report as the following. 
$$\begin{array}{c|c|c}&precision&recall\\\hline\\0&0.85&0.87\\\hline \\ 1&0.81&0.77\end{array}$$
This is a fine result, but the training process is computational heavy. Afterall, even the smaller BERT_base has over 300M trainable parameters. The next model with frozen BERT can achieve similar accuracy with the training time cut in half.

# 3. transfer learning with BERT

Here, we extend the pretrained BERT_base by a simple neural network. We will fix the weights in BERT and only train the attached neural network, so the training will be much easier than finetuning the complete BERT.

## define model

In [0]:
'''Transfer learning of bert'''
def build_ext_model(module_path, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    # load BERT with frozen weights
    bert_layer = hub.KerasLayer(module_path, trainable=False)

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

    x = tf.keras.layers.GlobalAveragePooling1D()(sequence_output)
    x = tf.keras.layers.Dropout(0.2)(x)
    # dense layers stacked after bert
    x = tf.keras.layers.Dense(400, activation="relu")(x)
    x = tf.keras.layers.Dropout(0.1)(x)
    x = tf.keras.layers.Dense(200, activation="relu")(x)
    x = tf.keras.layers.Dropout(0.1)(x)
    x = tf.keras.layers.Dense(100, activation="relu")(x)
    x = tf.keras.layers.Dropout(0.1)(x)
    out = Dense(1, activation='sigmoid', name="dense_output")(x)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=2e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

## train model and predict

In [0]:
%%time

train_model = True
if train_model:
    # initialize model
    n_epoch = 5
    model = build_ext_model(module_path, max_len=160)    
    checkpoint_path = "/content/bert_ext_model.ckpt"
    display(model.summary())

    # start training (about 30 mins)
    # Create a callback that saves the model's weights
    cp_callback = ModelCheckpoint(filepath=checkpoint_path,
                                  save_weights_only=True,
                                  save_best_only=True,
                                  verbose=1)
    model.fit(train_input, train_labels,
              validation_split=0.1,
              epochs=n_epoch,
              batch_size=16,
              callbacks=[cp_callback])

    # predict df_test (validation data from train.csv)
    predictions = model.predict(test_input).round().astype(int)
    print(classification_report(test_labels, predictions, labels=[0, 1]))

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 160)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 160)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 160)]        0                                            
__________________________________________________________________________________________________
keras_layer_2 (KerasLayer)      [(None, 1024), (None 335141889   input_word_ids[0][0]             
                                                                 input_mask[0][0]           

None

Train on 6165 samples, validate on 686 samples
Epoch 1/5
Epoch 00001: val_loss improved from inf to 0.50573, saving model to /content/bert_ext_model.ckpt
Epoch 2/5
Epoch 00002: val_loss improved from 0.50573 to 0.46735, saving model to /content/bert_ext_model.ckpt
Epoch 3/5
Epoch 00003: val_loss improved from 0.46735 to 0.45765, saving model to /content/bert_ext_model.ckpt
Epoch 4/5
Epoch 00004: val_loss did not improve from 0.45765
Epoch 5/5
Epoch 00005: val_loss improved from 0.45765 to 0.44334, saving model to /content/bert_ext_model.ckpt
              precision    recall  f1-score   support

           0       0.83      0.88      0.86       451
           1       0.81      0.74      0.78       311

    accuracy                           0.83       762
   macro avg       0.82      0.81      0.82       762
weighted avg       0.82      0.83      0.82       762

CPU times: user 5min 45s, sys: 4min 41s, total: 10min 27s
Wall time: 22min


A simple 3 layers neural network attached after BERT gives predictions with accuracy 0.83 and classification report as the following. 
$$\begin{array}{c|c|c}&precision&recall\\\hline\\0&0.83&0.88\\\hline \\ 1&0.81&0.74\end{array}$$
This transfer learning model achieves accuracy similar to the finetuned BERT with significantly less trainable weights (300M to 0.5M).


# 4. subsequent booster after BERT

Here, we build a pipeline of the finetuned BERT following by an ensemble of several classifiers.
The ensemble classifier takes the input of preliminary predictions from BERT and features of keyword and location, which are not considered in BERT, and therefore the aggregated predictions can be further improved.

The ensemble classifier averages the predictions from a histogram gradient boosting (HGB) classifier, a stochastic gradient boosting (SGD) classifier, and a gradient boosting (GB) classifier. By averaging different classifiers, the ensemble with lower variance is more adaptive to diverse data.

## encode categorical features

We need to encode 'keyword' and 'location' into number before concating them with BERT's predictions. `LabelEncoder` encodes n labels into numbers 0 to n-1, and `MinMaxScaler` scales these numbers into range [0,1].

In [0]:
'''encode categorical features'''
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from collections import defaultdict

cat_df_train = df_train[['keyword','location']].copy()
cat_df_test = df_test[['keyword','location']].copy()
# handle missing values
for cat in ['keyword','location']:
    cat_df_train[cat].loc[cat_df_train[cat].isnull()] = 'NaN'
    cat_df_test[cat].loc[cat_df_test[cat].isnull()] = 'NaN'

# initialize the encoder
le = defaultdict(LabelEncoder)
# fit the encoder and transform the training set
fit_cat_df_train = cat_df_train.apply(lambda x: le[x.name].fit_transform(x))
# normalize the encoding
scaler = MinMaxScaler(feature_range=(0, 1))
fit_cat_df_train = scaler.fit_transform(fit_cat_df_train)
fit_cat_df_train = pd.DataFrame(fit_cat_df_train,
                                index=cat_df_train.index,
                                columns=['keyword','location'])


# Replace test set labels unseen in the training set
for cat in ['keyword','location']:
    labels_train = cat_df_train[cat].unique().tolist()
    replacement_label = cat_df_train[cat].mode()[0]
    cat_df_test[cat].loc[~cat_df_test[cat].isin(labels_train)] = replacement_label

# Using the dictionary to label future data
fit_cat_df_test = cat_df_test.apply(lambda x: le[x.name].transform(x))
# normalize the encoding
fit_cat_df_test = scaler.transform(fit_cat_df_test)
fit_cat_df_test = pd.DataFrame(fit_cat_df_test,
                               index=cat_df_test.index,
                               columns=['keyword','location'])


## aggregate bert predictions with extra features in subsequent models

Firstly, we concate the preliminary predictions with encoded 'keyword' and 'location'.

In [0]:
%%time
'''aggregate previous predictions with extra features(processed)'''

# preliminary prediction from BERT
bert_train_pred = model.predict(train_input)
bert_predict_df = pd.DataFrame(bert_train_pred, 
                               index=df_train.index,
                               columns=['target'])
# more features to be considered
boosting_input = pd.concat([bert_predict_df,fit_cat_df_train],axis=1)
    
bert_test_pred = model.predict(test_input)
bert_test_predict_df = pd.DataFrame(bert_test_pred, 
                                    index=df_test.index,
                                    columns=['target'])
# more features to be considered
test_boosting_input = pd.concat([bert_test_predict_df,fit_cat_df_test],axis=1)

CPU times: user 1min 22s, sys: 57.2 s, total: 2min 20s
Wall time: 2min 19s


Secondly, we indivisually examinate the improvements from each classifier.

In [0]:
%%time
'''HistGradientBoosting Classifier (lightGBM inspired)'''
# To use this experimental feature, we need to explicitly ask for it:
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
HGB_classifier = HistGradientBoostingClassifier(l2_regularization=0.5, 
                                                learning_rate=0.1,
                                                max_depth=None,
                                                max_iter=100, 
                                                max_leaf_nodes=31,
                                                min_samples_leaf=20)
HGB_classifier.fit(boosting_input, train_labels)

predictions = HGB_classifier.predict(test_boosting_input).round().astype(int)
print(classification_report(test_labels, predictions, labels=[0, 1]))
print(HGB_classifier)

              precision    recall  f1-score   support

           0       0.85      0.89      0.87       451
           1       0.82      0.77      0.80       311

    accuracy                           0.84       762
   macro avg       0.84      0.83      0.83       762
weighted avg       0.84      0.84      0.84       762

HistGradientBoostingClassifier(l2_regularization=0.5, learning_rate=0.1,
                               loss='auto', max_bins=255, max_depth=None,
                               max_iter=100, max_leaf_nodes=31,
                               min_samples_leaf=20, n_iter_no_change=None,
                               random_state=None, scoring=None, tol=1e-07,
                               validation_fraction=0.1, verbose=0,
                               warm_start=False)
CPU times: user 703 ms, sys: 31.5 ms, total: 735 ms
Wall time: 377 ms


In [0]:
%%time
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier()
sgd.fit(boosting_input, train_labels)

predictions = sgd.predict(test_boosting_input).round().astype(int)
print(classification_report(test_labels, predictions, labels=[0, 1]))

              precision    recall  f1-score   support

           0       0.85      0.89      0.87       451
           1       0.83      0.77      0.80       311

    accuracy                           0.84       762
   macro avg       0.84      0.83      0.83       762
weighted avg       0.84      0.84      0.84       762

CPU times: user 21.4 ms, sys: 761 µs, total: 22.2 ms
Wall time: 24.5 ms


Unlike HGB and SGD, we need some extra efforts on GB to achieve the same level of accuracy. `RandomizedSearchCV` helps to tailor parameters with respect to the training data by randomly searching through the parameter grid.

In [0]:
%%time
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report
from sklearn.ensemble import GradientBoostingClassifier

# parameters set up
estimator = GradientBoostingClassifier()
param_grid={'n_estimators':[100,150], 
            'learning_rate': [0.05,0.1,0.2],
            'max_depth':[2,4,6], 
            'min_samples_leaf':[3,5,7], 
            'max_features':['sqrt','auto']
           }
n_iter=40 #test number of settings ~sqrt(choices)

# create best estimator
rand_classifier = RandomizedSearchCV(estimator, param_distributions=param_grid, 
                                    cv=3, n_iter=n_iter, n_jobs=-1, verbose=2)
rand_classifier.fit(boosting_input, train_labels)
display(rand_classifier.best_params_)
gb = rand_classifier.best_estimator_

predictions = gb.predict(test_boosting_input).round().astype(int)
print(classification_report(test_labels, predictions, labels=[0, 1]))

Fitting 3 folds for each of 40 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   16.7s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   52.1s finished


{'learning_rate': 0.1,
 'max_depth': 2,
 'max_features': 'sqrt',
 'min_samples_leaf': 5,
 'n_estimators': 150}

              precision    recall  f1-score   support

           0       0.85      0.88      0.86       451
           1       0.82      0.77      0.79       311

    accuracy                           0.84       762
   macro avg       0.83      0.83      0.83       762
weighted avg       0.84      0.84      0.84       762

CPU times: user 908 ms, sys: 81.1 ms, total: 989 ms
Wall time: 52.5 s


AdaBoost algorithm slightly improves GB with the best parameters.

In [0]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier

estimator = GradientBoostingClassifier(n_estimators = 150,
                                       min_samples_leaf = 5,
                                       max_features = 'sqrt',
                                       max_depth = 2,
                                       learning_rate = 0.1)

abgb = AdaBoostClassifier(base_estimator=estimator,
                          learning_rate=1,
                          n_estimators=10,
                          random_state=39)
abgb.fit(boosting_input, train_labels)

predictions = abgb.predict(test_boosting_input).round().astype(int)
print(classification_report(test_labels, predictions, labels=[0, 1]))

              precision    recall  f1-score   support

           0       0.85      0.88      0.87       451
           1       0.82      0.78      0.80       311

    accuracy                           0.84       762
   macro avg       0.84      0.83      0.83       762
weighted avg       0.84      0.84      0.84       762



## ensemble of effective classifiers

In [0]:
%%time
'''voting for optimal prediction in the ensemble of effective classifiers'''
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import classification_report
# To use this experimental feature, we need to explicitly ask for it:
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier

#enter selected classifiers
estimator = GradientBoostingClassifier(n_estimators = 150,
                                       min_samples_leaf = 5,
                                       max_features = 'sqrt',
                                       max_depth = 2,
                                       learning_rate = 0.1)
ABGB = AdaBoostClassifier(base_estimator=estimator,
                          learning_rate=1,
                          n_estimators=10,
                          random_state=39)

SGD = SGDClassifier()

HGB = HistGradientBoostingClassifier(l2_regularization=0.5, 
                                     learning_rate=0.1,
                                     max_depth=None,
                                     max_iter=100, 
                                     max_leaf_nodes=31,
                                     min_samples_leaf=20)

#ensemble of classifiers
ensemble = [('abgb', ABGB),                                     
            ('sgd', SGD),
            ('hgb', HGB)]
weights = [1, 1, 1]
voting_classifier = VotingClassifier(ensemble,
                                     weights=weights,
                                     n_jobs=-1)


#fit all regressors
voting_classifier.fit(boosting_input, train_labels)

predictions = voting_classifier.predict(test_boosting_input).round().astype(int)
print(classification_report(test_labels, predictions, labels=[0, 1]))

              precision    recall  f1-score   support

           0       0.85      0.90      0.87       451
           1       0.84      0.76      0.80       311

    accuracy                           0.84       762
   macro avg       0.84      0.83      0.83       762
weighted avg       0.84      0.84      0.84       762

CPU times: user 116 ms, sys: 2.16 ms, total: 118 ms
Wall time: 4.3 s


The ensemble classifier attached after BERT gives predictions with accuracy 0.84 (improved from 0.83) and classification report as the following. 
$$\begin{array}{c|c|c}&precision&recall\\\hline\\0&0.85&0.90\\\hline \\ 1&0.84&0.76\end{array}$$
