# Ensemble - Roberta + Logistic Regression

This notebook tests Logistic Regression, RoBERTa, And 2 ensembles performance on test set and timing of the models.

>**Note:** This was run in Google Colab, so there is no direct reference to the data. The data used was the same as in repository.

## Imports

In [None]:
from google.colab import drive
import glob

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install simpletransformers -q

[K     |████████████████████████████████| 204kB 8.7MB/s 
[K     |████████████████████████████████| 7.4MB 14.2MB/s 
[K     |████████████████████████████████| 1.1MB 54.3MB/s 
[K     |████████████████████████████████| 51kB 9.0MB/s 
[K     |████████████████████████████████| 1.4MB 51.1MB/s 
[K     |████████████████████████████████| 317kB 51.5MB/s 
[K     |████████████████████████████████| 2.9MB 55.8MB/s 
[K     |████████████████████████████████| 71kB 10.2MB/s 
[K     |████████████████████████████████| 1.8MB 55.2MB/s 
[K     |████████████████████████████████| 163kB 60.5MB/s 
[K     |████████████████████████████████| 102kB 14.6MB/s 
[K     |████████████████████████████████| 4.5MB 47.0MB/s 
[K     |████████████████████████████████| 112kB 59.3MB/s 
[K     |████████████████████████████████| 890kB 49.5MB/s 
[K     |████████████████████████████████| 133kB 58.0MB/s 
[K     |████████████████████████████████| 102kB 13.9MB/s 
[K     |████████████████████████████████| 71kB 10.2MB/s 
[

In [None]:
import pandas as pd
import numpy as np
import torch 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef
from simpletransformers.classification import ClassificationModel, ClassificationArgs
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

## Load Data

In [None]:
# CHANGE TO YOUR PATH
colab_resources_path = "/content/drive/My Drive/Machine Learning/Project/colab_resources"

In [None]:
data_files = glob.glob(colab_resources_path + "/*.csv")
data_files += glob.glob(colab_resources_path + "/*.py")
for data_file in data_files:
    print('Copying file {} to colab root.'.format(data_file))
    !cp "$data_file" .

Copying file /content/drive/My Drive/Machine Learning/Project/colab_resources/test.csv to colab root.
Copying file /content/drive/My Drive/Machine Learning/Project/colab_resources/am_additional.csv to colab root.
Copying file /content/drive/My Drive/Machine Learning/Project/colab_resources/random.csv to colab root.
Copying file /content/drive/My Drive/Machine Learning/Project/colab_resources/am.csv to colab root.
Copying file /content/drive/My Drive/Machine Learning/Project/colab_resources/nam.csv to colab root.
Copying file /content/drive/My Drive/Machine Learning/Project/colab_resources/data_preprocess.py to colab root.
Copying file /content/drive/My Drive/Machine Learning/Project/colab_resources/data_preprocess_old.py to colab root.


In [None]:
from data_preprocess import getTrainData, getTestData

In [None]:
train_data_all = getTrainData(include_random=True, shuffle=True) # article title + body
train_data_title = getTrainData(include_random=True, n_sentences=0, shuffle=True) # article title
train_data_body = getTrainData(include_random=True, no_title=True, shuffle=True) # article body

test_data_all = getTestData() # article title + body
test_data_title = getTestData(n_sentences=0) # article title
test_data_body = getTestData(no_title=True) # article body

## Test

In [None]:
def getResults(model, labels, predictions, time):
    acc = np.round(accuracy_score(labels, predictions), 4)
    precision = np.round(precision_score(labels, predictions), 4)
    recall = np.round(recall_score(labels, predictions), 4)
    f1 = np.round(f1_score(labels, predictions), 4)
    mcc = np.round(matthews_corrcoef(labels, predictions), 4)
    
    return pd.DataFrame(np.array([[model, acc, precision, recall, f1, mcc, time]]), columns = ['model', 'accuracy', 'precision', 'recall', 'f1', 'mcc', 'time'])

### Logistic Regression - Title + Body

In [None]:
def logreg_predict(vectorizer, logreg, X_test):
    X_test_v = vectorizer.transform(X_test)
    return logreg.predict(X_test_v)

In [None]:
vectorizer = TfidfVectorizer(strip_accents='ascii', lowercase=True, stop_words='english')
logreg = LogisticRegression(random_state=0, C=17, penalty='l2', max_iter=1000)

# Train
X_train_v = vectorizer.fit_transform(train_data_all['text'].array)
y_train = train_data_all['label'].array

logreg.fit(X_train_v, y_train)

# Predict
predictions = logreg_predict(vectorizer, logreg, test_data_all['text'].array)

In [None]:
%%timeit
logreg_predict(vectorizer, logreg, test_data_all['text'].array)

10 loops, best of 3: 67.9 ms per loop


In [None]:
labels = test_data_all['label'].array
result_logreg = getResults("logreg", labels, predictions, "67.9 ms")
result_logreg

Unnamed: 0,model,accuracy,precision,recall,f1,mcc,time
0,logreg,0.9308,0.9778,0.9072,0.9412,0.861,67.9 ms


### RoBERTa

In [None]:
model_args= ClassificationArgs(sliding_window=True)
model_args.num_train_epochs=4
model_args.save_best_model= True
model_args.tie_value = 1
model_args.batch_size = 16
model_args.learning_rate = 2e-5
model_args.overwrite_output_dir = True
model_args.max_seq_length = 512
model_args.no_cache=True
model_args.max_grad_norm = 1
model_args.use_multiprocessing = True
model_args.manual_seed = 4
model_args.reprocess_input_data = True
model_args.evaluate_during_training = False
model_args.labels_list = [0, 1]

In [None]:
# Train
train_data_all_r = train_data_all.rename(columns={"label": "labels"})
roberta_all = ClassificationModel('roberta', 'roberta-base', args=model_args)
roberta_all.train_model(train_data_all_r, acc=matthews_corrcoef)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1594.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=4.0), HTML(value='')))

HBox(children=(HTML(value='Running Epoch 0 of 4'), FloatProgress(value=0.0, max=588.0), HTML(value='')))






HBox(children=(HTML(value='Running Epoch 1 of 4'), FloatProgress(value=0.0, max=588.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 4'), FloatProgress(value=0.0, max=588.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 4'), FloatProgress(value=0.0, max=588.0), HTML(value='')))





(2352, 0.13761654173871105)

In [None]:
def roberta_predict(roberta, X_test):
    result, model_outputs = roberta.predict(X_test)
    return np.array([np.rint(np.mean(np.argmax(j, axis=1))) for j in model_outputs]).astype(int)

In [None]:
predictions_roberta = roberta_predict(roberta_all, test_data_all['text'].array)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=57.0), HTML(value='')))




In [None]:
%%timeit
roberta_predict(roberta_all, test_data_all['text'].array)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=57.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=57.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=57.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=57.0), HTML(value='')))


1 loop, best of 3: 9.21 s per loop


In [None]:
labels = test_data_all['label'].array
result_roberta_all = getResults("roberta_all", labels, predictions_roberta, "9.21 s")
result_roberta_all

Unnamed: 0,model,accuracy,precision,recall,f1,mcc,time
0,roberta_all,0.956,0.9688,0.9588,0.9637,0.9078,9.21 s


#### RoBERTa Soft Voting

In [None]:
roberta_all_prob = roberta_predic_proba(roberta_all, test_data_all['text'].array)
roberta_all_prob = roberta_all_prob[:, 0]
predictions_roberta_soft = np.where(roberta_all_prob > 0.5, 0, 1)
result_roberta_all_soft = getResults("roberta_all_soft", labels, predictions_roberta_soft, "9.21 s")

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=57.0), HTML(value='')))




### Ensemble: RoBERTa + LR

In [None]:
def logreg_predict_proba(vectorizer, logreg, X_test):
    X_test_v = vectorizer.transform(X_test)
    return logreg.predict_proba(X_test_v)

In [None]:
from scipy.special import softmax
def getProbabilitiesRoberta(pred):
  
  return np.array([np.sum(softmax(j, axis=1), axis=0)/len(j) for j in pred])

In [None]:
def roberta_predic_proba(roberta, X_test):
    result, model_outputs = roberta.predict(X_test)
    return getProbabilitiesRoberta(model_outputs)

In [None]:
def ensemble_roberta_lr_predict(roberta, vectorizer, logreg, X_test):

    prob_rb = roberta_predic_proba(roberta, X_test)
    prob_lr = logreg_predict_proba(vectorizer, logreg, X_test)

    w_lr = 0.877 # LR MCC cv6 score
    w_rf = 0.901 # RoBERTa MCC cv6 score

    prob_lr = prob_lr[:, 0]
    prob_rb = prob_rb[:, 0]

    prob = (prob_lr*w_lr + prob_rb*w_rf)/(w_lr+w_rf)

    return np.where(prob > 0.5, 0, 1)

In [None]:
predictions_ensemble_roberta_and_lr = ensemble_roberta_lr_predict(roberta_all, vectorizer, logreg, test_data_all['text'].array)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=57.0), HTML(value='')))




In [None]:
%%timeit
ensemble_roberta_lr_predict(roberta_all, vectorizer, logreg, test_data_all['text'].array)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=57.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=57.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=57.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=57.0), HTML(value='')))


1 loop, best of 3: 9.48 s per loop


In [None]:
labels = test_data_all['label'].array
result_ensemble_roberta_and_lr = getResults("ensemble_roberta_and_lr", labels, predictions_ensemble_roberta_and_lr, "9.48 s")
result_ensemble_roberta_and_lr

Unnamed: 0,model,accuracy,precision,recall,f1,mcc,time
0,ensemble_roberta_and_lr,0.956,0.9688,0.9588,0.9637,0.9078,9.48 s


### Ensemble: RoBERTa-Title + RoBERTa-Body + LR

In [None]:
# Train RoBERTa-Title
train_data_title_r = train_data_title.rename(columns={"label": "labels"})
roberta_title = ClassificationModel('roberta', 'roberta-base', args=model_args)
roberta_title.train_model(train_data_title_r, acc=matthews_corrcoef)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1594.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=4.0), HTML(value='')))

HBox(children=(HTML(value='Running Epoch 0 of 4'), FloatProgress(value=0.0, max=200.0), HTML(value='')))






HBox(children=(HTML(value='Running Epoch 1 of 4'), FloatProgress(value=0.0, max=200.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 4'), FloatProgress(value=0.0, max=200.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 4'), FloatProgress(value=0.0, max=200.0), HTML(value='')))





(800, 0.22088229659468198)

In [None]:
# Train RoBERTa-Body
train_data_body_r = train_data_body.rename(columns={"label": "labels"})
roberta_body = ClassificationModel('roberta', 'roberta-base', args=model_args)
roberta_body.train_model(train_data_body_r, acc=matthews_corrcoef)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1594.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=4.0), HTML(value='')))

HBox(children=(HTML(value='Running Epoch 0 of 4'), FloatProgress(value=0.0, max=581.0), HTML(value='')))






HBox(children=(HTML(value='Running Epoch 1 of 4'), FloatProgress(value=0.0, max=581.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 4'), FloatProgress(value=0.0, max=581.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 4'), FloatProgress(value=0.0, max=581.0), HTML(value='')))





(2324, 0.12983416162467107)

In [None]:
def ensemble_roberta_title_and_body_and_lr_predict(roberta_title, roberta_body, vectorizer, logreg, X_test_all, X_test_title, X_test_body):

    prob_rb_title = roberta_predic_proba(roberta_title, X_test_title)
    prob_rb_body = roberta_predic_proba(roberta_body, X_test_body)
    prob_lr = logreg_predict_proba(vectorizer, logreg, X_test_all)

    w_lr = 0.877
    w_rb_title = 0.863
    w_rb_body = 0.901

    prob_lr = prob_lr[:, 0]
    prob_rb_title = prob_rb_title[:, 0]
    prob_rb_body = prob_rb_body[:, 0]

    prob = (prob_lr*w_lr + prob_rb_body*w_rb_body+prob_rb_title*w_rb_title)/(w_lr+w_rb_title+w_rb_body)

    return np.where(prob > 0.5, 0, 1)

In [None]:
X_test_all = test_data_all['text'].array
X_test_title = test_data_title['text'].array
X_test_body = test_data_body['text'].array
predictions_ensemble_roberta_title_and_body_and_lr = ensemble_roberta_title_and_body_and_lr_predict(roberta_title, roberta_body, vectorizer, logreg, X_test_all, X_test_title, X_test_body)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=20.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=56.0), HTML(value='')))




In [None]:
%%timeit
ensemble_roberta_title_and_body_and_lr_predict(roberta_title, roberta_body, vectorizer, logreg, X_test_all, X_test_title, X_test_body)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=20.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=56.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=20.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=56.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=20.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=56.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=20.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=159.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=56.0), HTML(value='')))


1 loop, best of 3: 12.1 s per loop


In [None]:
result_ensemble_roberta_title_and_body_and_lr = getResults("ensemble_roberta_title_and_body_and_lr", labels, predictions_ensemble_roberta_title_and_body_and_lr, "12.1 s")

## Results

In [None]:
results = pd.concat([result_logreg, result_roberta_all, result_roberta_all_soft, result_ensemble_roberta_and_lr, result_ensemble_roberta_title_and_body_and_lr], ignore_index=True)
results

Unnamed: 0,model,accuracy,precision,recall,f1,mcc,time
0,logreg,0.9308,0.9778,0.9072,0.9412,0.861,67.9 ms
1,roberta_all,0.956,0.9688,0.9588,0.9637,0.9078,9.21 s
2,roberta_all_soft,0.9623,0.9691,0.9691,0.9691,0.9207,9.21 s
3,ensemble_roberta_and_lr,0.956,0.9688,0.9588,0.9637,0.9078,9.48 s
4,ensemble_roberta_title_and_body_and_lr,0.956,1.0,0.9278,0.9626,0.9131,12.1 s


In [None]:
!lscpu |grep 'Model name'

Model name:          Intel(R) Xeon(R) CPU @ 2.20GHz


In [None]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-d79d8ced-daed-ec0c-41d4-b24514fd8ea3)


## Conclusion

### Prediction

On test set, suprisingly RoBERTa with soft voting on windows showed the best results. It's a small test set, so there could be just the random factor as ensemble with RoBERTa-Title, RoBERTa-Body and logistic regression showed the best results and ensemble with RoBERTa-All and logistic regression in the second place.

Logistic regression performed poorly here. It showed a lot better results using cross validation. This could be the reason for plain RoBERTa beating ensembles. 

### Prediction Time

We timed model evaluation time on test set (159 articles). 

It was run in Google Colab:
* CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
* GPU: Tesla T4

Logistic regression unsuprisingly is by far the fastest model here (**67.9 ms**). Comparing it with the results (**F1: 0.9412**) it could be a good option if you're willing to accept slightly worse results. It can be basically run on any pc.

Using RoBERTa (base) greatly increases evaluation time (**9.21 s** for RoBERTa-All, more for ensembles) and also requires around 10GB of GPU memory. So running it requires having a high performing computer.