IMPORTANT (for Google Colab): `Runtime` 🡒 `Change runtime type` 🡒 `Hardware accelerator` 🡒 `(GPU)`

In [None]:
%%capture
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import nltk
import re

# it's for apply progress bar
from tqdm.auto import tqdm 
tqdm.pandas()

# here we're goona use SBERT https://github.com/UKPLab/sentence-transformers
# sentence_transformers allows to use pretrained BERT models that have been trained on specific tasks sush as QA-tasks
! pip install sentence_transformers
from sentence_transformers import SentenceTransformer

# installing instrument replacing "I'd" by "I would", "they've" by "they have" and so on
!pip install contractions
import  contractions

import warnings
warnings.filterwarnings('ignore')

In [None]:
# google drive mounting
from google.colab import drive
drive.mount('/content/drive')

train_df = pd.read_json('./drive/MyDrive/nlp/train.jsonl', lines=True)
val_df = pd.read_json('./drive/MyDrive/nlp/val.jsonl', lines=True)

Mounted at /content/drive


In [None]:
# let's combine the training and validation part of the data so that they get the same changes
train_df['purpose'] = 'train'
val_df['purpose'] = 'val'
df = train_df.append(val_df, ignore_index=True)

In [None]:
# the BERT vectorizer is designed to work with ordinary texts
# so we only delete what is specific to wikipedia (links, rare transcribtion symbols, etc.)
def light_clean(text):
    text = contractions.fix(text) # replace apostrophic short constructions
    text = re.sub(r'https?://\S+|www\.\S+', '', text) # remove urls
    text = re.sub(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});', '', text) # remove html tags
    text = re.sub(r'[^\x00-\x7f]', '', text) # remove non ascii
    text = re.sub(r'[]!"$%&\'()*+,./:;=#@?[\\^_`{|}~-]+', '', text) # remove normal symbols
    return text

In [None]:
# preprocessing
df['question_clean'] = df['question'].apply(light_clean)
df['passage_clean'] = df['passage'].apply(light_clean)

In [None]:
# split back to 2 parts
drop_cols = ['purpose', 'dist', 'question_length', 'passage_length', 'question_embeddings', 'passage_embeddings']
train_df = df[df['purpose'] == 'train'].drop(columns=drop_cols, errors='ignore')
val_df = df[df['purpose'] == 'val'].drop(columns=drop_cols, errors='ignore')

In [None]:
%%capture
# other pretrained models https://www.sbert.net/docs/pretrained_models.html
bert = SentenceTransformer('multi-qa-mpnet-base-dot-v1')

In [None]:
# this cell needs about seven minutes to work
%%time
train_X_clean = pd.DataFrame(np.vstack(train_df.progress_apply(lambda x: bert.encode([[x['question_clean'],x['passage_clean']]]), axis=1)))
val_X_clean = pd.DataFrame(np.vstack(val_df.progress_apply(lambda x: bert.encode([[x['question_clean'],x['passage_clean']]]), axis=1)))

# since BERT model was trained including on texts from Wikipedia, we will try to feed the vectorizer even not prepared data
train_X_no_clean = pd.DataFrame(np.vstack(train_df.progress_apply(lambda x: bert.encode([[x['question'],x['passage']]]), axis=1)))
val_X_no_clean = pd.DataFrame(np.vstack(val_df.progress_apply(lambda x: bert.encode([[x['question'],x['passage']]]), axis=1)))

  0%|          | 0/9427 [00:00<?, ?it/s]

  0%|          | 0/3270 [00:00<?, ?it/s]

  0%|          | 0/9427 [00:00<?, ?it/s]

  0%|          | 0/3270 [00:00<?, ?it/s]

CPU times: user 6min 45s, sys: 5.16 s, total: 6min 50s
Wall time: 7min


In [None]:
# this cell needs several minutes to work
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn import metrics

# compared models
models = {'LinearDiscriminantAnalysis': LinearDiscriminantAnalysis(), 
          'LogisticRegression': LogisticRegression(),
          'DecisionTreeClassifier': DecisionTreeClassifier(max_depth=10),
          'GaussianNB': GaussianNB(),
          'KNeighborsClassifier': KNeighborsClassifier(), 
          'C-SVM': SVC(random_state=42)}

# compared types of embedding vectors using as input for models
data_types = {'cleaned    ': (train_X_clean, val_X_clean), 'not cleaned': (train_X_no_clean, val_X_no_clean)}

# true labels
train_y = train_df['label'].tolist()
val_y = val_df['label'].tolist()

base_train = train_df[train_df['label']].shape[0]/train_df.shape[0]
base_val = val_df[val_df['label']].shape[0]/val_df.shape[0]
print(f' train_acc \t val_acc \tmodel\n{base_train: .6}\t{base_val: .6}\tbaseline')

for model_name, cur_model in models.items():
    for data_type, data in data_types.items():
        # using iterators
        train_X = data[0]
        val_X = data[1]
        model = cur_model 
        
        # training 
        model.fit(train_X, train_y)

        # predicting
        train_predicted = model.predict(train_X)
        val_predicted = model.predict(val_X)
        
        # evaluating
        train_accuracy = metrics.accuracy_score(train_y, train_predicted)
        val_accuracy = metrics.accuracy_score(val_y, val_predicted)

        print(f'{train_accuracy: .6}\t{val_accuracy: .6}\t{data_type}\t{model_name}')

 train_acc 	 val_acc 	model
 0.623104	 0.621713	baseline
 0.72112	 0.657798	cleaned    	LinearDiscriminantAnalysis
 0.721014	 0.662997	not cleaned	LinearDiscriminantAnalysis
 0.715498	 0.660245	cleaned    	LogisticRegression
 0.716559	 0.670031	not cleaned	LogisticRegression
 0.767476	 0.600917	cleaned    	DecisionTreeClassifier
 0.782752	 0.602446	not cleaned	DecisionTreeClassifier
 0.641243	 0.592661	cleaned    	GaussianNB
 0.643789	 0.606422	not cleaned	GaussianNB
 0.769598	 0.630275	cleaned    	KNeighborsClassifier
 0.767264	 0.630581	not cleaned	KNeighborsClassifier
 0.849369	 0.68685	cleaned    	C-SVM
 0.849157	 0.692049	not cleaned	C-SVM


Many algorithms, even without optimizing the parameters, gave an integer percentage difference with baseline. And C-SVM exceeded the baseline immediately by 7 percent instead of 3, obtained during the last vectorization after a long parameter tuning.

BERT is a deep learning model with a transformer architecture. It is clear that for the deeplearning algorithm to work not fully prepared features are ok as input but often "raw" data. So the process of forming features is also optimized during the training of the model. It is not surprising that BERT-vectorization turned out to be more effective than pipeline: clearing of symbols - lemmatization - clearing of stops - concatination of questions and passages - vectorization of sentences as an average of word vectors.

Taking into account the fact that on completely unprepared data each algorithm has greater accuracy than on those that were exposed to light_clean(), we can say that the BERT vectorizer is not just resistant to the "noise" from which we cleaned the data, but uses it. Although it is worth mentioning here that the preparation process was far from ideal, and we probably have removed important tokens while cleaning.
