#Predicting the Quality of StackOverflow Questions with DistilBert

##What is Natural Language Processing

Natural Language Processing or NLP, is a cross between linguistics, computing, and AI to understand or perform computations on natural language such as english.

##History of Natural Language Processing

Natural language processing has been around since the beginning of computing since ALan Turing used the first computer to decode German messages during WW2. Ever since then we have used grammatical, mathmatical, and statistical methods to use computing to understand language.

<img src="https://miro.medium.com/max/1200/1*tMmPG-QQQze0egScdM44nA.png">

##Natural Language Processing in Recent Years

In recent years there have been many popular methods of representing and computing natural language.

###Word Embeddings

Word embeddings use a number of features to describe words and to project them into some latent space. This is useful because we can represent words as a number that we can perform computations on and it can represent similarity between words by being mapped closer or farther together.

<img src="https://miro.medium.com/max/1400/1*sAJdxEsDjsPMioHyzlN3_A.png">

###Neural Networks

As discussed in our LSTM notebook, LSTM's work great with language. This is because when words are in a sentence, some words can modify others and some words can refer to other words in the same sentence, essentially, language is sequential and can be used with sequence based models to make classifications or generate other sentences.

###Seq2Seq

Seq2Seq uses encoders and decoders to create a number representation of a sequence of words. The encoders and decoders are usually RNN or LSTM based. This representation can be fed into the decoder to give us a final output. This works especially well with language translation and sentence generation. For example, you can feed a question into the encoder and get an answer from the decoder. As you can see this model introduces a bottleneck. It is hard to accuratly represent language as a single vector

<img src="https://www.guru99.com/images/1/111318_0848_seq2seqSequ1.png">

###Attention

Ignoring our previous model for now, Attention is simply a mechanism that maps the relations of words of a sentence onto other words in the same sentence. In this example you can see that the word "The" shares weights with words such as "animal" and "street" this is because in this sentence the word "The" refers to the other words in the sentence.

<img src="https://miro.medium.com/max/748/1*9XxSNAGInd3rbwTE_AwrQA.png">

We can relate this back to our previous Seq2Seq and solve the bottleneck issue. By using attention we can provide information about the context of words in a sentence to our decoder and get achieve better results. 

<img src="https://lena-voita.github.io/resources/lectures/seq2seq/attention/attn_for_steps/6-min.png">

##Transformers

Transformers are the most popular in terms of NLP today (2023). Transformers are similar to Seq2Seq since it is an encoder/decoder based model, however it differs in the fact that it no longer uses RNN's or LSTM's. This is because of some of the problems with RNN's and LSTM's, firstly they only have a limited amount of effective memory, LSTM's is bigger but still suffers from this issue, and secondly, because of their sequential nature, trainging through timesteps is slow and cannot be parrallelized for faster training. Transformers ditch this and simply use the attention masks gained from words. This can be read about into far more detail in the famous paper "Attention is all you Need": https://arxiv.org/abs/1706.03762. The downside is that transformers are very large models and require lots of data.

<img src="https://miro.medium.com/max/1400/1*BHzGVskWGS_3jEcYYi6miQ.png">

##BERT

To solve the needs of the data hungry transformer model "Bidirectional Encoder Representation from Transformers" or BERT was made. BERT uses a training method called self supervised learning to train. This method takes parts of sentences and masks them, the model then tries to predict the masked word and corrects itself. This means that we don't need validation sets, just language 😀. This means that we can create HUGE models and train them on HUGE corpuses, such as the English dictionary, Wikipedia, Twitter, etc.

<img src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/MLM.png">

###Transfer Learning

As mentioned before we can train these large models and large datasets. However, this is only practical for large businesses like google with the comput resources necessary to train these. Because of this, companies will pretrain models on some corpus and allow us to fine tune the model by training the head of the network on our very own dataset.

#The Code

##Import Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

import bs4
from bs4 import BeautifulSoup

import re

In [None]:
train = pd.read_csv("https://raw.githubusercontent.com/utkML/Stack-Overflow-Bert-Classification/main/stackoverflow_data/train.csv")
valid = pd.read_csv("https://raw.githubusercontent.com/utkML/Stack-Overflow-Bert-Classification/main/stackoverflow_data/valid.csv")

##Shape our data

Firstly we will neeed to shape our data in a way that we can feed into the network. Like with most machine learning tasks lets start by defining our data and our labels.

In [None]:
frames = [train, valid]
dataset = pd.concat(frames)

In [None]:
dataset = dataset.reset_index()

In [None]:
key = {'LQ_CLOSE':0, 'LQ_EDIT': 1, 'HQ':2}
dataset["Y"] = dataset["Y"].map(key)

In [None]:
dataset["Text"] = dataset["Title"] + " " + dataset["Body"]

In [None]:
dataset = dataset[["Text", "Y"]]
dataset.columns = ['text', 'labels']

Now we have a dataset consisting of our text and our encoded label that we will use for predictions.

In [None]:
dataset["text"][0]

Unnamed: 0,text,labels
0,Java: Repeat Task Every Random Seconds <p>I'm ...,0
1,Why are Java Optionals immutable? <p>I'd like ...,2
2,Text Overlay Image with Darkened Opacity React...,2
3,Why ternary operator in swift is so picky? <p>...,2
4,hide/show fab with scale animation <p>I'm usin...,2
...,...,...
59995,How can I align two flex boxes to follow each ...,0
59996,C++ The correct way to multiply an integer and...,0
59997,WHY DJANGO IS SHOWING ME THIS ERROR WHEN I TRY...,1
59998,PHP - getting the content of php page <p>I hav...,0


##Clean the Data

When using language models you want to break everything down into language that will work well with the model. For example, the transformer was not trained on html tags, and they don't provide data about its label, so lets remove them. The same can be said for newline characters. To clean our data even further we can turn all contractions into their respective words.

In [None]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize

def CleanQuestions(data):
  data = (BeautifulSoup(data, "html.parser").text)
  data = data.replace("\n", " ")
  data = data.replace("\r", "")
  data = re.sub(r'[^(a-zA-Z)\s]','', data)

  data = re.sub(r"won\'t", "will not", data)
  data = re.sub(r"can\'t", "can not", data)

  data = re.sub(r"n\'t", " not", data)
  data = re.sub(r"\'re", " are", data)
  data = re.sub(r"\'s", " is", data)
  data = re.sub(r"\'d", " would", data)
  data = re.sub(r"\'ll", " will", data)
  data = re.sub(r"\'t", " not", data)
  data = re.sub(r"\'ve", " have", data)
  data = re.sub(r"\'m", " am", data)
  data = re.sub(r"Im", "I am", data)
  data = re.sub(r"Id", "I would", data)
  data = word_tokenize(data)
  return data

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
dataset['text'] = dataset["text"].map(CleanQuestions)

In [None]:
dataset["labels"][419]

0

Now our dataset is completely cleaned of any impurites that were present before. Now onto feeding the data into our model.

The reason why I turn this into a transformer dataset is simply to make the process more generalized and easy. Looking at the HuggingFace documentation, all examples use these dataset types and so it will most likely be easier to find help.

In [None]:
dataset.to_csv('dataset.csv', index=None)

In [None]:
!pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m70.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
Coll

This will simply download our file as a csv in our current colab memory, we can then reload it in as a transformer dataset.

In [None]:
!head dataset.csv

text,labels
"['Java', 'Repeat', 'Task', 'Every', 'Random', 'Seconds', 'I', 'am', 'already', 'familiar', 'with', 'repeating', 'tasks', 'every', 'n', 'seconds', 'by', 'using', 'JavautilTimer', 'and', 'JavautilTimerTask', 'But', 'lets', 'say', 'I', 'want', 'to', 'print', 'Hello', 'World', 'to', 'the', 'console', 'every', 'random', 'seconds', 'from', 'Unfortunately', 'I', 'am', 'in', 'a', 'bit', 'of', 'a', 'rush', 'and', 'dont', 'have', 'any', 'code', 'to', 'show', 'so', 'far', 'Any', 'help', 'would', 'be', 'apriciated']",0
"['Why', 'are', 'Java', 'Optionals', 'immutable', 'I', 'would', 'like', 'to', 'understand', 'why', 'Java', 'Optionals', 'were', 'designed', 'to', 'be', 'immutable', 'Is', 'it', 'just', 'for', 'threadsafety']",2
"['Text', 'Overlay', 'I', 'amage', 'with', 'Darkened', 'Opacity', 'React', 'Native', 'I', 'am', 'attempting', 'to', 'overlay', 'a', 'title', 'over', 'an', 'image', 'with', 'the', 'image', 'darkened', 'with', 'a', 'lower', 'opacity', 'However', 'the', 'opacity', '

In [None]:
from datasets import load_dataset
data = load_dataset('csv', data_files='dataset.csv')



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-d0bf7bd3c7f9aa9d/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-d0bf7bd3c7f9aa9d/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

As you can see our dataset is essentially a dictionary, we currently only have train data so lets fix that by splitting it.

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 60000
    })
})

In [None]:
from sklearn.model_selection import train_test_split
data = data['train'].train_test_split(test_size=0.3, seed=11)

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 42000
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 18000
    })
})

##Tokenization/Vectorization

Tokenization is the process of taking our words and turning them into a list of words or some other kind of data. We do this so we can easily give each word its own vector representation. Vectorization is the process of giving each token its own vector representation. This can be done with multiple methods such as one-hot encoding, however since we are using a pre-trained model we have to use its method because it has an already existing "vector dictionary".

<img src="https://miro.medium.com/max/674/1*YEJf9BQQh0ma1ECs6x_7yQ.png">

Since we need a model specific tokenizer lets choose one. We are going to use a distilbert model. DistilBert is a BERT model that has undergone a process called distillation, in which weights are transferred from a large model to a small model. This makes it much smaller and easier for us to train.

We can then import the tokenizer for it using the AutoTokenizer class.

In [None]:
checkpoint = 'distilbert-base-cased'

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

We can then make our tokenization function, in this we are goint to set truncation to true and max_length=100. This will truncate our sentences if they exceed 100 words. This is to make the model train faster.

In [None]:
def tokenize(data):
  return tokenizer(data["text"], truncation=True, max_length=100)

In [None]:
tokenized_data = data.map(tokenize, batched=True)

  0%|          | 0/42 [00:00<?, ?ba/s]

  0%|          | 0/18 [00:00<?, ?ba/s]

You can see that we get a `input_ids` and an `attention_mask` back. `input_ids` is our vectorized words and `attention_mask` is a 1 or 0, a 0 represents if there has been an added token, in distillberts case we don't have these and all the values will be 1.

In [None]:
tokenized_data

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 42000
    })
    test: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 18000
    })
})

##Configure DistilBERT

We can finally configure our BERT model. This is very different to a regular neural network but I wil go through it step by step.

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install torchinfo

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchinfo
  Downloading torchinfo-1.7.2-py3-none-any.whl (22 kB)
Installing collected packages: torchinfo
Successfully installed torchinfo-1.7.2


First we want to import our model using AutoModelForSequenceClassification, we also need a Trainer and argumernt for that Trainer. We can then select how many labels there are in the model, in our case we have 3. This simply configures the head of the network to match our data.

In [None]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.wei

In [None]:
from torchinfo import summary
summary(model)

Layer (type:depth-idx)                                  Param #
DistilBertForSequenceClassification                     --
├─DistilBertModel: 1-1                                  --
│    └─Embeddings: 2-1                                  --
│    │    └─Embedding: 3-1                              22,268,928
│    │    └─Embedding: 3-2                              393,216
│    │    └─LayerNorm: 3-3                              1,536
│    │    └─Dropout: 3-4                                --
│    └─Transformer: 2-2                                 --
│    │    └─ModuleList: 3-5                             42,527,232
├─Linear: 1-2                                           590,592
├─Linear: 1-3                                           2,307
├─Dropout: 1-4                                          --
Total params: 65,783,811
Trainable params: 65,783,811
Non-trainable params: 0

Now we need to initialize our arguments. In our arguments we have to select an evalution strategy, I will use epoch since it is the most common and we are familiar with it. We then need to set our batches, watch out for memory!!!

We also need to make a comput function. We will use sklearns f1score but we still need to configure it. In the function we take our Logits (prediction values) and our labels and feed them into the f1score function.

In [None]:
from sklearn.metrics import f1_score

We can then feed our arguments into the trainer along with our dataset.

In [None]:
train_arguments = TrainingArguments(
    output_dir='training_dir',
    evaluation_strategy='epoch',
    save_strategy='epoch',
    num_train_epochs = 5,
    per_device_train_batch_size = 32,
    per_device_eval_batch_size = 32
)

def compute_metrics(logits_and_labels):
    logits, labels = logits_and_labels
    predictions = np.argmax(logits, axis=-1)
    acc = np.mean(predictions == labels)
    f1 = f1_score(labels, predictions, average='macro')
    return {'accuracy': acc, 'f1': f1}

trainer = Trainer(
    model,
    train_arguments,
    train_dataset = tokenized_data['train'],
    eval_dataset = tokenized_data['test'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Now we simply have to train it.

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 42000
  Num Epochs = 5
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 6565
  Number of trainable parameters = 65783811
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.775,0.743563,0.658222,0.657318
2,0.6427,0.743235,0.670833,0.663942


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 18000
  Batch size = 32
Saving model checkpoint to training_dir/checkpoint-1313
Configuration saved in training_dir/checkpoint-1313/config.json
Model weights saved in training_dir/checkpoint-1313/pytorch_model.bin
tokenizer config file saved in training_dir/checkpoint-1313/tokenizer_config.json
Special tokens file saved in training_dir/checkpoint-1313/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  N

KeyboardInterrupt: ignored