## Step 0: Importing the dataset & Importing the BERT model
Import the selected requirement dataset to be ready for processing.
Thenn, we'll import the BERT model



In [None]:
!pip install transformers

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
data = 'https://github.com/waadalhoshan/datasets/raw/main/Promise_NFR_dataset_orginal.csv'
dataset = pd.read_csv(data, delimiter = ';')

labels_short = ['US', 'SE']
labels_long =  ['Usability', 'Security']

Requirement_Statements = []
selected_class = 2
original_classes = []
for index, row in dataset.iterrows():
  #original_classes.append(row['NFR'])
  #Requirement_Statements.append(row['RequirementText'])
  if row['class'] == labels_short[0]:
    original_classes.append(labels_long[0])
    Requirement_Statements.append(row['RequirementText'])
  if row['class'] == labels_short[1]:
    original_classes.append(labels_long[1])
    Requirement_Statements.append(row['RequirementText'])

df = pd.DataFrame({'class' : original_classes,
                                'RequirementText' : Requirement_Statements}, 
                                columns=['class','RequirementText'])

We can ask pandas how many sentences are labeled as "positive" (value 1) and how many are labeled "negative" (having the value 0)

In [None]:
df['class'].value_counts().plot(kind='bar')

To use BERT model, we have three options provided by Transformers package:
* distilbert-base-uncased or cased model
* bert-based-uncased or cased model
* bert-large-uncased or cased model
More details about those models are found at https://huggingface.co/transformers/model_doc/bert.html
However, there are other BERT models specefic for langauge or domain-of-use 

In [None]:
# For DistilBERT:
#model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

# For BERTbase
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

## Step 1: Preparing the Dataset for BERT
Before we can hand our sentences to BERT, we need to tokenize the sentences in a compatible format with the BERT model.

In [None]:
tokenized = batch_1['RequirementText'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

###Padding
After tokenization, tokenized is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths)

In [None]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

Our dataset is now in the `padded` variable, we can view its dimensions below:

In [None]:
np.array(padded).shape
print(padded)

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [None]:
attention_mask = np.where(padded != 0, 1, 0)
print(attention_mask)
attention_mask.shape

## Step 2: Extract Features using BERT embeddings
Now that we have our model and inputs ready, let's run our model!



In [None]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)
print(input_ids)
with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)
print(last_hidden_states)

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [None]:
features = last_hidden_states[0][:,0,:].numpy()
print(features)

The labels indicating the class of each requirement goes now into the `labels` variable

In [None]:
labels = df['class']
print(labels)

## Step 3: Train The Classification Model with the BERT embeddings
Let's now split our datset into a training set and testing set (even though we're using 2,000 sentences from the SST2 training set).

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
print(test_labels)

Then, we can dive into Logistic regression directly with the Scikit Learn default parameters, but sometimes it's worth searching for the best value of the C parameter, which determines regularization strength.

In [None]:
 parameters = {'C': np.linspace(0.0001, 100, 20)}
 grid_search = GridSearchCV(LogisticRegression(), parameters)
 grid_search.fit(train_features, train_labels)

print('best parameters: ', grid_search.best_params_)
 print('best scrores: ', grid_search.best_score_)

We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. `LogisticRegression(C=5.2)`).

In [None]:
'''
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)
lr_clf = svm.SVC()
lr_clf.fit(train_features, train_labels)
from sklearn.ensemble import RandomForestClassifier
lr_clf = RandomForestClassifier(max_depth=2, random_state=0)
lr_clf.fit(train_features, train_labels)
from sklearn.tree import DecisionTreeClassifier
lr_clf = DecisionTreeClassifier(random_state=0)
lr_clf.fit(train_features, train_labels)
from sklearn.linear_model import LinearRegression
lr_clf = LinearRegression()
lr_clf.fit(train_features, train_labels)
from sklearn.neighbors import KNeighborsClassifier
lr_clf = KNeighborsClassifier(n_neighbors=3)
lr_clf.fit(train_features, train_labels)
from sklearn.ensemble import AdaBoostClassifier
lr_clf = AdaBoostClassifier(n_estimators=100, random_state=0)
lr_clf.fit(train_features, train_labels)
'''
from sklearn.naive_bayes import GaussianNB
lr_clf = GaussianNB()
lr_clf.fit(train_features, train_labels)

## Step 4: Evaluating Model
So how well does our model do in classifying sentences? One way is to check the accuracy against the testing dataset:

In [None]:
lr_clf.score(test_features, test_labels)

How good is this score? What can we compare it against? Let's first look at a dummy classifier:

In [None]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Reference: https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/