# Public Requirement Dataset
In this research, authors used PURE dataset which contains 79 requirements documents in different forms. It is publicly available on the internet for research use. In this dataset requirements documents had written in natural English language. And it can be used for NLP tasks such as ambiguity detection, identification and requirements categorisation

In [None]:
import pandas as pd

from google.colab import files
uploaded = files.upload()

Saving Pure_Annotate_Dataset.csv to Pure_Annotate_Dataset.csv


In [None]:
import pandas as pd

dataset_path = 'Pure_Annotate_Dataset.csv'
try:
    df = pd.read_csv(dataset_path, encoding='ISO-8859-1')
except UnicodeDecodeError:
    # If ISO-8859-1 doesn't work, you can try 'latin1' or 'utf-16'
    df = pd.read_csv(dataset_path, encoding='latin1')

# Display a success message and preview the dataset
print("Dataset Loaded Successfully")
print(df.head())
print("\nDataset Information:")
df.info()

Dataset Loaded Successfully
                                                  id  \
0  CCHIT Certified 2011 Ambulatory EHR Criteria 2...   
1  CCHIT Certified 2011 Ambulatory EHR Criteria 2...   
2  CCHIT Certified 2011 Ambulatory EHR Criteria 2...   
3  CCHIT Certified 2011 Ambulatory EHR Criteria 2...   
4  CCHIT Certified 2011 Ambulatory EHR Criteria 2...   

                                            sentence  security  reliability  \
0  The system shall create a single patient recor...         0            0   
1  The system shall associate (store and link) ke...         0            0   
2  The system shall provide the ability to store ...         0            0   
3  The system shall provide a field which will id...         0            0   
4  The system shall provide the ability to merge ...         0            0   

   NFR_boolean  
0            0  
1            0  
2            0  
3            0  
4            0  

Dataset Information:
<class 'pandas.core.frame.DataFrame'

# Data Pre-processing
After uploading the dataset successfully, we will have to preprocess the dataset because some sentences or paragraphs in these documents are irrelevant to the requirements and need to be excluded from requirement sentences. For this purpose, we performed this task in three steps: **tokenization**, **data cleaning**, and **normalization**.

In [None]:
import pandas as pd
import nltk
import re

# Download necessary resources for nltk (only needed once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


During Tokenization, requirements document is broken up into smaller segments.This process is also called data preparation. The requirements document in this process is broken into paragraphs, and the paragraph into sentences. In our experiments, we used **sentence tokenization function nltk, which is a python library** used to extract English sentences from a document

In [None]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt') # Download the required tokenizer

# Tokenize each sentence in the 'sentence' column and add new column 'tokens'
df['tokens'] = df['sentence'].apply(word_tokenize)

# Display the original text and its tokenized version
print(df[['sentence', 'tokens']].head())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


                                            sentence  \
0  The system shall create a single patient recor...   
1  The system shall associate (store and link) ke...   
2  The system shall provide the ability to store ...   
3  The system shall provide a field which will id...   
4  The system shall provide the ability to merge ...   

                                              tokens  
0  [The, system, shall, create, a, single, patien...  
1  [The, system, shall, associate, (, store, and,...  
2  [The, system, shall, provide, the, ability, to...  
3  [The, system, shall, provide, a, field, which,...  
4  [The, system, shall, provide, the, ability, to...  


The objective of data cleaning process is to clean all irrelevant tokens from requirement sentences. This task is based on three steps.
1.   Punctuation removal, occur in this step such as full stops, question marks, commas,colons, etc are removed from the requirement sentences.
2.   The second step is stop-word removal like high frequency words, such as (’they’, ’them’, ’their’, you,should, from etc) that don’t add any essential information to the requirement sentence.
3. The last step of data cleaning task is Non-alphabetic tokens removal
that didn’t contain useful information

In our model, **we used python library called Natural Language Tool Kit (NLTK)**.This library contains most of the stop words in English language.

In [None]:
import re

# Define a function to clean each token in the tokenized list
def clean_tokens(tokens):
    # Remove special characters and numbers, and convert each token to lowercase
    cleaned_tokens = [re.sub(r'[^A-Za-z]', '', token).lower()
    for token in tokens if re.sub(r'[^A-Za-z]', '', token)]
    return cleaned_tokens

# Apply data cleaning to each list of tokens in the 'tokens' column
df['cleaned_tokens'] = df['tokens'].apply(clean_tokens)

print(df[['tokens', 'cleaned_tokens']].head()) # Display the original tokens and cleaned tokens

                                              tokens  \
0  [The, system, shall, create, a, single, patien...   
1  [The, system, shall, associate, (, store, and,...   
2  [The, system, shall, provide, the, ability, to...   
3  [The, system, shall, provide, a, field, which,...   
4  [The, system, shall, provide, the, ability, to...   

                                      cleaned_tokens  
0  [the, system, shall, create, a, single, patien...  
1  [the, system, shall, associate, store, and, li...  
2  [the, system, shall, provide, the, ability, to...  
3  [the, system, shall, provide, a, field, which,...  
4  [the, system, shall, provide, the, ability, to...  


In normalization process, **we aimed to convert all the words to a more uniform sequence** by transform it to a common base form. For normalization, we’ll use lemmatization to convert each word to its base (root) form. **Lemmatization** is preferred over stemming because it produces actual words (e.g., "running" becomes "run" instead of "runn"). In this task, we improve the text modelling and matching.

In [None]:
from nltk.stem import WordNetLemmatizer
import nltk

nltk.download('wordnet')
nltk.download('omw-1.4') # Download WordNet data for lemmatization

lemmatizer = WordNetLemmatizer() # Initialize the lemmatizer

# Define a function
def lemmatize_tokens(tokens):
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized_tokens #lemmatize each token in the cleaned tokens list

# Create the new column 'lemmatized_toekn' Apply lemmatization to each list of cleaned tokens
df['lemmatized_tokens'] = df['cleaned_tokens'].apply(lemmatize_tokens)

# Display the original, cleaned, and lemmatized tokens
print(df[['tokens', 'cleaned_tokens', 'lemmatized_tokens']].head())

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


                                              tokens  \
0  [The, system, shall, create, a, single, patien...   
1  [The, system, shall, associate, (, store, and,...   
2  [The, system, shall, provide, the, ability, to...   
3  [The, system, shall, provide, a, field, which,...   
4  [The, system, shall, provide, the, ability, to...   

                                      cleaned_tokens  \
0  [the, system, shall, create, a, single, patien...   
1  [the, system, shall, associate, store, and, li...   
2  [the, system, shall, provide, the, ability, to...   
3  [the, system, shall, provide, a, field, which,...   
4  [the, system, shall, provide, the, ability, to...   

                                   lemmatized_tokens  
0  [the, system, shall, create, a, single, patien...  
1  [the, system, shall, associate, store, and, li...  
2  [the, system, shall, provide, the, ability, to...  
3  [the, system, shall, provide, a, field, which,...  
4  [the, system, shall, provide, the, ability, to..

# Feature Extraction (Vectorization)
After Data pre-processing, the second step in our methodology is to extract representative features from the requirement sentences using a various number of features extraction techniques used in the NLP.

In research, authors used four vectorization techniques in NLP. Two of them are **syntactical based methods: TF and TF-IDF**. The other two vectorizatin methods are **semantically based methods: Word2Vec4 and BERT5.**

# Term Frequency
**Term Frequency** is one of the basic vectorization methods and information retrieval in NLP. In our approach, we use this method to count how many
times each word in the requirement sentences appears in all
requirement documents and represent it as a vector.

We created words dictionary containing all normalized words in the requirement document. This process also called **bag of words (BOW)**. The rows corresponds to a requirement sentence and each column represents a unique word. The occurrence number in case the word is exist in the sentence increasing by one.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Convert preprocessed text into a format suitable for vectorization
preprocessed_text = df['lemmatized_tokens'].apply(lambda x: ' '.join(x))

# Initialize CountVectorizer to compute TF
tf_vectorizer = CountVectorizer()
tf_matrix = tf_vectorizer.fit_transform(preprocessed_text)

# Display TF feature matrix
print("TF Matrix Shape:", tf_matrix.shape)
print("Sample TF Matrix:", tf_matrix.toarray())

TF Matrix Shape: (11440, 6153)
Sample TF Matrix: [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [None]:
import numpy as np

# Checking the shape of tf_matrix to determine the number of rows
num_rows = tf_matrix.shape[0]

row_index = 0  # Use 0 to access the first row
if row_index < num_rows:
    second_array = tf_matrix[row_index].toarray()[0]

    limited_array = second_array[:15] # Limit the array to 15 indexes

    print("Second array (limited to 15 elements):", limited_array)
else:
    print(f"Error: Row index {row_index} is out of range. tf_matrix has {num_rows} rows.")

Second array (limited to 15 elements): [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In this technique of **TF-IDF (Term Frequency Inverse Document Frequency)**, we quantify a word in requirement documents. Weight of each word were computed which signifies of its importance in all requirement documents. This method is widely used in information retrieval in NLP. This methods will improve the basic features that can be extracted from the requirement sentences so that can differentiate between NFR categories.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer to compute TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(preprocessed_text)

# Display TF-IDF feature matrix
print("TF-IDF Matrix Shape:", tfidf_matrix.shape)
print("Sample TF-IDF Matrix:", tfidf_matrix.toarray())

TF-IDF Matrix Shape: (11440, 6153)
Sample TF-IDF Matrix: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


**Word2vec** is a common word embedding model provided by Google to improve words representation. In our research, Word2Vec is used to enhance the numeric representation of the words through increase the accuracy of capturing word context from a document in semantic and syntactic words relationship. The value of each feature in the word representation ranging from zero to one.

The objective of using this model in this study is to invest the affect of semantic representation for requirement sentences using big data model to achieve high accuracy in NFR classification.

In [None]:
from gensim.models import Word2Vec

# Train Word2Vec model on the tokenized data
word2vec_model = Word2Vec(sentences=df['lemmatized_tokens'],
                          vector_size=100, window=5, min_count=1, workers=4)

# To get the vector for a word, e.g., 'requirement'
print("Vector for 'requirement':", word2vec_model.wv['requirement'])

# Aggregate vectors to represent each document
df['word2vec_vectors'] = df['lemmatized_tokens'].apply(lambda tokens:
                                                       sum(word2vec_model.wv[token]
                                                                          for token in tokens
                                                           if token in word2vec_model.wv))

Vector for 'requirement': [-0.2656624   0.24085735  0.17579813  1.1480407   0.07920082 -0.9877641
  0.8354378   0.98058563 -0.59377414 -0.8023467  -0.4324077  -1.4503976
  0.5959428   0.52127653  0.00417846  0.4226356   0.14965805 -0.3720479
 -0.37842363 -0.7833258   0.7568334   0.5718663   0.29422912  0.1621622
  0.12471768 -0.08909896 -0.30672985 -0.37583214 -0.3022516  -0.10667772
  0.4749707  -0.11869216 -0.03090043 -0.8112786  -0.374059    0.98669297
  0.6877831  -0.6082959   0.1219599  -0.7228267   0.42638066 -0.36590284
 -0.52137285  0.02665865  0.5976789   0.33254704 -0.23943973 -0.31975004
  0.46090502  0.23530722  0.43688154 -0.24952984 -0.44665536 -0.20535974
 -0.8534624   0.38300374  0.0474469  -0.48625425 -0.82194155 -0.40102872
 -0.05814507  0.3400637  -0.17363532  0.24904467 -0.49357018 -0.06357519
 -0.537737    0.6812649  -0.36957672  0.43888995 -0.43545935  0.15301293
  0.7603106  -0.0136522   0.60375935  0.43822363 -0.19659933 -0.07213023
  0.02409289  0.45327142 -0.5

**BERT** is a text representation technique stands for **Bi-directional Encoder Representations from Transformers**. BERT is an inflection point in the application of machine learning for NLP and confirmed to be state-of-the-art for a wide range of NLP tasks such semantic analysis and text classification. In this research paper, authors used BERT model with Masked-LM strategy to represent requirement sentences in semantic numerical vectors. Then, they trained the classifiers on top of the transformer output of the BERT model.

In [None]:
from transformers import BertTokenizer, BertModel
import torch
# import pandas as pd

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Move model to GPU if available, else stay on CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Function to extract BERT embeddings for each sentence
def get_bert_embeddings(text):
    # Tokenize the sentence with padding and truncation (handles varying sentence lengths)
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=128)
    inputs = {key: value.to(device) for key, value in inputs.items()}  # Move to the correct device (GPU or CPU)

    # Get the model outputs (outputs[0] is the last hidden state)
    with torch.no_grad():  # Disable gradient calculation as we're not training
        outputs = model(**inputs)

    # We use the [CLS] token (first token in BERT) as the sentence embedding
    sentence_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy()  # Move back to CPU and convert to numpy
    return sentence_embedding

# Apply BERT embeddings to each sentence and Use 'lemmatized_tokens
df['bert_embeddings'] = df['lemmatized_tokens'].apply(lambda x: get_bert_embeddings(' '.join(x)))

# Check the shape of the resulting embeddings
print(f"BERT Embeddings Shape: {df['bert_embeddings'].iloc[0].shape}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BERT Embeddings Shape: (1, 768)


# Split the Data into Training and Testing Sets
Before applying machine learning models, we need to split the data into a training set and a test set. We will also convert the BERT embeddings into a format that can be used by these models.

In [None]:
from sklearn.model_selection import train_test_split

# Assuming 'bert_embeddings' is the column containing your BERT embeddings and 'NFR_boolean' contains the target labels
X = df['bert_embeddings'].apply(lambda x: x.flatten())  # Flatten the BERT embeddings
y = df['NFR_boolean']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("BERT Embeddings and labels extracted successfully.")

# Get the lengths of the training and testing sets
train_size = len(X_train)
test_size = len(X_test)

# Calculate the ratio
total_size = train_size + test_size
train_ratio = train_size / total_size
test_ratio = test_size / total_size

# Print the results
print(f"Training set size: {train_size}")
print(f"Testing set size: {test_size}")
print(f"Total dataset size: {total_size}")
print(f"Training set ratio: {train_ratio:.2f}")
print(f"Testing set ratio: {test_ratio:.2f}")

BERT Embeddings and labels extracted successfully.
Training set size: 9152
Testing set size: 2288
Total dataset size: 11440
Training set ratio: 0.80
Testing set ratio: 0.20


In [None]:
print(X_train)

8746     [-0.29028362, -0.0069290344, -0.58969885, 0.00...
11395    [-0.39160356, -0.089262, 0.41657862, 0.1582935...
36       [-0.5227131, -0.03067437, -0.020269044, -0.062...
11257    [-0.2579271, -0.051638927, 0.09467344, 0.05525...
3334     [-0.4160419, -0.5469826, 0.44621208, -0.131847...
                               ...                        
11284    [-0.1615796, -0.07516419, 0.15448351, 0.119395...
5191     [-0.4032112, -0.04167419, -0.15463793, -0.2913...
5390     [0.05034938, 0.28090125, -0.5540273, 0.226953,...
860      [-0.55429906, 0.07363611, 0.11128484, -0.26137...
7270     [-0.20526685, -0.073073134, -0.1182015, 0.1541...
Name: bert_embeddings, Length: 9152, dtype: object


In [None]:
print(X_test)

11013    [-0.3247295, 0.1726643, 0.22044359, 0.10864598...
6123     [-0.23559381, 0.35876867, -0.10723287, -0.5350...
10633    [-0.22872213, 0.17400908, -0.03794982, 0.21764...
2270     [-0.16080384, 0.05171148, -0.04882351, 0.04860...
5271     [-0.09217936, -0.5087056, -0.36632246, -0.0474...
                               ...                        
5773     [-0.13863471, 0.208345, -0.71376586, -0.362303...
6657     [-0.011860309, 0.15282507, -0.061276775, 0.064...
4220     [-0.20356126, 0.3949012, -0.18463196, -0.24909...
1224     [-0.12972067, 0.11177402, 0.0034937877, -0.071...
7698     [-0.122205585, -0.16259454, 0.3718352, -0.1792...
Name: bert_embeddings, Length: 2288, dtype: object


# Machine Learning Classifier
In previous stages of proposed system, Authors segmented requirements documents into sentences, then each sentence were converted into a numerical representation in the form of a vector in order to be used by ML models. In this Phase, Authors built ML models to classify the vectors that represent requirements sentences into our target **NFR categories (classes)** **usability**, **availability**, **reliability**, **security**, **performance or** **others**. We choose the most common three supervised ML algorithms applied to a similar task: **Naive Bayse**, **Support Vector Machine**, and **Logistic Regression**.

A **Support Vector Machine (SVM)** is a powerful machine learning algorithm widely used for both linear and nonlinear classification, as well as regression and outlier detection tasks. SVMs are highly adaptable, making them suitable for various applications such as text classification, image classification, spam detection, handwriting identification, gene expression analysis, face detection, and anomaly detection.

SVM is a discriminative classification method which is commonly recognized to be more accurate in NLP. Inthis research, SVM classifier were used to solve non-linear classification problem using a “kernel trick", which is a method for using a linear classification model to solve a non linear problem by projecting the feature vectors of the target classes into a higher dimension in which the classes are linearly separable.

In research, authors have five classes of NFR and the others class. The traditional SVM classifier is a binary classifier that can be applied to two classes only. In this case, authors have multiple classes (i.e. six classes). To handle this issue one-against-one and one-against-all strategies are
used. In order to maximize the margin of the hyperplane, the weight of each feature is minimized using gradient descent algorithm with cost function algorithm.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Initialize the SVM classifier
svm_classifier = SVC(kernel='linear', C=1.0)  # You can experiment with different kernels and C values

# Train the SVM classifier and convert X-train to list
svm_classifier.fit(X_train.to_list(), y_train)

# Make predictions on the test set and convert X-test to a list
y_pred = svm_classifier.predict(X_test.to_list())

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of SVM classifier: {accuracy}")

# Generate and print a classification report with precision, recall, and F1 score
svm_report = classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1'])
print("SVM Classification Report:\n", svm_report)

Accuracy of SVM classifier: 0.9042832167832168
SVM Classification Report:
               precision    recall  f1-score   support

     Class 0       0.93      0.95      0.94      1931
     Class 1       0.72      0.63      0.67       357

    accuracy                           0.90      2288
   macro avg       0.83      0.79      0.81      2288
weighted avg       0.90      0.90      0.90      2288



**Naive Bayse classifier** is another classifier we adopted in our methodology. This classifier is a probabilistic model based on Bayes theorem. A number of properties in this classifier have prompted us to use it in our NFR classification model. Naive Bayes is one of the most common used supervised ML classifiers.

NB is demonstrated to be accurate and reliable in natural language classification tasks. NB classifier does not require a lot of training data which is one of the issues that led us to choose it in our research.

In [None]:
from sklearn.naive_bayes import MultinomialNB
# from sklearn.preprocessing import MinMaxScaler
# import numpy as np
from sklearn.metrics import accuracy_score

# Train the Naive Bayes classifier.  MultinomialNB expects non-negative values.
nb_classifier = MultinomialNB() # Initialize the Naive Bayes classifier

# Option 1: Since BERT embeddings can be negative, we'll need to handle this:
X_train_clipped = X_train.apply(lambda x: np.clip(x, 0, None))
nb_classifier.fit(X_train_clipped.to_list(), y_train)

# Option 2:  Scale data to be positive.
# scaler = MinMaxScaler()
# X_train_scaled = scaler.fit_transform(np.vstack(X_train.values))
# X_test_scaled = scaler.transform(np.vstack(X_test.values))
# nb_classifier.fit(X_train_scaled, y_train)

# Make predictions on the test set (remember to handle negative values)
X_test_clipped = X_test.apply(lambda x: np.clip(x, 0, None))
y_pred = nb_classifier.predict(X_test_clipped.to_list())

accuracy = accuracy_score(y_test, y_pred) # Evaluate the model
print(f"Accuracy of Naive Bayes classifier: {accuracy}")

Accuracy of Naive Bayes classifier: 0.7827797202797203


In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Initialize the Naive Bayes classifier
nb_classifier = MultinomialNB()

# Option 1: Clip the data to ensure non-negative values (since BERT embeddings can be negative)
X_train_clipped = X_train.apply(lambda x: np.clip(x, 0, None))
X_test_clipped = X_test.apply(lambda x: np.clip(x, 0, None))

# Train the Naive Bayes classifier using clipped data
nb_classifier.fit(X_train_clipped.to_list(), y_train)

# Make predictions on the test set using clipped data
y_pred_clipped = nb_classifier.predict(X_test_clipped.to_list())

# Evaluate the model with clipped data
accuracy_clipped = accuracy_score(y_test, y_pred_clipped)
print(f"Accuracy with Clipped Data: {accuracy_clipped}")

# Generate classification report for clipped data
nb_report_clipped = classification_report(y_test, y_pred_clipped, target_names=['Class 0', 'Class 1'])
print("Naive Bayes Classification Report with Clipped Data:\n", nb_report_clipped)

# Option 2: Scale the data to be positive (using MinMaxScaler)
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(np.vstack(X_train.values))
X_test_scaled = scaler.transform(np.vstack(X_test.values))

# Train the Naive Bayes classifier using scaled data
nb_classifier.fit(X_train_scaled, y_train)

# Make predictions on the test set using scaled data
y_pred_scaled = nb_classifier.predict(X_test_scaled)

# Evaluate the model with scaled data
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy with Scaled Data: {accuracy_scaled}")

# Generate classification report for scaled data
nb_report_scaled = classification_report(y_test, y_pred_scaled, target_names=['Class 0', 'Class 1'])
print("Naive Bayes Classification Report with Scaled Data:\n", nb_report_scaled)

Accuracy with Clipped Data: 0.7827797202797203
Naive Bayes Classification Report with Clipped Data:
               precision    recall  f1-score   support

     Class 0       0.93      0.80      0.86      1931
     Class 1       0.39      0.69      0.50       357

    accuracy                           0.78      2288
   macro avg       0.66      0.74      0.68      2288
weighted avg       0.85      0.78      0.80      2288

Accuracy with Scaled Data: 0.861451048951049
Naive Bayes Classification Report with Scaled Data:
               precision    recall  f1-score   support

     Class 0       0.87      0.98      0.92      1931
     Class 1       0.65      0.24      0.35       357

    accuracy                           0.86      2288
   macro avg       0.76      0.61      0.64      2288
weighted avg       0.84      0.86      0.83      2288



**Logistic regression** deals with discrete classes using the
natural logarithm. It transforms its output using the logistic
sigmoid function to return a probability value which can then
be mapped to two or more discrete classes. Authors used NLP
techniques to represent requirement sentences in suitable
form.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Logistic Regression classifier
logreg_classifier = LogisticRegression(max_iter=1000)  # Increased max_iter to ensure convergence

# Train the Logistic Regression classifier
logreg_classifier.fit(X_train.to_list(), y_train)  # Convert X_train to a list of lists

# Make predictions on the test set
y_pred = logreg_classifier.predict(X_test.to_list())  # Convert X_test to a list of lists

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Logistic Regression classifier: {accuracy}")

# Generate and print a classification report with precision, recall, and F1 score
logreg_report = classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1'])  # Adjust class names if needed
print("Logistic Regression Classification Report:\n", logreg_report)

Accuracy of Logistic Regression classifier: 0.9117132867132867
Logistic Regression Classification Report:
               precision    recall  f1-score   support

     Class 0       0.94      0.96      0.95      1931
     Class 1       0.76      0.64      0.69       357

    accuracy                           0.91      2288
   macro avg       0.85      0.80      0.82      2288
weighted avg       0.91      0.91      0.91      2288



**TensorFlow** is a software library for machine learning and artificial intelligence. It can be used for range of tasks but mainly for training and inference of neural networks. It is one of the most popular deep learning frameworks, alongside others such as PyTorch and PaddlePaddle. It is free and open-source software released under the Apache License 2.0.

In [None]:
!pip install tensorflow scikit-learn



**CNN stands for Convolutional Neural Network** which is commonly applied for analyzing image classification. CNN takes an input image as 3 dimensional array based on the image resolution . The height and the width of the image represented 2 dimensions of the array while the third dimension is the color of the pixel (RGB).



In [None]:
from tensorflow import keras
from tensorflow.keras import layers

# Define the CNN model
model = keras.Sequential(
    [
        keras.Input(shape=(X_train.iloc[0].shape[0], 1)),  # Input shape adjusted for 1D CNN
        layers.Conv1D(32, 3, activation="relu"),
        layers.MaxPooling1D(pool_size=2),
        layers.Conv1D(64, 3, activation="relu"),
        layers.MaxPooling1D(pool_size=2),
        layers.Flatten(),
        layers.Dense(10, activation="relu"),  # Adjust the number of units as needed
        layers.Dense(1, activation="sigmoid"),  # Output layer for binary classification
    ]
)

# Compile the model
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Reshape the input data for the CNN
X_train_reshaped = np.array(X_train.to_list()).reshape(-1, X_train.iloc[0].shape[0], 1)
X_test_reshaped = np.array(X_test.to_list()).reshape(-1, X_train.iloc[0].shape[0], 1)

# Train the model
model.fit(X_train_reshaped, y_train, epochs=30, batch_size=32, validation_split=0.1) # Adjust epochs and batch_size

# Evaluate the model
loss, accuracy = model.evaluate(X_test_reshaped, y_test)
print(f"CNN Test Loss: {loss:.4f}")
print(f"CNN Test Accuracy: {accuracy:.4f}")

Epoch 1/30
[1m258/258[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 11ms/step - accuracy: 0.8272 - loss: 0.4309 - val_accuracy: 0.8504 - val_loss: 0.3258
Epoch 2/30
[1m258/258[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6ms/step - accuracy: 0.8601 - loss: 0.3161 - val_accuracy: 0.8810 - val_loss: 0.2969
Epoch 3/30
[1m258/258[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.8846 - loss: 0.2832 - val_accuracy: 0.8799 - val_loss: 0.2675
Epoch 4/30
[1m258/258[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.8916 - loss: 0.2711 - val_accuracy: 0.8734 - val_loss: 0.2590
Epoch 5/30
[1m258/258[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.8984 - loss: 0.2506 - val_accuracy: 0.8886 - val_loss: 0.2487
Epoch 6/30
[1m258/258[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.9053 - loss: 0.2397 - val_accuracy: 0.8843 - val_loss: 0.2419
Epoch 7/30
[1m258/258[0m 

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from sklearn.metrics import classification_report

# Reshape BERT embeddings to (samples, sequence_length, features)
X_train_reshaped = np.array(X_train.to_list()).reshape(-1, X_train.iloc[0].shape[0], 1)
X_test_reshaped = np.array(X_test.to_list()).reshape(-1, X_train.iloc[0].shape[0], 1)

# Convert reshaped data to PyTorch tensors
X_train_tensor = torch.tensor(X_train_reshaped, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_reshaped, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# Define CNN model
class CNNModel(nn.Module):
    def __init__(self, sequence_length):
        super(CNNModel, self).__init__()
        self.conv1 = nn.Conv1d(1, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool1d(2)

        # Dummy forward pass to calculate the output size after pooling
        with torch.no_grad():
            dummy_input = torch.zeros(1, 1, sequence_length)
            conv_out = self.pool(self.conv1(dummy_input))
            self.flatten_size = conv_out.numel()

        self.fc1 = nn.Linear(self.flatten_size, 2)  # Adjust for binary classification

    def forward(self, x):
        x = x.permute(0, 2, 1)  # Permute to (batch, channels, sequence_length)
        x = self.conv1(x)
        x = self.pool(x)
        x = x.reshape(x.size(0), -1)  # Flatten for fully connected layer
        x = self.fc1(x)
        return x

cnn_model = CNNModel(sequence_length=X_train_reshaped.shape[1])
cnn_model.to(device) # Instantiate CNN model

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(cnn_model.parameters(), lr=0.001)

# Training loop
def train_cnn(model, train_loader, criterion, optimizer):
    model.train()
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

train_cnn(cnn_model, train_loader, criterion, optimizer)

# Evaluate CNN
def evaluate_cnn(model, test_loader):
    model.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            _, preds = torch.max(outputs, 1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    return classification_report(all_labels, all_preds)

cnn_report = evaluate_cnn(cnn_model, test_loader)
print("CNN Model Classification Report:\n", cnn_report)

CNN Model Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.58      0.73      1931
           1       0.30      0.96      0.45       357

    accuracy                           0.64      2288
   macro avg       0.64      0.77      0.59      2288
weighted avg       0.88      0.64      0.69      2288



# Fusion Model
A fusion model in machine learning refers to a model that combines multiple different sources, techniques, or models to improve the overall performance, robustness, and accuracy of the predictions. The idea of this approach is to combine the four NLP techniques in on fused model. The objective of this
model is to exploit all of the good features from all NLP methods in one combined module. In this model each NLP technique has distinctive features.

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import torch
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Assuming you already have your text data in `df['sentence']` and labels in `df['label']`
dataset_path = 'Pure_Annotate_Dataset.csv'
try:
    df = pd.read_csv(dataset_path, encoding='ISO-8859-1')
except UnicodeDecodeError:
    # If ISO-8859-1 doesn't work, you can try 'latin1' or 'utf-16'
    df = pd.read_csv(dataset_path, encoding='latin1')

# Prepare data
X = df['sentence']
y = df['NFR_boolean']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 1. TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# 2. Word2Vec
# Train Word2Vec model on the training data
sentences_train = [sentence.split() for sentence in X_train]
sentences_test = [sentence.split() for sentence in X_test]
word2vec_model = Word2Vec(sentences_train, vector_size=100, window=5, min_count=1, workers=4)

# Create Word2Vec features for training and testing sets
def get_word2vec_features(sentences, model):
    features = []
    for sentence in sentences:
        vec = np.zeros(100)  # Initialize with zeros
        count = 0
        for word in sentence:
            if word in model.wv:
                vec += model.wv[word]
                count += 1
        if count > 0:
            vec /= count
        features.append(vec)
    return np.array(features)

X_train_word2vec = get_word2vec_features(sentences_train, word2vec_model)
X_test_word2vec = get_word2vec_features(sentences_test, word2vec_model)

# 3. BERT Embeddings
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

# Convert sentences to BERT embeddings
def get_bert_embeddings(sentences, tokenizer, model):
    inputs = tokenizer(list(sentences), padding=True, truncation=True, return_tensors="pt", max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).numpy()  # Take the mean of the last layer hidden states

X_train_bert = get_bert_embeddings(X_train, tokenizer, bert_model)
X_test_bert = get_bert_embeddings(X_test, tokenizer, bert_model)

# 4. Bag of Words (BoW)
bow_vectorizer = CountVectorizer(max_features=5000)
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

# 5. Combine all features into one fusion model
# Concatenate all features (TF-IDF + Word2Vec + BERT + BoW)
X_train_combined = np.concatenate([X_train_tfidf.toarray(), X_train_word2vec, X_train_bert, X_train_bow.toarray()], axis=1)
X_test_combined = np.concatenate([X_test_tfidf.toarray(), X_test_word2vec, X_test_bert, X_test_bow.toarray()], axis=1)

# Train a classifier on the fused features
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train_combined, y_train)

# Make predictions on the test set
y_pred = classifier.predict(X_test_combined)

# Evaluate the model using classification metrics
print(classification_report(y_test, y_pred))



After Implementation of Fusion model, Now we have to calculate the Accuracy of the Fusion model with respect to Precison, Recall and F1 Score.

In [None]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.ensemble import VotingClassifier
import numpy as np
import torch

# 1. Train individual models: SVM, Logistic Regression, Naive Bayes

# Train SVM model
svm_classifier = SVC(kernel='linear', C=1.0)
svm_classifier.fit(X_train.to_list(), y_train)
y_pred_svm = svm_classifier.predict(X_test.to_list())

# Train Logistic Regression model
logreg_classifier = LogisticRegression(max_iter=1000)
logreg_classifier.fit(X_train.to_list(), y_train)
y_pred_logreg = logreg_classifier.predict(X_test.to_list())

# Train Naive Bayes model
nb_classifier = MultinomialNB()
X_train_clipped = X_train.apply(lambda x: np.clip(x, 0, None))
nb_classifier.fit(X_train_clipped.to_list(), y_train)
X_test_clipped = X_test.apply(lambda x: np.clip(x, 0, None))
y_pred_nb = nb_classifier.predict(X_test_clipped.to_list())

# 2. Train CNN model (assuming you already have the CNN model and the training function)

# Assuming CNN training and evaluation functions are implemented
# Use the previously defined function to get CNN predictions
def get_cnn_predictions(model, test_loader):
    model.eval()
    all_preds = []
    with torch.no_grad():
        for inputs, _ in test_loader:
            inputs = inputs.to(device)
            outputs = model(inputs)
            _, preds = torch.max(outputs, 1)
            all_preds.extend(preds.cpu().numpy())
    return np.array(all_preds)

# Get CNN predictions (assuming test_loader is already prepared)
y_pred_cnn = get_cnn_predictions(cnn_model, test_loader)

# 3. Fuse the predictions using majority voting
# Stack the predictions of the four models
predictions_stack = np.stack([y_pred_svm, y_pred_logreg, y_pred_nb, y_pred_cnn], axis=1)

# Majority voting (mode of the predictions)
from scipy.stats import mode
y_pred_fusion, _ = mode(predictions_stack, axis=1)

# 4. Generate the classification report for the Fusion Model
report_fusion = classification_report(y_test, y_pred_fusion)
print("Fusion Model Classification Report:\n", report_fusion)

NameError: name 'y_test' is not defined

# Model Evaluation
The system evaluation is achieved by randomly splitting dataset into two subsets; train and test. The training set is used for training ML classifiers, while test set is used only for testing the performance of the classifiers.
It is worth saying the test dataset has never been used in training. Training dataset was used to train front-end classifiers. The test set was used for both training and testing the back-end classifier. The test set was divided into two equal non-overlapped subsets. One subset was used to train the backend classifier and the second was used to evaluate it. Then, the two subsets were exchanged and the same experiment is repeated

The total number of requirement instances are 1247. The splitting process split 872 instances for training, and 375 instances for testing. In fusion model, the dataset was split as in all experiments 70:30.

In [None]:
from google.colab import files
import pandas as pd

# Upload your dataset (CSV)
uploaded = files.upload()

# Load the dataset into pandas dataframe (assuming CSV format)
#X_test = pd.read_csv('testData.csv')  # Update with your actual filename


Saving testData.csv to testData (2).csv


In [None]:
import pandas as pd
import re
import nltk
import torch
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from transformers import BertTokenizer, BertModel
from sklearn.preprocessing import LabelEncoder
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers

# Assuming your new data is in CSV format (replace 'your_new_dataset.csv' with your actual file)
new_df = pd.read_csv('testData.csv')

# Preprocessing (tokenization, cleaning, lemmatization)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Tokenization
new_df['tokens'] = new_df['requirement_sentence'].apply(word_tokenize)

# Clean tokens (remove special characters and convert to lowercase)
def clean_tokens(tokens):
    cleaned_tokens = [re.sub(r'[^A-Za-z]', '', token).lower() for token in tokens if re.sub(r'[^A-Za-z]', '', token)]
    return cleaned_tokens

new_df['cleaned_tokens'] = new_df['tokens'].apply(clean_tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

new_df['lemmatized_tokens'] = new_df['cleaned_tokens'].apply(lemmatize_tokens)

# Extract BERT Embeddings (ensure the model and tokenizer are loaded)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Function to get BERT embeddings for each sentence
def get_bert_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=128)
    inputs = {key: value.to(device) for key, value in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].cpu().numpy()

new_df['bert_embeddings'] = new_df['lemmatized_tokens'].apply(lambda x: get_bert_embeddings(' '.join(x)))

# Flatten the BERT embeddings for the model input
X_new = new_df['bert_embeddings'].apply(lambda x: x.flatten())

# Assuming 'label' is your target column
# Encode labels if necessary
#label_encoder = LabelEncoder()
#y_new = label_encoder.fit_transform(new_df['label'])  # If the label is not already numeric

# Reshape the input data for CNN
X_new_reshaped = np.array(X_new.to_list()).reshape(-1, X_new.iloc[0].shape[0], 1)

# Load your trained CNN model (assuming it's saved previously)
# If you haven't saved the model, use the code below to train it first
# cnn_model = keras.models.load_model('path_to_your_trained_model')

# Use the trained model to make predictions
predictions = cnn_model.predict(X_new_reshaped)

# If the model output is a binary classification, you can use the following
predicted_labels = (predictions > 0.5).astype(int)

# Convert predictions back to label names if needed
predicted_label_names = label_encoder.inverse_transform(predicted_labels.flatten())

# Add the predictions to your dataframe
new_df['predicted_labels'] = predicted_label_names

# Display the results
print(new_df[['sentence', 'predicted_labels']])



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


AttributeError: 'BertModel' object has no attribute 'predict'