# Finetunes an LLM to improve topic modeling results.

2024-06-17

Author: Zachary Kilhoffer

Requirements:
- 'data\df_2024-02-28.xlsx'
- 'data\df_ada_2024-02-28.xlsx'

Outputs:
- fine_tuned_model
- domain_adapted_model


See https://maartengr.github.io/BERTopic/getting_started/parameter%20tuning/parametertuning.html#umap


### Important Note
- data-clean.csv doesn't contain any IP protected controls texts, so only those from FedRAMP and C5
- for the paper, we used 9 documents


In [31]:
! pip freeze > requirements-3-finetune.txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [1]:
import pandas as pd
import transformers
from transformers import (
    AutoTokenizer,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
    AutoModelForMaskedLM,
    AutoModelForSequenceClassification,
)
from datasets import Dataset
import accelerate
from sklearn.preprocessing import LabelEncoder
import joblib  # save the label encoder


In [2]:
# display tweaks
pd.set_option("display.max_colwidth", 200)  # how much text is showing within a cell
pd.set_option("display.max_columns", False)
pd.set_option("display.max_rows", False)
# warnings.filterwarnings("ignore")

In [3]:
# load data
data = "../data/data-clean.csv"  # the documents we want to embed must be in their own rows
df = pd.read_csv(data, index_col=0)

In [4]:
# inspect df
print(df.shape)
df.head(3)

(531, 6)


Unnamed: 0,control_category,control_code,control_name,document,control_text_corrected,full_control_text
0,organisation of information security (ois),OIS-01,information security management system (isms),c5,Basic criterion: The cloud service provider operates an Information Security Management System (ISMS) in accordance with ISO/IEC 27001. The scope of the ISMS covers the cloud service provider's or...,Organisation of information security (ois). Information security management system (isms). Basic criterion: The cloud service provider operates an Information Security Management System (ISMS) in ...
1,organisation of information security (ois),OIS-02,information security policy,c5,"Basic criterion: The top management of the cloud service provider has adopted an information security policy and communicated it to internal and external employees, as well as cloud customers. The...",Organisation of information security (ois). Information security policy. Basic criterion: The top management of the cloud service provider has adopted an information security policy and communicat...
2,organisation of information security (ois),OIS-03,interfaces and dependencies,c5,Basic criterion: Interfaces and dependencies between cloud service delivery activities performed by the cloud service provider and activities performed by third parties are documented and communic...,Organisation of information security (ois). Interfaces and dependencies. Basic criterion: Interfaces and dependencies between cloud service delivery activities performed by the cloud service provi...


# Finetuning with domain-specific lexicon

In [5]:
# 1. Prepare dataset for domain-specific lexicon
texts = list(df['full_control_text'].values)

In [6]:
# 2. Tokenize Your Dataset
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

# Create a Dataset object from your texts
dataset = Dataset.from_dict({"text": texts})

# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)



Map:   0%|          | 0/531 [00:00<?, ? examples/s]

In [7]:
# 3. Create a Data Collator for MLM
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

In [8]:
# 4a. Initialize model settings
model = AutoModelForMaskedLM.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


> Note: the warning is expected because we only use the parts of the model relevant for masked language modeling (MLM).

In [9]:
# 4b. Initialize model settings
training_args = TrainingArguments(
    output_dir="./outputs/test_pretrained_model",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,  # Reduced batch size, used 16 for actual model described in paper
    save_steps=10_000,
    save_total_limit=2,
)

In [10]:
# 4c. Initialize model settings
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,  # Directly use the tokenized dataset here
)

In [11]:
# 5. Train the model
trainer.train()

  0%|          | 0/399 [00:00<?, ?it/s]

{'train_runtime': 323.2761, 'train_samples_per_second': 4.928, 'train_steps_per_second': 1.234, 'train_loss': 1.8546662677200814, 'epoch': 3.0}


TrainOutput(global_step=399, training_loss=1.8546662677200814, metrics={'train_runtime': 323.2761, 'train_samples_per_second': 4.928, 'train_steps_per_second': 1.234, 'total_flos': 419277799010304.0, 'train_loss': 1.8546662677200814, 'epoch': 3.0})

In [12]:
# 6. Save the further pretrained Model
model.save_pretrained("../outputs/domain_adapted_model")
tokenizer.save_pretrained("../outputs/domain_adapted_model")

('../outputs/domain_adapted_model/tokenizer_config.json',
 '../outputs/domain_adapted_model/special_tokens_map.json',
 '../outputs/domain_adapted_model/vocab.txt',
 '../outputs/domain_adapted_model/added_tokens.json',
 '../outputs/domain_adapted_model/tokenizer.json')

# Finetuning with human-labeled data

In [13]:
# Importing training data. We still need to format it a bit before finetuning
df_training = pd.read_excel('../data/train-data-redacted.xlsx', sheet_name='Data')
temp_labels = pd.read_excel('../data/train-data-redacted.xlsx', sheet_name='Label Choices')

In [14]:
# inspect df_training
print(df_training.shape)
df_training.head()

(30, 9)


Unnamed: 0,researcher1,researcher2,researcher3,control_category,page,document,control_text,Labels Considered,researcher_notes
0,CLD,CLD,CLD,organisation of information security (ois),38.0,c5,Basic criterion: Conflicting tasks and responsibilities are separated based on an OIS-06 risk assessment to reduce the risk of unauthorized or unintended changes or misuse of cloud customer data p...,"CLD, GOV, MON",
1,CLD,IAC,CLD,product safety and security (pss),118.0,c5,"Basic criterion: Access to the functions provided by the cloud service is restricted by access controls (authorization mechanisms) that verify whether users, IT components, or applications are aut...","DCH, IAC",
2,GOV,CPL,SEA,application & interface security,,ccm,REDACTED,"CPL, MON,",
3,GOV,GOV,GOV,,,eu_coc,REDACTED,"DCH, PRI",which country == big picture governance/compliance
4,PRI,PRI,PRI,,,eu_coc,REDACTED,"CFG, DCH",sharing personal data


In [15]:
# Remove unneeded columns
to_drop = ['control_category', 'page', 'document', 'Labels Considered', 'researcher_notes']
df_training = df_training.drop(columns=to_drop)

In [16]:
# Get rid of invisible spaces in strings
cols = ['researcher1', 'researcher2', 'researcher3']

for col in cols:
    df_training[col] = df_training[col].apply(lambda x: str(x).strip())

In [17]:
# Inspect df_training
print(df_training.shape)
df_training.head()

(30, 4)


Unnamed: 0,researcher1,researcher2,researcher3,control_text
0,CLD,CLD,CLD,Basic criterion: Conflicting tasks and responsibilities are separated based on an OIS-06 risk assessment to reduce the risk of unauthorized or unintended changes or misuse of cloud customer data p...
1,CLD,IAC,CLD,"Basic criterion: Access to the functions provided by the cloud service is restricted by access controls (authorization mechanisms) that verify whether users, IT components, or applications are aut..."
2,GOV,CPL,SEA,REDACTED
3,GOV,GOV,GOV,REDACTED
4,PRI,PRI,PRI,REDACTED


In [18]:
# Get rid of invisible spaces in strings in other data
temp_labels["Abbreviation"] = temp_labels["Abbreviation"].apply(lambda x: str(x).strip())

In [19]:
# Inspect temp_labels
print(temp_labels.shape)
temp_labels.head()

(33, 3)


Unnamed: 0,Category,Abbreviation,Definition
0,Cybersecurity & Data Privacy Governance,GOV,"Execute a documented, risk-based program that supports business objectives while encompassing appropriate cybersecurity and data privacy principles that addresses applicable statutory, regulatory ..."
1,Artificial and Autonomous Technology,AAT,"Ensure trustworthy and resilient Artificial Intelligence (AI) and autonomous technologies to achieve a beneficial impact by informing, advising or simplifying tasks, while minimizing emergent prop..."
2,Asset Management,AST,"Manage all technology assets from purchase through disposition, both physical and virtual, to ensure secured use, regardless of the asset’s location."
3,Business Continuity & Disaster Recovery,BCD,Maintain a resilient capability to sustain business-critical functions while successfully responding to and recovering from incidents through well-documented and exercised processes.
4,Capacity & Performance Planning,CAP,Govern the current and future capacities and performance of technology assets.


> We will only take the majority opinion, where 2/3 or 3/3 researchers agreed after having a chance to reconsider.

> For more info, see "2-intercoder_reliability.ipynb"

In [20]:
# Filter to only where 2/3 or 3/3 agree
mask = (df_training[['researcher1', 'researcher2', 'researcher3']].apply(pd.Series.value_counts, axis=1).max(axis=1) >= 2)
filtered_df = df_training[mask]

In [21]:
# inspect filtering results
filtered_df.head()

Unnamed: 0,researcher1,researcher2,researcher3,control_text
0,CLD,CLD,CLD,Basic criterion: Conflicting tasks and responsibilities are separated based on an OIS-06 risk assessment to reduce the risk of unauthorized or unintended changes or misuse of cloud customer data p...
1,CLD,IAC,CLD,"Basic criterion: Access to the functions provided by the cloud service is restricted by access controls (authorization mechanisms) that verify whether users, IT components, or applications are aut..."
3,GOV,GOV,GOV,REDACTED
4,PRI,PRI,PRI,REDACTED
5,PRI,PRI,PRI,REDACTED


> Unfortunately most of the values we used had to be redacted due to copyright.

> In this code we must therefore get rid of the REDACTED values, too.

In [22]:
# Filter redacted values
filtered_df = filtered_df[filtered_df['control_text'] != "REDACTED"]

# inspect results
filtered_df

Unnamed: 0,researcher1,researcher2,researcher3,control_text
0,CLD,CLD,CLD,Basic criterion: Conflicting tasks and responsibilities are separated based on an OIS-06 risk assessment to reduce the risk of unauthorized or unintended changes or misuse of cloud customer data p...
1,CLD,IAC,CLD,"Basic criterion: Access to the functions provided by the cloud service is restricted by access controls (authorization mechanisms) that verify whether users, IT components, or applications are aut..."
7,TDA,TDA,TDA,Prevent the installation of [assignment: organization-defined software and firmware components] without verification that the component has been digitally signed using a certificate that is recogn...
9,CRY,CRY,CRY,Make provisions so that [assignment: organization-defined encrypted communications traffic] is visible to [assignment: organization-defined system monitoring tools and mechanisms]. Organizations b...
10,TDA,TDA,TDA,"Assess and review the supply chain-related risks associated with suppliers or contractors and the system, system component, or system service they provide [assignment: organization-defined frequen..."
11,HRS,SAT,SAT,"Train organization-defined personnel or roles to detect counterfeit system components (including hardware, software, and firmware). None."
12,IAC,NET,NET,Authorize network access to [assignment: organization-defined privileged commands] only for [assignment: organization-defined compelling operational needs] and document the rationale for such acce...


In [23]:
# replace researcher labels with majority opinion
filtered_df['label'] = ['CLD', 'CLD', 'TDA', 'CRY', 'TDA', 'SAT', 'NET']

# remove unneeded columns
to_drop = ['researcher1', 'researcher2', 'researcher3']
filtered_df = filtered_df.drop(columns=to_drop)

# inspect results
filtered_df

Unnamed: 0,control_text,label
0,Basic criterion: Conflicting tasks and responsibilities are separated based on an OIS-06 risk assessment to reduce the risk of unauthorized or unintended changes or misuse of cloud customer data p...,CLD
1,"Basic criterion: Access to the functions provided by the cloud service is restricted by access controls (authorization mechanisms) that verify whether users, IT components, or applications are aut...",CLD
7,Prevent the installation of [assignment: organization-defined software and firmware components] without verification that the component has been digitally signed using a certificate that is recogn...,TDA
9,Make provisions so that [assignment: organization-defined encrypted communications traffic] is visible to [assignment: organization-defined system monitoring tools and mechanisms]. Organizations b...,CRY
10,"Assess and review the supply chain-related risks associated with suppliers or contractors and the system, system component, or system service they provide [assignment: organization-defined frequen...",TDA
11,"Train organization-defined personnel or roles to detect counterfeit system components (including hardware, software, and firmware). None.",SAT
12,Authorize network access to [assignment: organization-defined privileged commands] only for [assignment: organization-defined compelling operational needs] and document the rationale for such acce...,NET


In [24]:
# Merge df_training with temp_labels on df_training['33_label'] and df_ada['topic_num']
to_drop = ['Abbreviation', 'Category']
df_training = pd.merge(filtered_df, temp_labels, left_on='label', right_on='Abbreviation', how='left').drop(columns=to_drop)

# inspect results
df_training.head()

Unnamed: 0,control_text,label,Definition
0,Basic criterion: Conflicting tasks and responsibilities are separated based on an OIS-06 risk assessment to reduce the risk of unauthorized or unintended changes or misuse of cloud customer data p...,CLD,Govern cloud instances as an extension of on-premise technologies with equal or greater security protections than the organization’s own internal cybersecurity and privacy controls.
1,"Basic criterion: Access to the functions provided by the cloud service is restricted by access controls (authorization mechanisms) that verify whether users, IT components, or applications are aut...",CLD,Govern cloud instances as an extension of on-premise technologies with equal or greater security protections than the organization’s own internal cybersecurity and privacy controls.
2,Prevent the installation of [assignment: organization-defined software and firmware components] without verification that the component has been digitally signed using a certificate that is recogn...,TDA,"Develop and test systems, applications or services according to a Secure Software Development Framework (SSDF) to reduce the potential impact of undetected or unaddressed vulnerabilities and desig..."
3,Make provisions so that [assignment: organization-defined encrypted communications traffic] is visible to [assignment: organization-defined system monitoring tools and mechanisms]. Organizations b...,CRY,Utilize appropriate cryptographic solutions and industry-recognized key management practices to protect the confidentiality and integrity of sensitive/regulated data both at rest and in transit.
4,"Assess and review the supply chain-related risks associated with suppliers or contractors and the system, system component, or system service they provide [assignment: organization-defined frequen...",TDA,"Develop and test systems, applications or services according to a Secure Software Development Framework (SSDF) to reduce the potential impact of undetected or unaddressed vulnerabilities and desig..."


In [25]:
# concatenate the category and definition as we need one column of labels
df_training['final_label'] = df_training['label'] + ': ' + df_training['Definition']
df_training['final_label']

to_drop = ['label', 'Definition']
df_training = df_training.drop(columns=to_drop)

# check results
df_training.head()

Unnamed: 0,control_text,final_label
0,Basic criterion: Conflicting tasks and responsibilities are separated based on an OIS-06 risk assessment to reduce the risk of unauthorized or unintended changes or misuse of cloud customer data p...,CLD: Govern cloud instances as an extension of on-premise technologies with equal or greater security protections than the organization’s own internal cybersecurity and privacy controls.
1,"Basic criterion: Access to the functions provided by the cloud service is restricted by access controls (authorization mechanisms) that verify whether users, IT components, or applications are aut...",CLD: Govern cloud instances as an extension of on-premise technologies with equal or greater security protections than the organization’s own internal cybersecurity and privacy controls.
2,Prevent the installation of [assignment: organization-defined software and firmware components] without verification that the component has been digitally signed using a certificate that is recogn...,"TDA: Develop and test systems, applications or services according to a Secure Software Development Framework (SSDF) to reduce the potential impact of undetected or unaddressed vulnerabilities and ..."
3,Make provisions so that [assignment: organization-defined encrypted communications traffic] is visible to [assignment: organization-defined system monitoring tools and mechanisms]. Organizations b...,CRY: Utilize appropriate cryptographic solutions and industry-recognized key management practices to protect the confidentiality and integrity of sensitive/regulated data both at rest and in transit.
4,"Assess and review the supply chain-related risks associated with suppliers or contractors and the system, system component, or system service they provide [assignment: organization-defined frequen...","TDA: Develop and test systems, applications or services according to a Secure Software Development Framework (SSDF) to reduce the potential impact of undetected or unaddressed vulnerabilities and ..."


## Tokenizing training data

In [26]:
# df_training is DataFrame with labeled data
texts = df_training['control_text'].values
labels = df_training['final_label'].values

# Convert text labels to integers
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

# Create a Dataset object from your texts and encoded labels
labeled_dataset = Dataset.from_dict({"text": texts, "label": encoded_labels})

# Tokenize the labeled dataset
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    # Adjust the function to handle the mapping correctly for sequence classification
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_labeled_dataset = labeled_dataset.map(tokenize_function, batched=True)

# Ensure model is suited for sequence classification
domain_adapted_model = "../outputs/domain_adapted_model"  # Importing the model already trained 
model = AutoModelForSequenceClassification.from_pretrained(domain_adapted_model, num_labels=len(label_encoder.classes_))



Map:   0%|          | 0/7 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ../outputs/domain_adapted_model and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="../outputs/fine_tuned_model",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    save_steps=10_000,
    save_total_limit=2,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_labeled_dataset,
)

In [28]:
# Train the model
trainer.train()

  0%|          | 0/3 [00:00<?, ?it/s]

{'train_runtime': 5.5518, 'train_samples_per_second': 3.783, 'train_steps_per_second': 0.54, 'train_loss': 1.4827165603637695, 'epoch': 3.0}


TrainOutput(global_step=3, training_loss=1.4827165603637695, metrics={'train_runtime': 5.5518, 'train_samples_per_second': 3.783, 'train_steps_per_second': 0.54, 'total_flos': 5525480991744.0, 'train_loss': 1.4827165603637695, 'epoch': 3.0})

In [29]:
# Optionally, save the model and the tokenizer
model.save_pretrained("../outputs/fine_tuned_model")
tokenizer.save_pretrained("../outputs/fine_tuned_model")

('outputs/fine_tuned_model/tokenizer_config.json',
 'outputs/fine_tuned_model/special_tokens_map.json',
 'outputs/fine_tuned_model/vocab.txt',
 'outputs/fine_tuned_model/added_tokens.json',
 'outputs/fine_tuned_model/tokenizer.json')

In [30]:
# Also, save the label encoder for later use in inference to decode the predicted labels
joblib.dump(label_encoder, "../outputs/fine_tuned_model/label_encoder.joblib")

['../outputs/fine_tuned_model/label_encoder.joblib']