***

# Topic Modeling

- By [Zachary Kilhoffer](https://zkilhoffer.github.io/)
- Updated 2024-06-17

***

## Description
- TODO
- This is the main topic modeling script.
- This is part of the BERTopic topic modeling pipeline.

***

### Input files:
Data: 
  
  > `/src/data/data-clean-embeddings.csv`

Fine-tuned LLM: 

  > `/src/outputs/fine_tuned_model`

***

# TODO:
- Start by adding sections for what needs to be done. 
- Ie, topic modeling, show visual of hierarchy and have explanation, merge categories, pass the representative texts to OpenAI 

In [1]:
import pandas as pd
import numpy as np
from transformers import BertTokenizer, BertModel, pipeline
from bertopic import BERTopic
import matplotlib.pyplot as plt
import json
import openai
from bertopic.backend import OpenAIBackend
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI

# Setup

In [2]:
# display tweaks
pd.set_option("display.max_colwidth", 200)  # how much text is showing within a cell
pd.set_option("display.max_columns", False)
pd.set_option("display.max_rows", False)
# warnings.filterwarnings("ignore")

In [3]:
# load data
df = pd.read_csv(
    "../data/data-clean-embeddings.csv",
    converters={"BERTembeddings": json.loads, "finetuned_embeddings": json.loads},
    index_col=0,
)

In [4]:
# inspect data
df.head(2)

Unnamed: 0_level_0,control_code,control_name,document,control_text_corrected,full_control_text,BERTembeddings,finetuned_embeddings
control_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
organisation of information security (ois),OIS-01,information security management system (isms),c5,Basic criterion: The cloud service provider operates an Information Security Management System (ISMS) in accordance with ISO/IEC 27001. The scope of the ISMS covers the cloud service provider's or...,Organisation of information security (ois). Information security management system (isms). Basic criterion: The cloud service provider operates an Information Security Management System (ISMS) in ...,"[-0.23476624488830566, -0.22193537652492523, -0.9937419891357422, 0.36702337861061096, 0.8497382998466492, -0.16401638090610504, -0.6381358504295349, -0.11291695386171341, -0.966692328453064, -0.9...","[0.07736359536647797, 0.3031435012817383, 0.2493477761745453, 0.2686106860637665, -0.08708430826663971, 0.18207691609859467, -0.3521919846534729, 0.13333041965961456, -0.4935278594493866, -0.26271..."
organisation of information security (ois),OIS-02,information security policy,c5,"Basic criterion: The top management of the cloud service provider has adopted an information security policy and communicated it to internal and external employees, as well as cloud customers. The...",Organisation of information security (ois). Information security policy. Basic criterion: The top management of the cloud service provider has adopted an information security policy and communicat...,"[-0.6862502098083496, -0.4821060001850128, -0.9979186654090881, 0.6634559631347656, 0.8985686898231506, -0.33475595712661743, 0.11295092105865479, 0.24699804186820984, -0.975180983543396, -0.99998...","[-0.016582056879997253, -0.0244668610394001, 0.07456215471029282, 0.16462843120098114, -0.07647715508937836, 0.17928528785705566, -0.32165899872779846, 0.036916326731443405, -0.4814055860042572, -..."


In [5]:
# main df
docs = list(df["full_control_text"].values)

# Out of the box topic modeling
- For illustration, you can get pretty good results with BERTopic with very little effort.

In [6]:
# topic modeling with BERTopic defaults
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


In [7]:
# show summary of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,48,-1_media_and_to_or,"[media, and, to, or, systems, the, information, of, external, access]",[Media protection. Media use. a. [selection: Restrict; Prohibit] the use of [assignment: organization-defined types of system media] on [assignment: organization-defined systems or system componen...
1,0,120,0_the_cloud_of_criterion,"[the, cloud, of, criterion, service, and, in, to, for, provider]",[Control and monitoring of service providers and suppliers (sso). Policies and instructions for controlling and monitoring third parties. Basic criterion: Policies and instructions for controlling...
2,1,36,1_configuration_system_software_and,"[configuration, system, software, and, components, the, changes, to, management, of]",[Configuration management. Configuration change control. A. Determine and document the types of changes to the system that are configuration-controlled.\nB. Review proposed configuration-controlle...
3,2,34,2_contingency_alternate_site_planning,"[contingency, alternate, site, planning, processing, recovery, sites, plans, plan, telecommunications]","[Contingency planning. Alternate storage site. a. Establish an alternate storage site, including necessary agreements to permit the storage and retrieval of system backup information; and b. Ensur..."
4,3,32,3_physical_access_power_environmental,"[physical, access, power, environmental, to, wireless, devices, facility, and, or]","[Physical and environmental protection. Monitoring physical access | monitoring physical access to systems. Monitor physical access to the system, in addition to the physical access monitoring of ..."
5,4,32,4_system_integrity_monitoring_information,"[system, integrity, monitoring, information, and, malicious, or, code, organizationdefined, to]",[System and information integrity. System monitoring | system-generated alerts. Alert [assignment: organization-defined personnel or roles] when the following system-generated indications of compr...
6,5,30,5_authentication_authenticators_identification_identity,"[authentication, authenticators, identification, identity, users, of, passwords, to, authenticator, and]",[Identification and authentication. Identification and authentication (organizational users). Uniquely identify and authenticate organizational users and associate that unique identification with ...
7,6,28,6_audit_records_event_record,"[audit, records, event, record, logging, of, information, time, and, accountability]",[Audit and accountability. Response to audit logging process failures. a. Alert (assignment: organization-defined personnel or roles) within (assignment: organization-defined time period) in the e...
8,7,27,7_accounts_account_access_privileged,"[accounts, account, access, privileged, or, system, users, of, control, for]","[Access control. Least privilege | privileged accounts. Restrict privileged accounts on the system to [assignment: organization-defined personnel or roles]. Privileged accounts, including super us..."
9,8,24,8_incident_response_and_incidents,"[incident, response, and, incidents, to, the, information, handling, training, or]",[Incident response. Incident response testing. Test the effectiveness of the incident response capability for the system [assignment: organization-defined frequency] using the following tests: [as...


In [8]:
# Visualize Topics
fig = topic_model.visualize_documents(docs=docs)
fig.update_layout(
    title="Topic clusters (out of the box)",
    legend=dict(bordercolor="Black", borderwidth=1),
)
fig.show()

# BERT topic modeling

In [9]:
# Train topic model using embeddings from BERT (pre-calculated)
embeddings = np.array(list(df['BERTembeddings'].values))
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs, embeddings)

In [10]:
# show summary of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,55,-1_the_and_of_to,"[the, and, of, to, for, system, or, information, that, security]",[Product safety and security (pss). Guidelines and recommendations for cloud customers. Basic criterion: The cloud service provider provides cloud customers with guidelines and recommendations for...
1,0,169,0_and_the_of_to,"[and, the, of, to, or, system, information, for, access, organizationdefined]",[Control and monitoring of service providers and suppliers (sso). Monitoring of compliance with requirements. Basic criterion: The cloud service provider monitors compliance with information secur...
2,1,99,1_the_of_and_to,"[the, of, and, to, criterion, cloud, in, for, service, are]","[Security policies and instructions (sp). Exceptions from existing policies and instructions. Basic Criterion: Exceptions to the policies and instructions for information security, as well as resp..."
3,2,70,2_and_the_or_procedures,"[and, the, or, procedures, to, of, policy, organizationdefined, system, security]","[Risk assessment. Policy and procedures. A. Develop, document, and disseminate to [assignment: organization-defined personnel or roles]: 1. [Selection (one or more): organization level; mission/bu..."
4,3,50,3_and_of_the_system,"[and, of, the, system, to, or, information, systems, organizations, organizationdefined]",[System and communications protection. Protection of information at rest. Protect the confidentiality and integrity of the following information at rest: Organization-defined information at rest. ...
5,4,27,4_and_the_system_of,"[and, the, system, of, or, for, to, components, information, configuration]","[System and services acquisition. Acquisition process | design and implementation information for controls. Require the developer of the system, system component, or system service to provide desi..."
6,5,27,5_and_the_of_to,"[and, the, of, to, system, or, testing, for, assessment, organizations]",[Security assessment and authorization. Authorization. a. Assign a senior official as the authorizing official for the system. \nb. Assign a senior official as the authorizing official for common ...
7,6,18,6_the_cloud_of_criterion,"[the, cloud, of, criterion, service, is, to, in, for, and]",[Identity and access management (idm). Access to cloud customer data. Basic criterion: The cloud customer is informed by the cloud service provider whenever internal or external employees of the c...
8,7,16,7_the_of_service_cloud,"[the, of, service, cloud, criterion, and, in, are, to, provider]",[Control and monitoring of service providers and suppliers (sso). Risk assessment of service providers and suppliers. Basic criterion: Service providers and suppliers of the cloud service provider...


In [11]:
# Visualize Topics
fig = topic_model.visualize_documents(docs, embeddings=embeddings)
fig.update_layout(
    title="Topic clusters (BERT)",
    legend=dict(bordercolor="Black", borderwidth=1),
)
fig.show()

# Finetuned BERT topic modeling

In [12]:
# Train topic model using embeddings from BERT (pre-calculated)
embeddings = np.array(list(df['finetuned_embeddings'].values))
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs, embeddings)

In [13]:
# show summary of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,296,0_and_the_of_to,"[and, the, of, to, system, or, information, for, access, that]","[System and services acquisition. External system services | processing, storage, and service location. Restrict the location of information processing, information or data, or system services to ..."
1,1,126,1_the_of_cloud_and,"[the, of, cloud, and, criterion, service, to, in, for, are]",[Control and monitoring of service providers and suppliers (sso). Monitoring of compliance with requirements. Basic criterion: The cloud service provider monitors compliance with information secur...
2,2,109,2_and_the_or_of,"[and, the, or, of, to, system, security, for, that, privacy]","[Access control. Policy and procedures. A. Develop, document, and disseminate to [assignment: organization-defined personnel or roles]: 1. [Selection (one or more): organization-level; mission/bus..."


In [14]:
# Visualize Topics
fig = topic_model.visualize_documents(docs, embeddings=embeddings)
fig.update_layout(
    title="Topic clusters (BERT finetuned)",
    legend=dict(bordercolor="Black", borderwidth=1),
)
fig.show()

# Preferred topic modeling

We can do most of the rest of the BERTopic algorithm in one function
<!-- - Dimensionality reduction
- Clustering
- Tokenizer
- Weighting Scheme -->


- Step 1 - Extract embeddings (though we pre-calculated ours)

` embedding_model = SentenceTransformer("all-MiniLM-L6-v2")`

- Step 2 - Reduce dimensionality

`umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')`

- Step 3 - Cluster reduced embeddings

`hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)`

- Step 4 - Tokenize topics

`vectorizer_model = CountVectorizer(stop_words="english")`

- Step 5 - Create topic representation

`ctfidf_model = ClassTfidfTransformer()`

- Step 6 - (Optional but strongly recommended) Fine-tune topic representations 

`representation_model=representation_model` 

See [documentation](https://maartengr.github.io/BERTopic/getting_started/parameter%20tuning/parametertuning.html#min_topic_size "More info on minimum topic size and other parameters") for more on min_topic_size and other parameter choices.

In [15]:
from sentence_transformers import SentenceTransformer, models
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP
from hdbscan import HDBSCAN

In [16]:
# Load fine-tuned sentence-transformers model
# model_path = "outputs/sentence_transformers_compatible_model"
model_path = "../outputs/fine_tuned_model"
finetuned_model = SentenceTransformer(model_path)

# Load pre-generated embeddings
pre_generated_embeddings = list(df['finetuned_embeddings'].values)
pre_generated_embeddings = np.array(pre_generated_embeddings)

# specifying dimensionality reduction
umap_model = UMAP(n_neighbors=15, n_components=5, metric='cosine', low_memory=False, random_state=42)  # may need to tweak

# specifying cluster model
hdbscan_model = HDBSCAN(min_cluster_size=2, metric='euclidean', prediction_data=True)  # To Do: check with new min cluster size

# better stop words handling
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 3))

# Create a representation model, 3 parts
keybert_model = KeyBERTInspired(random_state=42)
mmr_model = MaximalMarginalRelevance(diversity=0.3)
representation_model = {
    "KeyBERT": keybert_model,
    # "OpenAI": openai_model,  # Uncomment if you will use OpenAI
    "MMR": mmr_model
}

# Instantiate BERTopic with fine-tuned model's embeddings and the representation model
topic_model = BERTopic(embedding_model=finetuned_model,
                       umap_model=umap_model,
                       hdbscan_model=hdbscan_model,
                       vectorizer_model=vectorizer_model,
                       verbose=True,
                       n_gram_range=(1, 3),
                       min_topic_size=5,
                       calculate_probabilities=True,
                       representation_model=representation_model).fit(docs, embeddings)

topics, probs = topic_model.fit_transform(docs, embeddings)

# note that embedding_model=finetuned_model doesn't remake embeddings. see https://github.com/MaartenGr/BERTopic/issues/1601                   

No sentence-transformers model found with name ../outputs/fine_tuned_model. Creating a new one with mean pooling.
2024-08-29 15:02:45,837 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-08-29 15:02:46,942 - BERTopic - Dimensionality - Completed ✓
2024-08-29 15:02:46,945 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-08-29 15:02:47,012 - BERTopic - Cluster - Completed ✓
2024-08-29 15:02:47,013 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-08-29 15:02:58,442 - BERTopic - Representation - Completed ✓
2024-08-29 15:02:58,568 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-08-29 15:02:59,608 - BERTopic - Dimensionality - Completed ✓
2024-08-29 15:02:59,608 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-08-29 15:02:59,733 - BERTopic - Cluster - Completed ✓
2024-08-29 15:02:59,734 - BERTopic - Representation - Extracting topics from cl

In [17]:
# check results
topic_model.get_topic_info().head()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,Representative_Docs
0,-1,57,-1_information_time_organization_organization defined,"[information, time, organization, organization defined, defined, training, communications, integrity, assignment organization, assignment organization defined]","[information integrity monitoring, unauthorized, software firmware information, software firmware, firmware information, organizational systems, communications traffic, criticality analysis, inbou...","[information, time, organization, organization defined, defined, training, communications, integrity, assignment organization, assignment organization defined]","[System and information integrity. Software, firmware, and information integrity. a. Employ integrity verification tools to detect unauthorized changes to the following software, firmware, and inf..."
1,0,112,0_cloud_criterion_cloud service_service,"[cloud, criterion, cloud service, service, service provider, provider, cloud service provider, continuous, customer, data]","[continuous auditing feasibility, auditing feasibility, auditing feasibility partially, cloud service provider, notes continuous auditing, supplementary information criterion, continuous auditing,...","[cloud, criterion, cloud service, service, service provider, provider, cloud service provider, continuous, customer, data]",[Dealing with investigation requests from government agencies (inq). Informing cloud customers about investigation requests. Basic criterion: The cloud service provider informs the affected cloud ...
2,1,110,1_privacy_organization_procedures_security privacy,"[privacy, organization, procedures, security privacy, security, policy, systems, organization defined, defined, organizations]","[executive orders directives, policies standards guidelines, orders directives regulations, directives regulations policies, orders directives, directives regulations, risk management, organizatio...","[privacy, organization, procedures, security privacy, security, policy, systems, organization defined, defined, organizations]","[Maintenance. Policy and procedures. A. Develop, document, and disseminate to organization-defined personnel or roles:\n\n1. Organization-level maintenance policy that addresses purpose, scope, ro..."
3,2,23,2_audit_audit records_records_audit record,"[audit, audit records, records, audit record, record, audit accountability, event, accountability, audit information, audit accountability audit]","[audit accountability audit, accountability audit, audit accountability, accountability audit record, repositories, audit log storage, audit record review, accountability, correlate, review analys...","[audit, audit records, records, audit record, record, audit accountability, event, accountability, audit information, audit accountability audit]",[Audit and accountability. Content of audit records | additional audit information. Generate audit records containing the following additional information: [assignment: organization-defined additi...
4,3,21,3_authentication_identity_authenticators_identification authentication,"[authentication, identity, authenticators, identification authentication, identification, credentials, organizational users, non, users, piv]","[identification authentication authenticator, authentication authenticator management, authentication identification authentication, personal identity verification, authentication authenticator, a...","[authentication, identity, authenticators, identification authentication, identification, credentials, organizational users, non, users, piv]",[Identification and authentication. Identification and authentication (non-organizational users) | acceptance of external authenticators. (a) Accept only external authenticators that are NIST-comp...


In [18]:
# visualize hierarchy
topic_model.visualize_hierarchy()

## Merge Topics

- In BERTopic, you can use .merge_topics to manually select and merge those topics. 
- Doing so will update their topic representation which in turn updates the entire model
- You can also track the merges and other changes in topics and their mappings with the [BERTopic.topic_mapper_](https://maartengr.github.io/BERTopic/api/bertopic.html) class.


In [19]:
# merge topics
topics_to_merge = [[11, 13],  # perhaps we want to merge physical access and maintenance topics
                   [6, 36]]  # perhaps we want to merge topics that seem to deal with "access""
topic_model.merge_topics(docs, topics_to_merge)
            
# merging and updating topic model:
topic_model.merge_topics(docs, topics_to_merge)

In [20]:
# see updated topics - these are just topic numbers for each of our documents
# topic_model.topics_

In [21]:
# check updates
topic_model.visualize_hierarchy()

In [22]:
# show updates differently
for k, v in topic_model.get_topics().items():
    if k != -1:
        print(f'Cluster {k : >2}:  {v[0][0]: >35} {v[1][0]: >35} {v[2][0]: >35}')

Cluster  0:                                cloud                           criterion                       cloud service
Cluster  1:                              privacy                        organization                          procedures
Cluster  2:                                audit                       audit records                             records
Cluster  3:                       authentication                            identity                      authenticators
Cluster  4:                             incident                   incident response                            response
Cluster  5:                             accounts                             account                               usage
Cluster  6:                           privileged                              access                           privilege
Cluster  7:                        cryptographic                       configuration                        cryptography
Cluster  8:                     

In [23]:
# printing tree - shows higher order, less granular topics too
hierarchical_topics = topic_model.hierarchical_topics(docs)
tree = topic_model.get_topic_tree(hierarchical_topics)
print(tree)

100%|██████████| 39/39 [00:00<00:00, 846.26it/s]

.
├─contingency_alternate_site_telecommunications_planning
│    ├─■──recovery_reconstitution_critical_recovery point_recovery time ── Topic: 30
│    └─alternate_contingency_site_telecommunications_planning
│         ├─site_alternate_contingency_alternate storage_planning
│         │    ├─■──contingency_plan_contingency plan_planning_plans ── Topic: 14
│         │    └─■──alternate_site_alternate storage_storage_sites ── Topic: 10
│         └─■──telecommunications_telecommunications services_services_telecommunications service_alternate ── Topic: 15
└─security_information_cloud_service_organization
     ├─cloud_criterion_cloud service_service_service provider
     │    ├─discoverable_vulnerabilities_corrective actions_corrective_vulnerability
     │    │    ├─■──contaminated_corrective actions_corrective_disclosure_reporting ── Topic: 36
     │    │    └─■──discoverable_vulnerabilities_vulnerabilities scanned_scanned_vulnerability ── Topic: 27
     │    └─cloud_criterion_cloud service_s




## Redo topic representations w/ OpenAI API
> See [documentation](https://maartengr.github.io/BERTopic/api/representation/openai.html#bertopic.representation._openai.OpenAI).

In [24]:
# function to get your key
def read_key_from_file(filename="../keys/key.txt"):
    with open(filename, "r") as file:
        return file.read().strip()

In [25]:
# generate topic names using OpenAI
import openai
from bertopic.representation import OpenAI
from bertopic import BERTopic

# must manually add openai representation because the bertopic utility doesn't work
client = OpenAI(api_key=read_key_from_file())

# main loop - get the topic names from OpenAI
import time
rows_to_append = []

# client = openai.OpenAI(api_key=read_key_from_file())  # NOTE: this is the old and now defunct way of calling OPenAI

for x, i in enumerate(topic_model.get_topics()):
  # skip -1, outliers
  if x == 0 and i == -1:
        continue
  completion = client.chat.completions.create(
    model="gpt-4",
    messages = [
{"role": "system", "content": "You are a helpful assistant, knowledgeable on data privacy and security standards and controls, that helps in topic modeling tasks."},
{"role": "user", "content": f"I have a topic that contains the following documents: {topic_model.get_representative_docs()[i]}."},
{"role": "user", "content": f"The topic is described by the following keyword-probability pairs: {topic_model.get_topics()[i]}."},
{"role": "user", "content": """Based on the information above, extract a short but highly descriptive topic label of at most 5 words. 
 
    Make sure it is in the following format:
    
    topic: <topic label>."""},
]
  )

  # Extract the topic label from the completion response
  topic_label = completion.choices[0].message.content.split("topic: ")[1].strip()

  rows_to_append.append({
    "topic_num": i,
    "representative_docs": topic_model.get_representative_docs()[i],  
    "top_words": topic_model.get_topics()[i], 
    "topic_label": topic_label
    })

  print(f"Appended topic {i}: {topic_label}")

  time.sleep(.1)

TypeError: OpenAI.__init__() got an unexpected keyword argument 'api_key'