In [1]:
import numpy as np
from umap import UMAP
from sklearn.decomposition import PCA
import pandas as pd
import re
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer
from hdbscan import HDBSCAN
from bertopic.representation import KeyBERTInspired
import string
from bertopic.representation import OpenAI
from bertopic.vectorizers import ClassTfidfTransformer

2023-05-28 16:49:50.425520: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-28 16:49:50.926710: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-05-28 16:49:50.926733: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-05-28 16:49:52.944579: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-

## Load the email.csv

In [2]:
emails=pd.read_csv("split_emails.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'split_emails.csv'

In [None]:
emails.head()

## Function to parse the emails

In [None]:
def parse_email(raw_message):
    lines = raw_message.split('\n')
    email = {}
    message = ''
    keys_to_extract = ['from', 'to']
    for line in lines:
        if ':' not in line:
            message += line.strip()
            email['body'] = message
        else:
            pairs = line.split(':')
            key = pairs[0].lower()
            val = pairs[1].strip()
            if key in keys_to_extract:
                email[key] = val
    return email
def parse_final(messages):
    emails = [parse_email(message) for message in messages]
    return {
        'body': make_list(emails, 'body'), 
        'to': make_list(emails, 'to'), 
        'from_': make_list(emails, 'from')
            }
def make_list(emails, key):
    results = []
    for email in emails:
        if key not in email:
            results.append('')
        else:
            results.append(email[key])
    return results

In [None]:
email_df = pd.DataFrame(parse_final(emails.message))

In [None]:
email_df

### Clean the unwanted characters from the body column

In [None]:
def clean_email(test_cs_emails):
    test_cs_emails.body    = test_cs_emails.apply(lambda row: re.sub(r"http\S+", "", row.body).lower(), 1)
    test_cs_emails.body    = test_cs_emails.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.body.split())), 1)
    test_cs_emails.body    = test_cs_emails.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.body).split()), 1)
    emails                 = list(set(test_cs_emails.body))
    # remove emtpy string
    emails                 = [string for string in emails if string]
    return emails

In [None]:
body = clean_email(email_df)

In [None]:
body

## get the length of the body

In [None]:
len(body)

## Use Bertopic

BERTopic can be viewed as a sequence of steps to create its topic representations. There are five steps to this process:

- Step 1 - Embedding Extraction: This refers to the automatic extraction and representation of useful features from raw data.

- Step 2 - Dimensionality Reduction: Dimension reduction involves transforming data from a high-dimensional space to a low-dimensional space while preserving meaningful properties of the original data.

- Step 3 - Clustering Reduced Embeddings: Once the embeddings have been reduced, the data can be clustered. This is achieved using a density-based clustering technique called HDBSCAN, which can identify clusters of various shapes and detect outliers, ensuring that documents are not forced into inappropriate clusters. This process improves the quality of the resulting topic representation by reducing noise.

- Step 4 - Topic Tokenization: In this step, all documents within a cluster are combined into a single document, creating a bag of words representation.

- Step 5 - Extraction of Topic Words: By examining the generated bag-of-words representation, we can identify words that are distinctive to a specific cluster. These words are characteristic of one cluster and less prevalent in other clusters. To accomplish this, the TF-IDF (Term Frequency-Inverse Document Frequency) method is modified to consider topics (clusters) rather than individual documents.

- Step 6 - (Optional) Fine-tuning Topic Representations: To stay up-to-date with current developments, there is an option to refine the c-TF-IDF (cluster-based TF-IDF) topics using techniques such as GPT, T5, KeyBERT, Spacy, and others. BERTopic offers several implementation options for users to explore and utilize. In the case described, the OpenAI implementation encountered an error that could not be resolved, leading to the adoption of the KeyBERTInspired implementation

Reference [BERTOPIC](https://maartengr.github.io/BERTopic/algorithm/algorithm.html)

# Step 1 - Extract embeddings
- using the SentenceTransformer and all-MiniLM-L6-v2
- The models are trained on all available training data (more than 1 billion training pairs) and are designed as general purpose models
- all-MiniLM-L6-v2 is a sentence-transformers pre-trained model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search
- all-MiniLM-L6-v2 is 5 times faster tahn other models still offers good quality
- Other pre-trained models can be found [here](https://www.sbert.net/docs/pretrained_models.html)


In [None]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 2 - Reduce dimensionality
-  dimensionality reduction of the input embeddings
- embeddings are often high in dimensionality, clustering is difficult due to high dimensionality
- A solution is to reduce the dimensionality of the embeddings to a workable dimensional space (e.g., 5) for clustering algorithms to work with 
- UMAP is used as a default in BERTopic since it can capture both the local and global high-dimensional space in lower dimensions
- n_neighnbors:The size of local neighborhood, n_components: The dimension of the space to embed into. This defaults to 2 to
    provide easy visualization
 - The metric to use to compute distances in high dimensional space
 - min_dist : The effective minimum distance between embedded points. Smaller values
    will result in a more clustered/clumped embedding where nearby points
    on the manifold are drawn closer together, while larger values will
    result on a more even dispersal of points.


In [None]:
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')


# Step 3 - Cluster reduced embeddings
- Next step, after reducing the dimensionality of our input embeddings, we need to cluster them into groups of similar embeddings to extract our topics
- This process of clustering is quite important because the more performant our clustering technique the more accurate our topic representations are
- HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise
- Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon
- This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection
- HDBSCAN performs clustering with little or no parameter tunning
- HDBSCAN is ideal for exploratory data analysis


In [18]:
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics

- is responsible for creating the topic representations
- Countvectorizer is a method to convert text to numerical data
- Countvectorizer converts the text to lowercase and uses word-level tokenization
- Added english stop words; eg:  a, an, the, is, has, of, are etc

In [19]:
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create topic representation
- A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.
- c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes by joining all documents per class. Thus, each class is converted to a single document instead of set of documents
- The frequency of each word x is extracted for each class c and is l1 normalized. This constitutes the term frequency.


In [None]:
ctfidf_model = ClassTfidfTransformer()


# Step 6 -  Fine-tune topic representations
- First, select a few important documents for each topic.
- Then randomly choose candidate documents from each cluster.
- The top representative documents are determined based on their similarity to the topic.
- Then identify the most relevant words for each topic.
- Then calculate the embeddings for words and representative documents.
- Topic embeddings are created by averaging the representative documents.
- Finally, find the most similar words to each topic using cosine similarity

The other models implemented are:
  - MaximalMarginalRelevance
  - PartOfSpeech
  - KeyBERTInspired
  - ZeroShotClassification
  - TextGeneration
  - Cohere
  - OpenAI
  - LangChain


In [10]:
representation_model = KeyBERTInspired()

# All steps together


In [40]:
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model # Step 6 -  Fine-tune topic represenations
)
topics, probabilities = topic_model.fit_transform(body)

## Data Visualization

In [42]:
topic_model.visualize_barchart(n_words=10)

#### to create a 2D representation of your topics

In [20]:
topic_model.visualize_topics()


#### we can visualize the topics and get insight into their relationships
- This function recalculates the document embeddings and reduces them to 2-dimensional space for easier visualization purposes
- This step is computationally quite expensive

In [22]:
embedding_model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [23]:
# Prepare embeddings
embeddings = embedding_model.encode(body, show_progress_bar=False)

# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

## Get the terms or topics and their scores

In [49]:
topic_model.get_topic_info().head(10)

Unnamed: 0,Topic,Count,Name
0,-1,2053,-1_enron_corp_market_trading
1,0,241,0_building_escrow_builders_financing
2,1,156,1_resume_resumes_interviewing_interviews
3,2,154,2_demand_market_prices_buying
4,3,129,3_enron_enronoptions_corp_ermis
5,4,117,4_phillip_forwarded_john_keith
6,5,116,5_iso_stakeholders_ufe_inter
7,6,95,6_attend_scheduled_meeting_attending
8,7,85,7_enron_cn_allen_miller
9,8,84,8_weekend_lilly_saturdaymorning_friday


The topic -1 has count 2053, so wanted to what are the terms in that

In [50]:
topic_model.get_topic(-1)

[('enron', 0.5135958),
 ('corp', 0.36308533),
 ('market', 0.33070564),
 ('trading', 0.2967947),
 ('business', 0.29182035),
 ('company', 0.29075503),
 ('energy', 0.25308573),
 ('houston', 0.24630976),
 ('sell', 0.24201877),
 ('financial', 0.23718387)]

### The topics that were created can be hierarchically reduced
- In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another.

In [43]:
topic_model.visualize_hierarchy()

In [44]:
#  extract hierarchical topics

hierarchical_topics = topic_model.hierarchical_topics(body)
topic_model.visualize_hierarchical_documents(body, hierarchical_topics, reduced_embeddings=reduced_embeddings)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [01:31<00:00,  1.77s/it]


### Visualize Topic Similarity


In [45]:
topic_model.visualize_heatmap()

### Visualize Probablities or Distribution


In [46]:
topic_distr, _ = topic_model.approximate_distribution(body, min_similarity=0)


In [51]:
# To visualize the topic distributions in a document
topic_model.visualize_distribution(topic_distr[-1])