### Overview
This script was originally a python script turned Jupyter notebook - it gives various configurations & choices for modelling cluster visualizations with or without the generated cluster data (it's recommended you just use the already existing cluster data in `~/Data/pkl_clusters`).

The generated visualizations run on PyVis and often can have poor performance due to the size of data alongside PyVis simulating physics. I haven't really found a good way to get a good arrangement of the PyVis graph without enabling physics, so often what I do is upon opening the project I enable physics for a bit, then disable it after a few minutes.

This is definitely not the best program so please feel free to rectify issues you may find in it

#### Setup/Requirements


In [None]:
%pip install pandas
%pip install nltk
%pip install pyspellchecker
%pip install pyvis
%pip install tqdm

In [2]:
import pandas as pd
from pyvis.network import Network
import networkx as nx
from spellchecker import SpellChecker
import time
from tqdm import tqdm
import nltk
nltk.download('words')
from nltk.corpus import words as nltk_words

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\brain\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


#### Source Code

In [13]:
# Vars
lg_doc_lt = pd.read_pickle('../../Data/lookup_tables/doc_lookup_table')
sm_doc_lt = pd.read_pickle('../../Data/lookup_tables/sm_doc_lookup_table')
vs_doc_lt = pd.read_pickle('../../Data/lookup_tables/vs_doc_lookup_table')

cluster_distances = [
    0.04, 0.059, 0.095, 0.136, 0.188, 0.231, 0.043, 0.062, 
    0.1, 0.143, 0.19, 0.235, 0.045, 0.067, 0.105, 0.154, 
    0.2, 0.238, 0.048, 0.071, 0.111, 0.158, 0.211, 0.241, 
    0.05, 0.077, 0.118, 0.167, 0.214, 0.25, 0.053, 0.083, 
    0.125, 0.176, 0.222, 0.056, 0.091, 0.133, 0.182, 0.227
]

# Pyvis Graph Config
net = Network(select_menu=True, cdn_resources='remote')
net.show_buttons(filter_=['physics'])

def pyvisSaveGraph(name: str):
    # Disable Physics on Nodes
    net.toggle_physics(False)

    # Generate Graph with UTF-8 Encoding
    html = net.generate_html()
    with open(f"cluster_visualizations/{name}.html", mode='w', encoding='utf-8') as fp:
        fp.write(html)
    print(f"Done! Look for cluster_visualizations/{name}.html")        

##### Word Frequency Visualization
Given a threshold size and a size set for the document list to analyze, generates associations based off levensthein distances between words available in the corpus

In [14]:
def visualizeWordFreqData(threshold: int, doc_lt: dict, distance: int):
    timer = 0

    # Count Word Frequencies, then filter words not meeting a threshold
    all_words_freqs = {}

    timer = time.time()

    for doc, doc_val in tqdm(doc_lt.items(), desc="Counting Frequencies of all Words across documents".ljust(65)):
        for word in doc_val:
            # Prevent Nulls from being added
            if word == None:
                continue       

            # Count up times used across documents
            if word not in all_words_freqs:
                all_words_freqs[word] = 0
            all_words_freqs[word] += 1

    print("\tExecution time: ", time.time() - timer, " seconds")

    # Filter by Threshold Value
    timer = time.time()

    word_freqs = {}
    for word, word_freq in tqdm(all_words_freqs.items(), desc="Filtering words with frequencies below threshold".ljust(65)):
        if word_freq >= threshold:
            word_freqs[word] = word_freq
    print("\tExecution time: ", time.time() - timer, " seconds")

    # Match Similar Words based on Levenshtein Distance
    root_groups = {}
    corrected_groups = {}

    spell = None
    if distance == 1:
        spell = SpellChecker(distance=1)
    else:
        spell = SpellChecker()

    timer = time.time()

    for word, word_freq in tqdm(word_freqs.items(), desc="Spell Checking words via Levensthein Algorithm & Grouping them".ljust(65)):
        # Create source groups for grouping nodes later
        if word not in corrected_groups:
            corrected_word = spell.correction(word)
            source_word = corrected_word if corrected_word != None else word
            corrected_groups[word] = source_word
        else:
            source_word = corrected_groups[word]  

        # Associate groups
        if source_word not in root_groups:
            root_groups[source_word] = []
        
        if word not in root_groups[source_word]:
            root_groups[source_word].append(word)
    
    print("\tExecution time: ", time.time() - timer, " seconds")

    # Create Visual Graph
    timer = time.time()

    for word, word_freq in tqdm(word_freqs.items(), desc="Generating Graph...".ljust(65)):
        _group = corrected_groups[word]  

        if len(root_groups[_group]) > 1:
            # Add Node to diagram
            net.add_node(
                word, 
                size=max(min(word_freq / 2, 30), 3),  
                label=f"{word}\n({word_freq})",
                group=_group
                )

            # Add edges to common node to group them together
            if root_groups[_group][0] != word:
                net.add_edge(
                    word,
                    root_groups[_group][0],
                    width=0
                )
    
    print("\tExecution time: ", time.time() - timer, " seconds")

    pyvisSaveGraph(f"word_freq_diagram_{threshold}_{distance}")

Modify Variables below to play with function

In [15]:
# Minimum threshold of occurences for a word to appear on graph
threshold = 5 
# Word sets to look from (choices: vs_doc_lt, sm_doc_lt, lg_doc_lt)
doc_choice = sm_doc_lt
# Spellchecking Distance (choices: 1 (levensthein distance of 1), 2 (levensthein distance of 2))
distance = 2

visualizeWordFreqData(threshold, doc_choice, distance)

Counting Frequencies of all Words across documents               : 42it [00:00, 4570.97it/s]


	Execution time:  0.011195182800292969  seconds


Filtering words with frequencies below threshold                 : 100%|██████████| 4482/4482 [00:00<00:00, 720290.84it/s]




	Execution time:  0.007234096527099609  seconds


Spell Checking words via Levensthein Algorithm & Grouping them   : 100%|██████████| 298/298 [00:00<00:00, 406.14it/s]


	Execution time:  0.744495153427124  seconds


Generating Graph...                                              : 100%|██████████| 298/298 [00:00<?, ?it/s]

	Execution time:  0.002257823944091797  seconds
Done! Look for cluster_visualizations/word_freq_diagram_5_2.html





##### Word Cluster Visualization
Given a distance value to reference from the already ran cluster component algorithm, generates a basic cluster component diagram showing all clusters and the words in them.

In [20]:
def visualizeClusterCompData(distance: float, min_threshold: int, max_threshold: int):
    timer = 0
    initial_ls = pd.read_pickle(f'../../Data/pkl_clusters/connected_comps_{distance}')
    cluster_ls = []

    # Filter Cluster Comp List by thresholds
    if min_threshold != -1 or max_threshold != -1:
        timer = time.time()
        for ls in tqdm(initial_ls, desc="Compiling cluster list from data: ".ljust(65)):
            if len(ls) >= min_threshold and (max_threshold == -1 or len(ls) <= max_threshold):
                cluster_ls.append(ls)
        print("\tExecution time: ", time.time() - timer, " seconds")
    else:
        cluster_ls = initial_ls

    # Generate Graph
    timer = time.time()
    word_set = nltk_words.words()

    for ls in tqdm(cluster_ls, desc="Generating Graph & Pairing words: ".ljust(65)):
        # Edge case in case there's only one word
        if len(ls) <= 1:
            net.add_node(
                ls[0], 
                size=2,  
                label=f"{ls[0]}",
                group=ls[0]
                )
            continue

        # Identify Correct Word to link the words to, or at least the closest
        correct_word = ls[0]

        for word in ls: 
            if word in word_set:
                correct_word = word
                break

        net.add_node(
            correct_word, 
            size=9, 
            label=f"{correct_word}",
            group=correct_word
        )
        
        for word in ls: 
            if word != correct_word:
                net.add_node(
                    word, 
                    size=6, 
                    label=f"{word}",
                    group=correct_word
                )
    
                net.add_edge(
                    word,
                    correct_word,
                    width=0
                )

    print("\tExecution time: ", time.time() - timer, " seconds")

    pyvisSaveGraph(f"cluster_comp_diagram_{distance}_{min_threshold}_{max_threshold}")

Modify Variables below to play with function

In [22]:
# Distance value for clustering comp algorithm to reference

# Choices:      0.04, 0.059, 0.095, 0.136, 0.188, 0.231, 0.043, 0.062, 
#               0.1, 0.143, 0.19, 0.235, 0.045, 0.067, 0.105, 0.154, 
#               0.2, 0.238, 0.048, 0.071, 0.111, 0.158, 0.211, 0.241, 
#               0.05, 0.077, 0.118, 0.167, 0.214, 0.25, 0.053, 0.083, 
#               0.125, 0.176, 0.222, 0.056, 0.091, 0.133, 0.182, 0.227
distance = 0.211

# Clusters with less words than this threshold are not included in visualization
min_threshold = 5
# Clusters with more words than this threshold are not included in visualization
max_threshold = 20

visualizeClusterCompData(distance, min_threshold, max_threshold)

Compiling cluster list from data:                                : 100%|██████████| 100380/100380 [00:00<00:00, 4488914.15it/s]


	Execution time:  0.023319244384765625  seconds


Generating Graph & Pairing words:                                : 100%|██████████| 2688/2688 [01:20<00:00, 33.23it/s]


	Execution time:  80.96397066116333  seconds
Done! Look for cluster_visualizations/cluster_comp_diagram_0.211_5_20.html


##### Word Cluster with Frequency Visualizer
Given a distance value to reference from the already ran cluster component algorithm, generates a basic cluster component diagram showing all clusters and the words in them. Includes also their frequency based off a document lookup table set.

Personally this one isn't that useful... It's just a blend of the previous two algorithms without many benefits. Stick with the Cluster Component Visualization if this may not be needed.

In [23]:
def visualizeClusterCompFreqData(distance: float, freq_threshold: int, doc_lt: dict, min_cluster_threshold: int, max_cluster_threshold: int):
    timer = 0

    # Count Word Frequencies, then filter words not meeting a threshold
    all_words_freqs = {}
    timer = time.time()
    for doc, doc_val in tqdm(doc_lt.items(), desc="Counting Frequencies of all Words across documents".ljust(65)):
        for word in doc_val:
            # Prevent Nulls from being added
            if word == None:
                continue       

            # Count up times used across documents
            if word not in all_words_freqs:
                all_words_freqs[word] = 0
            all_words_freqs[word] += 1
    print("\tExecution time: ", time.time() - timer, " seconds")

    # Filter by Threshold Value
    timer = time.time()
    word_freqs = {}
    for word, word_freq in tqdm(all_words_freqs.items(), desc="Filtering words with frequencies below threshold".ljust(65)):
        if word_freq >= freq_threshold:
            word_freqs[word] = word_freq
    print("\tExecution time: ", time.time() - timer, " seconds")

    # Merge pickled information into one list
    initial_ls = pd.read_pickle(f'../../Data/pkl_clusters/connected_comps_{distance}')
    full_cluster_ls = []

    # Filter Cluster Comp List by thresholds
    if min_cluster_threshold != -1 or max_cluster_threshold != -1:
        timer = time.time()
        for ls in tqdm(initial_ls, desc="Compiling cluster list from data: ".ljust(65)):
            if len(ls) >= min_cluster_threshold and (max_cluster_threshold == -1 or len(ls) <= max_cluster_threshold):
                full_cluster_ls.append(ls)
        print("\tExecution time: ", time.time() - timer, " seconds")
    else:
        full_cluster_ls = initial_ls

    # Filter by Words with recorded word frequencies
    timer = time.time()
    final_cluster_ls = []
    for ls in tqdm(full_cluster_ls, desc="Culling clusters with low frequencies: ".ljust(65)):
        # Remove words from each list that did not meet the word frequency threshold
        for word in ls[:]: # Iterate through copy of list
            if word not in word_freqs:
                ls.remove(word)
        # If every word met requirements, then yay!
        if len(ls) > 1:
            final_cluster_ls.append(ls)
    print("\tExecution time: ", time.time() - timer, " seconds")

    # Now Perform Clustering Algorithm & Generate Graph
    timer = time.time()
    word_set = nltk_words.words()

    for ls in tqdm(final_cluster_ls, desc="Generating Graph & Pairing words: ".ljust(65)):
        # print("List I'm looking at !! ", ls)
        # Edge case in case there's only one word
        if len(ls) <= 1:
            net.add_node(
                word, 
                size=2,  
                label=f"{word}\n({word_freqs[word] if word in word_freqs else 'BROKEN'})",
                group=word
                )
            continue

        # Identify Correct Word to link the words to, or at least the closest
        correct_word = ls[0]

        for word in ls: 
            if word in word_set:
                correct_word = word
                break

        net.add_node(
            correct_word, 
            size=5, 
            label=f"{correct_word}\n({word_freqs[correct_word] if correct_word in word_freqs else 'BROKEN'})",
            group=correct_word
            )
        
        for word in ls: 
            if word != correct_word:
                net.add_node(
                    word, 
                    size=2, 
                    label=f"{word}\n({word_freqs[word] if word in word_freqs else 'BROKEN'})",
                    group=correct_word
                )
    
                net.add_edge(
                    word,
                    correct_word,
                    width=0
                )
    print("\tExecution time: ", time.time() - timer, " seconds")

    pyvisSaveGraph(f"cluster_comp_freq_diagram_{distance}_{freq_threshold}_{min_cluster_threshold}_{max_cluster_threshold}")

Modify Variables below to play with function

In [24]:
# Distance value for clustering comp algorithm to reference

# Choices:      0.04, 0.059, 0.095, 0.136, 0.188, 0.231, 0.043, 0.062, 
#               0.1, 0.143, 0.19, 0.235, 0.045, 0.067, 0.105, 0.154, 
#               0.2, 0.238, 0.048, 0.071, 0.111, 0.158, 0.211, 0.241, 
#               0.05, 0.077, 0.118, 0.167, 0.214, 0.25, 0.053, 0.083, 
#               0.125, 0.176, 0.222, 0.056, 0.091, 0.133, 0.182, 0.227
distance = 0.211

# Clusters with less words than this threshold are not included in visualization
min_cluster_threshold = 5
# Clusters with more words than this threshold are not included in visualization
max_cluster_threshold = 20

# Minimum threshold of occurences for a word to appear on graph
freq_threshold = 10 
# Word sets to look from (choices: vs_doc_lt, sm_doc_lt, lg_doc_lt)
doc_choice = lg_doc_lt

visualizeClusterCompFreqData(distance, freq_threshold, doc_choice, min_cluster_threshold, max_cluster_threshold)

Counting Frequencies of all Words across documents               : 100%|██████████| 1466/1466 [00:00<00:00, 9737.61it/s]


	Execution time:  0.1524655818939209  seconds


Filtering words with frequencies below threshold                 : 100%|██████████| 146393/146393 [00:00<00:00, 4338605.08it/s]


	Execution time:  0.03374195098876953  seconds


Compiling cluster list from data:                                : 100%|██████████| 100380/100380 [00:00<00:00, 3028297.75it/s]


	Execution time:  0.033147335052490234  seconds


Culling clusters with low frequencies:                           : 100%|██████████| 2688/2688 [00:00<?, ?it/s]


	Execution time:  0.0  seconds


Generating Graph & Pairing words:                                : 100%|██████████| 336/336 [00:02<00:00, 116.57it/s]


	Execution time:  2.9471027851104736  seconds
Done! Look for cluster_visualizations/cluster_comp_freq_diagram_0.211_10_5_20.html
