# Comparing Hypergraph Embeddings

The goal of this notebook is to use the tools in the [Tutte Institute ``vectorizers`` library](https://github.com/TutteInstitute/vectorizers) to construct hypergragh embeddings where we jointly embed vertices and hyperedges.

### Setup

In [1]:
# from src import paths
# from src.data import Dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.datasets
import sklearn.feature_extraction.text
import sklearn.preprocessing
import scipy.sparse
import vectorizers
import vectorizers.transformers
import umap
import pynndescent
import seaborn as sns
from scipy.sparse import vstack
import warnings
      
warnings.simplefilter("ignore")
sns.set()

We will need some data to work with. For the purposes of this demo we will make use of the 20 newsgroups dataset. Even though 20 newsgroups is a toy dataset, it offers enough complications to show how we can piece together document embeddings using ``vectorizers``. We've cleaned it up a little and created a custom color paletted as in `00-20-newsgroups-setup.ipynb`.

In [72]:
execfile('./00-data-setup.py')
recipes, recipes_label_id, ingredients_id, label_name, color_key = read_format_recipes()
recipes_label = [label_name.loc[i]['new_label'] for i in recipes_label_id.label]

# Build vertex vectors

In [73]:
%%time
long_list = max(len(x) for x in recipes)
ingredient_vectorizer = vectorizers.TokenCooccurrenceVectorizer(
    min_document_occurrences=5,
    window_radii=long_list,          
    window_functions='fixed',
    kernel_functions='flat',            
    n_iter = 0,
    normalize_windows=True,
).fit(recipes)
ingredient_vectors = ingredient_vectorizer.reduce_dimension(dimension=60, algorithm="randomized")

CPU times: user 4.77 s, sys: 164 ms, total: 4.94 s
Wall time: 4.01 s


# Build hyperedge vectors

In [74]:
n_recipes = len(recipes)
n_ingredients = len(ingredient_vectorizer.token_index_dictionary_)

These operations are relatively quick because it is matter of counting things and applying standard linear algebra techniques. This produces serviceable word vectors which are specifically trained on our corpus and cover the idiomatic usage of various words within the particular context of the 20 newsgroups dataset.

In [75]:
%%time
doc_matrix_vectorizer = vectorizers.NgramVectorizer(
    token_dictionary=ingredient_vectorizer.token_label_dictionary_
).fit(recipes)

doc_matrix = doc_matrix_vectorizer.transform(recipes)

CPU times: user 7.99 s, sys: 224 ms, total: 8.22 s
Wall time: 2.48 s


In [76]:
%%time
unsupervised_info_doc_matrix = vectorizers.transformers.InformationWeightTransformer(
    prior_strength=1e-1,
    approx_prior=False,
).fit_transform(doc_matrix)

CPU times: user 440 ms, sys: 0 ns, total: 440 ms
Wall time: 439 ms


# Map both in same space

### Stack recipes and ingredients

In [77]:
info_doc_with_identity = vstack([unsupervised_info_doc_matrix, scipy.sparse.identity(n_ingredients)])

In [78]:
%%time
joint_vectors_unsupervised = vectorizers.ApproximateWassersteinVectorizer(
    normalization_power=0.66,
    random_state=42,
).fit_transform(info_doc_with_identity, vectors=ingredient_vectors)

CPU times: user 8.85 s, sys: 1.07 s, total: 9.92 s
Wall time: 348 ms


In [79]:
%%time
joint_vectors_mapper = umap.UMAP(metric="cosine", random_state=42).fit(joint_vectors_unsupervised)

CPU times: user 1min 12s, sys: 1min 2s, total: 2min 15s
Wall time: 26.6 s


# This not that : explore hypergraph

In [80]:
import thisnotthat as tnt
import panel as pn

In [81]:
pn.extension()

### Build dataframe that contains information about vertex and hyperedges

In [82]:
recipes_bool = np.array([True for i in range(n_recipes)] + [False for i in range(n_ingredients)])
ingredients_bool = ~recipes_bool

In [83]:
recipe_metadata_all = pd.DataFrame()
recipe_metadata_all['Type'] = (
                    ['Recipes' for i in range(n_recipes)] 
                       + 
                    ['Ingredients' for x in range(n_ingredients)]
                  )
recipe_metadata_all['Description'] = (
                    [label_name.loc[i]['country'] for i in recipes_label_id.label] 
                       + 
                    [ingredient_vectorizer.token_index_dictionary_[x] for x in range(n_ingredients)]
                  )
recipe_metadata_all['Label'] = (
                    [label_name.loc[i]['new_label'] for i in recipes_label_id.label] 
                       + 
                    ['ingredient' for x in range(n_ingredients)]
                  )
recipe_metadata_all['Ingredients'] = (
                    recipes 
                       + 
                    [ingredient_vectorizer.token_index_dictionary_[x] for x in range(n_ingredients)]
                  )
recipe_metadata_all['Recipe_size'] = (
                    [len(x) for x in recipes] 
                       + 
                    [1 for x in range(n_ingredients)]
                  )

recipe_metadata_all['Ingredients_for_markdown'] = (
                    [[f'* {x}\n' for x in one_recipe] for one_recipe in recipes] 
                        +
                    [ingredient_vectorizer.token_index_dictionary_[x] for x in range(n_ingredients)]
                  )

In [84]:
recipe_metadata_all

Unnamed: 0,Type,Description,Label,Ingredients,Recipe_size,Ingredients_for_markdown
0,Recipes,brazilian,american.brazilian,"[crushed ice, cachaca, lime, superfine sugar]",4,"[* crushed ice\n, * cachaca\n, * lime\n, * sup..."
1,Recipes,mexican,american.mexican,"[cooked chicken, cilantro leaves, fresh lime j...",15,"[* cooked chicken\n, * cilantro leaves\n, * fr..."
2,Recipes,southern_us,american.southern_us,"[buttermilk, okra, large eggs, all-purpose flo...",7,"[* buttermilk\n, * okra\n, * large eggs\n, * a..."
3,Recipes,chinese,asian.chinese,"[fish sauce, steamed white rice, scallions, mi...",12,"[* fish sauce\n, * steamed white rice\n, * sca..."
4,Recipes,filipino,asian.filipino,"[bell pepper, ginger, shrimp, fish sauce, vege...",12,"[* bell pepper\n, * ginger\n, * shrimp\n, * fi..."
...,...,...,...,...,...,...
42885,Ingredients,zest,ingredient,zest,1,zest
42886,Ingredients,zesty italian dressing,ingredient,zesty italian dressing,1,zesty italian dressing
42887,Ingredients,zinfandel,ingredient,zinfandel,1,zinfandel
42888,Ingredients,ziti,ingredient,ziti,1,ziti


In [85]:
# w_plot = recipes_bool | ingredients_bool
w_plot = recipes_bool

In [86]:
recipe_metadata = recipe_metadata_all[w_plot]
recipe_umap = joint_vectors_mapper.embedding_[w_plot]
# recipe_vocab_use_vectors = joint_vectors_unsupervised[w_plot]

In [87]:
layer_metadata = recipe_metadata[['Type', 'Label', 'Recipe_size']].copy()

In [88]:
color_mapping = color_key.copy()
del color_mapping['ingredient']

sizes = [np.sqrt(len(x)) / 100 for x in recipes]

In [89]:
markdown_template = """## Recipe from {Label}
---
#### Ingredients

{Ingredients_for_markdown}

---
"""

In [149]:
%%time
bokeh_plot = tnt.BokehPlotPane(
    recipe_umap,
    labels=recipe_metadata.Label,
    hover_text=recipe_metadata.Description,
    legend_location='outside',
    marker_size=sizes,
    label_color_mapping=color_mapping,
    show_legend=False,
    min_point_size=0.001,
    max_point_size=0.05,
    title="What is cooking? Data Map",
)

CPU times: user 6.8 s, sys: 180 ms, total: 6.98 s
Wall time: 322 ms


In [171]:
legend = tnt.LegendWidget(
    recipe_metadata.Label,
    factors=list(color_mapping.keys()), 
    palette=list(color_mapping.values()), 
    palette_length=len(color_mapping),
    color_picker_height=16,
    color_picker_margin=[0,0],
    label_height=30,
    label_width=150,
    name="Legend",
    selectable=True,
)

In [176]:
legend.link(bokeh_plot, selected="selected")

Watcher(inst=LegendWidget(label_color_factors=['asian.chinese', ...], label_color_palette=['#dbe9f6', '#bad6eb', ...], labels=0          american.brazil..., name='Legend', selected=[4, 37, 173, 186, ...]), cls=<class 'thisnotthat.label_editor.LegendWidget'>, fn=<function Reactive.link.<locals>.link_cb at 0x7f994850a050>, mode='args', onlychanged=True, parameter_names=('selected',), what='value', queued=False, precedence=0)

In [177]:
pn.Row(bokeh_plot, legend)

### Search capability
* search_pane

In [150]:
search_pane = tnt.SearchWidget(recipe_metadata, width=400, title="Advanced Search")
search_pane.link_to_plot(bokeh_plot)

Watcher(inst=SearchWidget(data=          Type  Descriptio..., name='Search'), cls=<class 'thisnotthat.search.SearchWidget'>, fn=<function Reactive.link.<locals>.link_cb at 0x7f9ad05051b0>, mode='args', onlychanged=True, parameter_names=('selected',), what='value', queued=False, precedence=0)

In [152]:
# pn.Row(bokeh_plot, search_pane)

### Summarize selection
* count_summary : count how many things we select
* vertex_summary_pane : list vertices embeded close by

In [153]:
from thisnotthat.summary.dataframe import JointLabelSummarizer, CountSelectedSummarizer
count_summary = tnt.DataSummaryPane(CountSelectedSummarizer(),sizing_mode = "stretch_width")
count_summary.link_to_plot(bokeh_plot)

Watcher(inst=DataSummaryPane(name='Summary'), cls=<class 'thisnotthat.summary.dataframe.DataSummaryPane'>, fn=<function Reactive.link.<locals>.link_cb at 0x7f9ad0506200>, mode='args', onlychanged=True, parameter_names=('selected',), what='value', queued=False, precedence=0)

In [156]:
vocab = list(recipe_metadata_all.Description[ingredients_bool])
word_summary = JointLabelSummarizer(joint_vectors_unsupervised[recipes_bool],
                                    vocab, 
                                    joint_vectors_unsupervised[ingredients_bool])
vertex_summary_pane = tnt.DataSummaryPane(word_summary)
vertex_summary_pane.link_to_plot(bokeh_plot)

Watcher(inst=DataSummaryPane(name='Summary'), cls=<class 'thisnotthat.summary.dataframe.DataSummaryPane'>, fn=<function Reactive.link.<locals>.link_cb at 0x7f9ad05070a0>, mode='args', onlychanged=True, parameter_names=('selected',), what='value', queued=False, precedence=0)

In [158]:
#pn.Row(bokeh_plot, count_summary)
#pn.Row(bokeh_plot, pn.Column(count_summary, vertex_summary_pane))

### Add automatic labels

In [166]:
label_layers =  tnt.JointVectorLabelLayers(
    joint_vectors_unsupervised[recipes_bool],      # high dim edge embedding
    recipe_umap,                                   # 2-d edge embedding
    joint_vectors_unsupervised[ingredients_bool],  # high dim vertex embedding
    doc_matrix_vectorizer.column_index_dictionary_,# vertex name
    cluster_map_representation=True,
    min_clusters_in_layer=5,
    random_state=0,
)

In [168]:
annotated_plot = tnt.BokehPlotPane(
    recipe_umap,
    labels=recipe_metadata.Label,
    hover_text=recipe_metadata.Description,
    marker_size=sizes,
    label_color_mapping=color_mapping,
    width=700,
    height=600,
    show_legend=False,
    min_point_size=0.001,
    max_point_size=0.05,
    tools="pan,wheel_zoom,tap,lasso_select,box_zoom,save,reset",
    title="What is cooking? Data Map",
)
annotated_plot.add_cluster_labels(label_layers, max_text_size=24)

In [180]:
count_summary.link_to_plot(annotated_plot)
vertex_summary_pane.link_to_plot(annotated_plot)
search_pane.link_to_plot(annotated_plot)
legend.link_to_plot(annotated_plot)

Watcher(inst=LegendWidget(label_color_factors=['asian.chinese', ...], label_color_palette=['#dbe9f6', '#bad6eb', ...], labels=0          american.brazil..., name='Legend'), cls=<class 'thisnotthat.label_editor.LegendWidget'>, fn=<function Reactive.link.<locals>.link_cb at 0x7f9908aa3760>, mode='args', onlychanged=True, parameter_names=('labels', 'label_color_factors', 'label_color_palette'), what='value', queued=False, precedence=0)

In [181]:
pn.Row(annotated_plot, 
       legend,
pn.Tabs(
        pn.Column(count_summary, vertex_summary_pane, name='Selection'),
        search_pane
    )
)


In [139]:
import networkx as nx
import gravis as gv

In [163]:
selected_edges = [recipes[x] for x in annotated_plot.selected]

In [165]:
#selected_edges