# Thisnotthat for hypergraph exploration

The goal of this notebook is to use the tools in the [Tutte Institute ``thisnotthat`` library](https://github.com/TutteInstitute/thisnotthat) to construct hypergragh embeddings where we jointly embed vertices and hyperedges.

We will make use of a recipe dataset. After filtering out recipes having two ingredients or less (see data-setup), the data consists of 39,559 recipes (hyperedges) and 6,714 ingredients (vertices). The largest recipe has 65 ingredients (must be good!). Each recipe is assigned to a country (edge label), with 20 countries total. The data and some work done with it can be found here:

* https://arxiv.org/pdf/1910.09943.pdf
* https://www.cs.cornell.edu/~arb/data/cat-edge-Cooking/

### Setup

To create the environment:

* conda create -n tnt_simple numba datashader jupyter ipykernel
* pip install thisnotthat seaborn

In [1]:
import thisnotthat as tnt
import panel as pn

import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns
import csv

In [22]:
import scipy.sparse
import vectorizers
import vectorizers.transformers
import umap
from scipy.sparse import vstack
import warnings
      
warnings.simplefilter("ignore")
sns.set()

# Data preparation

We will make use of the recipe dataset. It consists of 39,774 recipes (hyperedges) that are sets of vertices (6,714 ingredients total). The largest recipe has 65 ingredients (must be good!). Each recipe is assigned to a country (edge label), 20 countries total. The data and some work done with it can be found here:

This function 
* reads the data
* keeps only the recipes containing at least 3 ingredients (after this pruning we are left with 6,714 ingredients and 39,559 recipes)
* chooses a country color mapping that respects countries' proximities, or continent - nearby countries are assigned to similar colors. This is to help with the eye-ball evaluation of the visualization and make it more pleasant.

In [50]:
def read_format_recipes(recipe_min_size=3):
    ingredients_id = pd.read_csv('../data/cat-edge-Cooking/node-labels.txt', sep='\t', header=None)
    ingredients_id.index = [x+1 for x in ingredients_id.index]
    ingredients_id.columns = ['Ingredient']
    
    recipes_with_id = []
    with open('../data/cat-edge-Cooking/hyperedges.txt', newline = '') as hyperedges:
        hyperedge_reader = csv.reader(hyperedges, delimiter='\t')
        for hyperedge in hyperedge_reader:
            recipes_with_id.append(hyperedge)
            
    recipes_all = [[ingredients_id.loc[int(i)]['Ingredient'] for i in x] for x in recipes_with_id]
    
    # Keep recipes with 3 ingredients and more
    keep_recipes = np.where(np.array([len(x) for x in recipes_all])>=recipe_min_size)[0]
    recipes = [recipes_all[i] for i in keep_recipes]
    
    recipes_label_id_all = pd.read_csv('../data/cat-edge-Cooking/hyperedge-labels.txt', sep='\t', header=None)
    recipes_label_id_all.columns = ['label']
    recipes_label_id = recipes_label_id_all.iloc[keep_recipes].reset_index()

    label_name = pd.read_csv('../data/cat-edge-Cooking/hyperedge-label-identities.txt', sep='\t', header=None)
    label_name.columns = ['country']
    label_name.index = [x+1 for x in label_name.index]
    
    grps_tmp = {
        'asian' : ('chinese', 'filipino', 'japanese','korean', 'thai', 'vietnamese'),
        'american' : ('brazilian', 'mexican', 'southern_us'),
        'english' : ('british', 'irish'),
        'islands' : ('cajun_creole', 'jamaican'),
        'europe' : ('french', 'italian', 'spanish'),
        'others' : ('greek', 'indian', 'moroccan', 'russian')
    }

    grps = {key:[key+'.'+x for x in value] for key, value in grps_tmp.items()}


    color_key = {}
    for l, c in zip(grps['asian'], sns.color_palette("Blues", 6)[0:]):
        color_key[l] = matplotlib.colors.rgb2hex(c)
    for l, c in zip(grps['american'], sns.color_palette("Purples", 4)[1:]):
        color_key[l] = matplotlib.colors.rgb2hex(c)
    for l, c in zip(grps['others'], sns.color_palette("YlOrRd", 4)):
        color_key[l] = matplotlib.colors.rgb2hex(c)
    for l, c in zip(grps['europe'], sns.color_palette("light:teal", 4)[1:]):
        color_key[l] = matplotlib.colors.rgb2hex(c)
    for l, c in zip(grps['islands'], sns.color_palette("light:#660033", 4)[1:3]):
        color_key[l] = matplotlib.colors.rgb2hex(c)
    for l, c in zip(grps['english'], sns.color_palette("YlGn", 4)[1:]):
        color_key[l] = matplotlib.colors.rgb2hex(c)
    color_key["ingredient"] = "#777777bb"
    
    new_names = []
    for key, value in grps.items():
        new_names = new_names + value

    label_name['new_label'] = [new_name for x in label_name.country for new_name in new_names if x in new_name]
    
    return(recipes, recipes_label_id, ingredients_id, label_name, color_key)

In [51]:
# execfile('./00-recipes-setup.py')
recipes, recipes_label_id, ingredients_id, label_name, color_key = read_format_recipes()
recipes_label = [label_name.loc[i]['new_label'] for i in recipes_label_id.label]

# Build vertex (ingredient) vectors

We first build a vector representation of ingredients based on co-occurrences of vertices in the same hyperedges. Little hack here: we make our own. We use our own as the current vectorizer library has a cooccurrence vectorizer based on ordered hyperedges and so has concepts such as "appears before" or "appears after" that we wish to avoid.

The vector representation used for

In [172]:
def vertexCooccurrenceVectorizer(hyperedges):
    vertexCooccurrence_vectorizer = vectorizers.TokenCooccurrenceVectorizer().fit(recipes)
    vectorizers.TokenCooccurrenceVectorizer().fit(hyperedges)
    
    incidence_vectorizer = vectorizers.NgramVectorizer(
    ).fit(hyperedges)

    H = incidence_vectorizer.transform(hyperedges)
    
    M_cooccurrence = (H.T@H)
    M_cooccurrence.setdiag(0)
    M_cooccurrence.eliminate_zeros()
    
    vertexCooccurrence_vectorizer.cooccurrences_ = M_cooccurrence
    
    vertexCooccurrence_vectorizer.column_index_dictionary_ = incidence_vectorizer.column_index_dictionary_
    vertexCooccurrence_vectorizer.column_label_dictionary_ = incidence_vectorizer.column_label_dictionary_
    return(vertexCooccurrence_vectorizer)

In [114]:
D_e_inv

<6709x6709 sparse matrix of type '<class 'numpy.float64'>'
	with 6709 stored elements in Compressed Sparse Column format>

In [142]:
ingredient_vectorizer = vectorizers.TokenCooccurrenceVectorizer().fit(recipes)
incidence_vectorizer = vectorizers.NgramVectorizer(
).fit(recipes)

H = incidence_vectorizer.transform(recipes)

ingredient_vectorizer.cooccurrences_ = (H.T@H)

In [173]:
%%time
# long_list = max(len(x) for x in recipes)
# ingredient_vectorizer = vectorizers.TokenCooccurrenceVectorizer(
#     min_document_occurrences=1,
#     window_radii=long_list,          
#     window_functions='fixed',
#     kernel_functions='flat',            
#     n_iter = 0,
#     normalize_windows=False,
# ).fit(recipes)
ingredient_vectorizer = vertexCooccurrenceVectorizer(recipes)
ingredient_vectors = ingredient_vectorizer.reduce_dimension(dimension=60, algorithm="randomized")

CPU times: user 8.49 s, sys: 742 ms, total: 9.23 s
Wall time: 8.3 s


# Build hyperedge vectors

In [174]:
n_recipes = len(recipes)
n_ingredients = len(ingredient_vectorizer.token_index_dictionary_)

These operations are relatively quick because it is matter of counting things and applying standard linear algebra techniques. This produces serviceable word vectors which are specifically trained on our corpus and cover the idiomatic usage of various words within the particular context of the 20 newsgroups dataset.

In [175]:
%%time
doc_matrix_vectorizer = vectorizers.NgramVectorizer(
    token_dictionary=ingredient_vectorizer.token_label_dictionary_
).fit(recipes)

doc_matrix = doc_matrix_vectorizer.transform(recipes)

CPU times: user 2.56 s, sys: 28.4 ms, total: 2.59 s
Wall time: 2.59 s


In [176]:
%%time
unsupervised_info_doc_matrix = vectorizers.transformers.InformationWeightTransformer(
    prior_strength=1e-1,
    approx_prior=False,
).fit_transform(doc_matrix)

CPU times: user 731 ms, sys: 943 µs, total: 732 ms
Wall time: 727 ms


# Map both in same space

### Stack recipes and ingredients

In [177]:
info_doc_with_identity = vstack([unsupervised_info_doc_matrix, scipy.sparse.identity(n_ingredients)])

In [178]:
%%time
joint_vectors_unsupervised = vectorizers.ApproximateWassersteinVectorizer(
    normalization_power=0.66,
    random_state=42,
).fit_transform(info_doc_with_identity, vectors=ingredient_vectors)

CPU times: user 8.6 s, sys: 383 ms, total: 8.98 s
Wall time: 322 ms


In [179]:
%%time
joint_vectors_mapper = umap.UMAP(metric="cosine", random_state=42).fit(joint_vectors_unsupervised)

CPU times: user 1min 7s, sys: 1min 10s, total: 2min 18s
Wall time: 22.8 s


# This not that : explore hypergraph

In [180]:
pn.extension()

### Build dataframe that contains information about vertex and hyperedges

In [181]:
recipes_bool = np.array([True for i in range(n_recipes)] + [False for i in range(n_ingredients)])
ingredients_bool = ~recipes_bool

In [182]:
recipe_metadata_all = pd.DataFrame()
recipe_metadata_all['Type'] = (
                    ['Recipes' for i in range(n_recipes)] 
                       + 
                    ['Ingredients' for x in range(n_ingredients)]
                  )
recipe_metadata_all['Description'] = (
                    [label_name.loc[i]['country'] for i in recipes_label_id.label] 
                       + 
                    [ingredient_vectorizer.token_index_dictionary_[x] for x in range(n_ingredients)]
                  )
recipe_metadata_all['Label'] = (
                    [label_name.loc[i]['new_label'] for i in recipes_label_id.label] 
                       + 
                    ['ingredient' for x in range(n_ingredients)]
                  )
recipe_metadata_all['Ingredients'] = (
                    recipes 
                       + 
                    [ingredient_vectorizer.token_index_dictionary_[x] for x in range(n_ingredients)]
                  )
recipe_metadata_all['Recipe_size'] = (
                    [len(x) for x in recipes] 
                       + 
                    [1 for x in range(n_ingredients)]
                  )

In [183]:
recipe_metadata_all

Unnamed: 0,Type,Description,Label,Ingredients,Recipe_size
0,Recipes,greek,others.greek,"[romaine lettuce, black olives, grape tomatoes...",9
1,Recipes,southern_us,american.southern_us,"[plain flour, ground pepper, salt, tomatoes, g...",11
2,Recipes,filipino,asian.filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g...",12
3,Recipes,indian,others.indian,"[water, vegetable oil, wheat, salt]",4
4,Recipes,indian,others.indian,"[black pepper, shallots, cornflour, cayenne pe...",20
...,...,...,...,...,...
46263,Ingredients,zesty italian dressing,ingredient,zesty italian dressing,1
46264,Ingredients,zinfandel,ingredient,zinfandel,1
46265,Ingredients,ziti,ingredient,ziti,1
46266,Ingredients,zucchini,ingredient,zucchini,1


In [184]:
# w_plot = recipes_bool | ingredients_bool
w_plot = recipes_bool

In [185]:
recipe_metadata = recipe_metadata_all[w_plot]
recipe_umap = joint_vectors_mapper.embedding_[w_plot]
# recipe_vocab_use_vectors = joint_vectors_unsupervised[w_plot]

In [186]:
layer_metadata = recipe_metadata[['Type', 'Label', 'Recipe_size']].copy()

In [187]:
color_mapping = color_key.copy()
del color_mapping['ingredient']

sizes = [np.sqrt(len(x)) / 100 for x in recipes]

In [188]:
markdown_template = """## Recipe from {Label}
---
#### Ingredients

{Ingredients_for_markdown}

---
"""

### Add a legend

Note: a little tweak to mention is that to control the ordering of the elements in the 

In [189]:
legend = tnt.LegendWidget(
    recipe_metadata.Label,
    factors=list(color_mapping.keys()), 
    palette=list(color_mapping.values()), 
    palette_length=len(color_mapping),
    color_picker_height=16,
    color_picker_margin=[0,0],
    label_height=30,
    label_width=150,
    name="Legend",
    selectable=True,
)

### Search capability
* search_pane

In [190]:
search_pane = tnt.SearchWidget(recipe_metadata, width=400, title="Advanced Search")
search_pane.link_to_plot(bokeh_plot)

Watcher(inst=SearchWidget(data=          Type  Descriptio..., name='Search'), cls=<class 'thisnotthat.search.SearchWidget'>, fn=<function Reactive.link.<locals>.link_cb at 0x7fd64031e0e0>, mode='args', onlychanged=True, parameter_names=('selected',), what='value', queued=False, precedence=0)

In [191]:
# pn.Row(bokeh_plot, search_pane)

### Summarize selection
* count_summary : count how many things we select
* vertex_summary_pane : list vertices embeded close by

In [192]:
from thisnotthat.summary.dataframe import JointLabelSummarizer, CountSelectedSummarizer
count_summary = tnt.DataSummaryPane(CountSelectedSummarizer(),sizing_mode = "stretch_width")
# count_summary.link_to_plot(bokeh_plot)

In [193]:
vocab = list(recipe_metadata_all.Description[ingredients_bool])
word_summary = JointLabelSummarizer(joint_vectors_unsupervised[recipes_bool],
                                    vocab, 
                                    joint_vectors_unsupervised[ingredients_bool])
vertex_summary_pane = tnt.DataSummaryPane(word_summary)
# vertex_summary_pane.link_to_plot(bokeh_plot)

In [194]:
#pn.Row(bokeh_plot, count_summary)
#pn.Row(bokeh_plot, pn.Column(count_summary, vertex_summary_pane))

### Add automatic labels

In [195]:
label_layers =  tnt.JointVectorLabelLayers(
    joint_vectors_unsupervised[recipes_bool],      # high dim edge embedding
    recipe_umap,                                   # 2-d edge embedding
    joint_vectors_unsupervised[ingredients_bool],  # high dim vertex embedding
    doc_matrix_vectorizer.column_index_dictionary_,# vertex name
    cluster_map_representation=True,
    min_clusters_in_layer=5,
    random_state=0,
)

In [196]:
annotated_plot = tnt.BokehPlotPane(
    recipe_umap,
    labels=recipe_metadata.Label,
    hover_text=recipe_metadata.Description,
    marker_size=sizes,
    label_color_mapping=color_mapping,
    width=700,
    height=600,
    show_legend=False,
    min_point_size=0.001,
    max_point_size=0.05,
    tools="pan,wheel_zoom,tap,lasso_select,box_zoom,save,reset",
    title="What is cooking? Data Map",
)
annotated_plot.add_cluster_labels(label_layers, max_text_size=24)

In [197]:
count_summary.link_to_plot(annotated_plot)
vertex_summary_pane.link_to_plot(annotated_plot)
search_pane.link_to_plot(annotated_plot)
legend.link_to_plot(annotated_plot)

Watcher(inst=LegendWidget(label_color_factors=['asian.chinese', ...], label_color_palette=['#dbe9f6', '#bad6eb', ...], labels=0                others.gr..., name='Legend'), cls=<class 'thisnotthat.label_editor.LegendWidget'>, fn=<function Reactive.link.<locals>.link_cb at 0x7fd63b99cf70>, mode='args', onlychanged=True, parameter_names=('labels', 'label_color_factors', 'label_color_palette'), what='value', queued=False, precedence=0)

In [198]:
pn.Row(annotated_plot, 
       legend,
pn.Tabs(
        pn.Column(count_summary, vertex_summary_pane, name='Selection'),
        search_pane
    )
)

In [139]:
import networkx as nx
import gravis as gv

In [163]:
selected_edges = [recipes[x] for x in annotated_plot.selected]

In [165]:
#selected_edges