# Hypergraph exploration - recipes example

The goal of this notebook is to use the tools in the Tutte Institute [``vectorizers``](https://github.com/TutteInstitute/vectorizers) and [``thisnotthat``](https://github.com/TutteInstitute/thisnotthat) libraries to construct hypergragh embeddings. We will jointly embed vertices and hyperedges in the same space and use this common space to guide hypergraph exploration.

We will make use of a recipe dataset. After filtering out recipes having two ingredients or less (see data-setup), the data consists of 39,559 recipes (hyperedges) and 6,714 ingredients (vertices). The largest recipe has 65 ingredients (must be good!). Each recipe is assigned to a country (edge label), with 20 countries total. The data and some work done with it can be found here:

* https://arxiv.org/pdf/1910.09943.pdf
* https://www.cs.cornell.edu/~arb/data/cat-edge-Cooking/

### Setup

To create the environment:

* mamba env create -f hypergraphs-simple.yml

or

* conda create -n hypergraphs-simple numba datashader jupyter ipykernel
* conda activate hypergraph-simple
* pip install thisnotthat seaborn

In [1]:
import thisnotthat as tnt
import panel as pn

import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns
import csv

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import scipy.sparse
import vectorizers
import vectorizers.transformers
import umap
from scipy.sparse import vstack
import warnings
      
warnings.simplefilter("ignore")
sns.set()

In [3]:
from IPython.display import display, HTML 
display(HTML("<style>.container { width:100% !important; }</style>"))

In [4]:
pn.extension()

# Data preparation

We will make use of the recipe dataset. It consists of 39,774 recipes (hyperedges) that are sets of vertices (6,714 ingredients total). The largest recipe has 65 ingredients (must be good!). Each recipe is assigned to a country (edge label), 20 countries total. The data and some work done with it can be found here:

This function 
* reads the data
* keeps only the recipes containing at least 3 ingredients (after this pruning we are left with 6,714 ingredients and 39,559 recipes)
* chooses a country color mapping that respects countries' proximities, or continent - nearby countries are assigned to similar colors. This is to help with the eye-ball evaluation of the visualization and make it more pleasant.

In [5]:
data_folder = '../data/cat-edge-Cooking/'

In [6]:
def read_format_recipes(recipe_min_size=3, data_folder=data_folder):
    ingredients_id = pd.read_csv(f'{data_folder}node-labels.txt', sep='\t', header=None)
    ingredients_id.index = [x+1 for x in ingredients_id.index]
    ingredients_id.columns = ['Ingredient']
    
    recipes_with_id = []
    with open(f'{data_folder}hyperedges.txt', newline = '') as hyperedges:
        hyperedge_reader = csv.reader(hyperedges, delimiter='\t')
        for hyperedge in hyperedge_reader:
            recipes_with_id.append(hyperedge)
            
    recipes_all = [[ingredients_id.loc[int(i)]['Ingredient'] for i in x] for x in recipes_with_id]
    
    # Keep recipes with 3 ingredients and more
    keep_recipes = np.where(np.array([len(x) for x in recipes_all])>=recipe_min_size)[0]
    recipes = [recipes_all[i] for i in keep_recipes]
    
    recipes_label_id_all = pd.read_csv(f'{data_folder}hyperedge-labels.txt', sep='\t', header=None)
    recipes_label_id_all.columns = ['label']
    recipes_label_id = recipes_label_id_all.iloc[keep_recipes].reset_index()

    label_name = pd.read_csv(f'{data_folder}hyperedge-label-identities.txt', sep='\t', header=None)
    label_name.columns = ['country']
    label_name.index = [x+1 for x in label_name.index]
    
    grps_tmp = {
        'asian' : ('chinese', 'filipino', 'japanese','korean', 'thai', 'vietnamese'),
        'american' : ('brazilian', 'mexican', 'southern_us'),
        'english' : ('british', 'irish'),
        'islands' : ('cajun_creole', 'jamaican'),
        'europe' : ('french', 'italian', 'spanish'),
        'others' : ('greek', 'indian', 'moroccan', 'russian')
    }

    grps = {key:[key+'.'+x for x in value] for key, value in grps_tmp.items()}


    color_key = {}
    for l, c in zip(grps['asian'], sns.color_palette("Blues", 6)[0:]):
        color_key[l] = matplotlib.colors.rgb2hex(c)
    for l, c in zip(grps['american'], sns.color_palette("Purples", 4)[1:]):
        color_key[l] = matplotlib.colors.rgb2hex(c)
    for l, c in zip(grps['others'], sns.color_palette("YlOrRd", 4)):
        color_key[l] = matplotlib.colors.rgb2hex(c)
    for l, c in zip(grps['europe'], sns.color_palette("light:teal", 4)[1:]):
        color_key[l] = matplotlib.colors.rgb2hex(c)
    for l, c in zip(grps['islands'], sns.color_palette("light:#660033", 4)[1:3]):
        color_key[l] = matplotlib.colors.rgb2hex(c)
    for l, c in zip(grps['english'], sns.color_palette("YlGn", 4)[1:]):
        color_key[l] = matplotlib.colors.rgb2hex(c)
    color_key["ingredient"] = "#777777bb"
    
    new_names = []
    for key, value in grps.items():
        new_names = new_names + value

    label_name['new_label'] = [new_name for x in label_name.country for new_name in new_names if x in new_name]
    
    return(recipes, recipes_label_id, ingredients_id, label_name, color_key)

In [7]:
# execfile('./00-recipes-setup.py')
recipes, recipes_label_id, ingredients_id, label_name, color_key = read_format_recipes()
recipes_label = [label_name.loc[i]['new_label'] for i in recipes_label_id.label]
recipes_country = [label_name.loc[i]['country'] for i in recipes_label_id.label] 

# Build vertex (ingredient) vectors

We first build a vector representation of ingredients based on co-occurrences of vertices in the same hyperedges. Little hack here: we make our own. We use our own as the current vectorizer library has a cooccurrence vectorizer based on ordered hyperedges and so has concepts such as "appears before" or "appears after" that we wish to avoid.

The vector representations of the vertices are rows of the weighted adjacency matrix of the hypergraph's 2-section.
From the incidence matrix $H$ (number of vertices x number of hyperedges), this can be obtained via $V_{vectors} = H^TH - D_e$ where $D_e$ is the diagonal matrix that contains the hyperedge sizes. From experiments, we obtain better results by row and column normalizing and reducing the dimensionality of this vertex representation using SVD. Experiments on documents, also show that we get improvements by taking the $4^{th}$ root entry-wise of the normalized matrices before reducing dimensionality. We do the same here because it works well, but it has not been tested on hypergraphs.

In [8]:
def vertexCooccurrenceVectorizer(hyperedges):
    vertexCooccurrence_vectorizer = vectorizers.TokenCooccurrenceVectorizer().fit(hyperedges)
    
    incidence_vectorizer = vectorizers.NgramVectorizer(
        token_dictionary=vertexCooccurrence_vectorizer.token_label_dictionary_
    ).fit(hyperedges)

    H = incidence_vectorizer.transform(hyperedges)
    
    M_cooccurrence = (H.T@H)
    M_cooccurrence.setdiag(0)
    M_cooccurrence.eliminate_zeros()
    
    vertexCooccurrence_vectorizer.cooccurrences_ = M_cooccurrence
    
    return(vertexCooccurrence_vectorizer)

In [9]:
%%time

ingredient_vectorizer = vertexCooccurrenceVectorizer(recipes)
ingredient_vectors = ingredient_vectorizer.reduce_dimension(dimension=60, algorithm="randomized")
n_ingredients = len(ingredient_vectorizer.token_index_dictionary_)
ingredients = [ingredient_vectorizer.token_index_dictionary_[i] for i in range(n_ingredients)]

CPU times: user 17.3 s, sys: 3.92 s, total: 21.2 s
Wall time: 13.8 s


# Plot ingredients and explore (with This not that)

In [10]:
ingredient_mapper = umap.UMAP(metric="cosine", random_state=42).fit(ingredient_vectors)

In [11]:
ingredient_label_layers =  tnt.JointVectorLabelLayers(
    ingredient_vectors,            # high dim edge embedding
    ingredient_mapper.embedding_,  # 2-d edge embedding
    ingredient_vectors,            # high dim vertex embedding
    ingredients,                   # vertex name
    cluster_map_representation=True,
    min_clusters_in_layer=5,
    random_state=0,
)

In [12]:
annotated_ingredient_plot = tnt.BokehPlotPane(
    ingredient_mapper.embedding_,
    hover_text=ingredients,
    marker_size=0.03,
    width=700,
    height=600,
    show_legend=False,
    min_point_size=0.001,
    max_point_size=0.05,
    tools="pan,wheel_zoom,tap,lasso_select,box_zoom,save,reset",
    title="What is cooking? Ingredient Map",
)
# annotated_ingredient_plot.add_cluster_labels(ingredient_label_layers, max_text_size=24)

In [13]:
pn.Row(annotated_ingredient_plot)

# Build hyperedge (recipe) vectors

In order to build hyperedge vectors, we will be treating each edge as a distribution over the vertex space. Instead of considering flat distributions of vertices contained in the hyperedges, we use a distribution given by the information weighted incidence matrix. In this incidence matrix, each vertex row gets multiplied by a weight: the information gain of the vertex' observed distribution over hyperedges
$$P_v(e) = \begin{cases}
1/deg(v) &\text{if $v \in e$}\\
0 &\text{otherwise}
\end{cases}
$$  compared to a baseline distribution based of hyperedge sizes $Q(e) = \frac{|e|}{\sum |e|}.$ The weights are computed as follows:

$$\mbox{Info}(v) = \sum_e P_v(e) \log\big(\frac{P_v(e)}{Q(e)}\big)= C - \frac{1}{deg(v)}\sum_{e: v \in e} (|e|\cdot deg(v)).$$
In the case of hyperedges, this will tend to give more weights to vertices that appear in smaller hyperedges and less weights to vertices that appear in all hyperedges.

This is done in two steps: (1) construct the incidence matrix, (2) transform it into its information weighted version.

In [14]:
%%time
incidence_vectorizer = vectorizers.NgramVectorizer(
    token_dictionary=ingredient_vectorizer.token_label_dictionary_
).fit(recipes)

incidence_matrix = incidence_vectorizer.transform(recipes)

CPU times: user 2.55 s, sys: 67.2 ms, total: 2.62 s
Wall time: 2.62 s


In [15]:
%%time
info_weighted_incidence = vectorizers.transformers.InformationWeightTransformer(
    prior_strength=1e-1,
    approx_prior=False,
).fit_transform(incidence_matrix)

CPU times: user 1.49 s, sys: 15.8 ms, total: 1.5 s
Wall time: 1.49 s


# Joint embedding of vertices and hyperedges

We now have a vector representation for each vertex, and we treat hyperedges as distribution over that space. We can also treat vertices themselves as distributions: a vertex is the Dirac Delta distribution having all mass on its itself (its associated vector).

Now, both vertices and hyperedges are seen as distributions over a vector space. We can now use the Approximate Wasserstein vectorizer. This vectorizer transforms finite distributions over a metric space into vectors in a linear space such that euclidean or cosine distance approximates the Wasserstein distance between the distributions. 

This is done by representing the vertex distributions with the identity matrix on the vertices, and stacking the weighted incidence with this identity matrix. Then give this matrix of both hyperedge and node distributions to the vectorizer function along with the vertex vectors.

In [16]:
info_doc_with_identity = vstack([info_weighted_incidence, scipy.sparse.identity(n_ingredients)])

In [17]:
%%time
joint_vectors_unsupervised = vectorizers.ApproximateWassersteinVectorizer(
    normalization_power=0.25,
    random_state=42,
).fit_transform(info_doc_with_identity, vectors=ingredient_vectors)

CPU times: user 10.4 s, sys: 1.05 s, total: 11.4 s
Wall time: 417 ms


In [18]:
%%time
joint_vectors_mapper = umap.UMAP(metric="cosine", random_state=42).fit(joint_vectors_unsupervised)

CPU times: user 1min 9s, sys: 44.9 s, total: 1min 54s
Wall time: 23.5 s


# This not that : explore hypergraph

### Build dataframe that contains information about vertex and hyperedges

In [19]:
recipe_metadata = pd.DataFrame()
recipe_metadata['Country'] = recipes_country
recipe_metadata['Label'] = recipes_label
recipe_metadata['Ingredients'] = recipes 
recipe_metadata['Recipe_size'] = [len(x) for x in recipes] 

In [20]:
recipe_metadata

Unnamed: 0,Country,Label,Ingredients,Recipe_size
0,greek,others.greek,"[romaine lettuce, black olives, grape tomatoes...",9
1,southern_us,american.southern_us,"[plain flour, ground pepper, salt, tomatoes, g...",11
2,filipino,asian.filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g...",12
3,indian,others.indian,"[water, vegetable oil, wheat, salt]",4
4,indian,others.indian,"[black pepper, shallots, cornflour, cayenne pe...",20
...,...,...,...,...
39554,irish,english.irish,"[light brown sugar, granulated sugar, butter, ...",12
39555,italian,europe.italian,"[KRAFT Zesty Italian Dressing, purple onion, b...",7
39556,irish,english.irish,"[eggs, citrus fruit, raisins, sourdough starte...",12
39557,chinese,asian.chinese,"[boneless chicken skinless thigh, minced garli...",21


### Select the proper vectors to plot

In [21]:
# Just plot the recipes
n_recipes = len(recipes)
recipes_bool = np.array([True for i in range(n_recipes)] + [False for i in range(n_ingredients)])
ingredients_bool = ~recipes_bool
recipe_umap = joint_vectors_mapper.embedding_[recipes_bool]

In [22]:
# Remove the ingredient from the color map as we are not plotting ingredient vectors
color_mapping = color_key.copy()
del color_mapping['ingredient']

In [23]:
# Resize the hyperedge points in terms of the recipe size (this is not very useful in this case as all recipes are of similar sizes)
sizes = [np.sqrt(len(x)) / 100 for x in recipes]

### Add a legend : legend

In [24]:
legend = tnt.LegendWidget(
    recipe_metadata.Label,
    factors=list(color_mapping.keys()), 
    palette=list(color_mapping.values()), 
    palette_length=len(color_mapping),
    color_picker_height=16,
    color_picker_margin=[0,0],
    label_height=30,
    label_width=150,
    name="Legend",
    selectable=True,
)

### Search capability : search_pane
This will allow to search the dataframe rows and have the matching rows selected and displayed on the plot.

In [25]:
search_pane = tnt.SearchWidget(recipe_metadata, width=400, title="Advanced Search")

### Summarize selection : count_summary
Counts how many things we select

In [26]:
from thisnotthat.summary.dataframe import JointLabelSummarizer, CountSelectedSummarizer
count_summary = tnt.DataSummaryPane(CountSelectedSummarizer(),sizing_mode = "stretch_width")

### Summarize selection : count_summary
First time we use the vertex vectors. This summarizer will give us the names of the closest vertex vectors to the centroid of a selection of hyperedges on the plot. This is only possible because the vertex and the hyperedges live in a common space. It will list the names along with a distance to the centroid point.

In [27]:
word_summary = JointLabelSummarizer(joint_vectors_unsupervised[recipes_bool],
                                    ingredients, 
                                    joint_vectors_unsupervised[ingredients_bool])
vertex_summary_pane = tnt.DataSummaryPane(word_summary)

### Information on click : info_pane
We will display the ingredient list on click.

In [28]:
markdown_template = """## Recipe from {Label}
---
#### Ingredients

{Ingredients}

---
"""

In [29]:
info_pane = tnt.InformationPane(recipe_metadata, markdown_template, width=400, height=750, sizing_mode="stretch_height")

### Link everything to the plot

In [36]:
%%time
bokeh_plot = tnt.BokehPlotPane(
    recipe_umap,
    labels=recipe_metadata.Label,
    hover_text=recipe_metadata.Country,
    legend_location='outside',
    marker_size=sizes,
    label_color_mapping=color_mapping,
    show_legend=False,
    min_point_size=0.001,
    max_point_size=0.05,
    tools="pan,wheel_zoom,tap,lasso_select,box_zoom,save,reset",
    title="What is cooking? Data Map",
)

CPU times: user 7.18 s, sys: 294 ms, total: 7.48 s
Wall time: 345 ms


In [37]:
count_summary.link_to_plot(bokeh_plot)
vertex_summary_pane.link_to_plot(bokeh_plot)
search_pane.link_to_plot(bokeh_plot)
info_pane.link_to_plot(bokeh_plot)
legend.link_to_plot(bokeh_plot)

Watcher(inst=LegendWidget(label_color_factors=['asian.chinese', ...], label_color_palette=['#dbe9f6', '#bad6eb', ...], labels=0                others.gr..., name='Legend', selected=[7, 9, 10, 12, ...]), cls=<class 'thisnotthat.label_editor.LegendWidget'>, fn=<function Reactive.link.<locals>.link_cb at 0x7f310a346170>, mode='args', onlychanged=True, parameter_names=('labels', 'label_color_factors', 'label_color_palette'), what='value', queued=False, precedence=0)

In [38]:
pn.Row(bokeh_plot, 
       legend,
pn.Tabs(
        pn.Column(count_summary, vertex_summary_pane, name='Selection'),
        search_pane,
        info_pane
    )
)