# Big picture
We are trying to think of what it means to "probabilistically" assign cluster labels. That is, if we cluster a network, some nodes might be "partially" in two clusters. So, we want to think about clustering in a way that assigns a probability distribution to each node, rather than a cluster label.

One of the difficulties with this is that, since the actual "labels" can be assigned arbitrarily (for instance, if there are two clusters, we could all them "A" and "B", but it doesn't matter if we change every "A" to a "B" and vice versa), the very concept of "cluster membership" can be a little slippery. To address this, we use the coclustering matrix: given an assignment of labels $\mathbf{k} = (k_1,\dots,k_n)$ to the $n$ nodes, the coclustering matrix $C = C(\mathbf{k})$ where $C_{ij}$ is $1$ if $k_i = k_j$ and $0$ otherwise. This won't solve all our problems, but it might give us a way to identify nodes that are "between" clusters. 



# Next Steps
1. Define (at least one) quality score for the histogram of average coclustering values of a single node (e.g. the distance between means of gaussian components in a 2-component 1d GMM)
2. For each node, compute its "best friends", i.e. who is it always clustered with?
2. **For a variety of graphs,** compute this quality score for each node, and also the betweenness centrality of those nodes. Compare them (e.g. a scatter plot with betweenness on the x-axis and quality on the y-axis)
    1. For example: Stochastic block models with more blocks, different combinations of parameters, etc.
    2. Try out some of the built-in "graph generators" in networkx, e.g. small world, etc
    3. Try building networks with intentionally "confusing" vertices, similar to what's in the notebook `most recent progress[...]`. For example, a two-block SBM with .9 within-block connectivity, 0 cross-block, and then add a node connected to half the vertices in each block (or equal numbers of nodes in each block)

## Side goals
1. Implement a way of choosing the "best" embedding dimension, instead of always uding d = 2. We can still visualize just the first two dimensions if we want (which we often will)

## Related reading
This paper might be related: https://arxiv.org/abs/1509.00556

The wikipedia article on Betweenness Centrality: https://en.wikipedia.org/wiki/Betweenness_centrality