# Clustering

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [1]:
import bipartitepandas as bpd

## Get your data ready

For this notebook, we simulate data.

In [2]:
df = bpd.SimBipartite().simulate()
bdf = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t']).clean()
display(bdf)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids categorical
making 'j' ids categorical
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m
0,0,187,3.365202,0,0
1,0,187,1.434979,1,0
2,0,187,2.899365,2,1
3,0,124,2.740293,3,2
4,0,144,0.773866,4,1
...,...,...,...,...,...
49995,9999,76,-2.720836,0,0
49996,9999,76,-0.483792,1,0
49997,9999,76,-1.305111,2,0
49998,9999,76,-1.392870,3,1


## Basic clustering

Clustering in BipartitePandas estimates firm groups.

Clustering is simple, just run `.cluster()` - notice the new `g` column!

In [3]:
bdf = bdf.cluster()
display(bdf)

Unnamed: 0,i,j,y,t,g,m
0,0,187,3.365202,0,3,0
1,0,187,1.434979,1,3,0
2,0,187,2.899365,2,3,1
3,0,124,2.740293,3,1,2
4,0,144,0.773866,4,1,1
...,...,...,...,...,...,...
49995,9999,76,-2.720836,0,0,0
49996,9999,76,-0.483792,1,0,0
49997,9999,76,-1.305111,2,0,0
49998,9999,76,-1.392870,3,0,1


## Advanced clustering

You can investigate all clustering parameters by running `bpd.cluster_params().describe_all()`. We are going to go through some of the most important options.

#### Computing measures and selecting how to group on them

We compute measures using the `bpd.measures` module, and group on the computed measures using the `bpd.grouping` module.

Let's use firm-level income cdfs as our measure, and group using KMeans.

In [4]:
measures = bpd.measures.CDFs()
grouping = bpd.grouping.KMeans()
bdf = bdf.cluster(bpd.cluster_params({'measures': measures, 'grouping': grouping}))
display(bdf)

Unnamed: 0,i,j,y,t,g,m
0,0,187,3.365202,0,4,0
1,0,187,1.434979,1,4,0
2,0,187,2.899365,2,4,1
3,0,124,2.740293,3,0,2
4,0,144,0.773866,4,7,1
...,...,...,...,...,...,...
49995,9999,76,-2.720836,0,3,0
49996,9999,76,-0.483792,1,3,0
49997,9999,76,-1.305111,2,3,0
49998,9999,76,-1.392870,3,3,1


We can even group on multiple measures!

In [5]:
measures = [bpd.measures.CDFs(), bpd.measures.Moments(measures=['mean', 'var'])]
grouping = bpd.grouping.KMeans()
bdf = bdf.cluster(bpd.cluster_params({'measures': measures, 'grouping': grouping}))
display(bdf)

Unnamed: 0,i,j,y,t,g,m
0,0,187,3.365202,0,0,0
1,0,187,1.434979,1,0,0
2,0,187,2.899365,2,0,1
3,0,124,2.740293,3,2,2
4,0,144,0.773866,4,6,1
...,...,...,...,...,...,...
49995,9999,76,-2.720836,0,7,0
49996,9999,76,-0.483792,1,7,0
49997,9999,76,-1.305111,2,7,0
49998,9999,76,-1.392870,3,7,1


#### Clustering on subsets of the data - stayers/movers/stays/moves

What if we want our measures to be computed with only movers or only stayers? We can specify `stayers_movers`. Note that some firms may not be clustered - these firms will have `g=pd.NA` (set `'dropna': True` if you want to drop firms that don't get clustered).

In [6]:
bdf = bdf.cluster(bpd.cluster_params({'stayers_movers': 'movers'}))
display(bdf)

Unnamed: 0,i,j,y,t,g,m
0,0,187,3.365202,0,5,0
1,0,187,1.434979,1,5,0
2,0,187,2.899365,2,5,1
3,0,124,2.740293,3,0,2
4,0,144,0.773866,4,4,1
...,...,...,...,...,...,...
49995,9999,76,-2.720836,0,6,0
49996,9999,76,-0.483792,1,6,0
49997,9999,76,-1.305111,2,6,0
49998,9999,76,-1.392870,3,6,1


#### Clustering on subsets of the data - time

On the other hand, what if we want to cluster on particular periods of data? We can specify `t`. Again, note that some firms may not be clustered - these firms will have `g=pd.NA` (set `'dropna': True` if you want to drop firms that don't get clustered).

In [7]:
bdf = bdf.cluster(bpd.cluster_params({'t': [0, 1, 2]}))
display(bdf)

Unnamed: 0,i,j,y,t,g,m
0,0,187,3.365202,0,2,0
1,0,187,1.434979,1,2,0
2,0,187,2.899365,2,2,1
3,0,124,2.740293,3,3,2
4,0,144,0.773866,4,3,1
...,...,...,...,...,...,...
49995,9999,76,-2.720836,0,6,0
49996,9999,76,-0.483792,1,6,0
49997,9999,76,-1.305111,2,6,0
49998,9999,76,-1.392870,3,6,1
