# Clustering

In [1]:
# Add BipartitePandas to system path, do not run this
# import sys
# sys.path.append('../../..')

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [2]:
import bipartitepandas as bpd

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Get your data ready

For this notebook, we simulate data.

In [3]:
df = bpd.SimBipartite().simulate()
bdf = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t']).clean()
display(bdf)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m
0,0,100,-0.333267,0,0
1,0,100,-2.046465,1,0
2,0,100,0.757547,2,0
3,0,100,2.007658,3,1
4,0,101,0.295336,4,1
...,...,...,...,...,...
49995,9999,93,2.052232,0,1
49996,9999,0,-0.893098,1,2
49997,9999,103,1.249681,2,2
49998,9999,142,1.531079,3,1


## Basic clustering

Clustering in BipartitePandas estimates firm groups.

Clustering is simple, just run `.cluster()` - notice the new `g` column!

In [4]:
bdf = bdf.cluster()
display(bdf)

INFO:bipartitelong:firm groups computed
INFO:bipartitelong:dictionary linking firms to clusters generated
INFO:bipartitelong:sorting columns
INFO:bipartitelong:clusters merged into data


Unnamed: 0,i,j,y,t,g,m
0,0,100,-0.333267,0,2,0
1,0,100,-2.046465,1,2,0
2,0,100,0.757547,2,2,0
3,0,100,2.007658,3,2,1
4,0,101,0.295336,4,4,1
...,...,...,...,...,...,...
49995,9999,93,2.052232,0,2,1
49996,9999,0,-0.893098,1,7,2
49997,9999,103,1.249681,2,4,2
49998,9999,142,1.531079,3,1,1


## Advanced clustering

You can investigate all clustering parameters by running `bpd.cluster_params().describe_all()`. We are going to go through some of the most important options.

#### Computing measures and selecting how to group on them

We compute measures using the `bpd.measures` module, and group on the computed measures using the `bpd.grouping` module.

Let's use firm-level income cdfs as our measure, and group using KMeans.

In [5]:
measures = bpd.measures.CDFs()
grouping = bpd.grouping.KMeans()
bdf = bdf.cluster(bpd.cluster_params({'measures': measures, 'grouping': grouping}))
display(bdf)

INFO:bipartitelong:beginning clustering
INFO:bipartitelong:beginning copy
INFO:bipartitelong:firm moments computed
INFO:bipartitelong:computing firm groups
INFO:bipartitelong:firm groups computed
INFO:bipartitelong:dictionary linking firms to clusters generated
INFO:bipartitelong:sorting columns
INFO:bipartitelong:clusters merged into data


Unnamed: 0,i,j,y,t,g,m
0,0,100,-0.333267,0,6,0
1,0,100,-2.046465,1,6,0
2,0,100,0.757547,2,6,0
3,0,100,2.007658,3,6,1
4,0,101,0.295336,4,6,1
...,...,...,...,...,...,...
49995,9999,93,2.052232,0,6,1
49996,9999,0,-0.893098,1,5,2
49997,9999,103,1.249681,2,0,2
49998,9999,142,1.531079,3,4,1


We can even group on multiple measures!

In [6]:
measures = [bpd.measures.CDFs(), bpd.measures.Moments(measures=['mean', 'var'])]
grouping = bpd.grouping.KMeans()
bdf = bdf.cluster(bpd.cluster_params({'measures': measures, 'grouping': grouping}))
display(bdf)

INFO:bipartitelong:beginning clustering
INFO:bipartitelong:beginning copy
INFO:bipartitelong:firm moments computed
INFO:bipartitelong:computing firm groups
INFO:bipartitelong:firm groups computed
INFO:bipartitelong:dictionary linking firms to clusters generated
INFO:bipartitelong:sorting columns
INFO:bipartitelong:clusters merged into data


Unnamed: 0,i,j,y,t,g,m
0,0,100,-0.333267,0,7,0
1,0,100,-2.046465,1,7,0
2,0,100,0.757547,2,7,0
3,0,100,2.007658,3,7,1
4,0,101,0.295336,4,2,1
...,...,...,...,...,...,...
49995,9999,93,2.052232,0,7,1
49996,9999,0,-0.893098,1,5,2
49997,9999,103,1.249681,2,2,2
49998,9999,142,1.531079,3,4,1


#### Clustering on subsets of the data - stayers/movers/stays/moves

What if we want our measures to be computed with only movers or only stayers? We can specify `stayers_movers`. Note that some firms may not be clustered - these firms will have `g=pd.NA` (set `'dropna': True` if you want to drop firms that don't get clustered).

In [7]:
bdf = bdf.cluster(bpd.cluster_params({'stayers_movers': 'movers'}))
display(bdf)

INFO:bipartitelong:beginning clustering
INFO:bipartitelong:beginning copy
INFO:bipartitelong:firm moments computed
INFO:bipartitelong:computing firm groups
INFO:bipartitelong:firm groups computed
INFO:bipartitelong:dictionary linking firms to clusters generated
INFO:bipartitelong:sorting columns
INFO:bipartitelong:clusters merged into data


Unnamed: 0,i,j,y,t,g,m
0,0,100,-0.333267,0,0,0
1,0,100,-2.046465,1,0,0
2,0,100,0.757547,2,0,0
3,0,100,2.007658,3,0,1
4,0,101,0.295336,4,7,1
...,...,...,...,...,...,...
49995,9999,93,2.052232,0,0,1
49996,9999,0,-0.893098,1,5,2
49997,9999,103,1.249681,2,7,2
49998,9999,142,1.531079,3,3,1


#### Clustering on subsets of the data - time

On the other hand, what if we want to cluster on particular periods of data? We can specify `t`. Again, note that some firms may not be clustered - these firms will have `g=pd.NA` (set `'dropna': True` if you want to drop firms that don't get clustered).

In [8]:
bdf = bdf.cluster(bpd.cluster_params({'t': [0, 1, 2]}))
display(bdf)

INFO:bipartitelong:beginning clustering
INFO:bipartitelong:beginning copy
INFO:bipartitelong:firm moments computed
INFO:bipartitelong:computing firm groups
INFO:bipartitelong:firm groups computed
INFO:bipartitelong:dictionary linking firms to clusters generated
INFO:bipartitelong:sorting columns
INFO:bipartitelong:clusters merged into data


Unnamed: 0,i,j,y,t,g,m
0,0,100,-0.333267,0,5,0
1,0,100,-2.046465,1,5,0
2,0,100,0.757547,2,5,0
3,0,100,2.007658,3,5,1
4,0,101,0.295336,4,0,1
...,...,...,...,...,...,...
49995,9999,93,2.052232,0,5,1
49996,9999,0,-0.893098,1,3,2
49997,9999,103,1.249681,2,0,2
49998,9999,142,1.531079,3,4,1
