# Custom columns

In [1]:
# Add BipartitePandas to system path, do not run this
import sys
sys.path.append('../../..')

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [2]:
import bipartitepandas as bpd

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Get your data ready

In this example, we simulate data.

In [3]:
df = bpd.SimBipartite().simulate()
display(df)

Unnamed: 0,i,j,y,t,l,k,alpha,psi
0,0,144,-0.540522,0,4,7,0.967422,0.604585
1,0,144,1.920034,1,4,7,0.967422,0.604585
2,0,165,1.797380,2,4,8,0.967422,0.908458
3,0,133,0.914947,3,4,6,0.967422,0.348756
4,0,133,1.597989,4,4,6,0.967422,0.348756
...,...,...,...,...,...,...,...,...
49995,9999,79,-0.131457,0,3,3,0.430727,-0.348756
49996,9999,79,0.378927,1,3,3,0.430727,-0.348756
49997,9999,79,0.841094,2,3,3,0.430727,-0.348756
49998,9999,72,-0.643424,3,3,3,0.430727,-0.348756


## Columns

BipartitePandas includes seven pre-defined general columns:

#### Required
- `i`: worker id (any type)
- `j`: firm id (any type)
- `y`: income (float or int)

#### Optional
- `t`: time (int)
- `g`: firm type (any type)
- `w`: weight (float or int)
- `m`: move indicator (int)

## Formats

BipartitePandas includes four formats:

- *Long* - each row gives a single observation
- *Collapsed Long* - like *Long*, but employment spells at the same firm are collapsed into a single observation
- *Event Study* - each row gives two consecutive observations
- *Collapsed Event Study* - like *Event Study*, but employment spells at the same firm are collapsed into a single observation

These formats divide general columns differently:

- *Long* - `i`, `j`, `y`, `t`, `g`, `w`, `m`
- *Collapsed Long* - `i`, `j`, `y`, `t1`, `t2`, `g`, `w`, `m`
- *Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t1`, `t2`, `g1`, `g2`, `w1`, `w2`, `m`
- *Collapsed Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t11`, `t12`, `t21`, `t22`, `g1`, `g2`, `w1`, `w2`, `m`

## Constructing DataFrames

Our simulated data is in *Long* format, but includes columns that aren't pre-defined. How do we construct a *Long* dataframe that includes these columns?

In [4]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'], l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'])
display(bdf_long)

Unnamed: 0,i,j,y,t,alpha,k,l,psi
0,0,144,-0.540522,0,0.967422,7,4,0.604585
1,0,144,1.920034,1,0.967422,7,4,0.604585
2,0,165,1.797380,2,0.967422,8,4,0.908458
3,0,133,0.914947,3,0.967422,6,4,0.348756
4,0,133,1.597989,4,0.967422,6,4,0.348756
...,...,...,...,...,...,...,...,...
49995,9999,79,-0.131457,0,0.430727,3,3,-0.348756
49996,9999,79,0.378927,1,0.430727,3,3,-0.348756
49997,9999,79,0.841094,2,0.430727,3,3,-0.348756
49998,9999,72,-0.643424,3,0.430727,3,3,-0.348756


Are we sure this is long? Let's check the datatype:

In [5]:
type(bdf_long)

bipartitepandas.bipartitelong.BipartiteLong

## Contiguous columns

What if we want to specify a column should be contiguous? Then we should specify `custom_contiguous_dict`!

*Note:* `alpha` is float, and BipartiteDataFrame automatically sets floats to collapse by `mean`. Contiguous columns cannot collapsed by mean, so if we mark `alpha` as contiguous, we must also specify that it should collapse by `first` (`last` or `None` also work).

In [6]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'], l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'], custom_contiguous_dict={'alpha': True}, custom_how_collapse_dict={'alpha': 'first'}).clean()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
making 'alpha' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m,alpha,k,l,psi
0,0,144,-0.540522,0,0,0,7,4,0.604585
1,0,144,1.920034,1,1,0,7,4,0.604585
2,0,165,1.797380,2,2,0,8,4,0.908458
3,0,133,0.914947,3,1,0,6,4,0.348756
4,0,133,1.597989,4,0,0,6,4,0.348756
...,...,...,...,...,...,...,...,...,...
49995,9999,79,-0.131457,0,0,4,3,3,-0.348756
49996,9999,79,0.378927,1,0,4,3,3,-0.348756
49997,9999,79,0.841094,2,1,4,3,3,-0.348756
49998,9999,72,-0.643424,3,1,4,3,3,-0.348756


## Collapsing data

What if instead of collapsing by the `mean`, we want a column to collapse by `first`, or even to drop when we collapse? Then we should specify `custom_how_collapse_dict`!

In [7]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'], l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'], custom_how_collapse_dict={'alpha': None, 'psi': 'first'}).clean().collapse()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t1,t2,w,m,k,l,psi
0,0,144,0.689756,0,1,2,1,7.0,4.0,0.604585
1,0,165,1.797380,2,2,1,2,8.0,4.0,0.908458
2,0,133,1.256468,3,4,2,1,6.0,4.0,0.348756
3,1,17,-2.188944,0,3,4,1,0.0,0.0,-1.335178
4,1,32,-1.942657,4,4,1,1,1.0,0.0,-0.908458
...,...,...,...,...,...,...,...,...,...,...
29682,9998,103,-0.679716,0,1,2,1,5.0,2.0,0.114185
29683,9998,40,-1.335878,2,2,1,2,1.0,2.0,-0.908458
29684,9998,152,-0.242433,3,4,2,1,7.0,2.0,0.604585
29685,9999,79,0.362855,0,2,3,1,3.0,3.0,-0.348756


## Converting between (collapsed) long and (collapsed) event study formats

What if we don't want a column to split when converting to event study, or if we want it to drop during the conversion? Then we should specify `custom_long_es_split_dict`!

In [8]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'], l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'], custom_long_es_split_dict={'alpha': False, 'psi': None}).clean().to_eventstudy()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j1,j2,y1,y2,t1,t2,m,alpha,k1,k2,l1,l2
0,0,144,144,-0.540522,1.920034,0,1,0,0.967422,7,7,4,4
1,0,144,165,1.920034,1.797380,1,2,1,0.967422,7,8,4,4
2,0,165,133,1.797380,0.914947,2,3,1,0.967422,8,6,4,4
3,0,133,133,0.914947,1.597989,3,4,0,0.967422,6,6,4,4
4,1,17,17,-2.042282,-2.604465,0,1,0,-0.967422,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
40656,9998,152,152,0.013852,-0.498718,3,4,0,0.000000,7,7,2,2
40657,9999,79,79,-0.131457,0.378927,0,1,0,0.430727,3,3,3,3
40658,9999,79,79,0.378927,0.841094,1,2,0,0.430727,3,3,3,3
40659,9999,79,72,0.841094,-0.643424,2,3,1,0.430727,3,3,3,3
