# Custom columns

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [1]:
import bipartitepandas as bpd

## Get your data ready

For this notebook, we simulate data.

In [2]:
df = bpd.SimBipartite().simulate()
display(df)

Unnamed: 0,i,j,y,t,l,k,alpha,psi
0,0,126,-0.855581,0,2,6,0.0,0.348756
1,0,53,-2.067781,1,2,2,0.0,-0.604585
2,0,53,-0.943629,2,2,2,0.0,-0.604585
3,0,179,1.418097,3,2,8,0.0,0.908458
4,0,129,0.356113,4,2,6,0.0,0.348756
...,...,...,...,...,...,...,...,...
49995,9999,168,1.126927,0,2,8,0.0,0.908458
49996,9999,170,0.432647,1,2,8,0.0,0.908458
49997,9999,170,0.105176,2,2,8,0.0,0.908458
49998,9999,170,1.853938,3,2,8,0.0,0.908458


## Columns

BipartitePandas includes seven pre-defined general columns:

#### Required
- `i`: worker id (any type)
- `j`: firm id (any type)
- `y`: income (float or int)

#### Optional
- `t`: time (int)
- `g`: firm type (any type)
- `w`: weight (float or int)
- `m`: move indicator (int)

## Formats

BipartitePandas includes four formats:

- *Long* - each row gives a single observation
- *Collapsed Long* - like *Long*, but employment spells at the same firm are collapsed into a single observation
- *Event Study* - each row gives two consecutive observations
- *Collapsed Event Study* - like *Event Study*, but employment spells at the same firm are collapsed into a single observation

These formats divide general columns differently:

- *Long* - `i`, `j`, `y`, `t`, `g`, `w`, `m`
- *Collapsed Long* - `i`, `j`, `y`, `t1`, `t2`, `g`, `w`, `m`
- *Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t1`, `t2`, `g1`, `g2`, `w1`, `w2`, `m`
- *Collapsed Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t11`, `t12`, `t21`, `t22`, `g1`, `g2`, `w1`, `w2`, `m`

## Constructing DataFrames

Our simulated data is in *Long* format, but includes columns that aren't pre-defined. How do we construct a *Long* dataframe that includes these columns?

In [3]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'], l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'])
display(bdf_long)

Unnamed: 0,i,j,y,t,alpha,k,l,psi
0,0,126,-0.855581,0,0.0,6,2,0.348756
1,0,53,-2.067781,1,0.0,2,2,-0.604585
2,0,53,-0.943629,2,0.0,2,2,-0.604585
3,0,179,1.418097,3,0.0,8,2,0.908458
4,0,129,0.356113,4,0.0,6,2,0.348756
...,...,...,...,...,...,...,...,...
49995,9999,168,1.126927,0,0.0,8,2,0.908458
49996,9999,170,0.432647,1,0.0,8,2,0.908458
49997,9999,170,0.105176,2,0.0,8,2,0.908458
49998,9999,170,1.853938,3,0.0,8,2,0.908458


Are we sure this is long? Let's check the datatype:

In [4]:
type(bdf_long)

bipartitepandas.bipartitelong.BipartiteLong

## Contiguous columns

What if we want to specify a column should be contiguous? Then we should specify `custom_contiguous_dict`!

<span class="label label-info">Note</span> `alpha` is float, and BipartiteDataFrame automatically sets floats to collapse by `mean`. Contiguous columns cannot collapsed by mean, so if we mark `alpha` as contiguous, we must also specify that it should collapse by `first` (`last` or `None` also work). In addition, contiguous columns must use the datatype `contig`.

In [5]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'], l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'], custom_contiguous_dict={'alpha': True}, custom_dtype_dict={'alpha': 'contig'}, custom_how_collapse_dict={'alpha': 'first'}).clean()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
making 'alpha' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m,alpha,k,l,psi
0,0,126,-0.855581,0,1,0,6,2,0.348756
1,0,53,-2.067781,1,1,0,2,2,-0.604585
2,0,53,-0.943629,2,1,0,2,2,-0.604585
3,0,179,1.418097,3,2,0,8,2,0.908458
4,0,129,0.356113,4,1,0,6,2,0.348756
...,...,...,...,...,...,...,...,...,...
49995,9999,168,1.126927,0,1,0,8,2,0.908458
49996,9999,170,0.432647,1,1,0,8,2,0.908458
49997,9999,170,0.105176,2,0,0,8,2,0.908458
49998,9999,170,1.853938,3,1,0,8,2,0.908458


## Collapsing data

What if instead of collapsing by the `mean`, we want a column to collapse by `first`, or even to drop when we collapse? Then we should specify `custom_how_collapse_dict`!

In [6]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'], l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'], custom_how_collapse_dict={'alpha': None, 'psi': 'first'}).clean().collapse()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t1,t2,w,m,k,l,psi
0,0,126,-0.855581,0,0,1,1,6.0,2.0,0.348756
1,0,53,-1.505705,1,2,2,2,2.0,2.0,-0.604585
2,0,179,1.418097,3,3,1,2,8.0,2.0,0.908458
3,0,129,0.356113,4,4,1,1,6.0,2.0,0.348756
4,1,11,-2.076062,0,4,5,0,0.0,0.0,-1.335178
...,...,...,...,...,...,...,...,...,...,...
29741,9998,164,1.476241,1,3,3,2,8.0,4.0,0.908458
29742,9998,165,3.702658,4,4,1,1,8.0,4.0,0.908458
29743,9999,168,1.126927,0,0,1,1,8.0,2.0,0.908458
29744,9999,170,0.797253,1,3,3,2,8.0,2.0,0.908458


<span class="label label-info">Note</span> Collapsing by `first`, `last`, `mean`, and `sum` will uncollapse correctly (although information may be lost); any other option (e.g. `var` or `std`) is not guaranteed to uncollapse correctly.

## Converting between (collapsed) long and (collapsed) event study formats

What if we don't want a column to split when converting to event study, or if we want it to drop during the conversion? Then we should specify `custom_long_es_split_dict`!

In [7]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'], l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'], custom_long_es_split_dict={'alpha': False, 'psi': None}).clean().to_eventstudy()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j1,j2,y1,y2,t1,t2,m,alpha,k1,k2,l1,l2
0,0,126,53,-0.855581,-2.067781,0,1,1,0.000000,6,2,2,2
1,0,53,53,-2.067781,-0.943629,1,2,0,0.000000,2,2,2,2
2,0,53,179,-0.943629,1.418097,2,3,1,0.000000,2,8,2,2
3,0,179,129,1.418097,0.356113,3,4,1,0.000000,8,6,2,2
4,1,11,11,-0.957193,-0.957193,0,0,0,-0.967422,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
40635,9998,164,165,0.541696,3.702658,3,4,1,0.967422,8,8,4,4
40636,9999,168,170,1.126927,0.432647,0,1,1,0.000000,8,8,2,2
40637,9999,170,170,0.432647,0.105176,1,2,0,0.000000,8,8,2,2
40638,9999,170,170,0.105176,1.853938,2,3,0,0.000000,8,8,2,2
