# Custom columns

In [1]:
# Add BipartitePandas to system path, do not run this
# import sys
# sys.path.append('../../..')

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [2]:
import bipartitepandas as bpd

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Get your data ready

For this notebook, we simulate data.

In [3]:
df = bpd.SimBipartite().simulate()
display(df)

Unnamed: 0,i,j,y,t,l,k,alpha,psi
0,0,35,-0.361932,0,0,1,-0.967422,-0.908458
1,0,35,-0.977417,1,0,1,-0.967422,-0.908458
2,0,35,-0.539126,2,0,1,-0.967422,-0.908458
3,0,35,-1.382090,3,0,1,-0.967422,-0.908458
4,0,35,-0.958114,4,0,1,-0.967422,-0.908458
...,...,...,...,...,...,...,...,...
49995,9999,16,-2.955536,0,0,0,-0.967422,-1.335178
49996,9999,13,-0.999088,1,0,0,-0.967422,-1.335178
49997,9999,88,-3.198988,2,0,4,-0.967422,-0.114185
49998,9999,88,-2.267574,3,0,4,-0.967422,-0.114185


## Columns

BipartitePandas includes seven pre-defined general columns:

#### Required
- `i`: worker id (any type)
- `j`: firm id (any type)
- `y`: income (float or int)

#### Optional
- `t`: time (int)
- `g`: firm type (any type)
- `w`: weight (float or int)
- `m`: move indicator (int)

## Formats

BipartitePandas includes four formats:

- *Long* - each row gives a single observation
- *Collapsed Long* - like *Long*, but employment spells at the same firm are collapsed into a single observation
- *Event Study* - each row gives two consecutive observations
- *Collapsed Event Study* - like *Event Study*, but employment spells at the same firm are collapsed into a single observation

These formats divide general columns differently:

- *Long* - `i`, `j`, `y`, `t`, `g`, `w`, `m`
- *Collapsed Long* - `i`, `j`, `y`, `t1`, `t2`, `g`, `w`, `m`
- *Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t1`, `t2`, `g1`, `g2`, `w1`, `w2`, `m`
- *Collapsed Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t11`, `t12`, `t21`, `t22`, `g1`, `g2`, `w1`, `w2`, `m`

## Constructing DataFrames

Our simulated data is in *Long* format, but includes columns that aren't pre-defined. How do we construct a *Long* dataframe that includes these columns?

In [4]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'], l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'])
display(bdf_long)

Unnamed: 0,i,j,y,t,alpha,k,l,psi
0,0,35,-0.361932,0,-0.967422,1,0,-0.908458
1,0,35,-0.977417,1,-0.967422,1,0,-0.908458
2,0,35,-0.539126,2,-0.967422,1,0,-0.908458
3,0,35,-1.382090,3,-0.967422,1,0,-0.908458
4,0,35,-0.958114,4,-0.967422,1,0,-0.908458
...,...,...,...,...,...,...,...,...
49995,9999,16,-2.955536,0,-0.967422,0,0,-1.335178
49996,9999,13,-0.999088,1,-0.967422,0,0,-1.335178
49997,9999,88,-3.198988,2,-0.967422,4,0,-0.114185
49998,9999,88,-2.267574,3,-0.967422,4,0,-0.114185


Are we sure this is long? Let's check the datatype:

In [5]:
type(bdf_long)

bipartitepandas.bipartitelong.BipartiteLong

## Contiguous columns

What if we want to specify a column should be contiguous? Then we should specify `custom_contiguous_dict`!

*Note:* `alpha` is float, and BipartiteDataFrame automatically sets floats to collapse by `mean`. Contiguous columns cannot collapsed by mean, so if we mark `alpha` as contiguous, we must also specify that it should collapse by `first` (`last` or `None` also work). In addition, contiguous columns must use the datatype `contig`.

In [6]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'], l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'], custom_contiguous_dict={'alpha': True}, custom_dtype_dict={'alpha': 'contig'}, custom_how_collapse_dict={'alpha': 'first'}).clean()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
making 'alpha' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m,alpha,k,l,psi
0,0,35,-0.361932,0,0,0,1,0,-0.908458
1,0,35,-0.977417,1,0,0,1,0,-0.908458
2,0,35,-0.539126,2,0,0,1,0,-0.908458
3,0,35,-1.382090,3,0,0,1,0,-0.908458
4,0,35,-0.958114,4,0,0,1,0,-0.908458
...,...,...,...,...,...,...,...,...,...
49995,9999,16,-2.955536,0,1,0,0,0,-1.335178
49996,9999,13,-0.999088,1,2,0,0,0,-1.335178
49997,9999,88,-3.198988,2,1,0,4,0,-0.114185
49998,9999,88,-2.267574,3,1,0,4,0,-0.114185


## Collapsing data

What if instead of collapsing by the `mean`, we want a column to collapse by `first`, or even to drop when we collapse? Then we should specify `custom_how_collapse_dict`!

In [7]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'], l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'], custom_how_collapse_dict={'alpha': None, 'psi': 'first'}).clean().collapse()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t1,t2,w,m,k,l,psi
0,0,35,-0.843736,0,4,5,0,1.0,0.0,-0.908458
1,1,9,0.230103,0,0,1,1,0.0,0.0,-1.335178
2,1,25,-2.173012,1,2,2,2,1.0,0.0,-0.908458
3,1,19,-1.923531,3,4,2,1,0.0,0.0,-1.335178
4,2,198,0.840433,0,0,1,1,9.0,2.0,1.335178
...,...,...,...,...,...,...,...,...,...,...
29781,9998,104,1.016412,4,4,1,1,5.0,3.0,0.114185
29782,9999,16,-2.955536,0,0,1,1,0.0,0.0,-1.335178
29783,9999,13,-0.999088,1,1,1,2,0.0,0.0,-1.335178
29784,9999,88,-2.733281,2,3,2,2,4.0,0.0,-0.114185


## Converting between (collapsed) long and (collapsed) event study formats

What if we don't want a column to split when converting to event study, or if we want it to drop during the conversion? Then we should specify `custom_long_es_split_dict`!

In [8]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'], l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'], custom_long_es_split_dict={'alpha': False, 'psi': None}).clean().to_eventstudy()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j1,j2,y1,y2,t1,t2,m,alpha,k1,k2,l1,l2
0,0,35,35,-0.361932,-0.361932,0,0,0,-0.967422,1,1,0,0
1,0,35,35,-0.977417,-0.977417,1,1,0,-0.967422,1,1,0,0
2,0,35,35,-0.539126,-0.539126,2,2,0,-0.967422,1,1,0,0
3,0,35,35,-1.382090,-1.382090,3,3,0,-0.967422,1,1,0,0
4,0,35,35,-0.958114,-0.958114,4,4,0,-0.967422,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
40650,9998,56,104,0.734139,1.016412,3,4,1,0.430727,2,5,3,3
40651,9999,16,13,-2.955536,-0.999088,0,1,1,-0.967422,0,0,0,0
40652,9999,13,88,-0.999088,-3.198988,1,2,1,-0.967422,0,4,0,0
40653,9999,88,88,-3.198988,-2.267574,2,3,0,-0.967422,4,4,0,0
