# Format examples

In [1]:
# Add BipartitePandas to system path, do not run this
# import sys
# sys.path.append('../../..')

## Import the BipartitePandas Package

Make sure to install it using `pip install bipartitepandas`

In [2]:
import bipartitepandas as bpd

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Get your data ready

In this example, we simulate data (we set parameters to make data cleaning interesting)

In [3]:
df = bpd.SimBipartite(bpd.sim_params({'firm_size': 10, 'p_move': 0.05})).simulate()
display(df)

Unnamed: 0,i,j,y,t,l,k,alpha,psi
0,0,272,-1.366183,0,1,2,-0.430727,-0.604585
1,0,272,-1.424854,1,1,2,-0.430727,-0.604585
2,0,272,-0.924350,2,1,2,-0.430727,-0.604585
3,0,272,-1.269021,3,1,2,-0.430727,-0.604585
4,0,272,-2.009270,4,1,2,-0.430727,-0.604585
...,...,...,...,...,...,...,...,...
49995,9999,804,1.285949,0,3,8,0.430727,0.908458
49996,9999,804,3.005898,1,3,8,0.430727,0.908458
49997,9999,804,1.722347,2,3,8,0.430727,0.908458
49998,9999,804,-0.127220,3,3,8,0.430727,0.908458


## Columns

BipartitePandas includes seven pre-defined general columns:

#### Required
- `i`: worker id (any type)
- `j`: firm id (any type)
- `y`: income (float or int)

#### Optional
- `t`: time (int)
- `g`: firm type (any type)
- `w`: weight (float or int)
- `m`: move indicator (int)

## Formats

BipartitePandas includes four formats:

- *Long* - each row gives a single observation
- *Collapsed Long* - like *Long*, but employment spells at the same firm are collapsed into a single observation
- *Event Study* - each row gives two consecutive observations
- *Collapsed Event Study* - like *Event Study*, but employment spells at the same firm are collapsed into a single observation

These formats divide general columns differently:

- *Long* - `i`, `j`, `y`, `t`, `g`, `w`, `m`
- *Collapsed Long* - `i`, `j`, `y`, `t1`, `t2`, `g`, `w`, `m`
- *Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t1`, `t2`, `g1`, `g2`, `w1`, `w2`, `m`
- *Collapsed Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t11`, `t12`, `t21`, `t22`, `g1`, `g2`, `w1`, `w2`, `m`

## Constructing DataFrames

Our simulated data is in *Long* format. How do we construct a *Long* dataframe?

In [4]:
i = df['i']
j = df['j']
y = df['y']
t = df['t']
bdf_long = bpd.BipartiteDataFrame(i=i, j=j, y=y, t=t)
display(bdf_long)

Unnamed: 0,i,j,y,t
0,0,272,-1.366183,0
1,0,272,-1.424854,1
2,0,272,-0.924350,2
3,0,272,-1.269021,3
4,0,272,-2.009270,4
...,...,...,...,...
49995,9999,804,1.285949,0
49996,9999,804,3.005898,1
49997,9999,804,1.722347,2
49998,9999,804,-0.127220,3


Are we sure this is long? Let's check the datatype:

In [5]:
type(bdf_long)

bipartitepandas.bipartitelong.BipartiteLong

This method works to construct any format! Just make sure not to mix up columns between formats.

## Converting between formats

Converting between formats is meant to be easy. Methods exist to go from:

- *Long* to *Collapsed Long* (`.collapse()`)
- *Long* to *Event Study* (`.to_eventstudy()`)
- *Collapsed Long* to *Long* (`.uncollapse()`)
- *Collapsed Long* to *Collapsed Event Study* (`.to_eventstudy()`)
- *Event Study* to *Long* (`.to_long()`)
- *Collapsed Event Study* to *Collapsed Long* (`.to_long()`)

Let's experiment with these and see what happens. Before we start, we just need to clean our data to make sure the conversions work properly (notice the new `m` column).

In [6]:
bdf_long = bdf_long.clean()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m
0,0,272,-1.366183,0,0
1,0,272,-1.424854,1,0
2,0,272,-0.924350,2,0
3,0,272,-1.269021,3,0
4,0,272,-2.009270,4,0
...,...,...,...,...,...
49995,9999,804,1.285949,0,0
49996,9999,804,3.005898,1,0
49997,9999,804,1.722347,2,0
49998,9999,804,-0.127220,3,0


#### *Long* to *Collapsed Long*

Notice that:

- `t` splits into `t1` and `t2`, which indicate the start the end of the spell, respectively
- `w` is new - it gives the number of observations in the spell

In [7]:
bdf_collapsedlong = bdf_long.collapse()
display(bdf_collapsedlong)

Unnamed: 0,i,j,y,t1,t2,w,m
0,0,272,-1.398736,0,4,5,0
1,1,743,0.873573,0,4,5,0
2,2,519,0.532657,0,4,5,0
3,3,526,-0.356654,0,4,5,0
4,4,3,-1.906020,0,4,5,0
...,...,...,...,...,...,...,...
11941,9995,3,-2.174325,0,4,5,0
11942,9996,739,0.659037,0,4,5,0
11943,9997,817,1.626558,0,4,5,0
11944,9998,38,-2.212273,0,4,5,0


#### *Long* to *Event Study*

Notice that:

- `j` splits into `j1` and `j2`, which indicate the first and second firm id in the event study, respectively
- `y` splits into `y1` and `y2`, which indicate the first and second income in the event study, respectively
- `t` splits into `t1` and `t2`, which indicate the first and second period in the event study, respectively

*Hint:* for stayers (individuals who stay at the same firm for all their observations), each row in the event study represents a single observation, since they never move firms.

In [8]:
bdf_eventstudy = bdf_long.to_eventstudy()
display(bdf_eventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t1,t2,m
0,0,272,272,-1.366183,-1.366183,0,0,0
1,0,272,272,-1.424854,-1.424854,1,1,0
2,0,272,272,-0.924350,-0.924350,2,2,0
3,0,272,272,-1.269021,-1.269021,3,3,0
4,0,272,272,-2.009270,-2.009270,4,4,0
...,...,...,...,...,...,...,...,...
48189,9999,804,804,1.285949,1.285949,0,0,0
48190,9999,804,804,3.005898,3.005898,1,1,0
48191,9999,804,804,1.722347,1.722347,2,2,0
48192,9999,804,804,-0.127220,-0.127220,3,3,0


#### *Collapsed Long* to *Collapsed Event Study*

Notice that:

- `j` splits into `j1` and `j2`, which indicate the first and second firm id in the event study, respectively
- `y` splits into `y1` and `y2`, which indicate the first and second income in the event study, respectively
- `t1` splits into `t11` and `t12`, which indicate the start the end of the spell for the first observation in the event study, respectively
- `t2` splits into `t21` and `t22`, which indicate the start the end of the spell for the second observation in the event study, respectively
- `w` splits into `w1` and `w2`, which indicate number of observations in the first and second spell in the event study, respectively

In [9]:
bdf_collapsedeventstudy = bdf_collapsedlong.to_eventstudy()
display(bdf_collapsedeventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t11,t12,t21,t22,w1,w2,m
0,0,272,272,-1.398736,-1.398736,0,4,0,4,5,5,0
1,1,743,743,0.873573,0.873573,0,4,0,4,5,5,0
2,2,519,519,0.532657,0.532657,0,4,0,4,5,5,0
3,3,526,526,-0.356654,-0.356654,0,4,0,4,5,5,0
4,4,3,3,-1.906020,-1.906020,0,4,0,4,5,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...
10135,9995,3,3,-2.174325,-2.174325,0,4,0,4,5,5,0
10136,9996,739,739,0.659037,0.659037,0,4,0,4,5,5,0
10137,9997,817,817,1.626558,1.626558,0,4,0,4,5,5,0
10138,9998,38,38,-2.212273,-2.212273,0,4,0,4,5,5,0


We showed how to get from *Long* to any other format, but feel free to experiment and see what happens when you convert in other directions!

## Initializing from different formats

If your data is saved in a format other than *Long*, it's simple to construct a BipartiteDataFrame.

#### Initializing from *Collapsed Long* format

In [10]:
i = bdf_collapsedlong['i']
j = bdf_collapsedlong['j']
y = bdf_collapsedlong['y']
t1 = bdf_collapsedlong['t1']
t2 = bdf_collapsedlong['t2']
bdf_collapsedlong = bpd.BipartiteDataFrame(i=i, j=j, y=y, t1=t1, t2=t2)
display(bdf_collapsedlong)

Unnamed: 0,i,j,y,t1,t2
0,0,272,-1.398736,0,4
1,1,743,0.873573,0,4
2,2,519,0.532657,0,4
3,3,526,-0.356654,0,4
4,4,3,-1.906020,0,4
...,...,...,...,...,...
11941,9995,3,-2.174325,0,4
11942,9996,739,0.659037,0,4
11943,9997,817,1.626558,0,4
11944,9998,38,-2.212273,0,4


Let's check the datatype:

In [11]:
type(bdf_collapsedlong)

bipartitepandas.bipartitelongcollapsed.BipartiteLongCollapsed

#### Initializing from *Event Study* format

In [12]:
i = bdf_eventstudy['i']
j1 = bdf_eventstudy['j1']
j2 = bdf_eventstudy['j2']
y1 = bdf_eventstudy['y1']
y2 = bdf_eventstudy['y2']
t1 = bdf_eventstudy['t1']
t2 = bdf_eventstudy['t2']
bdf_eventstudy = bpd.BipartiteDataFrame(i=i, j1=j1, j2=j2, y1=y1, y2=y2, t1=t1, t2=t2)
display(bdf_eventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t1,t2
0,0,272,272,-1.366183,-1.366183,0,0
1,0,272,272,-1.424854,-1.424854,1,1
2,0,272,272,-0.924350,-0.924350,2,2
3,0,272,272,-1.269021,-1.269021,3,3
4,0,272,272,-2.009270,-2.009270,4,4
...,...,...,...,...,...,...,...
48189,9999,804,804,1.285949,1.285949,0,0
48190,9999,804,804,3.005898,3.005898,1,1
48191,9999,804,804,1.722347,1.722347,2,2
48192,9999,804,804,-0.127220,-0.127220,3,3


Let's check the datatype:

In [13]:
type(bdf_eventstudy)

bipartitepandas.bipartiteeventstudy.BipartiteEventStudy

#### Initializing from *Collapsed Event Study* format

In [14]:
i = bdf_collapsedeventstudy['i']
j1 = bdf_collapsedeventstudy['j1']
j2 = bdf_collapsedeventstudy['j2']
y1 = bdf_collapsedeventstudy['y1']
y2 = bdf_collapsedeventstudy['y2']
t11 = bdf_collapsedeventstudy['t11']
t12 = bdf_collapsedeventstudy['t12']
t21 = bdf_collapsedeventstudy['t21']
t22 = bdf_collapsedeventstudy['t22']
bdf_collapsedeventstudy = bpd.BipartiteDataFrame(i=i, j1=j1, j2=j2, y1=y1, y2=y2, t11=t11, t12=t12, t21=t21, t22=t22)
display(bdf_collapsedeventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t11,t12,t21,t22
0,0,272,272,-1.398736,-1.398736,0,4,0,4
1,1,743,743,0.873573,0.873573,0,4,0,4
2,2,519,519,0.532657,0.532657,0,4,0,4
3,3,526,526,-0.356654,-0.356654,0,4,0,4
4,4,3,3,-1.906020,-1.906020,0,4,0,4
...,...,...,...,...,...,...,...,...,...
10135,9995,3,3,-2.174325,-2.174325,0,4,0,4
10136,9996,739,739,0.659037,0.659037,0,4,0,4
10137,9997,817,817,1.626558,1.626558,0,4,0,4
10138,9998,38,38,-2.212273,-2.212273,0,4,0,4


Let's check the datatype:

In [15]:
type(bdf_collapsedeventstudy)

bipartitepandas.bipartiteeventstudycollapsed.BipartiteEventStudyCollapsed