# Formats

In [1]:
# Add BipartitePandas to system path, do not run this
# import sys
# sys.path.append('../../..')

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [2]:
import bipartitepandas as bpd

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Get your data ready

In this example, we simulate data.

In [3]:
df = bpd.SimBipartite().simulate()
display(df)

Unnamed: 0,i,j,y,t,l,k,alpha,psi
0,0,144,-1.101778,0,2,7,0.000000,0.604585
1,0,144,-0.592045,1,2,7,0.000000,0.604585
2,0,144,-0.469552,2,2,7,0.000000,0.604585
3,0,192,1.612450,3,2,9,0.000000,1.335178
4,0,192,1.279063,4,2,9,0.000000,1.335178
...,...,...,...,...,...,...,...,...
49995,9999,142,0.062949,0,4,7,0.967422,0.604585
49996,9999,142,1.171816,1,4,7,0.967422,0.604585
49997,9999,170,2.137745,2,4,8,0.967422,0.908458
49998,9999,170,3.491560,3,4,8,0.967422,0.908458


## Columns

BipartitePandas includes seven pre-defined general columns:

#### Required
- `i`: worker id (any type)
- `j`: firm id (any type)
- `y`: income (float or int)

#### Optional
- `t`: time (int)
- `g`: firm type (any type)
- `w`: weight (float or int)
- `m`: move indicator (int)

## Formats

BipartitePandas includes four formats:

- *Long* - each row gives a single observation
- *Collapsed Long* - like *Long*, but employment spells at the same firm are collapsed into a single observation
- *Event Study* - each row gives two consecutive observations
- *Collapsed Event Study* - like *Event Study*, but employment spells at the same firm are collapsed into a single observation

These formats divide general columns differently:

- *Long* - `i`, `j`, `y`, `t`, `g`, `w`, `m`
- *Collapsed Long* - `i`, `j`, `y`, `t1`, `t2`, `g`, `w`, `m`
- *Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t1`, `t2`, `g1`, `g2`, `w1`, `w2`, `m`
- *Collapsed Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t11`, `t12`, `t21`, `t22`, `g1`, `g2`, `w1`, `w2`, `m`

## Constructing DataFrames

Our simulated data is in *Long* format. How do we construct a *Long* dataframe?

In [4]:
i = df['i']
j = df['j']
y = df['y']
t = df['t']
bdf_long = bpd.BipartiteDataFrame(i=i, j=j, y=y, t=t)
display(bdf_long)

Unnamed: 0,i,j,y,t
0,0,144,-1.101778,0
1,0,144,-0.592045,1
2,0,144,-0.469552,2
3,0,192,1.612450,3
4,0,192,1.279063,4
...,...,...,...,...
49995,9999,142,0.062949,0
49996,9999,142,1.171816,1
49997,9999,170,2.137745,2
49998,9999,170,3.491560,3


Are we sure this is long? Let's check the datatype:

In [5]:
type(bdf_long)

bipartitepandas.bipartitelong.BipartiteLong

This method works to construct any format! Just make sure not to mix up columns between formats.

## Converting between formats

Converting between formats is meant to be easy. Methods exist to go from:

- *Long* to *Collapsed Long* (`.collapse()`)
- *Long* to *Event Study* (`.to_eventstudy()`)
- *Collapsed Long* to *Long* (`.uncollapse()`)
- *Collapsed Long* to *Collapsed Event Study* (`.to_eventstudy()`)
- *Event Study* to *Long* (`.to_long()`)
- *Collapsed Event Study* to *Collapsed Long* (`.to_long()`)

Let's experiment with these and see what happens. Before we start, we just need to clean our data to make sure the conversions work properly (notice the new `m` column).

In [6]:
bdf_long = bdf_long.clean()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m
0,0,144,-1.101778,0,0
1,0,144,-0.592045,1,0
2,0,144,-0.469552,2,1
3,0,192,1.612450,3,1
4,0,192,1.279063,4,0
...,...,...,...,...,...
49995,9999,142,0.062949,0,0
49996,9999,142,1.171816,1,1
49997,9999,170,2.137745,2,1
49998,9999,170,3.491560,3,0


#### *Long* to *Collapsed Long*

Notice that:

- `t` splits into `t1` and `t2`, which indicate the start the end of the spell, respectively
- `w` is new - it gives the number of observations in the spell

In [7]:
bdf_collapsedlong = bdf_long.collapse()
display(bdf_collapsedlong)

Unnamed: 0,i,j,y,t1,t2,w,m
0,0,144,-0.721125,0,2,3,1
1,0,192,1.445756,3,4,2,1
2,1,77,-0.500109,0,1,2,1
3,1,68,-2.108956,2,2,1,2
4,1,169,2.093587,3,4,2,1
...,...,...,...,...,...,...,...
29853,9997,54,0.350283,3,4,2,1
29854,9998,66,-1.213496,0,0,1,1
29855,9998,0,-1.830369,1,4,4,1
29856,9999,142,0.617383,0,1,2,1


#### *Long* to *Event Study*

Notice that:

- `j` splits into `j1` and `j2`, which indicate the first and second firm id in the event study, respectively
- `y` splits into `y1` and `y2`, which indicate the first and second income in the event study, respectively
- `t` splits into `t1` and `t2`, which indicate the first and second period in the event study, respectively

*Hint:* for stayers (individuals who stay at the same firm for all their observations), each row in the event study represents a single observation, since they never move firms.

In [8]:
bdf_eventstudy = bdf_long.to_eventstudy()
display(bdf_eventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t1,t2,m
0,0,144,144,-1.101778,-0.592045,0,1,0
1,0,144,144,-0.592045,-0.469552,1,2,0
2,0,144,192,-0.469552,1.612450,2,3,1
3,0,192,192,1.612450,1.279063,3,4,0
4,1,77,77,-0.491114,-0.509105,0,1,0
...,...,...,...,...,...,...,...,...
40620,9998,0,0,-1.419349,-2.380551,3,4,0
40621,9999,142,142,0.062949,1.171816,0,1,0
40622,9999,142,170,1.171816,2.137745,1,2,1
40623,9999,170,170,2.137745,3.491560,2,3,0


#### *Collapsed Long* to *Collapsed Event Study*

Notice that:

- `j` splits into `j1` and `j2`, which indicate the first and second firm id in the event study, respectively
- `y` splits into `y1` and `y2`, which indicate the first and second income in the event study, respectively
- `t1` splits into `t11` and `t12`, which indicate the start the end of the spell for the first observation in the event study, respectively
- `t2` splits into `t21` and `t22`, which indicate the start the end of the spell for the second observation in the event study, respectively
- `w` splits into `w1` and `w2`, which indicate number of observations in the first and second spell in the event study, respectively

In [9]:
bdf_collapsedeventstudy = bdf_collapsedlong.to_eventstudy()
display(bdf_collapsedeventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t11,t12,t21,t22,w1,w2,m
0,0,144,192,-0.721125,1.445756,0,2,3,4,3,2,1
1,1,77,68,-0.500109,-2.108956,0,1,2,2,2,1,1
2,1,68,169,-2.108956,2.093587,2,2,3,4,1,2,1
3,2,44,41,-1.907822,-1.794051,0,3,4,4,4,1,1
4,3,124,105,0.030321,0.015422,0,0,1,4,1,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...
20478,9996,123,32,0.477752,0.362093,2,3,4,4,2,1,1
20479,9997,118,86,1.928970,1.605210,0,0,1,2,1,2,1
20480,9997,86,54,1.605210,0.350283,1,2,3,4,2,2,1
20481,9998,66,0,-1.213496,-1.830369,0,0,1,4,1,4,1


We showed how to get from *Long* to any other format, but feel free to experiment and see what happens when you convert in other directions!

## Initializing from different formats

If your data is saved in a format other than *Long*, it's simple to construct a BipartiteDataFrame.

#### Initializing from *Collapsed Long* format

In [10]:
i = bdf_collapsedlong['i']
j = bdf_collapsedlong['j']
y = bdf_collapsedlong['y']
t1 = bdf_collapsedlong['t1']
t2 = bdf_collapsedlong['t2']
bdf_collapsedlong = bpd.BipartiteDataFrame(i=i, j=j, y=y, t1=t1, t2=t2)
display(bdf_collapsedlong)

Unnamed: 0,i,j,y,t1,t2
0,0,144,-0.721125,0,2
1,0,192,1.445756,3,4
2,1,77,-0.500109,0,1
3,1,68,-2.108956,2,2
4,1,169,2.093587,3,4
...,...,...,...,...,...
29853,9997,54,0.350283,3,4
29854,9998,66,-1.213496,0,0
29855,9998,0,-1.830369,1,4
29856,9999,142,0.617383,0,1


Let's check the datatype:

In [11]:
type(bdf_collapsedlong)

bipartitepandas.bipartitelongcollapsed.BipartiteLongCollapsed

#### Initializing from *Event Study* format

In [12]:
i = bdf_eventstudy['i']
j1 = bdf_eventstudy['j1']
j2 = bdf_eventstudy['j2']
y1 = bdf_eventstudy['y1']
y2 = bdf_eventstudy['y2']
t1 = bdf_eventstudy['t1']
t2 = bdf_eventstudy['t2']
bdf_eventstudy = bpd.BipartiteDataFrame(i=i, j1=j1, j2=j2, y1=y1, y2=y2, t1=t1, t2=t2)
display(bdf_eventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t1,t2
0,0,144,144,-1.101778,-0.592045,0,1
1,0,144,144,-0.592045,-0.469552,1,2
2,0,144,192,-0.469552,1.612450,2,3
3,0,192,192,1.612450,1.279063,3,4
4,1,77,77,-0.491114,-0.509105,0,1
...,...,...,...,...,...,...,...
40620,9998,0,0,-1.419349,-2.380551,3,4
40621,9999,142,142,0.062949,1.171816,0,1
40622,9999,142,170,1.171816,2.137745,1,2
40623,9999,170,170,2.137745,3.491560,2,3


Let's check the datatype:

In [13]:
type(bdf_eventstudy)

bipartitepandas.bipartiteeventstudy.BipartiteEventStudy

#### Initializing from *Collapsed Event Study* format

In [14]:
i = bdf_collapsedeventstudy['i']
j1 = bdf_collapsedeventstudy['j1']
j2 = bdf_collapsedeventstudy['j2']
y1 = bdf_collapsedeventstudy['y1']
y2 = bdf_collapsedeventstudy['y2']
t11 = bdf_collapsedeventstudy['t11']
t12 = bdf_collapsedeventstudy['t12']
t21 = bdf_collapsedeventstudy['t21']
t22 = bdf_collapsedeventstudy['t22']
bdf_collapsedeventstudy = bpd.BipartiteDataFrame(i=i, j1=j1, j2=j2, y1=y1, y2=y2, t11=t11, t12=t12, t21=t21, t22=t22)
display(bdf_collapsedeventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t11,t12,t21,t22
0,0,144,192,-0.721125,1.445756,0,2,3,4
1,1,77,68,-0.500109,-2.108956,0,1,2,2
2,1,68,169,-2.108956,2.093587,2,2,3,4
3,2,44,41,-1.907822,-1.794051,0,3,4,4
4,3,124,105,0.030321,0.015422,0,0,1,4
...,...,...,...,...,...,...,...,...,...
20478,9996,123,32,0.477752,0.362093,2,3,4,4
20479,9997,118,86,1.928970,1.605210,0,0,1,2
20480,9997,86,54,1.605210,0.350283,1,2,3,4
20481,9998,66,0,-1.213496,-1.830369,0,0,1,4


Let's check the datatype:

In [15]:
type(bdf_collapsedeventstudy)

bipartitepandas.bipartiteeventstudycollapsed.BipartiteEventStudyCollapsed