# Formats

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [1]:
import bipartitepandas as bpd

## Get your data ready

For this notebook, we simulate data.

In [2]:
df = bpd.SimBipartite().simulate()
display(df)

Unnamed: 0,i,j,y,t,l,k,alpha,psi
0,0,52,-1.635676,0,2,2,0.000000,-0.604585
1,0,52,-0.355736,1,2,2,0.000000,-0.604585
2,0,50,-1.278299,2,2,2,0.000000,-0.604585
3,0,50,-1.729138,3,2,2,0.000000,-0.604585
4,0,50,-1.697820,4,2,2,0.000000,-0.604585
...,...,...,...,...,...,...,...,...
49995,9999,70,0.724830,0,1,3,-0.430727,-0.348756
49996,9999,70,1.574465,1,1,3,-0.430727,-0.348756
49997,9999,70,-0.199578,2,1,3,-0.430727,-0.348756
49998,9999,70,-1.346981,3,1,3,-0.430727,-0.348756


## Columns

BipartitePandas includes seven pre-defined general columns:

#### Required
- `i`: worker id (any type)
- `j`: firm id (any type)
- `y`: income (float or int)

#### Optional
- `t`: time (int)
- `g`: firm type (any type)
- `w`: weight (float or int)
- `m`: move indicator (int)

## Formats

BipartitePandas includes four formats:

- *Long* - each row gives a single observation
- *Collapsed Long* - like *Long*, but employment spells at the same firm are collapsed into a single observation
- *Event Study* - each row gives two consecutive observations
- *Collapsed Event Study* - like *Event Study*, but employment spells at the same firm are collapsed into a single observation

These formats divide general columns differently:

- *Long* - `i`, `j`, `y`, `t`, `g`, `w`, `m`
- *Collapsed Long* - `i`, `j`, `y`, `t1`, `t2`, `g`, `w`, `m`
- *Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t1`, `t2`, `g1`, `g2`, `w1`, `w2`, `m`
- *Collapsed Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t11`, `t12`, `t21`, `t22`, `g1`, `g2`, `w1`, `w2`, `m`

## Constructing DataFrames

Our simulated data is in *Long* format. How do we construct a *Long* dataframe?

In [3]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'])
display(bdf_long)

Unnamed: 0,i,j,y,t
0,0,52,-1.635676,0
1,0,52,-0.355736,1
2,0,50,-1.278299,2
3,0,50,-1.729138,3
4,0,50,-1.697820,4
...,...,...,...,...
49995,9999,70,0.724830,0
49996,9999,70,1.574465,1
49997,9999,70,-0.199578,2
49998,9999,70,-1.346981,3


Are we sure this is long? Let's check the datatype:

In [4]:
type(bdf_long)

bipartitepandas.bipartitelong.BipartiteLong

This method works to construct any format! Just make sure not to mix up columns between formats.

## Converting between formats

Converting between formats is meant to be easy. Methods exist to go from:

- *Long* to *Collapsed Long* (`.collapse()`)
- *Long* to *Event Study* (`.to_eventstudy()`)
- *Collapsed Long* to *Long* (`.uncollapse()`)
- *Collapsed Long* to *Collapsed Event Study* (`.to_eventstudy()`)
- *Event Study* to *Long* (`.to_long()`)
- *Collapsed Event Study* to *Collapsed Long* (`.to_long()`)

Let's experiment with these and see what happens. Before we start, we just need to clean our data to make sure the conversions work properly (notice the new `m` column).

In [5]:
bdf_long = bdf_long.clean()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m
0,0,52,-1.635676,0,0
1,0,52,-0.355736,1,1
2,0,50,-1.278299,2,1
3,0,50,-1.729138,3,0
4,0,50,-1.697820,4,0
...,...,...,...,...,...
49995,9999,70,0.724830,0,0
49996,9999,70,1.574465,1,0
49997,9999,70,-0.199578,2,0
49998,9999,70,-1.346981,3,1


#### *Long* to *Collapsed Long*

Notice that:

- `t` splits into `t1` and `t2`, which indicate the start the end of the spell, respectively
- `w` is new - it gives the number of observations in the spell

In [6]:
bdf_collapsedlong = bdf_long.collapse()
display(bdf_collapsedlong)

Unnamed: 0,i,j,y,t1,t2,w,m
0,0,52,-0.995706,0,1,2,1
1,0,50,-1.568419,2,4,3,1
2,1,173,2.166473,0,4,5,0
3,2,92,-0.340171,0,0,1,1
4,2,87,-0.326012,1,3,3,2
...,...,...,...,...,...,...,...
29716,9998,158,1.273030,0,1,2,1
29717,9998,174,0.201508,2,2,1,2
29718,9998,163,1.287281,3,4,2,1
29719,9999,70,0.188184,0,3,4,1


#### *Long* to *Event Study*

Notice that:

- `j` splits into `j1` and `j2`, which indicate the first and second firm id in the event study, respectively
- `y` splits into `y1` and `y2`, which indicate the first and second income in the event study, respectively
- `t` splits into `t1` and `t2`, which indicate the first and second period in the event study, respectively

<span class="label label-info">Note</span> For stayers (individuals who stay at the same firm for all their observations), each row in the event study represents a single observation, since they never move firms.

In [7]:
bdf_eventstudy = bdf_long.to_eventstudy()
display(bdf_eventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t1,t2,m
0,0,52,52,-1.635676,-0.355736,0,1,0
1,0,52,50,-0.355736,-1.278299,1,2,1
2,0,50,50,-1.278299,-1.729138,2,3,0
3,0,50,50,-1.729138,-1.697820,3,4,0
4,1,173,173,2.199857,2.199857,0,0,0
...,...,...,...,...,...,...,...,...
40641,9998,163,163,0.958440,1.616123,3,4,0
40642,9999,70,70,0.724830,1.574465,0,1,0
40643,9999,70,70,1.574465,-0.199578,1,2,0
40644,9999,70,70,-0.199578,-1.346981,2,3,0


#### *Collapsed Long* to *Collapsed Event Study*

Notice that:

- `j` splits into `j1` and `j2`, which indicate the first and second firm id in the event study, respectively
- `y` splits into `y1` and `y2`, which indicate the first and second income in the event study, respectively
- `t1` splits into `t11` and `t12`, which indicate the start the end of the spell for the first observation in the event study, respectively
- `t2` splits into `t21` and `t22`, which indicate the start the end of the spell for the second observation in the event study, respectively
- `w` splits into `w1` and `w2`, which indicate number of observations in the first and second spell in the event study, respectively

In [8]:
bdf_collapsedeventstudy = bdf_collapsedlong.to_eventstudy()
display(bdf_collapsedeventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t11,t12,t21,t22,w1,w2,m
0,0,52,50,-0.995706,-1.568419,0,1,2,4,2,3,1
1,1,173,173,2.166473,2.166473,0,4,0,4,5,5,0
2,2,92,87,-0.340171,-0.326012,0,0,1,3,1,3,1
3,2,87,42,-0.326012,-1.392960,1,3,4,4,3,1,1
4,3,185,186,3.250695,1.966088,0,1,2,4,2,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...
20362,9997,177,173,1.768343,4.042330,0,2,3,3,3,1,1
20363,9997,173,185,4.042330,4.225602,3,3,4,4,1,1,1
20364,9998,158,174,1.273030,0.201508,0,1,2,2,2,1,1
20365,9998,174,163,0.201508,1.287281,2,2,3,4,1,2,1


We showed how to get from *Long* to any other format, but feel free to experiment and see what happens when you convert in other directions!

## Initializing from different formats

If your data is saved in a format other than *Long*, it's simple to construct a BipartiteDataFrame.

#### Initializing from *Collapsed Long* format

In [9]:
i = bdf_collapsedlong['i']
j = bdf_collapsedlong['j']
y = bdf_collapsedlong['y']
t1 = bdf_collapsedlong['t1']
t2 = bdf_collapsedlong['t2']
bdf_collapsedlong = bpd.BipartiteDataFrame(i=i, j=j, y=y, t1=t1, t2=t2)
display(bdf_collapsedlong)

Unnamed: 0,i,j,y,t1,t2
0,0,52,-0.995706,0,1
1,0,50,-1.568419,2,4
2,1,173,2.166473,0,4
3,2,92,-0.340171,0,0
4,2,87,-0.326012,1,3
...,...,...,...,...,...
29716,9998,158,1.273030,0,1
29717,9998,174,0.201508,2,2
29718,9998,163,1.287281,3,4
29719,9999,70,0.188184,0,3


Let's check the datatype:

In [10]:
type(bdf_collapsedlong)

bipartitepandas.bipartitelongcollapsed.BipartiteLongCollapsed

#### Initializing from *Event Study* format

In [11]:
i = bdf_eventstudy['i']
j1 = bdf_eventstudy['j1']
j2 = bdf_eventstudy['j2']
y1 = bdf_eventstudy['y1']
y2 = bdf_eventstudy['y2']
t1 = bdf_eventstudy['t1']
t2 = bdf_eventstudy['t2']
bdf_eventstudy = bpd.BipartiteDataFrame(i=i, j1=j1, j2=j2, y1=y1, y2=y2, t1=t1, t2=t2)
display(bdf_eventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t1,t2
0,0,52,52,-1.635676,-0.355736,0,1
1,0,52,50,-0.355736,-1.278299,1,2
2,0,50,50,-1.278299,-1.729138,2,3
3,0,50,50,-1.729138,-1.697820,3,4
4,1,173,173,2.199857,2.199857,0,0
...,...,...,...,...,...,...,...
40641,9998,163,163,0.958440,1.616123,3,4
40642,9999,70,70,0.724830,1.574465,0,1
40643,9999,70,70,1.574465,-0.199578,1,2
40644,9999,70,70,-0.199578,-1.346981,2,3


Let's check the datatype:

In [12]:
type(bdf_eventstudy)

bipartitepandas.bipartiteeventstudy.BipartiteEventStudy

#### Initializing from *Collapsed Event Study* format

In [13]:
i = bdf_collapsedeventstudy['i']
j1 = bdf_collapsedeventstudy['j1']
j2 = bdf_collapsedeventstudy['j2']
y1 = bdf_collapsedeventstudy['y1']
y2 = bdf_collapsedeventstudy['y2']
t11 = bdf_collapsedeventstudy['t11']
t12 = bdf_collapsedeventstudy['t12']
t21 = bdf_collapsedeventstudy['t21']
t22 = bdf_collapsedeventstudy['t22']
bdf_collapsedeventstudy = bpd.BipartiteDataFrame(i=i, j1=j1, j2=j2, y1=y1, y2=y2, t11=t11, t12=t12, t21=t21, t22=t22)
display(bdf_collapsedeventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t11,t12,t21,t22
0,0,52,50,-0.995706,-1.568419,0,1,2,4
1,1,173,173,2.166473,2.166473,0,4,0,4
2,2,92,87,-0.340171,-0.326012,0,0,1,3
3,2,87,42,-0.326012,-1.392960,1,3,4,4
4,3,185,186,3.250695,1.966088,0,1,2,4
...,...,...,...,...,...,...,...,...,...
20362,9997,177,173,1.768343,4.042330,0,2,3,3
20363,9997,173,185,4.042330,4.225602,3,3,4,4
20364,9998,158,174,1.273030,0.201508,0,1,2,2
20365,9998,174,163,0.201508,1.287281,2,2,3,4


Let's check the datatype:

In [14]:
type(bdf_collapsedeventstudy)

bipartitepandas.bipartiteeventstudycollapsed.BipartiteEventStudyCollapsed