# Formats

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [1]:
import bipartitepandas as bpd

## Get your data ready

For this notebook, we simulate data.

In [2]:
df = bpd.SimBipartite().simulate()
display(df)

Unnamed: 0,i,j,y,t,l,k,alpha,psi
0,0,3,-2.970479,0,1,0,-0.430727,-1.335178
1,0,67,-1.271152,1,1,3,-0.430727,-0.348756
2,0,67,0.873944,2,1,3,-0.430727,-0.348756
3,0,121,-1.881740,3,1,6,-0.430727,0.348756
4,0,41,-1.016592,4,1,2,-0.430727,-0.604585
...,...,...,...,...,...,...,...,...
49995,9999,188,2.952023,0,3,9,0.430727,1.335178
49996,9999,188,1.938828,1,3,9,0.430727,1.335178
49997,9999,188,1.586594,2,3,9,0.430727,1.335178
49998,9999,188,0.989413,3,3,9,0.430727,1.335178


## Columns

BipartitePandas includes seven pre-defined general columns:

#### Required
- `i`: worker id (any type)
- `j`: firm id (any type)
- `y`: income (float or int)

#### Optional
- `t`: time (int)
- `g`: firm type (any type)
- `w`: weight (float or int)
- `m`: move indicator (int)

## Formats

BipartitePandas includes four formats:

- *Long* - each row gives a single observation
- *Collapsed Long* - like *Long*, but employment spells at the same firm are collapsed into a single observation
- *Event Study* - each row gives two consecutive observations
- *Collapsed Event Study* - like *Event Study*, but employment spells at the same firm are collapsed into a single observation

These formats divide general columns differently:

- *Long* - `i`, `j`, `y`, `t`, `g`, `w`, `m`
- *Collapsed Long* - `i`, `j`, `y`, `t1`, `t2`, `g`, `w`, `m`
- *Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t1`, `t2`, `g1`, `g2`, `w1`, `w2`, `m`
- *Collapsed Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t11`, `t12`, `t21`, `t22`, `g1`, `g2`, `w1`, `w2`, `m`

## Constructing DataFrames

Our simulated data is in *Long* format. How do we construct a *Long* dataframe?

In [3]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'])
display(bdf_long)

Unnamed: 0,i,j,y,t
0,0,3,-2.970479,0
1,0,67,-1.271152,1
2,0,67,0.873944,2
3,0,121,-1.881740,3
4,0,41,-1.016592,4
...,...,...,...,...
49995,9999,188,2.952023,0
49996,9999,188,1.938828,1
49997,9999,188,1.586594,2
49998,9999,188,0.989413,3


Are we sure this is long? Let's check the datatype:

In [4]:
type(bdf_long)

bipartitepandas.bipartitelong.BipartiteLong

This method works to construct any format! Just make sure not to mix up columns between formats.

## Converting between formats

Converting between formats is meant to be easy. Methods exist to go from:

- *Long* to *Collapsed Long* (`.collapse()`)
- *Long* to *Event Study* (`.to_eventstudy()`)
- *Collapsed Long* to *Long* (`.uncollapse()`)
- *Collapsed Long* to *Collapsed Event Study* (`.to_eventstudy()`)
- *Event Study* to *Long* (`.to_long()`)
- *Collapsed Event Study* to *Collapsed Long* (`.to_long()`)

Let's experiment with these and see what happens. Before we start, we just need to clean our data to make sure the conversions work properly (notice the new `m` column).

In [5]:
bdf_long = bdf_long.clean()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m
0,0,3,-2.970479,0,1
1,0,67,-1.271152,1,1
2,0,67,0.873944,2,1
3,0,121,-1.881740,3,2
4,0,41,-1.016592,4,1
...,...,...,...,...,...
49995,9999,188,2.952023,0,0
49996,9999,188,1.938828,1,0
49997,9999,188,1.586594,2,0
49998,9999,188,0.989413,3,0


#### *Long* to *Collapsed Long*

Notice that:

- `t` splits into `t1` and `t2`, which indicate the start the end of the spell, respectively
- `w` is new - it gives the number of observations in the spell

In [6]:
bdf_collapsedlong = bdf_long.collapse()
display(bdf_collapsedlong)

Unnamed: 0,i,j,y,t1,t2,w,m
0,0,3,-2.970479,0,0,1,1
1,0,67,-0.198604,1,2,2,2
2,0,121,-1.881740,3,3,1,2
3,0,41,-1.016592,4,4,1,1
4,1,2,-2.155878,0,0,1,1
...,...,...,...,...,...,...,...
29890,9997,189,2.318654,3,3,1,2
29891,9997,152,-0.007326,4,4,1,1
29892,9998,80,-0.644006,0,3,4,1
29893,9998,115,-1.508793,4,4,1,1


#### *Long* to *Event Study*

Notice that:

- `j` splits into `j1` and `j2`, which indicate the first and second firm id in the event study, respectively
- `y` splits into `y1` and `y2`, which indicate the first and second income in the event study, respectively
- `t` splits into `t1` and `t2`, which indicate the first and second period in the event study, respectively

<span class="label label-info">Note</span> For stayers (individuals who stay at the same firm for all their observations), each row in the event study represents a single observation, since they never move firms.

In [7]:
bdf_eventstudy = bdf_long.to_eventstudy()
display(bdf_eventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t1,t2,m
0,0,3,67,-2.970479,-1.271152,0,1,1
1,0,67,67,-1.271152,0.873944,1,2,0
2,0,67,121,0.873944,-1.881740,2,3,1
3,0,121,41,-1.881740,-1.016592,3,4,1
4,1,2,67,-2.155878,-1.933500,0,1,1
...,...,...,...,...,...,...,...,...
40597,9999,188,188,2.952023,2.952023,0,0,0
40598,9999,188,188,1.938828,1.938828,1,1,0
40599,9999,188,188,1.586594,1.586594,2,2,0
40600,9999,188,188,0.989413,0.989413,3,3,0


#### *Collapsed Long* to *Collapsed Event Study*

Notice that:

- `j` splits into `j1` and `j2`, which indicate the first and second firm id in the event study, respectively
- `y` splits into `y1` and `y2`, which indicate the first and second income in the event study, respectively
- `t1` splits into `t11` and `t12`, which indicate the start the end of the spell for the first observation in the event study, respectively
- `t2` splits into `t21` and `t22`, which indicate the start the end of the spell for the second observation in the event study, respectively
- `w` splits into `w1` and `w2`, which indicate number of observations in the first and second spell in the event study, respectively

In [8]:
bdf_collapsedeventstudy = bdf_collapsedlong.to_eventstudy()
display(bdf_collapsedeventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t11,t12,t21,t22,w1,w2,m
0,0,3,67,-2.970479,-0.198604,0,0,1,2,1,2,1
1,0,67,121,-0.198604,-1.881740,1,2,3,3,2,1,1
2,0,121,41,-1.881740,-1.016592,3,3,4,4,1,1,1
3,1,2,67,-2.155878,-1.933500,0,0,1,1,1,1,1
4,1,67,143,-1.933500,0.288426,1,1,2,3,1,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...
20492,9996,94,71,-0.529813,-0.766677,0,0,1,4,1,4,1
20493,9997,182,189,0.973182,2.318654,0,2,3,3,3,1,1
20494,9997,189,152,2.318654,-0.007326,3,3,4,4,1,1,1
20495,9998,80,115,-0.644006,-1.508793,0,3,4,4,4,1,1


We showed how to get from *Long* to any other format, but feel free to experiment and see what happens when you convert in other directions!

## Initializing from different formats

If your data is saved in a format other than *Long*, it's simple to construct a BipartiteDataFrame.

#### Initializing from *Collapsed Long* format

In [9]:
i = bdf_collapsedlong['i']
j = bdf_collapsedlong['j']
y = bdf_collapsedlong['y']
t1 = bdf_collapsedlong['t1']
t2 = bdf_collapsedlong['t2']
bdf_collapsedlong = bpd.BipartiteDataFrame(i=i, j=j, y=y, t1=t1, t2=t2)
display(bdf_collapsedlong)

Unnamed: 0,i,j,y,t1,t2
0,0,3,-2.970479,0,0
1,0,67,-0.198604,1,2
2,0,121,-1.881740,3,3
3,0,41,-1.016592,4,4
4,1,2,-2.155878,0,0
...,...,...,...,...,...
29890,9997,189,2.318654,3,3
29891,9997,152,-0.007326,4,4
29892,9998,80,-0.644006,0,3
29893,9998,115,-1.508793,4,4


Let's check the datatype:

In [10]:
type(bdf_collapsedlong)

bipartitepandas.bipartitelongcollapsed.BipartiteLongCollapsed

#### Initializing from *Event Study* format

In [11]:
i = bdf_eventstudy['i']
j1 = bdf_eventstudy['j1']
j2 = bdf_eventstudy['j2']
y1 = bdf_eventstudy['y1']
y2 = bdf_eventstudy['y2']
t1 = bdf_eventstudy['t1']
t2 = bdf_eventstudy['t2']
bdf_eventstudy = bpd.BipartiteDataFrame(i=i, j1=j1, j2=j2, y1=y1, y2=y2, t1=t1, t2=t2)
display(bdf_eventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t1,t2
0,0,3,67,-2.970479,-1.271152,0,1
1,0,67,67,-1.271152,0.873944,1,2
2,0,67,121,0.873944,-1.881740,2,3
3,0,121,41,-1.881740,-1.016592,3,4
4,1,2,67,-2.155878,-1.933500,0,1
...,...,...,...,...,...,...,...
40597,9999,188,188,2.952023,2.952023,0,0
40598,9999,188,188,1.938828,1.938828,1,1
40599,9999,188,188,1.586594,1.586594,2,2
40600,9999,188,188,0.989413,0.989413,3,3


Let's check the datatype:

In [12]:
type(bdf_eventstudy)

bipartitepandas.bipartiteeventstudy.BipartiteEventStudy

#### Initializing from *Collapsed Event Study* format

In [13]:
i = bdf_collapsedeventstudy['i']
j1 = bdf_collapsedeventstudy['j1']
j2 = bdf_collapsedeventstudy['j2']
y1 = bdf_collapsedeventstudy['y1']
y2 = bdf_collapsedeventstudy['y2']
t11 = bdf_collapsedeventstudy['t11']
t12 = bdf_collapsedeventstudy['t12']
t21 = bdf_collapsedeventstudy['t21']
t22 = bdf_collapsedeventstudy['t22']
bdf_collapsedeventstudy = bpd.BipartiteDataFrame(i=i, j1=j1, j2=j2, y1=y1, y2=y2, t11=t11, t12=t12, t21=t21, t22=t22)
display(bdf_collapsedeventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t11,t12,t21,t22
0,0,3,67,-2.970479,-0.198604,0,0,1,2
1,0,67,121,-0.198604,-1.881740,1,2,3,3
2,0,121,41,-1.881740,-1.016592,3,3,4,4
3,1,2,67,-2.155878,-1.933500,0,0,1,1
4,1,67,143,-1.933500,0.288426,1,1,2,3
...,...,...,...,...,...,...,...,...,...
20492,9996,94,71,-0.529813,-0.766677,0,0,1,4
20493,9997,182,189,0.973182,2.318654,0,2,3,3
20494,9997,189,152,2.318654,-0.007326,3,3,4,4
20495,9998,80,115,-0.644006,-1.508793,0,3,4,4


Let's check the datatype:

In [14]:
type(bdf_collapsedeventstudy)

bipartitepandas.bipartiteeventstudycollapsed.BipartiteEventStudyCollapsed