# Formats

In [1]:
# Add BipartitePandas to system path, do not run this
# import sys
# sys.path.append('../../..')

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [2]:
import bipartitepandas as bpd

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Get your data ready

In this example, we simulate data.

In [3]:
df = bpd.SimBipartite().simulate()
display(df)

Unnamed: 0,i,j,y,t,l,k,alpha,psi
0,0,184,1.235525,0,4,9,0.967422,1.335178
1,0,184,3.736296,1,4,9,0.967422,1.335178
2,0,171,1.120824,2,4,8,0.967422,0.908458
3,0,171,-0.581654,3,4,8,0.967422,0.908458
4,0,137,0.963569,4,4,6,0.967422,0.348756
...,...,...,...,...,...,...,...,...
49995,9999,121,-0.414404,0,3,6,0.430727,0.348756
49996,9999,87,-2.129833,1,3,4,0.430727,-0.114185
49997,9999,87,-0.644136,2,3,4,0.430727,-0.114185
49998,9999,87,0.717443,3,3,4,0.430727,-0.114185


## Columns

BipartitePandas includes seven pre-defined general columns:

#### Required
- `i`: worker id (any type)
- `j`: firm id (any type)
- `y`: income (float or int)

#### Optional
- `t`: time (int)
- `g`: firm type (any type)
- `w`: weight (float or int)
- `m`: move indicator (int)

## Formats

BipartitePandas includes four formats:

- *Long* - each row gives a single observation
- *Collapsed Long* - like *Long*, but employment spells at the same firm are collapsed into a single observation
- *Event Study* - each row gives two consecutive observations
- *Collapsed Event Study* - like *Event Study*, but employment spells at the same firm are collapsed into a single observation

These formats divide general columns differently:

- *Long* - `i`, `j`, `y`, `t`, `g`, `w`, `m`
- *Collapsed Long* - `i`, `j`, `y`, `t1`, `t2`, `g`, `w`, `m`
- *Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t1`, `t2`, `g1`, `g2`, `w1`, `w2`, `m`
- *Collapsed Event Study* - `i`, `j1`, `j2`, `y1`, `y2`, `t11`, `t12`, `t21`, `t22`, `g1`, `g2`, `w1`, `w2`, `m`

## Constructing DataFrames

Our simulated data is in *Long* format. How do we construct a *Long* dataframe?

In [4]:
bdf_long = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'])
display(bdf_long)

Unnamed: 0,i,j,y,t
0,0,184,1.235525,0
1,0,184,3.736296,1
2,0,171,1.120824,2
3,0,171,-0.581654,3
4,0,137,0.963569,4
...,...,...,...,...
49995,9999,121,-0.414404,0
49996,9999,87,-2.129833,1
49997,9999,87,-0.644136,2
49998,9999,87,0.717443,3


Are we sure this is long? Let's check the datatype:

In [5]:
type(bdf_long)

bipartitepandas.bipartitelong.BipartiteLong

This method works to construct any format! Just make sure not to mix up columns between formats.

## Converting between formats

Converting between formats is meant to be easy. Methods exist to go from:

- *Long* to *Collapsed Long* (`.collapse()`)
- *Long* to *Event Study* (`.to_eventstudy()`)
- *Collapsed Long* to *Long* (`.uncollapse()`)
- *Collapsed Long* to *Collapsed Event Study* (`.to_eventstudy()`)
- *Event Study* to *Long* (`.to_long()`)
- *Collapsed Event Study* to *Collapsed Long* (`.to_long()`)

Let's experiment with these and see what happens. Before we start, we just need to clean our data to make sure the conversions work properly (notice the new `m` column).

In [6]:
bdf_long = bdf_long.clean()
display(bdf_long)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m
0,0,184,1.235525,0,0
1,0,184,3.736296,1,1
2,0,171,1.120824,2,1
3,0,171,-0.581654,3,1
4,0,137,0.963569,4,1
...,...,...,...,...,...
49995,9999,121,-0.414404,0,1
49996,9999,87,-2.129833,1,1
49997,9999,87,-0.644136,2,0
49998,9999,87,0.717443,3,0


#### *Long* to *Collapsed Long*

Notice that:

- `t` splits into `t1` and `t2`, which indicate the start the end of the spell, respectively
- `w` is new - it gives the number of observations in the spell

In [7]:
bdf_collapsedlong = bdf_long.collapse()
display(bdf_collapsedlong)

Unnamed: 0,i,j,y,t1,t2,w,m
0,0,184,2.485910,0,1,2,1
1,0,171,0.269585,2,3,2,2
2,0,137,0.963569,4,4,1,1
3,1,1,0.613989,0,0,1,1
4,1,54,-0.150576,1,1,1,2
...,...,...,...,...,...,...,...
29943,9998,4,-2.001026,0,2,3,1
29944,9998,103,-0.164205,3,3,1,2
29945,9998,62,-2.367536,4,4,1,1
29946,9999,121,-0.414404,0,0,1,1


#### *Long* to *Event Study*

Notice that:

- `j` splits into `j1` and `j2`, which indicate the first and second firm id in the event study, respectively
- `y` splits into `y1` and `y2`, which indicate the first and second income in the event study, respectively
- `t` splits into `t1` and `t2`, which indicate the first and second period in the event study, respectively

*Hint:* for stayers (individuals who stay at the same firm for all their observations), each row in the event study represents a single observation, since they never move firms.

In [8]:
bdf_eventstudy = bdf_long.to_eventstudy()
display(bdf_eventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t1,t2,m
0,0,184,184,1.235525,3.736296,0,1,0
1,0,184,171,3.736296,1.120824,1,2,1
2,0,171,171,1.120824,-0.581654,2,3,0
3,0,171,137,-0.581654,0.963569,3,4,1
4,1,1,54,0.613989,-0.150576,0,1,1
...,...,...,...,...,...,...,...,...
40637,9998,103,62,-0.164205,-2.367536,3,4,1
40638,9999,121,87,-0.414404,-2.129833,0,1,1
40639,9999,87,87,-2.129833,-0.644136,1,2,0
40640,9999,87,87,-0.644136,0.717443,2,3,0


#### *Collapsed Long* to *Collapsed Event Study*

Notice that:

- `j` splits into `j1` and `j2`, which indicate the first and second firm id in the event study, respectively
- `y` splits into `y1` and `y2`, which indicate the first and second income in the event study, respectively
- `t1` splits into `t11` and `t12`, which indicate the start the end of the spell for the first observation in the event study, respectively
- `t2` splits into `t21` and `t22`, which indicate the start the end of the spell for the second observation in the event study, respectively
- `w` splits into `w1` and `w2`, which indicate number of observations in the first and second spell in the event study, respectively

In [9]:
bdf_collapsedeventstudy = bdf_collapsedlong.to_eventstudy()
display(bdf_collapsedeventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t11,t12,t21,t22,w1,w2,m
0,0,184,171,2.485910,0.269585,0,1,2,3,2,2,1
1,0,171,137,0.269585,0.963569,2,3,4,4,2,1,1
2,1,1,54,0.613989,-0.150576,0,0,1,1,1,1,1
3,1,54,196,-0.150576,1.419270,1,1,2,3,1,2,1
4,1,196,92,1.419270,0.481084,2,3,4,4,2,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
20585,9996,40,22,-3.077534,-2.409247,1,1,2,4,1,3,1
20586,9997,49,49,-1.564512,-1.564512,0,4,0,4,5,5,0
20587,9998,4,103,-2.001026,-0.164205,0,2,3,3,3,1,1
20588,9998,103,62,-0.164205,-2.367536,3,3,4,4,1,1,1


We showed how to get from *Long* to any other format, but feel free to experiment and see what happens when you convert in other directions!

## Initializing from different formats

If your data is saved in a format other than *Long*, it's simple to construct a BipartiteDataFrame.

#### Initializing from *Collapsed Long* format

In [10]:
i = bdf_collapsedlong['i']
j = bdf_collapsedlong['j']
y = bdf_collapsedlong['y']
t1 = bdf_collapsedlong['t1']
t2 = bdf_collapsedlong['t2']
bdf_collapsedlong = bpd.BipartiteDataFrame(i=i, j=j, y=y, t1=t1, t2=t2)
display(bdf_collapsedlong)

Unnamed: 0,i,j,y,t1,t2
0,0,184,2.485910,0,1
1,0,171,0.269585,2,3
2,0,137,0.963569,4,4
3,1,1,0.613989,0,0
4,1,54,-0.150576,1,1
...,...,...,...,...,...
29943,9998,4,-2.001026,0,2
29944,9998,103,-0.164205,3,3
29945,9998,62,-2.367536,4,4
29946,9999,121,-0.414404,0,0


Let's check the datatype:

In [11]:
type(bdf_collapsedlong)

bipartitepandas.bipartitelongcollapsed.BipartiteLongCollapsed

#### Initializing from *Event Study* format

In [12]:
i = bdf_eventstudy['i']
j1 = bdf_eventstudy['j1']
j2 = bdf_eventstudy['j2']
y1 = bdf_eventstudy['y1']
y2 = bdf_eventstudy['y2']
t1 = bdf_eventstudy['t1']
t2 = bdf_eventstudy['t2']
bdf_eventstudy = bpd.BipartiteDataFrame(i=i, j1=j1, j2=j2, y1=y1, y2=y2, t1=t1, t2=t2)
display(bdf_eventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t1,t2
0,0,184,184,1.235525,3.736296,0,1
1,0,184,171,3.736296,1.120824,1,2
2,0,171,171,1.120824,-0.581654,2,3
3,0,171,137,-0.581654,0.963569,3,4
4,1,1,54,0.613989,-0.150576,0,1
...,...,...,...,...,...,...,...
40637,9998,103,62,-0.164205,-2.367536,3,4
40638,9999,121,87,-0.414404,-2.129833,0,1
40639,9999,87,87,-2.129833,-0.644136,1,2
40640,9999,87,87,-0.644136,0.717443,2,3


Let's check the datatype:

In [13]:
type(bdf_eventstudy)

bipartitepandas.bipartiteeventstudy.BipartiteEventStudy

#### Initializing from *Collapsed Event Study* format

In [14]:
i = bdf_collapsedeventstudy['i']
j1 = bdf_collapsedeventstudy['j1']
j2 = bdf_collapsedeventstudy['j2']
y1 = bdf_collapsedeventstudy['y1']
y2 = bdf_collapsedeventstudy['y2']
t11 = bdf_collapsedeventstudy['t11']
t12 = bdf_collapsedeventstudy['t12']
t21 = bdf_collapsedeventstudy['t21']
t22 = bdf_collapsedeventstudy['t22']
bdf_collapsedeventstudy = bpd.BipartiteDataFrame(i=i, j1=j1, j2=j2, y1=y1, y2=y2, t11=t11, t12=t12, t21=t21, t22=t22)
display(bdf_collapsedeventstudy)

Unnamed: 0,i,j1,j2,y1,y2,t11,t12,t21,t22
0,0,184,171,2.485910,0.269585,0,1,2,3
1,0,171,137,0.269585,0.963569,2,3,4,4
2,1,1,54,0.613989,-0.150576,0,0,1,1
3,1,54,196,-0.150576,1.419270,1,1,2,3
4,1,196,92,1.419270,0.481084,2,3,4,4
...,...,...,...,...,...,...,...,...,...
20585,9996,40,22,-3.077534,-2.409247,1,1,2,4
20586,9997,49,49,-1.564512,-1.564512,0,4,0,4
20587,9998,4,103,-2.001026,-0.164205,0,2,3,3
20588,9998,103,62,-0.164205,-2.367536,3,3,4,4


Let's check the datatype:

In [15]:
type(bdf_collapsedeventstudy)

bipartitepandas.bipartiteeventstudycollapsed.BipartiteEventStudyCollapsed