# Simple example

In [1]:
# Add BipartitePandas to system path, do not run this
import sys
sys.path.append('../../..')

## Import the BipartitePandas Package

Make sure to install it using `pip install bipartitepandas`

In [2]:
import bipartitepandas as bpd

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Get your data ready

In this example, we simulate data (we set parameters to make data cleaning interesting)

In [18]:
df = bpd.SimBipartite(bpd.sim_params({'firm_size': 10, 'p_move': 0.05})).simulate()
display(df)

Unnamed: 0,i,j,y,t,l,k,alpha,psi
0,0,732,1.403440,0,3,7,0.430727,0.604585
1,0,732,0.396067,1,3,7,0.430727,0.604585
2,0,732,2.111937,2,3,7,0.430727,0.604585
3,0,732,0.364341,3,3,7,0.430727,0.604585
4,0,732,1.286483,4,3,7,0.430727,0.604585
...,...,...,...,...,...,...,...,...
49995,9999,446,2.202550,0,4,4,0.967422,-0.114185
49996,9999,446,1.770438,1,4,4,0.967422,-0.114185
49997,9999,446,-0.338953,2,4,4,0.967422,-0.114185
49998,9999,446,0.834811,3,4,4,0.967422,-0.114185


## Columns

BipartitePandas includes 5 pre-defined general columns:

#### Required
- $i$: worker id (any type)
- $j$: firm id (any type)
- $y$: income (float or int)

#### Optional
- $t$: time (int)
- $g$: firm type (any type)
- $w$: weight (float or int)
- $m$: move indicator (int)

## Formats

BipartitePandas includes 4 formats:
- *Long* - each row gives a single observation
- *Collapsed Long* - like *Long*, but employment spells at the same firm are collapsed into a single observation
- *Event Study* - each row gives two consecutive observations
- *Collapsed Event Study* - like *Event Study*, but employment spells at the same firm are collapsed into a single observation

These formats divide general columns differently:
- *Long* - $i$, $j$, $y$, $t$, $g$, $w$, $m$
- *Collapsed Long* - $i$, $j$, $y$, $t1$, $t2$, $g$, $w$, $m$
- *Event Study* - $i$, $j1$, $j2$, $y1$, $y2$, $t1$, $t2$, $g1$, $g2$, $w1$, $w2$, $m$
- *Collapsed Event Study* - $i$, $j1$, $j2$, $y1$, $y2$, $t11$, $t12$, $t21$, $t22$, $g1$, $g2$, $w1$, $w2$, $m$

## Constructing DataFrames

Our simulated data is in *Long* format. How do we construct a *Long* dataframe?

In [19]:
i = df['i']
j = df['j']
y = df['y']
t = df['t']
bdf = bpd.BipartiteDataFrame(i=i, j=j, y=y, t=t)
display(bdf)

Unnamed: 0,i,j,y,t
0,0,732,1.403440,0
1,0,732,0.396067,1
2,0,732,2.111937,2
3,0,732,0.364341,3
4,0,732,1.286483,4
...,...,...,...,...
49995,9999,446,2.202550,0
49996,9999,446,1.770438,1
49997,9999,446,-0.338953,2
49998,9999,446,0.834811,3


Are we sure this is long? Let's check the datatype:

In [20]:
type(bdf)

bipartitepandas.bipartitelong.BipartiteLong

## Before we clean our data, let's check out some statistics

In [21]:
bdf.summary()

format: 'BipartiteLong'
number of workers: 10000
number of firms: 1000
number of observations: 50000
mean wage: 0.0013107443134556176
median wage: -0.004787415170543902
min wage: -6.18872620478865
max wage: 5.92125674618363
var(wage): 2.6666863328574104
no NaN values: False
no duplicates: False
i-t (worker-year) observations unique (None if t column(s) not included): False
no returns (None if not yet computed): None
contiguous 'i' ids (None if not included): False
contiguous 'j' ids (None if not included): False
contiguous 'g' ids (None if not included): None
connectedness (None if ignoring connectedness): None


## Now let's clean our data - and make sure the result is leave-one-observation-out connected

Hint: want details on all cleaning parameters? Type `bpd.clean_params().describe_all()`, or search through `bpd.clean_params().keys()` for a particular key, and then type `bpd.clean_params().describe(key)`.

In [22]:
bdf = bdf.clean(bpd.clean_params({'connectedness': 'leave_out_observation'}))
display(bdf)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how='leave_out_observation')
making 'i' ids contiguous
making 'j' ids contiguous
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m
0,0,0,1.403440,0,0
1,0,0,0.396067,1,0
2,0,0,2.111937,2,0
3,0,0,0.364341,3,0
4,0,0,1.286483,4,0
...,...,...,...,...,...
46086,9254,441,2.202550,0,0
46087,9254,441,1.770438,1,0
46088,9254,441,-0.338953,2,0
46089,9254,441,0.834811,3,0


We can check how the summary statistics changed:

In [23]:
bdf.summary()

format: 'BipartiteLong'
number of workers: 9255
number of firms: 905
number of observations: 46091
mean wage: 0.0072915696615777625
median wage: 0.0006768608246390806
min wage: -6.18872620478865
max wage: 5.92125674618363
var(wage): 2.6727804963539126
no NaN values: True
no duplicates: True
i-t (worker-year) observations unique (None if t column(s) not included): True
no returns (None if not yet computed): True
contiguous 'i' ids (None if not included): True
contiguous 'j' ids (None if not included): True
contiguous 'g' ids (None if not included): None
connectedness (None if ignoring connectedness): 'leave_out_observation'


## Converting formats

### *Collapsed* format

In [27]:
display(bdf.collapse())

Unnamed: 0,i,j,y,t1,t2,w,m
0,0,0,1.112454,0,4,5,0
1,1,1,-0.134522,0,4,5,0
2,2,2,0.188888,0,4,5,0
3,3,3,0.201713,0,2,3,1
4,3,4,0.376277,3,4,2,1
...,...,...,...,...,...,...,...
11224,9251,602,2.174079,0,4,5,0
11225,9252,669,0.720166,0,4,5,0
11226,9253,869,0.193721,0,2,3,1
11227,9253,814,0.473813,3,4,2,1


### *Event Study* format

In [28]:
display(bdf.to_eventstudy())

Unnamed: 0,i,j1,j2,y1,y2,t1,t2,m
0,0,0,0,1.403440,1.403440,0,0,0
1,0,0,0,0.396067,0.396067,1,1,0
2,0,0,0,2.111937,2.111937,2,2,0
3,0,0,0,0.364341,0.364341,3,3,0
4,0,0,0,1.286483,1.286483,4,4,0
...,...,...,...,...,...,...,...,...
44255,9254,441,441,2.202550,2.202550,0,0,0
44256,9254,441,441,1.770438,1.770438,1,1,0
44257,9254,441,441,-0.338953,-0.338953,2,2,0
44258,9254,441,441,0.834811,0.834811,3,3,0


### *Collapsed Event Study* format

In [29]:
display(bdf.collapse().to_eventstudy())

Unnamed: 0,i,j1,j2,y1,y2,t11,t12,t21,t22,w1,w2,m
0,0,0,0,1.112454,1.112454,0,4,0,4,5,5,0
1,1,1,1,-0.134522,-0.134522,0,4,0,4,5,5,0
2,2,2,2,0.188888,0.188888,0,4,0,4,5,5,0
3,3,3,4,0.201713,0.376277,0,2,3,4,3,2,1
4,4,5,6,-0.703317,0.120172,0,0,1,4,1,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...
9393,9250,655,655,-0.420098,-0.420098,0,4,0,4,5,5,0
9394,9251,602,602,2.174079,2.174079,0,4,0,4,5,5,0
9395,9252,669,669,0.720166,0.720166,0,4,0,4,5,5,0
9396,9253,869,814,0.193721,0.473813,0,2,3,4,3,2,1


## Generating firm clusters

Notice the new $g$ column

In [31]:
display(bdf.cluster())

INFO:bipartitelong:beginning clustering
INFO:bipartitelong:beginning copy
INFO:bipartitelong:firm moments computed
INFO:bipartitelong:computing firm groups
INFO:bipartitelong:firm groups computed
INFO:bipartitelong:dictionary linking firms to clusters generated
INFO:bipartitelong:sorting columns
INFO:bipartitelong:clusters merged into data


Unnamed: 0,i,j,y,t,g,m
0,0,0,1.403440,0,3,0
1,0,0,0.396067,1,3,0
2,0,0,2.111937,2,3,0
3,0,0,0.364341,3,3,0
4,0,0,1.286483,4,3,0
...,...,...,...,...,...,...
46086,9254,441,2.202550,0,1,0
46087,9254,441,1.770438,1,1,0
46088,9254,441,-0.338953,2,1,0
46089,9254,441,0.834811,3,1,0
