# Simple example

In [1]:
# Add BipartitePandas to system path, do not run this
import sys
sys.path.append('../../..')

## Import the BipartitePandas Package

Make sure to install it using `pip install bipartitepandas`

In [2]:
import bipartitepandas as bpd

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Get your data ready

In this example, we simulate data (we set parameters to make data cleaning interesting)

In [3]:
df = bpd.SimBipartite(bpd.sim_params({'firm_size': 10, 'p_move': 0.05})).simulate()
display(df)

Unnamed: 0,i,j,y,t,l,k,alpha,psi
0,0,63,-0.618350,0,2,0,0.000000,-1.335178
1,0,63,-1.811602,1,2,0,0.000000,-1.335178
2,0,63,-1.750508,2,2,0,0.000000,-1.335178
3,0,63,-0.021734,3,2,0,0.000000,-1.335178
4,0,63,-1.753123,4,2,0,0.000000,-1.335178
...,...,...,...,...,...,...,...,...
49995,9999,361,-0.263988,0,1,3,-0.430727,-0.348756
49996,9999,361,-0.472704,1,1,3,-0.430727,-0.348756
49997,9999,361,0.518493,2,1,3,-0.430727,-0.348756
49998,9999,361,-1.102591,3,1,3,-0.430727,-0.348756


## Columns

BipartitePandas includes 5 pre-defined general columns:

#### Required
- $i$: worker id (any type)
- $j$: firm id (any type)
- $y$: income (float or int)

#### Optional
- $t$: time (int)
- $g$: firm type (any type)
- $w$: weight (float or int)
- $m$: move indicator (int)

## Formats

BipartitePandas includes 4 formats:
- *Long* - each row gives a single observation
- *Collapsed Long* - like *Long*, but employment spells at the same firm are collapsed into a single observation
- *Event Study* - each row gives two consecutive observations
- *Collapsed Event Study* - like *Event Study*, but employment spells at the same firm are collapsed into a single observation

These formats divide general columns differently:
- *Long* - $i$, $j$, $y$, $t$, $g$, $w$, $m$
- *Collapsed Long* - $i$, $j$, $y$, $t1$, $t2$, $g$, $w$, $m$
- *Event Study* - $i$, $j1$, $j2$, $y1$, $y2$, $t1$, $t2$, $g1$, $g2$, $w1$, $w2$, $m$
- *Collapsed Event Study* - $i$, $j1$, $j2$, $y1$, $y2$, $t11$, $t12$, $t21$, $t22$, $g1$, $g2$, $w1$, $w2$, $m$

## Constructing DataFrames

Our simulated data is in *Long* format. How do we construct a *Long* dataframe?

In [4]:
i = df['i']
j = df['j']
y = df['y']
t = df['t']
bdf = bpd.BipartiteDataFrame(i=i, j=j, y=y, t=t)
display(bdf)

Unnamed: 0,i,j,y,t
0,0,63,-0.618350,0
1,0,63,-1.811602,1
2,0,63,-1.750508,2
3,0,63,-0.021734,3
4,0,63,-1.753123,4
...,...,...,...,...
49995,9999,361,-0.263988,0
49996,9999,361,-0.472704,1
49997,9999,361,0.518493,2
49998,9999,361,-1.102591,3


Are we sure this is long? Let's check the datatype:

In [5]:
type(bdf)

bipartitepandas.bipartitelong.BipartiteLong

## Before we clean our data, let's check out some statistics

In [6]:
bdf.summary()

format: 'BipartiteLong'
number of workers: 10000
number of firms: 1000
number of observations: 50000
mean wage: -0.009399737891587488
median wage: -0.006301350921179505
min wage: -5.523671469790966
max wage: 5.922274960525396
var(wage): 2.6572896778352675
no NaN values: False
no duplicates: False
i-t (worker-year) observations unique (None if t column(s) not included): False
no returns (None if not yet computed): None
contiguous 'i' ids (None if not included): False
contiguous 'j' ids (None if not included): False
contiguous 'g' ids (None if not included): None
connectedness (None if ignoring connectedness): None


## Now let's clean our data - and make sure the result is leave-one-observation-out connected

Hint: want details on all cleaning parameters? Type `bpd.clean_params().describe_all()`, or search through `bpd.clean_params().keys()` for a particular key, and then type `bpd.clean_params().describe(key)`.

In [7]:
bpd.clean_params().describe_all()

KEY: 'connectedness'
CURRENT VALUE: None
VALID VALUES: one of ['connected', 'leave_out_observation', 'leave_out_spell', 'leave_out_match', 'leave_out_worker', 'leave_out_firm', None]
DESCRIPTION: 
            (default=None) When computing largest connected set of firms: if 'connected', keep observations in the largest connected set of firms; if 'leave_out_observation', keep observations in the largest leave-one-observation-out connected set; if 'leave_out_spell', keep observations in the largest leave-one-spell-out connected set; if 'leave_out_match', keep observations in the largest leave-one-match-out connected set; if 'leave_out_worker', keep observations in the largest leave-one-worker-out connected set; if 'leave_out_firm', keep observations in the largest leave-one-firm-out connected set; if None, keep all observations.
        
KEY: 'component_size_variable'
CURRENT VALUE: 'firms'
VALID VALUES: one of ['len', 'length', 'firms', 'workers', 'stayers', 'movers', 'firms_plus_workers

In [None]:
bdf = bdf.clean(bpd.clean_params({'connectedness': 'leave_out_observation'}))
display(bdf)

We can check how the summary statistics changed:

In [None]:
bdf.summary()

## Converting formats

### *Collapsed* format

In [None]:
display(bdf.collapse())

### *Event Study* format

In [None]:
display(bdf.to_eventstudy())

### *Collapsed Event Study* format

In [None]:
display(bdf.collapse().to_eventstudy())

## Generating firm clusters

Notice the new $g$ column

In [None]:
display(bdf.cluster())