# Advanced features

In [3]:
# Add BipartitePandas to system path, do not run this
import sys
sys.path.append('../../..')

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [4]:
import bipartitepandas as bpd

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Get your data ready

For this notebook, we simulate data.

In [5]:
df = bpd.SimBipartite().simulate()
bdf = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'])
display(bdf)

Unnamed: 0,i,j,y,t
0,0,177,1.845578,0
1,0,189,1.956269,1
2,0,189,3.617850,2
3,0,189,1.615126,3
4,0,182,2.171518,4
...,...,...,...,...
49995,9999,36,-2.409573,0
49996,9999,113,-1.828584,1
49997,9999,113,1.035640,2
49998,9999,99,-0.864686,3


In [7]:
bpd.clean_params().keys()

dict_keys(['connectedness', 'component_size_variable', 'i_t_how', 'drop_returns', 'drop_returns_to_stays', 'is_sorted', 'force', 'copy', 'verbose'])

## Advanced data cleaning

*Hint:* want details on all cleaning parameters? Run `bpd.clean_params().describe_all()`, or search through `bpd.clean_params().keys()` for a particular key, and then run `bpd.clean_params().describe(key)`.

#### Compute the largest connected component

Use the parameter `connectedness` to set the connectedness measure to use, and `component_size_variable` to choose how to measure component size (e.g. `firms` chooses the connected component with the greatest number of unique firm ids, while `workers` chooses the connected component with the greatest number of unique worker ids).

#### Set how to handle worker-year duplicates

Use the parameter `i_t_how` to customize how worker-year duplicates are handled.

#### Collapse at the match-level

If you drop the `t` column, collapsing will automatically collapse at the match level. However, this prevents conversion to event study format (this can be bypassed with the `.construct_artificial_time()` method, but the data will likely have a meaningless order, rendering the event study uninterpretable).

#### Avoid unnecessary copies

If you are working with a large dataset, you will want to avoid copies whenever possible. So set `copy=False`.

#### Avoid unnecessary sorts

If you know your data is sorted by `i` and `t` (or, if you aren't including a `t` column, just by `i`), then set `is_sorted=True`.

#### Avoid complicated loops

Sometimes workers leave a firm, then return to it (we call these workers *returners*). Returners can cause computational difficulties because sometimes intermediate firms get dropped (e.g. a worker goes from firm $A \to B \to A$, but firm $B$ gets dropped). This turns returners into stayers. This can change the largest connected set of firms, and if data is in collapsed format, requires the data to be recollapsed.

Because of these potential complications, if there are returners, many methods require loops that run until convergence.

These difficulties can be avoided by setting the parameter `drop_returns` (there are multiple ways to handle returners, they can be seen by running `bpd.clean_params().describe('drop_returns')`).

*Alternative:* another way to handle returners is to drop the `t` column. Then, sorting will automatically sort by `i` and `j`, which eliminates the possibility of returners. However, this prevents conversion to event study format (this can be bypassed with the `.construct_artificial_time()` method, but the data will likely have a meaningless order, rendering the event study uninterpretable).

#### Disable logging

Logging can slow down data cleaning. Set the parameter `log=False` when constructing your dataframe to turn off logging.

## Filling in missing years as unemployed

The method `.fill_periods` (for *Long* format) will fill in rows for missing intermediate periods. Note that this method will not work with custom columns.

In this example, we drop periods 1-3, then fill them in:

In [13]:
bdf_missing = bdf[(bdf['t'] == 0) | (bdf['t'] == 4)].clean()
display(bdf_missing.fill_periods())

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m
0,0,177,1.845578,0,1
1,0,-1,,1,
2,0,-1,,2,
3,0,-1,,3,
4,0,182,2.171518,4,1
...,...,...,...,...,...
49995,9999,36,-2.409573,0,1
49996,9999,-1,,1,
49997,9999,-1,,2,
49998,9999,-1,,3,


## Getting extended event study dataframes

BipartitePandas allows you to use *Long* format data to generate event studies with more than 2 periods.

You can specify:

- what column signals a transition (e.g. if `j` is used, a transition is when a worker moves firms)
- which column(s) should be treated as the event study outcome
- how many periods before and after the transition should be considered
- whether the pre- and/or post-trends must be stable, and for which column(s)

We consider an example where `j` is the transition column, `y` is the outcome column, and with pre- and post-trends of length 2 that are required to be at the same firm.

In [14]:
es_extended = bdf.get_extended_eventstudy(transition_col='j', outcomes='y', periods_pre=2, periods_post=2, stable_pre='j', stable_post='j')
display(es_extended)

Unnamed: 0,i,t,y_l2,y_l1,y_f1,y_f2
0,1,2,-0.323825,-1.355848,0.755453,0.391587
1,3,3,-0.319881,-1.431893,-2.443917,-1.929856
2,8,2,0.820075,1.578752,0.262440,0.935381
3,10,2,1.943129,0.539556,1.633443,3.170492
4,18,2,1.183891,-0.100619,0.841120,0.557501
...,...,...,...,...,...,...
2528,9982,3,0.638243,0.653912,-1.710992,-2.246682
2529,9985,3,-1.784652,-1.686304,-2.085295,-0.225994
2530,9988,3,-1.618343,-0.237859,-0.503339,-0.645826
2531,9996,3,-2.611572,-0.775764,-3.030953,-0.849486


In [None]:
# You can specify which columns to include (by default they are g and y)
es_extended = bdf.get_extended_eventstudy(periods_pre=3, periods_post=2, include=['j', 'y'])
display(es_extended)

In [None]:
# You can specify column(s) for stable_pre or stable_post to keep only workers with those columns constant before/after the transition
es_extended = bdf.get_extended_eventstudy(periods_pre=3, periods_post=2, stable_pre='j', stable_post='j', include=['j', 'y'])
display(es_extended)

In [None]:
# You can specify column(s) for stable_pre or stable_post that aren't included
es_extended = bdf.get_extended_eventstudy(periods_pre=3, periods_post=2, stable_pre='g', stable_post='g', include=['j', 'y'])
display(es_extended)

In [None]:
# You can also redefine what column to use to define a transition
es_extended = bdf.get_extended_eventstudy(periods_pre=3, periods_post=2, stable_pre='g', stable_post='j', include=['j', 'g', 'y'], transition_col='g')
display(es_extended)
display(es_extended[es_extended['j_l3'] != es_extended['j_l2']])