# Advanced features

In [1]:
# Add BipartitePandas to system path, do not run this
# import sys
# sys.path.append('../../..')

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [2]:
import bipartitepandas as bpd

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Get your data ready

For this notebook, we simulate data.

In [3]:
df = bpd.SimBipartite().simulate()
bdf = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'])
display(bdf)

Unnamed: 0,i,j,y,t
0,0,123,1.836624,0
1,0,92,-0.504826,1
2,0,92,0.119876,2
3,0,92,1.446130,3
4,0,92,0.879996,4
...,...,...,...,...
49995,9999,60,-0.667988,0
49996,9999,80,-0.243995,1
49997,9999,80,-1.355750,2
49998,9999,46,-1.213313,3


## Advanced data cleaning

*Hint:* want details on all cleaning parameters? Run `bpd.clean_params().describe_all()`, or search through `bpd.clean_params().keys()` for a particular key, and then run `bpd.clean_params().describe(key)`.

#### Compute the largest connected component

Use the parameter `connectedness` to set the connectedness measure to use, and `component_size_variable` to choose how to measure component size (e.g. `firms` chooses the connected component with the greatest number of unique firm ids, while `workers` chooses the connected component with the greatest number of unique worker ids).

*Note:* connectedness is NOT NECESSARILY maintained between non-collapsed and collapsed formats. Therefore, if you plan to use connected, collapsed data, it is recommended to first clean the non-collapsed data with `connectedness=None`, next collapse the data, and finally clean the collapsed data with the connectedness measure you plan to use.

#### Set how to handle worker-year duplicates

Use the parameter `i_t_how` to customize how worker-year duplicates are handled.

#### Collapse at the match-level

If you drop the `t` column, collapsing will automatically collapse at the match level. However, this prevents conversion to event study format (this can be bypassed with the `.construct_artificial_time()` method, but the data will likely have a meaningless order, rendering the event study uninterpretable).

#### Avoid unnecessary copies

If you are working with a large dataset, you will want to avoid copies whenever possible. So set `copy=False`.

#### Avoid unnecessary sorts

If you know your data is sorted by `i` and `t` (or, if you aren't including a `t` column, just by `i`), then set `is_sorted=True`.

#### Avoid complicated loops

Sometimes workers leave a firm, then return to it (we call these workers *returners*). Returners can cause computational difficulties because sometimes intermediate firms get dropped (e.g. a worker goes from firm $A \to B \to A$, but firm $B$ gets dropped). This turns returners into stayers. This can change the largest connected set of firms, and if data is in collapsed format, requires the data to be recollapsed.

Because of these potential complications, if there are returners, many methods require loops that run until convergence.

These difficulties can be avoided by setting the parameter `drop_returns` (there are multiple ways to handle returners, they can be seen by running `bpd.clean_params().describe('drop_returns')`).

*Alternative:* another way to handle returners is to drop the `t` column. Then, sorting will automatically sort by `i` and `j`, which eliminates the possibility of returners. However, this prevents conversion to event study format (this can be bypassed with the `.construct_artificial_time()` method, but the data will likely have a meaningless order, rendering the event study uninterpretable).

#### Disable logging

Logging can slow down data cleaning. Set the parameter `log=False` when constructing your dataframe to turn off logging.

#### Install Intel(R) Extension for Scikit-learn

Intel(R) Extension for Scikit-learn ([GitHub](https://github.com/intel/scikit-learn-intelex)) can speed up KMeans clustering.

## Simpler constructor

If the columns in your Pandas dataframe are already named correctly, you can simply put the dataframe as a parameter into the BipartitePandas dataframe constructor.

In [4]:
bdf = bpd.BipartiteDataFrame(df).clean()
display(bdf)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m,alpha,k,l,psi
0,0,123,1.836624,0,1,0.430727,6,3,0.348756
1,0,92,-0.504826,1,1,0.430727,4,3,-0.114185
2,0,92,0.119876,2,0,0.430727,4,3,-0.114185
3,0,92,1.446130,3,0,0.430727,4,3,-0.114185
4,0,92,0.879996,4,0,0.430727,4,3,-0.114185
...,...,...,...,...,...,...,...,...,...
49995,9999,60,-0.667988,0,1,-0.967422,3,0,-0.348756
49996,9999,80,-0.243995,1,1,-0.967422,4,0,-0.114185
49997,9999,80,-1.355750,2,1,-0.967422,4,0,-0.114185
49998,9999,46,-1.213313,3,1,-0.967422,2,0,-0.604585


## Restoring original ids

To restore original ids, we need to make sure the dataframe is tracking ids as they change.

We make sure the dataframe tracks ids as they change by setting `include_id_reference_dict=True`.

Notice that in this example we use `j / 2`, so that `j` will be modified during data cleaning.

In [5]:
bdf_original_ids = bpd.BipartiteDataFrame(i=df['i'], j=(df['j'] / 2), y=df['y'], t=df['t'], include_id_reference_dict=True).clean()
display(bdf_original_ids)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m
0,0,0,1.836624,0,1
1,0,1,-0.504826,1,1
2,0,1,0.119876,2,0
3,0,1,1.446130,3,0
4,0,1,0.879996,4,0
...,...,...,...,...,...
49995,9999,197,-0.667988,0,1
49996,9999,72,-0.243995,1,1
49997,9999,72,-1.355750,2,1
49998,9999,176,-1.213313,3,1


#### Merging in original ids

The method `.original_ids()` will return a dataframe that merges in the original ids.

In [6]:
bdf_original_ids.original_ids()

Unnamed: 0,i,j,y,t,m,original_j
0,0,0,1.836624,0,1,61.5
1,0,1,-0.504826,1,1,46.0
2,0,1,0.119876,2,0,46.0
3,0,1,1.446130,3,0,46.0
4,0,1,0.879996,4,0,46.0
...,...,...,...,...,...,...
49995,9999,197,-0.667988,0,1,30.0
49996,9999,72,-0.243995,1,1,40.0
49997,9999,72,-1.355750,2,1,40.0
49998,9999,176,-1.213313,3,1,23.0


## Comparing dataframes

Dataframes can be compared using the utility method `bpd.util.compare_frames()`.

In [7]:
bpd.util.compare_frames(bdf, bdf.iloc[:len(bdf) // 2], size_variable='len', operator='geq')

True

## Filling in missing years as unemployed

The method `.fill_periods()` (for *Long* format) will fill in rows for missing intermediate periods. Note that this method will not work with custom columns.

In this example, we drop periods 1-3, then fill them in:

In [8]:
bdf_missing = bdf[(bdf['t'] == 0) | (bdf['t'] == 4)].clean()
display(bdf_missing.fill_periods())

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m,alpha,k,l,psi
0,0,123,1.836624,0,1,0.430727,6.0,3.0,0.348756
1,0,-1,,1,,,,,
2,0,-1,,2,,,,,
3,0,-1,,3,,,,,
4,0,92,0.879996,4,1,0.430727,4.0,3.0,-0.114185
...,...,...,...,...,...,...,...,...,...
49995,9999,60,-0.667988,0,1,-0.967422,3.0,0.0,-0.348756
49996,9999,-1,,1,,,,,
49997,9999,-1,,2,,,,,
49998,9999,-1,,3,,,,,


## Getting extended event study dataframes

BipartitePandas allows you to use *Long* format data to generate event studies with more than 2 periods.

You can specify:

- what column signals a transition (e.g. if `j` is used, a transition is when a worker moves firms)
- which column(s) should be treated as the event study outcome
- how many periods before and after the transition should be considered
- whether the pre- and/or post-trends must be stable, and for which column(s)

We consider an example where `j` is the transition column, `y` is the outcome column, and with pre- and post-trends of length 2 that are required to be at the same firm. Note that `y_f1` is the first observation after the individual moves firms.

In [9]:
es_extended = bdf.get_extended_eventstudy(transition_col='j', outcomes='y', periods_pre=2, periods_post=2, stable_pre='j', stable_post='j')
display(es_extended)

Unnamed: 0,i,t,y_l2,y_l1,y_f1,y_f2
0,3,2,0.950670,1.108203,0.269408,0.974423
1,9,3,2.067053,1.915992,2.117024,2.467688
2,10,3,1.128810,1.414773,0.874945,0.730173
3,14,3,-0.119791,0.082104,0.274195,1.291794
4,21,2,1.562660,0.569162,0.658669,1.418546
...,...,...,...,...,...,...
2492,9984,3,-0.635233,-0.718112,-0.676545,-0.317920
2493,9989,2,1.838070,3.038493,-0.259361,1.004997
2494,9993,2,-0.427859,2.305903,-0.549674,0.442083
2495,9994,2,-1.450266,-1.982413,1.764035,-1.707863
