# Advanced features

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [1]:
import bipartitepandas as bpd

## Get your data ready

For this notebook, we simulate data.

In [2]:
df = bpd.SimBipartite().simulate()
bdf = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'])
display(bdf)

Unnamed: 0,i,j,y,t
0,0,12,-0.603359,0
1,0,41,-1.550461,1
2,0,18,-0.022030,2
3,0,18,-0.584272,3
4,0,64,-1.017753,4
...,...,...,...,...
49995,9999,77,-0.480020,0
49996,9999,77,-1.488714,1
49997,9999,165,1.128553,2
49998,9999,141,-0.177351,3


## Advanced data cleaning

<span class="label label-success">Hint</span> Want details on all cleaning parameters? Run `bpd.clean_params().describe_all()`, or search through `bpd.clean_params().keys()` for a particular key, and then run `bpd.clean_params().describe(key)`.

#### Compute the largest connected component

Use the parameter `connectedness` to set the connectedness measure to use, and `component_size_variable` to choose how to measure component size (e.g. `firms` chooses the connected component with the greatest number of unique firm ids, while `workers` chooses the connected component with the greatest number of unique worker ids).

<span class="label label-info">Note</span> Connectedness is NOT NECESSARILY maintained between non-collapsed and collapsed formats. Therefore, if you plan to use connected, collapsed data, it is recommended to first clean the non-collapsed data with `connectedness=None`, next collapse the data, and finally clean the collapsed data with the connectedness measure you plan to use.

#### Set how to handle worker-year duplicates

Use the parameter `i_t_how` to customize how worker-year duplicates are handled.

#### Collapse at the match-level

If you drop the `t` column, collapsing will automatically collapse at the match level. However, this prevents conversion to event study format (this can be bypassed with the `.construct_artificial_time()` method, but the data will likely have a meaningless order, rendering the event study uninterpretable).

#### Avoid unnecessary copies

If you are working with a large dataset, you will want to avoid copies whenever possible. So set `copy=False`.

#### Avoid unnecessary sorts

If you know your data is sorted by `i` and `t` (or, if you aren't including a `t` column, just by `i`), then set `is_sorted=True`.

#### Avoid complicated loops

Sometimes workers leave a firm, then return to it (we call these workers *returners*). Returners can cause computational difficulties because sometimes intermediate firms get dropped (e.g. a worker goes from firm $A \to B \to A$, but firm $B$ gets dropped). This turns returners into stayers. This can change the largest connected set of firms, and if data is in collapsed format, requires the data to be recollapsed.

Because of these potential complications, if there are returners, many methods require loops that run until convergence.

These difficulties can be avoided by setting the parameter `drop_returns` (there are multiple ways to handle returners, they can be seen by running `bpd.clean_params().describe('drop_returns')`).

<span class="label label-default">Alternative</span> Another way to handle returners is to drop the `t` column. Then, sorting will automatically sort by `i` and `j`, which eliminates the possibility of returners. However, this prevents conversion to event study format (this can be bypassed with the `.construct_artificial_time()` method, but the data will likely have a meaningless order, rendering the event study uninterpretable).

## Advanced clustering

#### Install Intel(R) Extension for Scikit-learn

Intel(R) Extension for Scikit-learn ([GitHub](https://github.com/intel/scikit-learn-intelex)) can speed up KMeans clustering.

## Advanced dataframe handling

#### Disable logging

Logging can slow down basic operations on BipartitePandas dataframes (e.g. data cleaning). Set the parameter `log=False` when constructing your dataframe to turn off logging.

#### Use method chaining with in-place operations

Unlike standard Pandas, BipartitePandas allows method chaining with in-place operations.

#### Understand the distinction between general columns and subcolumns

Users interact with general columns, while BipartitePandas dataframes display subcolumns. As an example, for event study format, the columns for firm clusters are labeled `g1` and `g2`. These are the subcolumns for general column `g`. If you want to drop firm clusters from the dataframe, rather than dropping `g1` and `g2` separately, you must drop the general column `g`. This paradigm applies throughout BipartitePandas and the documentation will make clear when you should specify general columns.

#### Simpler constructor

If the columns in your Pandas dataframe are already named correctly, you can simply put the dataframe as a parameter into the BipartitePandas dataframe constructor. Here is an example:

In [3]:
bdf = bpd.BipartiteDataFrame(df).clean()
display(bdf)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m,alpha,k,l,psi
0,0,12,-0.603359,0,1,-0.430727,0,1,-1.335178
1,0,41,-1.550461,1,2,-0.430727,1,1,-0.908458
2,0,18,-0.022030,2,1,-0.430727,0,1,-1.335178
3,0,18,-0.584272,3,1,-0.430727,0,1,-1.335178
4,0,64,-1.017753,4,1,-0.430727,3,1,-0.348756
...,...,...,...,...,...,...,...,...,...
49995,9999,77,-0.480020,0,0,0.000000,3,2,-0.348756
49996,9999,77,-1.488714,1,1,0.000000,3,2,-0.348756
49997,9999,165,1.128553,2,2,0.000000,8,2,0.908458
49998,9999,141,-0.177351,3,2,0.000000,7,2,0.604585


## Restoring original ids

To restore original ids, we need to make sure the dataframe is tracking ids as they change.

We make sure the dataframe tracks ids as they change by setting `include_id_reference_dict=True`.

Notice that in this example we use `j / 2`, so that `j` will be modified during data cleaning.

The method `.original_ids()` will then return a dataframe that merges in the original ids.

In [4]:
bdf_original_ids = bpd.BipartiteDataFrame(i=df['i'], j=(df['j'] / 2), y=df['y'], t=df['t'], include_id_reference_dict=True).clean()
display(bdf_original_ids.original_ids())

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m,original_j
0,0,0,-0.603359,0,1,6.0
1,0,1,-1.550461,1,2,20.5
2,0,2,-0.022030,2,1,9.0
3,0,2,-0.584272,3,1,9.0
4,0,3,-1.017753,4,1,32.0
...,...,...,...,...,...,...
49995,9999,136,-0.480020,0,0,38.5
49996,9999,136,-1.488714,1,1,38.5
49997,9999,182,1.128553,2,2,82.5
49998,9999,91,-0.177351,3,2,70.5


## Comparing dataframes

Dataframes can be compared using the utility method `bpd.util.compare_frames()`.

In [5]:
bpd.util.compare_frames(bdf, bdf.iloc[:len(bdf) // 2], size_variable='len', operator='geq')

True

## Filling in missing periods as unemployed

The method `.fill_missing_periods()` (for *Long* format) will fill in rows for missing intermediate periods. Note that this method will not work with custom columns.

<span class="label label-success">Hint</span> Filling in missing periods is a useful way to make sure that `.collapse()` only collapses over worker-firm spells if they are for contiguous periods.

In this example, we drop periods 1-3, then fill them in, setting `alpha` to become -1:

In [6]:
bdf_missing = bdf[(bdf['t'] == 0) | (bdf['t'] == 4)].clean()
display(bdf_missing.fill_missing_periods({'alpha': -1}))

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m,alpha,k,l,psi
0,0,12,-0.603359,0,1,-0.430727,0,1,-1.335178
1,0,-1,,1,,-1.000000,,,
2,0,-1,,2,,-1.000000,,,
3,0,-1,,3,,-1.000000,,,
4,0,64,-1.017753,4,1,-0.430727,3,1,-0.348756
...,...,...,...,...,...,...,...,...,...
49995,9999,77,-0.48002,0,1,0.000000,3,2,-0.348756
49996,9999,-1,,1,,-1.000000,,,
49997,9999,-1,,2,,-1.000000,,,
49998,9999,-1,,3,,-1.000000,,,


## Getting extended event study dataframes

BipartitePandas allows you to use *Long* format data to generate event studies with more than 2 periods.

You can specify:

- which column signals a transition (e.g. if `j` is used, a transition is when a worker moves firms)
- which column(s) should be treated as the event study outcome
- how many periods before and after the transition should be considered
- whether the pre- and/or post-trends must be stable, and for which column(s)

We consider an example where `j` is the transition column, `y` is the outcome column, and with pre- and post-trends of length 2 that are required to be at the same firm. Note that `y_f1` is the first observation after the individual moves firms.

In [7]:
es_extended = bdf.get_extended_eventstudy(transition_col='j', outcomes='y', periods_pre=2, periods_post=2, stable_pre='j', stable_post='j')
display(es_extended)

Unnamed: 0,i,t,y_l2,y_l1,y_f1,y_f2
0,4,2,0.495119,-0.399197,1.299567,1.101117
1,7,3,0.992943,0.111506,2.728447,1.094903
2,14,2,-2.988141,-2.434932,-1.283972,-2.091461
3,25,2,1.794433,1.790265,-1.088385,-0.070309
4,43,2,-1.670816,-1.210209,-1.615393,0.888983
...,...,...,...,...,...,...
2433,9989,3,-0.636009,-1.011849,-0.431337,0.723096
2434,9994,3,-3.058060,-2.365576,-1.482422,-2.007100
2435,9996,2,-2.187348,-1.758026,-0.020407,0.467370
2436,9997,2,1.507929,0.719959,0.769608,1.802798


## Advanced data simulation

For details on all simulation parameters, run `bpd.sim_params().describe_all()`, or search through `bpd.sim_params().keys()` for a particular key, and then run `bpd.sim_params().describe(key)`.