# Connected sets

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [1]:
import bipartitepandas as bpd

## Get your data ready

For this notebook, we simulate data (we set parameters to make the connected sets interesting).

In [2]:
df = bpd.SimBipartite(
    bpd.sim_params(
        {
            'firm_size': 10,
            'p_move': 0.05
        }
    )
).simulate()
bdf = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t']
)
display(bdf)

Unnamed: 0,i,j,y,t
0,0,78,-0.039918,0
1,0,78,-1.395162,1
2,0,78,-2.167545,2
3,0,78,-0.471150,3
4,0,78,-0.009972,4
...,...,...,...,...
49995,9999,768,1.769249,0
49996,9999,768,1.923128,1
49997,9999,768,1.865583,2
49998,9999,768,1.838138,3


## Computing connected sets

There are seven connectedness options:

- None
- Connected
- Leave-out-observation
- Leave-out-spell
- Leave-out-match
- Leave-out-worker
- Leave-out-firm

These are specified in the cleaning parameters dictionary under the key `'connectedness'`. We will demonstrate `'connectedness' = None` and `'connectedness' = 'leave_out_observation'`.

<div class="alert alert-info">

Note

Leave-out-spell and leave-out-match are distinguished by workers who leave a firm then return to it.

</div>

Note

Stayers who have only a single observation after computing the largest connected set can be dropped by specifying `'drop_single_stayers' = True` in your cleaning parameters dictionary.

</div>

<div class="alert alert-warning">

Warning

Connectedness is not necessarily maintained between non-collapsed and collapsed formats. Therefore, if you plan to use connected, collapsed data, it is recommended to set the connectedness level at the level at which you would to collapse your data, and to set `'collapse_at_connectedness_measure' = True` in your cleaning parameters dictionary. An example is given below.

</div>

### 'connectedness' = None

In [3]:
conn_none = bdf.clean(
    bpd.clean_params(
        {
            'connectedness': None,
            'verbose': False
        }
    )
)
display(conn_none)

Unnamed: 0,i,j,y,t,m
0,0,78,-0.039918,0,0
1,0,78,-1.395162,1,0
2,0,78,-2.167545,2,0
3,0,78,-0.471150,3,0
4,0,78,-0.009972,4,0
...,...,...,...,...,...
49995,9999,768,1.769249,0,0
49996,9999,768,1.923128,1,0
49997,9999,768,1.865583,2,0
49998,9999,768,1.838138,3,0


### 'connectedness' = 'leave_out_observation'

In [4]:
conn_loo = bdf.clean(
    bpd.clean_params(
        {
            'connectedness': 'leave_out_observation',
            'verbose': False
        }
    )
)
display(conn_loo)

Unnamed: 0,i,j,y,t,m
0,0,0,-0.039918,0,0
1,0,0,-1.395162,1,0
2,0,0,-2.167545,2,0
3,0,0,-0.471150,3,0
4,0,0,-0.009972,4,0
...,...,...,...,...,...
46332,9299,538,1.769249,0,0
46333,9299,538,1.923128,1,0
46334,9299,538,1.865583,2,0
46335,9299,538,1.838138,3,0


## Connected sets for collapsed data

As mentioned above, connectedness is not necessarily maintained between non-collapsed and collapsed formats.

Here we show an example that demonstrates this, then show how setting `'collapse_at_connectedness_measure' = True` in your cleaning parameters dictionary will give the correct results, all in one line.

In [5]:
coll_conn_loo_wrong = conn_loo.collapse(level='spell')
display(coll_conn_loo_wrong)

Unnamed: 0,i,j,y,t1,t2,w,m
0,0,0,-0.816749,0,4,5,0
1,1,1,0.948204,0,4,5,0
2,2,2,-1.229131,0,4,5,0
3,3,3,0.275640,0,4,5,0
4,4,4,-0.531474,0,4,5,0
...,...,...,...,...,...,...,...
11282,9295,323,-2.016925,0,4,5,0
11283,9296,14,0.831897,0,4,5,0
11284,9297,224,-1.794814,0,4,5,0
11285,9298,345,1.308279,0,4,5,0


In [6]:
coll_conn_loo_right_1 = bdf.clean(
    bpd.clean_params(
        {
            'connectedness': None,
            'verbose': False
        }
    )
).collapse(level='spell').clean(
    bpd.clean_params(
        {
            'connectedness': 'leave_out_observation',
            'verbose': False
        }
    )
)
display(coll_conn_loo_right_1)

Unnamed: 0,i,j,y,t1,t2,w,m
0,0,0,-0.816749,0,4,5.0,0
1,1,1,0.948204,0,4,5.0,0
2,2,2,-1.229131,0,4,5.0,0
3,3,3,0.275640,0,4,5.0,0
4,4,4,-0.531474,0,4,5.0,0
...,...,...,...,...,...,...,...
11269,9283,323,-2.016925,0,4,5.0,0
11270,9284,14,0.831897,0,4,5.0,0
11271,9285,224,-1.794814,0,4,5.0,0
11272,9286,345,1.308279,0,4,5.0,0


### Simpler code

Instead of cleaning, collapsing, then cleaning again, we can do it all at once by specifying `'connectedness' = 'leave_out_spell'` (or `'leave_out_match'`) and `'collapse_at_connectedness_measure' = True`.

In [7]:
coll_conn_loo_right_2 = bdf.clean(
    bpd.clean_params(
        {
            'connectedness': 'leave_out_spell',
            'collapse_at_connectedness_measure': True,
            'verbose': False
        }
    )
)
display(coll_conn_loo_right_2)

Unnamed: 0,i,j,y,t1,t2,w,m
0,0,0,-0.816749,0,4,5,0
1,1,1,0.948204,0,4,5,0
2,2,2,-1.229131,0,4,5,0
3,3,3,0.275640,0,4,5,0
4,4,4,-0.531474,0,4,5,0
...,...,...,...,...,...,...,...
11269,9283,323,-2.016925,0,4,5,0
11270,9284,14,0.831897,0,4,5,0
11271,9285,224,-1.794814,0,4,5,0
11272,9286,345,1.308279,0,4,5,0
