# Simple example

In [1]:
# Add BipartitePandas to system path, do not run this
# import sys
# sys.path.append('../../..')

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [2]:
import bipartitepandas as bpd

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Get your data ready

For this notebook, we simulate data (we set parameters to make data cleaning interesting).

In [3]:
df = bpd.SimBipartite(bpd.sim_params({'firm_size': 10, 'p_move': 0.05})).simulate()
display(df)

Unnamed: 0,i,j,y,t,l,k,alpha,psi
0,0,74,-0.900179,0,1,0,-0.430727,-1.335178
1,0,74,-3.151944,1,1,0,-0.430727,-1.335178
2,0,74,-1.870019,2,1,0,-0.430727,-1.335178
3,0,74,-1.842928,3,1,0,-0.430727,-1.335178
4,0,74,-1.945111,4,1,0,-0.430727,-1.335178
...,...,...,...,...,...,...,...,...
49995,9999,279,-0.683951,0,0,2,-0.967422,-0.604585
49996,9999,279,-1.313664,1,0,2,-0.967422,-0.604585
49997,9999,279,-2.558599,2,0,2,-0.967422,-0.604585
49998,9999,279,-1.577978,3,0,2,-0.967422,-0.604585


## Columns

BipartitePandas includes seven pre-defined general columns:

#### Required
- `i`: worker id (any type)
- `j`: firm id (any type)
- `y`: income (float or int)

#### Optional
- `t`: time (int)
- `g`: firm type (any type)
- `w`: weight (float or int)
- `m`: move indicator (int)

## Constructing DataFrames

How do we construct a dataframe? Just use the required columns (plus any optional columns you want to include)!

In [4]:
bdf = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t'])
display(bdf)

Unnamed: 0,i,j,y,t
0,0,74,-0.900179,0
1,0,74,-3.151944,1
2,0,74,-1.870019,2
3,0,74,-1.842928,3
4,0,74,-1.945111,4
...,...,...,...,...
49995,9999,279,-0.683951,0
49996,9999,279,-1.313664,1
49997,9999,279,-2.558599,2
49998,9999,279,-1.577978,3


## Now that we have our dataframe, let's check out some summary statistics

In [5]:
bdf.summary()

format: 'BipartiteLong'
number of workers: 10000
number of firms: 1001
number of observations: 50000
mean wage: 0.009019668394443387
median wage: 0.009736416514212864
min wage: -5.685990000651978
max wage: 5.855224237854599
var(wage): 2.672583211030842
no NaN values: False
no duplicates: False
i-t (worker-year) observations unique (None if t column(s) not included): False
no returns (None if not yet computed): None
contiguous 'i' ids (None if not included): False
contiguous 'j' ids (None if not included): False
contiguous 'g' ids (None if not included): None
connectedness (None if ignoring connectedness): None


## Let's clean our data - and make sure the result is leave-one-observation-out connected

*Hint:* want details on all cleaning parameters? Run `bpd.clean_params().describe_all()`, or search through `bpd.clean_params().keys()` for a particular key, and then run `bpd.clean_params().describe(key)`.

In [6]:
bdf = bdf.clean(bpd.clean_params({'connectedness': 'leave_out_observation'}))
display(bdf)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how='leave_out_observation')
making 'i' ids contiguous
making 'j' ids contiguous
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m
0,0,0,-0.900179,0,0
1,0,0,-3.151944,1,0
2,0,0,-1.870019,2,0
3,0,0,-1.842928,3,0
4,0,0,-1.945111,4,0
...,...,...,...,...,...
45257,9092,400,-0.683951,0,0
45258,9092,400,-1.313664,1,0
45259,9092,400,-2.558599,2,0
45260,9092,400,-1.577978,3,0


We can check how the summary statistics changed:

In [7]:
bdf.summary()

format: 'BipartiteLong'
number of workers: 9093
number of firms: 885
number of observations: 45262
mean wage: -0.006254032007778557
median wage: -0.005315020187764016
min wage: -5.685990000651978
max wage: 5.855224237854599
var(wage): 2.6782471069652165
no NaN values: True
no duplicates: True
i-t (worker-year) observations unique (None if t column(s) not included): True
no returns (None if not yet computed): True
contiguous 'i' ids (None if not included): True
contiguous 'j' ids (None if not included): True
contiguous 'g' ids (None if not included): None
connectedness (None if ignoring connectedness): 'leave_out_observation'
