In [1]:
import random
import polars as pl

import netcbs

In [2]:
# Print contexts and codebook
print(netcbs.context2types)
print(netcbs.codebook)

{'Family': {301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322}, 'Colleagues': {201}, 'Neighbors': {101, 102}, 'Schoolmates': {501, 502, 503, 504, 505, 506}, 'Housemates': {401, 402}}
{101: 'Neighbor - 10 closest addresses', 102: 'Neighborhood acquaintance - 20 random neighbors within 200 meters', 201: 'Colleague', 301: 'Parent', 302: 'Co-parent', 303: 'Grandparent', 304: 'Child', 305: 'Grandchild', 306: 'Full sibling', 307: 'Half sibling', 308: 'Unknown sibling', 309: 'Full cousin', 310: 'Cousin', 311: 'Aunt/Uncle', 312: 'Partner - married', 313: 'Partner - not married', 314: 'Parent-in-law', 315: 'Child-in-law', 316: 'Sibling-in-law', 317: 'Stepparent', 318: 'Stepchild', 319: 'Stepsibling', 320: 'Married full cousin', 321: 'Married cousin', 322: 'Married aunt/uncle', 401: 'Housemate', 402: 'Housemate - institution', 501: 'Classmate primary education', 502: 'Classmate special education', 503: 'Classmate secondary education', 50

### Select the sample dataframe and the dataframe with the variable to aggregate

For this example we will be using synthetic data. For each context (Family, Colleagues, Neighbors, Schoolmates, Housemates), we generated a "network file" containing 1,000,000 relationships (see section below). Each relationship is taken at random from any of the context types (see netdbs.contexts2types).

We then create two files: one with the IDs (RINPERSOON) in the sample, one with the IDs (RINPERSOON) and the variable to aggregate. In the CBS RA you will use real data.

In [5]:
# Create df_sample example
df_sample = pl.DataFrame(
    {
        "RINPERSOON": range(100_000_000, 100_010_000),
        "RINPERSOONS": ["R"]*10_000
    }
)

df_agg = pl.DataFrame(
    {
        "RINPERSOON":   range(100_000_000, 100_500_000),
        "RINPERSOONS":  ["R"]*500_000,
        "Income":       [random.normalvariate(30000, 5000) for _ in range(500_000)],
        "Age":          [random.normalvariate(30, 10) for _ in range(500_000)]
    }
)



### Run query

This is the most important part of the code. Here we will aggregate the variable of interest. In this case, we will aggregate the number of relationships per context.

In [7]:
## Query: The income of the parent's of the schoolmates of the children in the sample

## How to construct the query
# 1. Start with the variables that you want to aggregate, e.g. "[Income, Age] ->"
# 2. Then add the relationships between the tables, e.g., "[Income, Age] -> Family[301]".
# In square brackets you can specify the type of the relationships: 
# write [all] for all, or [301,302] for parents and co-parents
# 3. You can add several tables: "[Income, Age] -> Family[301] -> Schoolmates[all]"
# 4. Finally, you must write "-> Sample" 

## Other parameters
# df_sample: the sample dataframe (with the people you want to have information on)
# df_agg: the dataframe with the information you want to aggregate. For example, the income of all people in the country
# year: the year of the data you want to use
# agg_func: the aggregation function you want to use. For example, pl.mean or pl.sum
# return_pandas: if True, the function returns a pandas dataframe. If False, it returns a polars dataframe
# lazy: if True, the operations are concatenated lazily and computed at the end. If False, the operations are computed immediately
# cbdata_path: the path to the CBS data. Usually this is "G:/Bevolking". In this example, we use synthetic data saved in "cbsdata/Bevolking". 


## The transform function validates the query before running it

# Example
query =  "[Income, Age] -> Family[301,302,303] -> Schoolmates[all] -> Sample"

df = netcbs.transform(query, 
               df_sample = df_sample, 
               df_agg = df_agg, 
               year=2021,
               cbsdata_path='cbsdata/Bevolking', # Path to the CBS data ("G:/Bevolking"), in this example is synthetic data locally 
               agg_func=[pl.mean, pl.sum, pl.max], 
               return_pandas=False, 
               lazy=True)

df    

RINPERSOON,RINPERSOONS,mean_Income,mean_Age,sum_Income,sum_Age,max_Income,max_Age
i64,str,f64,f64,f64,f64,f64,f64
100000000,"""R""",,,,,,
100000001,"""R""",23563.020593,28.113413,47126.041185,56.226826,28302.789867,36.629473
100000002,"""R""",,,,,,
100000003,"""R""",32550.805986,32.185977,65101.611971,64.371954,33606.460227,37.592311
100000004,"""R""",,,,,,
…,…,…,…,…,…,…,…
100009995,"""R""",33743.89044,17.139401,67487.78088,34.278801,36074.331943,26.296811
100009996,"""R""",,,,,,
100009997,"""R""",,,,,,
100009998,"""R""",35975.875574,43.344039,71951.751147,86.688079,37016.237929,51.345133


In [6]:
# You can also validate the query before running it
query =  "Income -> Family[301,302,303] -> Schoolmates[all] -> Sample"
netcbs.validate_query(query, 
               df_sample = df_sample, 
               df_agg = df_agg, 
               year=2021,
               cbsdata_path='cbsdata/Bevolking', # Path to the CBS data, in this example is synthetic data locally 
)

['Sample', 'Schoolmates[all]', 'Family[301,302,303]', 'Income']

In [7]:
# Create path to latest verion of CBS data
netcbs.format_path(context='Family[301]', year=2010, cbsdata_path='cbsdata/Bevolking')

('cbsdata/Bevolking/FAMILIENETWERKTAB/FAMILIENETWERKTAB2010V1.csv', {301})

## Create synthetic data (not needed at CBS!)

Let's create some synthetic data to explain how the code works

For each context (Family, Colleagues, Neighbors, Schoolmates, Housemates), we will generate a "network file" containing 1,000,000 relationsihps. Each relationship is taken at random from any of the context types (see netdbs.contexts2types).


In [3]:
netcbs.create_synthetic_data("Family", 2021, 1_000_000, outpath="cbsdata/Bevolking")
netcbs.create_synthetic_data("Colleagues", 2021, 1_000_000, outpath="cbsdata/Bevolking")
netcbs.create_synthetic_data("Neighbors", 2021, 1_000_000, outpath="cbsdata/Bevolking")
netcbs.create_synthetic_data("Schoolmates", 2021, 1_000_000, outpath="cbsdata/Bevolking")
netcbs.create_synthetic_data("Housemates", 2021, 1_000_000, outpath="cbsdata/Bevolking")
