## ðŸ§© User and Note Sampling Utility

This script randomly samples a specified number of users and, for each selected user, retrieves up to a set number of notes.  
It is designed to generate a manageable subset of the full dataset for exploratory analysis or preliminary modeling.  
The number of users (`users`) and maximum notes per user (`notes`) are configurable variables.  
Each sampled userâ€™s notes are joined with user metadata from the enrollment table, and the final sample is saved as a CSV for further analysis.  
This approach ensures diversity across users while limiting per-user note volume for balanced testing and faster downstream processing.


In [2]:
%pip install polars

Note: you may need to restart the kernel to use updated packages.


In [None]:
users = 500
notes = 7

In [1]:
import polars as pl

# Load user enrollment
df_user_enrollment = pl.read_parquet(
    "/home/jovyan/Shared/2025-09-27-input/userEnrollment-00000.parquet"
)

# Load all notes
df_notes = pl.read_parquet("/home/jovyan/Shared/2025-09-27-input/notes-00000.parquet", glob=True)

# Get unique users who wrote notes
user_ids_with_notes = df_notes.select("noteAuthorParticipantId").unique()

# Sample however many users
sampled_ids = (
    user_ids_with_notes
    .sample(n=users, shuffle=True)
    .to_series()
    .to_list()
)

# Filter notes down to however many user sample
sampled_notes = df_notes.filter(
    pl.col("noteAuthorParticipantId").is_in(sampled_ids)
)

# For each user, sample up to however many notes
def sample_up_to(df: pl.DataFrame) -> pl.DataFrame:
    n = min(notes, df.height)
    return df.sample(n=n, shuffle=True)

sampled_notes_limited = (
    sampled_notes
    .group_by("noteAuthorParticipantId", maintain_order=True)
    .map_groups(sample_up_to)
)

# Join sampled notes to user enrollment to get metadata
result = sampled_notes_limited.join(
    df_user_enrollment,
    left_on="noteAuthorParticipantId",
    right_on="participantId",
    how="inner"
)

# Save result
result.write_csv(f"/home/jovyan/Shared/project1-group1/info-470-project-1/langdata/sampled_notes_{users}users.csv")

print(f"Exported {result.height} rows across up to {notes} notes per {users} users.")

Exported 1790 rows across up to 7 notes per 500 users.
