# User Metadata Construction from Browsing Logs

This notebook constructs a **clean, user-level metadata table** by combining:
- A large browsing masterlist (`masterlist_df`)
- Raw device-level logs (Term 1 vs Term 3)
- Official school metadata (`school_category_3.csv`)

Each user is uniquely identified by a **(device_name_actual, school)** pair.
The final output is a consolidated dataset suitable for downstream analysis.


---
## 1. Load Masterlist Dataset

We begin by loading the master browsing dataset, which contains:
- Device identifiers
- School identifiers
- Timestamps, URLs, and derived linguistic features

This dataset represents **event-level browsing records**, not users.

In [9]:
import pandas as pd 
masterlist_df = pd.read_csv("/Users/tdf/Downloads/filtered_df_for_specific_broad_final.csv")
masterlist_df.head()

  masterlist_df = pd.read_csv("/Users/tdf/Downloads/filtered_df_for_specific_broad_final.csv")


Unnamed: 0,school_name,created_at,uri,device_name_actual,was_allowed,profile_custom,search_q,has_safe_active,is_navigational,q_episode,...,topic,hypernym_distance,hypernym_specificity,episode_hypernym_specificity_ratio,avg_text_concreteness,brysbaert_specificity,brysbaert_concreteness_for_episode,episode_brysbaert_specificity_ratio,Category_nouns,Category_noun_words
0,orchid_park_secondary,2023-07-09 21:06:21+00:00,https://www.google.com/search?q=wy&rlz=1CAGWKK...,36719,True,0.0,wy,True,False,1.0,...,1,9.0,specific,1.0,0.0,abstract,0.0,0.0,False,[]
1,orchid_park_secondary,2023-07-09 21:21:00+00:00,https://www.google.com/search?rlz=1CAGWKK_enSG...,36719,True,0.0,hospital between 1959 and 1970,True,False,2.0,...,73,7.5,specific,1.0,4.64,concrete,4.64,1.0,False,[]
2,orchid_park_secondary,2023-07-09 21:21:54+00:00,https://www.google.com/search?q=hospital+betwe...,36719,True,0.0,hospital between 1960s and 1970s,False,False,2.0,...,0,6.5,specific,1.0,4.64,concrete,4.64,1.0,False,[]
3,orchid_park_secondary,2023-07-09 21:24:07+00:00,https://www.google.com/search?q=hospital+after...,36719,True,0.0,hospital after 1965,False,False,2.0,...,73,7.5,specific,1.0,4.64,concrete,4.64,1.0,False,[]
4,orchid_park_secondary,2023-07-09 21:24:34+00:00,https://www.google.com/search?q=hospital+after...,36719,True,0.0,hospital after 1965 in singapore,False,False,2.0,...,46,7.75,specific,1.0,4.64,concrete,4.64,1.0,False,[]


---
## 2. Inspect School Naming Conventions (Masterlist)

The `school_name` field in the masterlist uses **machine-friendly identifiers**
(eg. `orchid_park_secondary`, `st_margaret's_secondary`).

We enumerate all unique school identifiers to:
- Understand coverage
- Prepare a mapping to official school names


In [10]:
for name in masterlist_df['school_name'].unique():
    print(name)

orchid_park_secondary
methodist_girls'_school
chung_cheng_high_yishun
mayflower_secondary
xinmin_secondary
choa_chu_kang_secondary
bukit_view_secondary
bendemeer_secondary
raffles_girls'
fuhua_secondary
catholic_high
chij_st_joseph's_convent
kent_ridge_secondary
chung_cheng_high_main
christ_church_secondary
sports_school
whitley_secondary
chij_st_nicholas
hai_sing_catholic_school
dunman_high
woodlands_secondary
nan_chiau_high_school
northland_secondary
kuo_chuan_presbyterian_school
st_margaret's_secondary
woodgrove_secondary
broadrick_secondary
yishun_secondary
north_vista_secondary
yishun_town_secondary
hillgrove_secondary
guangyang_secondary
swiss_cottage_secondary
manjusri_secondary
boonlay_secondary
assumption_english_secondary
yio_chu_kang_secondary
bowen_secondary
evergreen_secondary
pasir_ris_secondary
outram_secondary
maris_stella
ngee_ann_secondary
montfort_secondary
greendale_secondary
edgefield_secondary
seng_kang_secondary


---
## 3. Load Official School Metadata

The `school_category_3.csv` file contains **official MOE school metadata**, including:
- School name (human-readable)
- Gender type
- PSLE AL cutoff group

This dataset will later be merged using **standardised school names**.

In [15]:
SCHOOL_CATEGORY_FILE = pd.read_csv('/Users/tdf/Downloads/school_category_3.csv')
SCHOOL_CATEGORY_FILE.head()

Unnamed: 0,category,type,zone,cluster,name,principal,email,baseline,name2,os,touchscreen?,keyboard,stylus,gender,psle_al_cutoff_pg3,psle_al_cutoff_pg3_group,ses
0,GA,Sec,West,W6,Assumption English School,Mr Kwok Chin Poh Benjamin,Benjamin_KWOK@schools.gov.sg,Survey (alt),Assumption English Sch,Chrome,Y,Y,Y,co-ed,22,4,
1,G,Sec,South,S5,Bendemeer Secondary School,Ms Foo Sheue Feng,Foo Sheue Feng (SCHOOLS),Survey,Bendemeer Secondary - 539,Chrome,Y,Y,Y,co-ed,22,4,5.0
2,G,Sec,West,W3,Boon Lay Secondary School,Mr Inderjit Singh,Inderjit_SINGH@schools.gov.sg,Survey,Boon Lay Secondary - 644,iOS,Y,Y,Y,co-ed,22,4,4.0
3,G,Sec,North,N1,Bowen Secondary School,Mr Loh Chih Hui\n,LOH_CHIH_HUI@SCHOOLS.GOV.SG,Survey (alt),Bowen Secondary - 855,Chrome,Y,Y,Y,co-ed,16,3,3.0
4,G,Sec,East,E5,Broadrick Secondary School,Mr Ng Tiong Nam,Tiong Nam NG (SCHOOLS),Survey (alt),Broadrick Secondary - 688,Chrome,Y,Y,Y,co-ed,22,4,


In [16]:
for name in SCHOOL_CATEGORY_FILE['name'].unique():
    print(name)

Assumption English School
Bendemeer Secondary School
Boon Lay Secondary School
Bowen Secondary School
Broadrick Secondary School
Bukit View Secondary School
Canberra Secondary School
Catholic High School (Secondary)
CHIJ St. Joseph's Convent
CHIJ St. Nicholas Girls' School (Secondary)
Christ Church Secondary School
Chua Chu Kang Secondary School
Chung Cheng High School (Main)
Chung Cheng High School (Yishun)
Dunman High School
Edgefield Secondary School
Evergreen Secondary School
Fuhua Secondary School
Greendale Secondary School
Guangyang Secondary School
Hai Sing Catholic School
Hillgrove Secondary School
Kent Ridge Secondary School
Kuo Chuan Presbyterian Secondary School
Manjusri Secondary School
Maris Stella High School (Secondary)
Mayflower Secondary School
Methodist Girls' School (Secondary) 
Montfort Secondary School
Nan Chiau High School
Ngee Ann Secondary School
North Vista Secondary School
Northland Secondary School
Orchid Park Secondary School
Outram Secondary School
Pasir Ri

---
## 4. Identify and Handle Encoding Issues in School Names

During inspection of the official school metadata (`school_category_3.csv`), we identify
a character encoding issue affecting **St. Margaret’s Secondary School**:

- The apostrophe (`’`) appears as the mis-encoded sequence **`â€™`** in the CSV file.
- This is a known UTF-8 / Windows-1252 encoding mismatch.

To preserve data integrity, this issue is handled **surgically**:
- Only rows containing both **“Margaret”** and **`â€™`** are corrected.
- All other school names remain unchanged.

This ensures that encoding fixes do **not affect unrelated schools**.

---
## 5. Construct the User Universe

We define a user as a unique:

> **(device_name_actual, school_name)** pair

This design choice ensures that:
- Devices reused across different schools are treated as distinct users
- All downstream joins and classifications remain school-specific

The same `device_name_actual` can appear across multiple schools, but each occurrence is treated as a distinct user.

The resulting dataset (`users_df`) represents the **complete universe of users**
derived from the masterlist.

---
## 6. Map School Identifiers to Official School Names

The masterlist uses machine-readable school identifiers
(e.g. `st_margaret's_secondary`, `orchid_park_secondary`).

To enable a clean merge with official school metadata, we map these identifiers to
**conventional, human-readable school names** that match those in
`school_category_3.csv`.

Example:
- `st_margaret's_secondary` → `St. Margaret’s Secondary School`

This standardised field is stored as:
- **`school_name_official`**

---
## 7. Determine Device Presence by Academic Term

To infer students’ academic levels, we determine whether each device appears in:

- **Term 1 data**
  - Data files with filenames starting with **`MG_`**
- **Term 3 data**
  - Data files with filenames starting with **`Browsing history`**

For each school folder:
- Device identifiers are extracted in chunks for memory efficiency
- Two global sets are built:
  - `term1_devices`
  - `term3_devices`

This allows efficient membership checks for each user.

---
## 8. Define and Assign Student Level (`level`)

We define a new variable **`level`** to represent students’ academic level:

### Definition
- **`level = 1`**  
  Device appears in **Term 3 data but NOT Term 1 data**  
  → These are **Secondary 1 students**, who only received their devices in Term 3

- **`level = 2`**  
  All other users  
  → These represent **Secondary 2–5 students**

### Implementation
- All users are initialized with `level = 2`
- Users whose devices appear only in Term 3 are reassigned to `level = 1`


This inference is done at the **device-school level**, not globally.

---
## 9. Load School Metadata and Define New Variables

We now merge official school metadata from `school_category_3.csv`
to define additional user-level variables.

The following variables are constructed:

### Gender (`gender`)

The `gender` variable is derived from the **`gender`** column in
`school_category_3.csv`, based on school type:

- **Girls’ school** → `gender = 1`
- **Boys’ school** → `gender = 2`
- **Co-educational school** → `gender = 99`

The value `99` is used for co-ed schools because the biological sex
of individual students is unknown.

### PSLE Cutoff Group (`sch_psle`)

The `sch_psle` variable is derived from the
**`psle_al_cutoff_pg3_group`** column in `school_category_3.csv`.

This represents the school’s PSLE AL cutoff group.

Example:
- Devices from **Assumption English School** are assigned  
  `sch_psle = 4`, as the school’s cutoff group is 4 in the metadata file.

---
## 10. Merge User Data with School Metadata

The user universe is merged with official school metadata using:

- `school_name_official` (from the masterlist mapping)
- `name` (from `school_category_3.csv`)

This merge appends:
- `gender`
- `sch_psle`

to each **(device_name_actual, school)** pair.

After merging:
- Redundant columns are removed
- Variable names are standardized for clarity

---
## 11. Final Data Quality Checks

We perform final validation checks to ensure data completeness:

- Missing values in:
  - `school_name_official`
  - `gender`
  - `sch_psle`
  - `level`
- Verification that encoding corrections succeeded
- Confirmation that all users were successfully classified

Any remaining missing values indicate schools that require
manual inspection or additional mapping.

---
## 12. Export Final User Metadata

The final dataset includes:
- `device_name_actual`
- Official school name
- Gender code
- PSLE cutoff group
- Inferred academic level

Outputs are saved as both:
- Excel (`.xlsx`) for inspection
- CSV (`.csv`) for analysis pipelines


In [52]:
import pandas as pd
from pathlib import Path
from tqdm import tqdm

# =====================================================
# 0. INPUTS
# =====================================================
MASTERLIST_FILE = "/Users/tdf/Downloads/filtered_df_for_specific_broad_final.csv"
RAW_DATA_ROOT = Path("/Users/tdf/Downloads/raw_data")
SCHOOL_CATEGORY_FILE = "/Users/tdf/Downloads/school_category_3.csv"
OUTPUT_FILE_XLSX = "/Users/tdf/Downloads/user_metadata.xlsx"
OUTPUT_FILE_CSV = "/Users/tdf/Downloads/user_metadata.csv"

CHUNKSIZE = 200_000

# =====================================================
# 1. LOAD MASTERLIST AND BUILD USER UNIVERSE
# =====================================================
masterlist_df = pd.read_csv(MASTERLIST_FILE, low_memory=False)

# Each (device_name_actual, school_name) is unique
users_df = masterlist_df[['device_name_actual', 'school_name']].drop_duplicates().reset_index(drop=True)
print(f"Unique user-school pairs in masterlist: {len(users_df)}")

# =====================================================
# 2. SCHOOL MAPPING: masterlist -> conventional official name
# =====================================================
school_mapping = {
    'orchid_park_secondary': 'Orchid Park Secondary School',
    'methodist_girls\'_school': 'Methodist Girls\' School (Secondary)',
    'chung_cheng_high_yishun': 'Chung Cheng High School (Yishun)',
    'mayflower_secondary': 'Mayflower Secondary School',
    'xinmin_secondary': 'Xinmin Secondary School',
    'choa_chu_kang_secondary': 'Chua Chu Kang Secondary School',
    'bukit_view_secondary': 'Bukit View Secondary School',
    'bendemeer_secondary': 'Bendemeer Secondary School',
    'raffles_girls\'': 'Raffles Girls\' School',
    'fuhua_secondary': 'Fuhua Secondary School',
    'catholic_high': 'Catholic High School (Secondary)',
    "chij_st_joseph's_convent": "CHIJ St. Joseph's Convent",
    'kent_ridge_secondary': 'Kent Ridge Secondary School',
    'chung_cheng_high_main': 'Chung Cheng High School (Main)',
    'christ_church_secondary': 'Christ Church Secondary School',
    'sports_school': 'Singapore Sports School',
    'whitley_secondary': 'Whitley Secondary School',
    'chij_st_nicholas': "CHIJ St. Nicholas Girls' School (Secondary)",
    'hai_sing_catholic_school': 'Hai Sing Catholic School',
    'dunman_high': 'Dunman High School',
    'woodlands_secondary': 'Woodlands Secondary School',
    'nan_chiau_high_school': 'Nan Chiau High School',
    'northland_secondary': 'Northland Secondary School',
    'kuo_chuan_presbyterian_school': 'Kuo Chuan Presbyterian Secondary School',
    'st_margaret\'s_secondary': 'St. Margaret’s Secondary School',  # target
    'woodgrove_secondary': 'Woodgrove Secondary School',
    'broadrick_secondary': 'Broadrick Secondary School',
    'yishun_secondary': 'Yishun Secondary School',
    'north_vista_secondary': 'North Vista Secondary School',
    'yishun_town_secondary': 'Yishun Town Secondary School',
    'hillgrove_secondary': 'Hillgrove Secondary School',
    'guangyang_secondary': 'Guangyang Secondary School',
    'swiss_cottage_secondary': 'Swiss Cottage Secondary School',
    'manjusri_secondary': 'Manjusri Secondary School',
    'boonlay_secondary': 'Boon Lay Secondary School',
    'assumption_english_secondary': 'Assumption English School',
    'yio_chu_kang_secondary': 'Yio Chu Kang Secondary School',
    'bowen_secondary': 'Bowen Secondary School',
    'evergreen_secondary': 'Evergreen Secondary School',
    'pasir_ris_secondary': 'Pasir Ris Secondary School',
    'outram_secondary': 'Outram Secondary School',
    'maris_stella': 'Maris Stella High School (Secondary)',
    'ngee_ann_secondary': 'Ngee Ann Secondary School',
    'montfort_secondary': 'Montfort Secondary School',
    'greendale_secondary': 'Greendale Secondary School',
    'edgefield_secondary': 'Edgefield Secondary School',
    'seng_kang_secondary': 'Seng Kang Secondary School',
}

# Map masterlist_df school_name -> conventional official name
users_df['school_name_official'] = users_df['school_name'].map(school_mapping)

# =====================================================
# 3. COLLECT TERM 1 & TERM 3 DEVICE PRESENCE
# =====================================================
term1_devices = set()
term3_devices = set()

school_folders = [f for f in RAW_DATA_ROOT.iterdir() if f.is_dir()]

for school_folder in tqdm(school_folders, desc="Schools"):
    csv_files = list(school_folder.glob("*.csv"))
    for csv_file in tqdm(csv_files, desc=f"{school_folder.name}", leave=False):
        filename_lower = csv_file.name.lower()
        if "mg" in filename_lower:
            file_type = "term1"
        elif "browsing history" in filename_lower:
            file_type = "term3"
        else:
            continue

        try:
            for chunk in pd.read_csv(csv_file, usecols=['device_name_actual'], chunksize=CHUNKSIZE):
                devices = chunk['device_name_actual'].dropna().unique()
                if file_type == "term1":
                    term1_devices.update(devices)
                else:
                    term3_devices.update(devices)
        except Exception as e:
            print(f"Skipping {csv_file.name} due to error: {e}")

print("\nFinished scanning raw data.")
print(f"Devices in Term 1: {len(term1_devices)}, Term 3: {len(term3_devices)}")

# =====================================================
# 4. ASSIGN LEVEL
# =====================================================
users_df['level'] = 2
users_df.loc[
    users_df['device_name_actual'].isin(term3_devices) &
    ~users_df['device_name_actual'].isin(term1_devices),
    'level'
] = 1

# =====================================================
# 5. LOAD SCHOOL METADATA AND MERGE
# =====================================================
school_cat = pd.read_csv(SCHOOL_CATEGORY_FILE, low_memory=False)

# Only fix mis-encoding for St. Margaret’s
mask_margaret = school_cat['name'].str.contains('â€™', regex=False) & \
                school_cat['name'].str.contains('Margaret', regex=False)
school_cat.loc[mask_margaret, 'name'] = school_cat.loc[mask_margaret, 'name'].str.replace('â€™', '’', regex=False)

# Map gender to numbers
gender_map = {'girls': 1, 'boys': 2, 'co-ed': 99}
school_cat['gender'] = school_cat['gender'].map(gender_map)

# Merge on official names
users_df = users_df.merge(
    school_cat[['name', 'gender', 'psle_al_cutoff_pg3_group']],
    left_on='school_name_official',
    right_on='name',
    how='left'
)

# Clean final columns
users_df.rename(columns={'psle_al_cutoff_pg3_group': 'sch_psle'}, inplace=True)
users_df.drop(columns=['name'], inplace=True)

# =====================================================
# 6. FINAL CHECKS
# =====================================================
print("\nMissing values summary:")
print(users_df[['school_name_official','gender','sch_psle','level']].isna().sum())

# =====================================================
# 7. EXPORT
# =====================================================
final_user_metadata = users_df[['device_name_actual','school_name_official','gender','sch_psle','level']]
final_user_metadata.to_excel(OUTPUT_FILE_XLSX, index=False)
final_user_metadata.to_csv(OUTPUT_FILE_CSV, index=False)

print(f"\nUser metadata files created: {OUTPUT_FILE_XLSX}, {OUTPUT_FILE_CSV}")

Unique user-school pairs in masterlist: 4989


Schools:   0%|                                           | 0/47 [00:00<?, ?it/s]
orchid_park_secondary:   0%|                              | 0/4 [00:00<?, ?it/s][A
orchid_park_secondary:  25%|█████▌                | 1/4 [00:01<00:03,  1.15s/it][A
orchid_park_secondary:  50%|███████████           | 2/4 [00:01<00:01,  1.14it/s][A
orchid_park_secondary:  75%|████████████████▌     | 3/4 [00:02<00:00,  1.27it/s][A
orchid_park_secondary: 100%|██████████████████████| 4/4 [00:03<00:00,  1.17it/s][A
Schools:   2%|▋                                  | 1/47 [00:03<02:40,  3.48s/it][A
methodist_girls'_school:   0%|                            | 0/4 [00:00<?, ?it/s][A
methodist_girls'_school:  25%|█████               | 1/4 [00:00<00:01,  2.50it/s][A
methodist_girls'_school:  50%|██████████          | 2/4 [00:01<00:01,  1.54it/s][A
methodist_girls'_school:  75%|███████████████     | 3/4 [00:01<00:00,  1.42it/s][A
methodist_girls'_school: 100%|████████████████████| 4/4 [00:02<00:00,  1.44it/s

sports_school:  50%|███████████████               | 2/4 [00:00<00:00,  2.24it/s][A
sports_school:  75%|██████████████████████▌       | 3/4 [00:01<00:00,  2.19it/s][A
sports_school: 100%|██████████████████████████████| 4/4 [00:01<00:00,  2.19it/s][A
Schools:  34%|███████████▌                      | 16/47 [01:14<02:18,  4.48s/it][A
whitley_secondary:   0%|                                  | 0/4 [00:00<?, ?it/s][A
whitley_secondary:  25%|██████▌                   | 1/4 [00:00<00:01,  2.02it/s][A
whitley_secondary:  50%|█████████████             | 2/4 [00:01<00:01,  1.42it/s][A
whitley_secondary:  75%|███████████████████▌      | 3/4 [00:01<00:00,  1.55it/s][A
whitley_secondary: 100%|██████████████████████████| 4/4 [00:02<00:00,  1.34it/s][A
Schools:  36%|████████████▎                     | 17/47 [01:17<01:59,  3.98s/it][A
chij_st_nicholas:   0%|                                   | 0/4 [00:00<?, ?it/s][A
chij_st_nicholas:  25%|██████▊                    | 1/4 [00:00<00:01,  1.67i

hillgrove_secondary:  75%|██████████████████      | 3/4 [00:01<00:00,  2.72it/s][A
hillgrove_secondary: 100%|████████████████████████| 4/4 [00:01<00:00,  2.94it/s][A
Schools:  66%|██████████████████████▍           | 31/47 [02:16<00:50,  3.19s/it][A
guangyang_secondary:   0%|                                | 0/4 [00:00<?, ?it/s][A
guangyang_secondary:  25%|██████                  | 1/4 [00:00<00:01,  2.08it/s][A
guangyang_secondary:  50%|████████████            | 2/4 [00:00<00:00,  2.11it/s][A
guangyang_secondary:  75%|██████████████████      | 3/4 [00:01<00:00,  1.86it/s][A
guangyang_secondary: 100%|████████████████████████| 4/4 [00:02<00:00,  1.51it/s][A
Schools:  68%|███████████████████████▏          | 32/47 [02:19<00:44,  2.96s/it][A
swiss_cottage_secondary:   0%|                            | 0/4 [00:00<?, ?it/s][A
swiss_cottage_secondary:  25%|█████               | 1/4 [00:00<00:01,  1.78it/s][A
swiss_cottage_secondary:  50%|██████████          | 2/4 [00:00<00:00,  2.15i

seng_kang_secondary:  75%|██████████████████      | 3/4 [00:02<00:00,  1.22it/s][A
seng_kang_secondary: 100%|████████████████████████| 4/4 [00:03<00:00,  1.34it/s][A
Schools: 100%|██████████████████████████████████| 47/47 [03:02<00:00,  3.89s/it][A



Finished scanning raw data.
Devices in Term 1: 34674, Term 3: 46294

Missing values summary:
school_name_official    0
gender                  1
sch_psle                1
level                   0
dtype: int64

User metadata files created: /Users/tdf/Downloads/user_metadata.xlsx, /Users/tdf/Downloads/user_metadata.csv


- Here, one row has missing gender and sch_psle values, indicating manual inspection is needed.

## 13. Inspect Rows with Missing Values

We extract the specific rows with missing values for review:

In [53]:
missing_rows = users_df[
    users_df['gender'].isna() | users_df['sch_psle'].isna()
]

missing_rows[
    ['device_name_actual', 'school_name', 'school_name_official', 'gender', 'sch_psle']
]

Unnamed: 0,device_name_actual,school_name,school_name_official,gender,sch_psle
78,1822,methodist_girls'_school,Methodist Girls' School (Secondary),,


## 14. Manually Fill Missing Values

Based on metadata inspection:

- Methodist Girls’ School is a girls’ school, so `gender` = 1
- PSLE cutoff group for this school is 1, so `sch_psle` = 1

We update the row manually.

After manual updates, confirm that there are no remaining missing values:

In [54]:
# Fill missing gender and PSLE for device 1822
users_df.loc[
    users_df['device_name_actual'] == 1822, 
    ['gender', 'sch_psle']
] = [1, 1]  # gender = 1 (girls), sch_psle = 1

# Check that missing values are gone
users_df[['school_name_official','gender','sch_psle','level']].isna().sum()


school_name_official    0
gender                  0
sch_psle                0
level                   0
dtype: int64

## 15. Export Updated User Metadata

Finally, we export the fully corrected and completed user metadata:

In [55]:
# Export updated user metadata
final_user_metadata = users_df[['device_name_actual','school_name_official','gender','sch_psle','level']]
final_user_metadata.to_excel(OUTPUT_FILE_XLSX, index=False)
final_user_metadata.to_csv(OUTPUT_FILE_CSV, index=False)

print(f"Updated user metadata files created: {OUTPUT_FILE_XLSX}, {OUTPUT_FILE_CSV}")


Updated user metadata files created: /Users/tdf/Downloads/user_metadata.xlsx, /Users/tdf/Downloads/user_metadata.csv


In [57]:
# Load the CSV
df = pd.read_csv('/Users/tdf/Downloads/user_metadata2.csv')

# Check for missing values in all columns
missing_summary = df.isna().sum()
print(missing_summary)

device_name_actual      0
school_name_official    0
gender                  0
sch_psle                0
level                   0
dtype: int64
