# Reading PISA data from SAS and SPSS files and importing it into BigQuery

## Background
[PISA (Programme for International Student Assessment)](https://www.oecd.org/en/about/programmes/pisa.html) conducts school-level analysis to assess student performance in reading, mathematics, and science. While PISA aggregates data at a country level, it has school-level datasets that list a variety of indicators or factors potentially impacting the outcomes.

PISA does not provide data on individual students' performance. However, it calculates [*plausible values*](https://www.oecd.org/en/about/programmes/pisa/how-to-prepare-and-analyse-the-pisa-database.html#methodology) that are created through multiple imputations of values drawn from a distribution that reflects the uncertainty in estimating a student's true proficiency based on their test responses.

Based on the plausible values, we can calculate average performance per school, thus enabling the analysis of the impact of various school-related factors, e.g., student/teacher ratio or having a band, on the educational outcomes in math, reading, and sciences as assessed by PISA.

We will then upload the data to BigQuery, creating two tables:
- pisa_data
- pisa_codebooks

These tables have been downloaded and posted on Kaggle as
- [pisa_data.csv](https://www.kaggle.com/datasets/yummykaggle/pisa-school-level-indicators-and-outcomes/data?select=pisa_data.csv)
- [pisa_codebooks.csv](https://www.kaggle.com/datasets/yummykaggle/pisa-school-level-indicators-and-outcomes/data?select=pisa_codebooks.csv)

## Data sources

For both the impacting and the indicators of educational performance, we will use PISA's datasets obtained from https://webfs.oecd.org/pisa2022/index.html.

To explain the meanign of the indicators' labels and their datatypes, PISA provides *codebook* which we will analyze and import as well.

In [None]:
import pandas as pd # for data
import openpyxl # for codebooks
import re
from collections import defaultdict

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
path = 'your-folder'


## Impacting factors dataset

To construct the dataset of factors that can potentially impact educational performance, we used three PISA assessment datasets:
- for 2022, https://webfs.oecd.org/pisa2022/SCH_QQQ_SAS.zip
- for 2018, https://webfs.oecd.org/pisa2018/SAS_SCH_QQQ.zip
- for 2015, https://webfs.oecd.org/pisa/PUF_SAS_COMBINED_CMB_SCH_QQQ.zip

In [None]:
years = [2015, 2018, 2022]

Created a unified dataframe for all years

In [None]:
import pandas as pd

def create_unified_df(years, path):
    """
    Create unified DataFrame from multiple SAS files and track new indicators.

    Parameters:
    years: List of years [2022, 2018, 2015]
    path: Path to SAS files

    Returns:
    df: Unified DataFrame with 'year' column
    new_indicators_df: DataFrame with new columns by year
    """

    df = pd.DataFrame()
    new_indicators_list = []
    existing_columns = set()

    for i, year in enumerate(years):
        print(f"Processing year {year}...")

        # Read SAS file
        file = f'{year}.sas7bdat'
        file_path = path + file
        year_df = pd.read_sas(file_path)

        # Add year column
        year_df['year'] = year

        # Check for new columns (skip first file)
        if i > 0:
            new_columns = set(year_df.columns) - existing_columns
            new_columns.discard('year')  # Don't count year as new

            if new_columns:
                print(f"  Found {len(new_columns)} new columns: {list(new_columns)}")
                for col in new_columns:
                    new_indicators_list.append({'year': year, 'column': col})

        # Update existing columns set
        existing_columns.update(year_df.columns)

        # Concatenate DataFrames (pandas will align columns automatically)
        if df.empty:
            df = year_df.copy()
        else:
            df = pd.concat([df, year_df], ignore_index=True, sort=False)

        print(f"  DataFrame now has {len(df)} rows, {len(df.columns)} columns")

    # Create new indicators DataFrame
    new_indicators_df = pd.DataFrame(new_indicators_list) if new_indicators_list else pd.DataFrame(columns=['year', 'column'])

    print(f"\nFinal unified DataFrame: {len(df)} rows, {len(df.columns)} columns")
    print(f"New indicators found: {len(new_indicators_df)}")

    return df, new_indicators_df

In [None]:
df, new_indicators_df = create_unified_df([2022, 2018, 2015], path)
df.head()

Processing year 2022...


  year_df['year'] = year


  DataFrame now has 21629 rows, 432 columns
Processing year 2018...


  year_df['year'] = year


  Found 97 new columns: ['SC154Q05WA', 'SC154Q04WA', 'PROAT5AB', 'SC150Q04IA', 'SC166Q03HA', 'SC004Q04NA', 'SC159Q01HA', 'SC156Q04HA', 'Region', 'SC052Q02NA', 'SC154Q07WA', 'SC155Q05HA', 'SC156Q01HA', 'SC167Q01HA', 'SC154Q02WA', 'SC155Q04HA', 'SC012Q07TA', 'BOOKID', 'SC167Q03HA', 'SC018Q05NA02', 'SC155Q03HA', 'SC161Q01SA', 'SC156Q06HA', 'SC150Q03IA', 'SC165Q03HA', 'SC154Q06WA', 'SC160Q01WA', 'SC052Q01NA', 'SC167Q04HA', 'SC158Q07HA', 'SCMCEG', 'SC048Q03NA', 'PROAT6', 'SC165Q07HA', 'SC165Q05HA', 'SC150Q01IA', 'SC155Q02HA', 'SC166Q02HA', 'SC036Q01TA', 'SC165Q09HA', 'SC158Q01HA', 'PROAT5AM', 'SC165Q04HA', 'SC166Q06HA', 'M', 'SC164Q01HA', 'SC018Q05NA01', 'SC156Q02HA', 'SC158Q09HA', 'SC158Q04HA', 'SC161Q04SA', 'SC053Q13IA', 'SC165Q06HA', 'SC158Q08HA', 'SC048Q01NA', 'SC150Q02IA', 'SC167Q06HA', 'SC053Q14IA', 'SC154Q10WA', 'SC156Q07HA', 'SC165Q02HA', 'SC154Q11HA', 'SC167Q05HA', 'SC036Q02TA', 'SC154Q03WA', 'SC154Q08WA', 'SC158Q12HA', 'SC158Q02HA', 'SC053Q15IA', 'SC036Q03NA', 'SC053Q12IA', 'SC156

  year_df['year'] = year


  Found 122 new columns: ['PROSTCE', 'SC010Q07TE', 'SC041Q01NA', 'SC009Q02TA', 'SC010Q05TD', 'SC059Q01NA', 'SC010Q03TE', 'SC040Q17NA', 'SC010Q11TA', 'SC010Q02TB', 'SC010Q09TE', 'SC010Q07TC', 'SC014Q01NA', 'SC019Q01NA01', 'SC041Q06NA', 'SC010Q12TA', 'SC010Q02TA', 'SC041Q03NA', 'SC009Q01TA', 'SC009Q08TA', 'SC010Q12TB', 'SC019Q02NA01', 'SC040Q15NA', 'LEADPD', 'SC010Q05TC', 'SC010Q04TB', 'SC010Q05TA', 'SCHAUT', 'SC059Q03NA', 'LEADINST', 'SC009Q11TA', 'SC010Q06TC', 'SC040Q03NA', 'SC063Q09NA', 'SC019Q01NA02', 'SC009Q13TA', 'PROSTMAS', 'SC010Q01TB', 'SC010Q10TD', 'SC010Q02TC', 'SC010Q02TE', 'SC009Q04TA', 'SC009Q07TA', 'SC059Q08NA', 'SC010Q06TB', 'SC010Q10TE', 'SC010Q11TE', 'SC019Q03NA01', 'SC010Q11TD', 'SC059Q02NA', 'SC010Q05TB', 'SC059Q06NA', 'SC010Q03TB', 'SC009Q06TA', 'SC040Q12NA', 'SC009Q05TA', 'SC010Q03TC', 'SC063Q06NA', 'SC010Q12TE', 'SC063Q04NA', 'SC010Q09TA', 'SC059Q07NA', 'SC010Q07TD', 'SC010Q09TB', 'SC010Q07TB', 'SC041Q05NA', 'SC019Q03NA02', 'SC010Q04TA', 'RESPCUR', 'TOTST', 'SC059Q

Unnamed: 0,CNT,CNTRYID,CNTSCHID,CYC,NatCen,STRATUM,SUBNATIO,REGION,OECD,ADMINMODE,LANGTEST_QQQ,SC001Q01TA,SC013Q01TA,SC014Q01TA,SC016Q01TA,SC016Q02TA,SC016Q03TA,SC016Q04TA,SC011Q01TA,SC002Q01TA,SC002Q02TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC018Q01TA01,SC018Q01TA02,SC018Q02TA01,SC018Q02TA02,SC018Q08JA01,SC018Q08JA02,SC018Q09JA01,SC018Q09JA02,SC018Q10JA01,SC018Q10JA02,SC182Q01WA01,SC182Q01WA02,SC182Q06WA01,SC182Q06WA02,SC182Q07JA01,SC182Q07JA02,SC182Q08JA01,SC182Q08JA02,SC182Q09JA01,SC182Q09JA02,SC182Q10JA01,SC182Q10JA02,SC168Q01JA,SC168Q02JA,SC168Q03JA,SC168Q04JA,SC012Q01TA,SC012Q02TA,SC012Q03TA,SC012Q04TA,SC012Q05TA,SC012Q06TA,SC012Q08JA,SC012Q10JA,SC012Q11JA,SC012Q12JA,SC185Q01WA,SC185Q02WA,SC185Q03WA,SC185Q04WA,SC185Q05WA,SC202Q01JA,SC202Q02JA,SC202Q03JA,SC202Q04JA,SC202Q05JA,SC202Q06JA,SC202Q07JA,SC202Q08JA,SC202Q09JA,SC202Q10JA,SC202Q11JA,SC202Q12JA,SC201Q01JA,SC201Q03JA,SC201Q04JA,SC201Q05JA,SC201Q06JA,SC201Q07JA,SC201Q11JA,SC004Q01TA,SC004Q02TA,SC004Q03TA,SC004Q08JA,SC004Q05NA,SC004Q06NA,SC004Q07NA,SC190Q01JA,SC190Q02JA,SC190Q05JA,SC190Q06JA,SC190Q07JA,SC190Q08JA,SC190Q09JA,SC190Q10JA,SC190Q11JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q08TA,SC037Q09TA,SC037Q11JA,SC200Q01JA,SC200Q02JA,SC200Q03JA,SC200Q04JA,SC032Q01TA,SC032Q02TA,SC032Q03TA,SC032Q04TA,SC193Q01WA,SC193Q02WA,SC193Q03WA,SC193Q04WA,SC193Q05WA,SC193Q06WA,SC193Q07WA,SC025Q01NA,SC025Q02NA,SC027Q02NA,SC027Q03NA,SC027Q04NA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC184Q01JA,SC184Q02JA,SC184Q03JA,SC184Q04JA,SC184Q05JA,SC184Q06JA,SC184Q07JA,SC017Q01NA,SC017Q02NA,SC017Q03NA,SC017Q04NA,SC017Q05NA,SC017Q06NA,SC017Q07NA,SC017Q08NA,SC017Q09JA,SC017Q10JA,SC061Q01TA,SC061Q02TA,SC061Q03TA,SC061Q04TA,SC061Q05TA,SC061Q11HA,SC061Q06TA,SC061Q07TA,SC061Q08TA,SC061Q09TA,SC061Q10TA,SC172Q02JA,SC172Q03JA,SC172Q04JA,SC172Q05JA,SC172Q06JA,SC172Q07JA,SC173Q01JA,SC173Q02JA,SC173Q03JA,SC173Q04JA,SC173Q05JA,SC173Q06JA,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC192Q01JA,SC192Q02JA,SC192Q03JA,SC192Q04JA,SC192Q05JA,SC192Q06JA,SC175Q01JA,SC175Q02JA,SC176Q01JA,SC003Q01TA,SC174Q01JA,SC053Q01TA,SC053Q02TA,SC053Q03TA,SC053Q04TA,SC053Q05NA,SC053Q06NA,SC053Q07TA,SC053Q08TA,SC053Q09TA,SC053Q10TA,SC053D11TA,SC212Q01JA,SC212Q02JA,SC212Q03JA,SC034Q01NA,SC034Q02NA,SC034Q03TA,SC034Q04TA,SC035Q01NA,SC035Q01NB,SC035Q02TA,SC035Q02TB,SC035Q03TA,SC035Q03TB,SC035Q04TA,SC035Q04TB,SC035Q05TA,SC035Q05TB,SC035Q06TA,SC035Q06TB,SC035Q07TA,SC035Q07TB,SC035Q08TA,SC035Q08TB,SC035Q09NA,SC035Q09NB,SC035Q10TA,SC035Q10TB,SC035Q11NA,SC035Q11NB,SC042Q01TA,SC042Q02TA,SC187Q01WA,SC187Q02WA,SC187Q03WA,SC187Q04WA,SC177Q01JA,SC177Q02JA,SC177Q03JA,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC195Q01JA,SC195Q02JA,SC195Q03JA,SC195Q04JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC181Q01JA,SC181Q02JA,SC181Q03JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SC189Q01JA,SC189Q05JA,SC189Q06JA,SC169Q01JA,SC210Q01JA,SC170Q01JA,SC171Q01JA,SC171Q02JA,SC171Q03JA,SC171Q04JA,SC204Q01JA,SC204Q02JA,SC204Q05JA,SC204Q06JA,SC205Q01JA,SC205Q02JA,SC205Q03JA,SC205Q05JA,SC205Q06JA,SC205Q07JA,SC207Q01JA,SC207Q02JA,SC207Q03JA,SC207Q04JA,SC207Q05JA,SC207Q06JA,SC207Q07JA,SC207Q08JA,SC208Q01JA,SC208Q02JA,SC208Q03JA,SC208Q04JA,SC208Q05JA,SC208Q06JA,SC208Q07JA,SC208Q08JA,SC208Q09JA,SC213Q01JA,SC213Q02JA,SC214Q01JA,SC214Q02JA,SC214Q03JA,SC215Q01JA,SC215Q02JA,SC215Q03JA,SC215Q04JA,SC215Q05JA,SC215Q06JA,SC215Q07JA,SC215Q08JA,SC216Q01JA,SC216Q02JA,SC216Q03JA,SC216Q04JA,SC216Q05JA,SC216Q06JA,SC216Q07JA,SC216Q08JA,SC216Q09JA,SC217Q01JA,SC217Q02JA,SC217Q03JA,SC217Q04JA,SC217Q05JA,SC217Q06JA,SC217Q07JA,SC217Q08JA,SC217Q10JA,SC218Q01JA,SC219Q01JA,SC220Q01JA,SC221Q01JA,SC221Q02JA,SC221Q03JA,SC221Q04JA,SC222Q01JA,SC222Q02JA,SC222Q03JA,SC222Q04JA,SC222Q05JA,SC223Q01JA,SC223Q02JA,SC223Q03JA,SC223Q04JA,SC223Q05JA,SC223Q06JA,SC223Q07JA,SC223Q08JA,SC223Q09JA,SC223Q10JA,SC155Q06HA,SC155Q07HA,SC155Q08HA,SC155Q09HA,SC155Q10HA,SC155Q11HA,SC224Q01JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,PRIVATESCH,SCHLTYPE,SCHSIZE,TOTAT,PROATCE,PROPAT6,PROPAT7,PROPAT8,STRATIO,TOTMATH,PROPMATH,SMRATIO,TOTSTAFF,PROPSUPP,PROADMIN,PROMGMT,PROOSTAF,SCHSEL,SCHAUTO,TCHPART,SRESPCUR,SRESPRES,EDULEAD,INSTLEAD,ENCOURPG,RATCMP1,RATCMP2,RATTAB,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,MCLSIZE,CLSIZE,STDTEST,TDTEST,CREACTIV,ALLACTIV,MACTIV,MATHEXC,ABGMATH,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,SCSUPRTED,SCSUPRT,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,W_SCHGRNRABWT,W_FSTUWT_SCH_SUM,W_FSTUWT_SCH_N,SENWT,VER_DAT,year,Region,LANGTEST,SC161Q01SA,SC161Q02SA,SC161Q03SA,SC161Q04SA,SC161Q05SA,SC162Q01SA,SC155Q01HA,SC155Q02HA,SC155Q03HA,SC155Q04HA,SC155Q05HA,SC156Q01HA,SC156Q02HA,SC156Q03HA,SC156Q04HA,SC156Q05HA,SC156Q06HA,SC156Q07HA,SC156Q08HA,SC012Q07TA,SC154Q01HA,SC154Q02WA,SC154Q03WA,SC154Q04WA,SC154Q05WA,SC154Q06WA,SC154Q07WA,SC154Q08WA,SC154Q09HA,SC154Q10WA,SC154Q11HA,SC036Q01TA,SC036Q02TA,SC036Q03NA,SC037Q10NA,SC165Q01HA,SC165Q02HA,SC165Q03HA,SC165Q04HA,SC165Q05HA,SC165Q06HA,SC165Q07HA,SC165Q08HA,SC165Q09HA,SC165Q10HA,SC166Q02HA,SC166Q03HA,SC166Q05HA,SC166Q06HA,SC167Q01HA,SC167Q02HA,SC167Q03HA,SC167Q04HA,SC167Q05HA,SC167Q06HA,SC158Q01HA,SC158Q02HA,SC158Q04HA,SC158Q07HA,SC158Q08HA,SC158Q09HA,SC158Q12HA,SC048Q01NA,SC048Q02NA,SC048Q03NA,SC004Q04NA,SC018Q05NA01,SC018Q05NA02,SC018Q06NA01,SC018Q06NA02,SC018Q07NA01,SC018Q07NA02,SC159Q01HA,SC053Q12IA,SC053Q13IA,SC053Q14IA,SC053Q15IA,SC053Q16IA,SC150Q01IA,SC150Q02IA,SC150Q03IA,SC150Q04IA,SC150Q05IA,SC164Q01HA,SC152Q01HA,SC160Q01WA,SC052Q01NA,SC052Q02NA,SC052Q03HA,PROAT5AB,PROAT5AM,PROAT6,SCMCEG,M,BOOKID,SC059Q01NA,SC059Q02NA,SC059Q03NA,SC059Q04NA,SC059Q05NA,SC059Q06NA,SC059Q07NA,SC059Q08NA,SC009Q01TA,SC009Q02TA,SC009Q03TA,SC009Q04TA,SC009Q05TA,SC009Q06TA,SC009Q07TA,SC009Q08TA,SC009Q09TA,SC009Q10TA,SC009Q11TA,SC009Q12TA,SC009Q13TA,SC010Q01TA,SC010Q01TB,SC010Q01TC,SC010Q01TD,SC010Q01TE,SC010Q02TA,SC010Q02TB,SC010Q02TC,SC010Q02TD,SC010Q02TE,SC010Q03TA,SC010Q03TB,SC010Q03TC,SC010Q03TD,SC010Q03TE,SC010Q04TA,SC010Q04TB,SC010Q04TC,SC010Q04TD,SC010Q04TE,SC010Q05TA,SC010Q05TB,SC010Q05TC,SC010Q05TD,SC010Q05TE,SC010Q06TA,SC010Q06TB,SC010Q06TC,SC010Q06TD,SC010Q06TE,SC010Q07TA,SC010Q07TB,SC010Q07TC,SC010Q07TD,SC010Q07TE,SC010Q08TA,SC010Q08TB,SC010Q08TC,SC010Q08TD,SC010Q08TE,SC010Q09TA,SC010Q09TB,SC010Q09TC,SC010Q09TD,SC010Q09TE,SC010Q10TA,SC010Q10TB,SC010Q10TC,SC010Q10TD,SC010Q10TE,SC010Q11TA,SC010Q11TB,SC010Q11TC,SC010Q11TD,SC010Q11TE,SC010Q12TA,SC010Q12TB,SC010Q12TC,SC010Q12TD,SC010Q12TE,SC014Q01NA,SC019Q01NA01,SC019Q01NA02,SC019Q02NA01,SC019Q02NA02,SC019Q03NA01,SC019Q03NA02,SC027Q01NA,SC040Q02NA,SC040Q03NA,SC040Q05NA,SC040Q11NA,SC040Q12NA,SC040Q15NA,SC040Q16NA,SC040Q17NA,SC041Q01NA,SC041Q03NA,SC041Q04NA,SC041Q05NA,SC041Q06NA,SC063Q02NA,SC063Q03NA,SC063Q04NA,SC063Q06NA,SC063Q07NA,SC063Q09NA,LEAD,LEADCOM,LEADINST,LEADPD,LEADTCH,RESPCUR,RESPRES,SCHAUT,TEACHPART,PROSTAT,PROSTCE,PROSTMAS,TOTST,SCIERES
0,b'ALB',8.0,800001.0,b'08MS',b'000800',b'ALB05',b'0080000',800.0,0.0,2.0,140.0,3.0,1.0,4.0,100.0,0.0,0.0,0.0,3.0,303.0,349.0,,,14.0,,32.0,,38.0,1.0,38.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,5.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,4.0,1.0,3.0,5.0,2.0,3.0,1.0,2.0,1.0,2.0,3.0,3.0,3.0,1.0,2.0,1.0,2.0,1.0,2.0,4.0,1.0,6.0,6.0,6.0,6.0,6.0,6.0,1.0,2.0,6.0,6.0,3.0,4.0,4.0,4.0,3.0,4.0,3.0,179.0,28.0,28.0,21.0,3.0,4.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,3.0,3.0,4.0,3.0,100.0,100.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,3.0,2.0,4.0,4.0,4.0,4.0,3.0,3.0,2.0,1.0,2.0,3.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,3.0,3.0,4.0,5.0,5.0,27.0,64.0,36.0,72.0,21.0,36.0,13.0,2.0,3.0,3.0,2.0,3.0,3.0,45.0,45.0,5.0,5.0,4.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,2.0,2.0,1.0,b'0080002',2.0,2.0,1.0,3.0,1.0,3.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,3.0,3.0,1.0,3.0,3.0,1.0,3.0,4.0,3.0,2.0,2.0,4.0,4.0,3.0,3.0,4.0,2.0,3.0,2.0,1.0,3.0,3.0,3.0,4.0,2.0,2.0,1.0,95.0,5.0,2.0,,,,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,3.0,1.0,4.0,1.0,4.0,1.0,1.0,1.0,1.0,3.0,3.0,3.0,4.0,3.0,3.0,2.0,3.0,3.0,90.0,0.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,2.0,3.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9.0,3.0,1.0,1.0,1.0,1.0,4.0,3.0,3.0,4.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,b'public',3.0,652.0,38.5,1.0,0.0779,0.0779,0.0519,16.9351,5.0,0.1299,100.0,13.0,0.3077,0.0769,0.2308,0.3846,3.0,-0.4178,,0.5,0.3333,0.9403,0.9572,0.9212,0.1564,1.0,0.1173,-0.0598,0.8961,-0.0313,0.4944,-1.6916,-0.2968,1.2048,0.6058,-0.5488,33.0,33.0,0.9589,0.1717,0.0,-0.3384,1.0,,3.0,-0.3682,0.3518,-1.0019,0.5652,2.0,2.0,0.7965,-0.8314,0.8462,0.5908,1.43376,160.5808,41.0,5.76182,b'03MAY23:10:11:34',2022,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,b'ALB',8.0,800002.0,b'08MS',b'000800',b'ALB07',b'0080000',800.0,0.0,2.0,140.0,1.0,1.0,4.0,90.0,5.0,5.0,0.0,2.0,88.0,95.0,,,,,20.0,,16.0,0.0,16.0,0.0,3.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,3.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,2.0,2.0,2.0,5.0,5.0,6.0,6.0,3.0,3.0,1.0,6.0,1.0,2.0,6.0,2.0,3.0,3.0,3.0,3.0,2.0,2.0,3.0,59.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,2.0,3.0,3.0,3.0,100.0,100.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,4.0,4.0,4.0,4.0,4.0,2.0,2.0,2.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,5.0,5.0,5.0,5.0,2.0,80.0,100.0,100.0,70.0,80.0,100.0,80.0,2.0,2.0,2.0,2.0,2.0,2.0,45.0,45.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,b'0080001',2.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,3.0,2.0,1.0,2.0,3.0,1.0,1.0,3.0,1.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,3.0,1.0,1.0,1.0,3.0,1.0,2.0,1.0,1.0,90.0,10.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,3.0,3.0,3.0,2.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,3.0,2.0,1.0,1.0,1.0,2.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,3.0,3.0,70.0,10.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0,3.0,1.0,1.0,1.0,1.0,4.0,4.0,4.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,2.0,2.0,2.0,2.0,3.0,1.0,1.0,1.0,b'public',3.0,183.0,16.0,1.0,0.4375,0.25,0.0,11.4375,3.0,0.1875,61.0,4.0,0.25,0.0,0.25,0.5,3.0,-0.528,,1.0,0.6,-0.1642,-0.166,-0.3835,0.0,0.0,0.0,1.3483,0.4869,1.0982,1.2535,-1.6916,-1.4551,2.9595,-0.0956,-2.0409,13.0,13.0,0.7237,1.5226,1.0,0.4074,4.0,3.0,3.0,-0.9046,2.1631,-0.6079,0.0911,2.0,1.0,-0.5687,-0.8314,0.8462,-0.3475,2.85278,133.7114,36.0,11.46442,b'03MAY23:10:11:34',2022,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,b'ALB',8.0,800003.0,b'08MS',b'000800',b'ALB06',b'0080000',800.0,0.0,2.0,140.0,2.0,2.0,2.0,0.0,0.0,10.0,90.0,1.0,74.0,47.0,77.0,,1.0,,64.0,,17.0,0.0,17.0,0.0,4.0,0.0,13.0,0.0,0.0,0.0,2.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,1.0,2.0,1.0,2.0,2.0,3.0,1.0,3.0,3.0,3.0,3.0,3.0,2.0,2.0,3.0,2.0,2.0,2.0,4.0,3.0,3.0,2.0,3.0,3.0,3.0,11.0,22.0,22.0,0.0,9.0,9.0,18.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,3.0,3.0,3.0,2.0,2.0,2.0,100.0,100.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,3.0,1.0,3.0,2.0,3.0,2.0,3.0,2.0,3.0,2.0,3.0,4.0,4.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,4.0,4.0,4.0,3.0,3.0,4.0,4.0,2.0,3.0,3.0,3.0,29.0,37.0,16.0,38.0,13.0,43.0,55.0,2.0,2.0,3.0,2.0,2.0,2.0,45.0,45.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,b'0080002',2.0,1.0,1.0,3.0,4.0,3.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,3.0,3.0,1.0,3.0,3.0,1.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,3.0,3.0,3.0,3.0,2.0,2.0,1.0,3.0,4.0,4.0,5.0,2.0,1.0,1.0,90.0,10.0,2.0,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,3.0,3.0,4.0,2.0,3.0,3.0,3.0,3.0,3.0,4.0,2.0,2.0,3.0,2.0,2.0,6.0,6.0,2.0,2.0,2.0,3.0,3.0,3.0,2.0,2.0,2.0,3.0,115.0,6.0,5.0,5.0,5.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9.0,8.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0,3.0,1.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,3.0,3.0,3.0,3.0,3.0,2.0,3.0,3.0,1.0,2.0,1.0,b'private',1.0,121.0,17.0,1.0,1.0,0.7647,0.0,7.1176,2.0,0.1176,60.5,4.0,0.0,0.25,0.75,0.0,3.0,1.4001,,5.0,7.0,-0.1686,-0.2686,-0.0248,2.0,1.0,0.0,0.0889,0.4569,0.0794,0.0836,2.0223,0.2833,0.6299,1.3399,0.2266,13.0,13.0,-0.2921,-0.9681,2.0,-0.8108,0.0,,3.0,0.1019,0.4118,-0.2771,-0.8952,1.0,1.0,0.1896,-0.8314,-0.8711,0.4409,7.40007,25.61079,3.0,29.73857,b'03MAY23:10:11:34',2022,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,b'ALB',8.0,800004.0,b'08MS',b'000800',b'ALB05',b'0080000',800.0,0.0,2.0,140.0,3.0,1.0,4.0,100.0,0.0,0.0,0.0,1.0,583.0,491.0,11.0,2.0,16.0,,4.0,,63.0,1.0,60.0,0.0,2.0,0.0,13.0,0.0,0.0,0.0,8.0,0.0,8.0,0.0,8.0,0.0,8.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,4.0,7.0,3.0,3.0,3.0,1.0,3.0,2.0,3.0,1.0,1.0,3.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,6.0,6.0,3.0,3.0,2.0,6.0,1.0,2.0,6.0,6.0,5.0,4.0,3.0,3.0,2.0,2.0,2.0,136.0,25.0,16.0,0.0,0.0,2.0,16.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,2.0,2.0,2.0,2.0,3.0,3.0,100.0,100.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,3.0,4.0,4.0,3.0,3.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,80.0,85.0,85.0,85.0,30.0,100.0,,2.0,2.0,2.0,2.0,3.0,3.0,45.0,45.0,4.0,4.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,2.0,1.0,b'0080001',2.0,2.0,1.0,3.0,1.0,3.0,1.0,1.0,,,1.0,,1.0,,1.0,2.0,,1.0,,,1.0,,1.0,,1.0,2.0,,,1.0,3.0,3.0,1.0,3.0,3.0,3.0,3.0,3.0,4.0,1.0,1.0,4.0,4.0,1.0,1.0,3.0,2.0,2.0,3.0,2.0,2.0,1.0,3.0,1.0,2.0,1.0,1.0,84.0,16.0,1.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,3.0,3.0,2.0,4.0,4.0,4.0,3.0,3.0,3.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,3.0,3.0,3.0,3.0,3.0,2.0,2.0,3.0,2.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,2.0,2.0,2.0,2.0,3.0,2.0,1.0,1.0,1.0,b'public',3.0,1074.0,63.5,0.9449,0.2362,0.2047,0.0,16.9134,8.0,0.126,100.0,11.0,0.0,0.0,0.3636,0.6364,3.0,0.2405,,0.5,1.6667,0.0243,0.1402,0.4919,0.1838,0.64,0.0,1.3483,0.4795,1.0982,2.1585,-0.8905,-1.4551,1.6863,-0.7912,-0.9138,28.0,28.0,-0.1123,1.4914,0.0,-0.1653,3.0,1.0,3.0,-0.9046,1.118,-1.1844,-0.4642,,,,,,-0.7414,4.34319,192.2413,39.0,17.45392,b'03MAY23:10:11:34',2022,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,b'ALB',8.0,800005.0,b'08MS',b'000800',b'ALB03',b'0080000',800.0,0.0,2.0,140.0,3.0,1.0,4.0,75.0,10.0,10.0,5.0,1.0,166.0,151.0,,1.0,1.0,1.0,3.0,,26.0,2.0,26.0,0.0,12.0,2.0,14.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,2.0,4.0,2.0,0.0,3.0,3.0,1.0,2.0,2.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,,2.0,3.0,6.0,6.0,1.0,2.0,6.0,6.0,3.0,4.0,3.0,2.0,2.0,3.0,2.0,38.0,2.0,2.0,9.0,2.0,2.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,3.0,2.0,3.0,3.0,2.0,4.0,4.0,1.0,,1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,1.0,1.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,45.0,5.0,20.0,100.0,30.0,70.0,75.0,2.0,2.0,3.0,2.0,3.0,2.0,45.0,45.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,1.0,b'0080001',1.0,1.0,1.0,2.0,2.0,3.0,5.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0,1.0,2.0,3.0,1.0,3.0,3.0,1.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,1.0,1.0,1.0,3.0,3.0,3.0,1.0,2.0,3.0,4.0,5.0,2.0,1.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,2.0,3.0,3.0,4.0,2.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,5.0,3.0,6.0,4.0,6.0,6.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,3.0,5.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,3.0,2.0,3.0,2.0,1.0,1.0,3.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0,2.0,2.0,1.0,1.0,1.0,3.0,3.0,4.0,4.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,1.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,b'public',3.0,317.0,27.0,0.963,,0.5556,0.037,11.7407,2.0,0.0741,100.0,8.0,0.25,0.5,0.25,0.0,3.0,-0.0472,,0.5,,-0.4247,-0.4562,0.3802,0.0526,1.0,0.2368,1.3483,1.2115,0.1756,2.1585,-0.8905,-0.2129,1.2478,-2.0719,-0.5079,18.0,18.0,0.858,-0.2225,1.0,-0.5672,3.0,3.0,1.0,0.1019,0.4118,1.0543,0.4217,2.0,2.0,0.6649,-0.8314,0.8462,-0.0102,7.5206,45.69535,5.0,30.22293,b'03MAY23:10:11:34',2022,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
df['CNT'] = df['CNT'].str.decode('utf-8') # fix CNT
df.head()

Unnamed: 0,CNT,CNTRYID,CNTSCHID,CYC,NatCen,STRATUM,SUBNATIO,REGION,OECD,ADMINMODE,LANGTEST_QQQ,SC001Q01TA,SC013Q01TA,SC014Q01TA,SC016Q01TA,SC016Q02TA,SC016Q03TA,SC016Q04TA,SC011Q01TA,SC002Q01TA,SC002Q02TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC018Q01TA01,SC018Q01TA02,SC018Q02TA01,SC018Q02TA02,SC018Q08JA01,SC018Q08JA02,SC018Q09JA01,SC018Q09JA02,SC018Q10JA01,SC018Q10JA02,SC182Q01WA01,SC182Q01WA02,SC182Q06WA01,SC182Q06WA02,SC182Q07JA01,SC182Q07JA02,SC182Q08JA01,SC182Q08JA02,SC182Q09JA01,SC182Q09JA02,SC182Q10JA01,SC182Q10JA02,SC168Q01JA,SC168Q02JA,SC168Q03JA,SC168Q04JA,SC012Q01TA,SC012Q02TA,SC012Q03TA,SC012Q04TA,SC012Q05TA,SC012Q06TA,SC012Q08JA,SC012Q10JA,SC012Q11JA,SC012Q12JA,SC185Q01WA,SC185Q02WA,SC185Q03WA,SC185Q04WA,SC185Q05WA,SC202Q01JA,SC202Q02JA,SC202Q03JA,SC202Q04JA,SC202Q05JA,SC202Q06JA,SC202Q07JA,SC202Q08JA,SC202Q09JA,SC202Q10JA,SC202Q11JA,SC202Q12JA,SC201Q01JA,SC201Q03JA,SC201Q04JA,SC201Q05JA,SC201Q06JA,SC201Q07JA,SC201Q11JA,SC004Q01TA,SC004Q02TA,SC004Q03TA,SC004Q08JA,SC004Q05NA,SC004Q06NA,SC004Q07NA,SC190Q01JA,SC190Q02JA,SC190Q05JA,SC190Q06JA,SC190Q07JA,SC190Q08JA,SC190Q09JA,SC190Q10JA,SC190Q11JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q08TA,SC037Q09TA,SC037Q11JA,SC200Q01JA,SC200Q02JA,SC200Q03JA,SC200Q04JA,SC032Q01TA,SC032Q02TA,SC032Q03TA,SC032Q04TA,SC193Q01WA,SC193Q02WA,SC193Q03WA,SC193Q04WA,SC193Q05WA,SC193Q06WA,SC193Q07WA,SC025Q01NA,SC025Q02NA,SC027Q02NA,SC027Q03NA,SC027Q04NA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC184Q01JA,SC184Q02JA,SC184Q03JA,SC184Q04JA,SC184Q05JA,SC184Q06JA,SC184Q07JA,SC017Q01NA,SC017Q02NA,SC017Q03NA,SC017Q04NA,SC017Q05NA,SC017Q06NA,SC017Q07NA,SC017Q08NA,SC017Q09JA,SC017Q10JA,SC061Q01TA,SC061Q02TA,SC061Q03TA,SC061Q04TA,SC061Q05TA,SC061Q11HA,SC061Q06TA,SC061Q07TA,SC061Q08TA,SC061Q09TA,SC061Q10TA,SC172Q02JA,SC172Q03JA,SC172Q04JA,SC172Q05JA,SC172Q06JA,SC172Q07JA,SC173Q01JA,SC173Q02JA,SC173Q03JA,SC173Q04JA,SC173Q05JA,SC173Q06JA,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC192Q01JA,SC192Q02JA,SC192Q03JA,SC192Q04JA,SC192Q05JA,SC192Q06JA,SC175Q01JA,SC175Q02JA,SC176Q01JA,SC003Q01TA,SC174Q01JA,SC053Q01TA,SC053Q02TA,SC053Q03TA,SC053Q04TA,SC053Q05NA,SC053Q06NA,SC053Q07TA,SC053Q08TA,SC053Q09TA,SC053Q10TA,SC053D11TA,SC212Q01JA,SC212Q02JA,SC212Q03JA,SC034Q01NA,SC034Q02NA,SC034Q03TA,SC034Q04TA,SC035Q01NA,SC035Q01NB,SC035Q02TA,SC035Q02TB,SC035Q03TA,SC035Q03TB,SC035Q04TA,SC035Q04TB,SC035Q05TA,SC035Q05TB,SC035Q06TA,SC035Q06TB,SC035Q07TA,SC035Q07TB,SC035Q08TA,SC035Q08TB,SC035Q09NA,SC035Q09NB,SC035Q10TA,SC035Q10TB,SC035Q11NA,SC035Q11NB,SC042Q01TA,SC042Q02TA,SC187Q01WA,SC187Q02WA,SC187Q03WA,SC187Q04WA,SC177Q01JA,SC177Q02JA,SC177Q03JA,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC195Q01JA,SC195Q02JA,SC195Q03JA,SC195Q04JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC181Q01JA,SC181Q02JA,SC181Q03JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SC189Q01JA,SC189Q05JA,SC189Q06JA,SC169Q01JA,SC210Q01JA,SC170Q01JA,SC171Q01JA,SC171Q02JA,SC171Q03JA,SC171Q04JA,SC204Q01JA,SC204Q02JA,SC204Q05JA,SC204Q06JA,SC205Q01JA,SC205Q02JA,SC205Q03JA,SC205Q05JA,SC205Q06JA,SC205Q07JA,SC207Q01JA,SC207Q02JA,SC207Q03JA,SC207Q04JA,SC207Q05JA,SC207Q06JA,SC207Q07JA,SC207Q08JA,SC208Q01JA,SC208Q02JA,SC208Q03JA,SC208Q04JA,SC208Q05JA,SC208Q06JA,SC208Q07JA,SC208Q08JA,SC208Q09JA,SC213Q01JA,SC213Q02JA,SC214Q01JA,SC214Q02JA,SC214Q03JA,SC215Q01JA,SC215Q02JA,SC215Q03JA,SC215Q04JA,SC215Q05JA,SC215Q06JA,SC215Q07JA,SC215Q08JA,SC216Q01JA,SC216Q02JA,SC216Q03JA,SC216Q04JA,SC216Q05JA,SC216Q06JA,SC216Q07JA,SC216Q08JA,SC216Q09JA,SC217Q01JA,SC217Q02JA,SC217Q03JA,SC217Q04JA,SC217Q05JA,SC217Q06JA,SC217Q07JA,SC217Q08JA,SC217Q10JA,SC218Q01JA,SC219Q01JA,SC220Q01JA,SC221Q01JA,SC221Q02JA,SC221Q03JA,SC221Q04JA,SC222Q01JA,SC222Q02JA,SC222Q03JA,SC222Q04JA,SC222Q05JA,SC223Q01JA,SC223Q02JA,SC223Q03JA,SC223Q04JA,SC223Q05JA,SC223Q06JA,SC223Q07JA,SC223Q08JA,SC223Q09JA,SC223Q10JA,SC155Q06HA,SC155Q07HA,SC155Q08HA,SC155Q09HA,SC155Q10HA,SC155Q11HA,SC224Q01JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,PRIVATESCH,SCHLTYPE,SCHSIZE,TOTAT,PROATCE,PROPAT6,PROPAT7,PROPAT8,STRATIO,TOTMATH,PROPMATH,SMRATIO,TOTSTAFF,PROPSUPP,PROADMIN,PROMGMT,PROOSTAF,SCHSEL,SCHAUTO,TCHPART,SRESPCUR,SRESPRES,EDULEAD,INSTLEAD,ENCOURPG,RATCMP1,RATCMP2,RATTAB,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,MCLSIZE,CLSIZE,STDTEST,TDTEST,CREACTIV,ALLACTIV,MACTIV,MATHEXC,ABGMATH,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,SCSUPRTED,SCSUPRT,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,W_SCHGRNRABWT,W_FSTUWT_SCH_SUM,W_FSTUWT_SCH_N,SENWT,VER_DAT,year,Region,LANGTEST,SC161Q01SA,SC161Q02SA,SC161Q03SA,SC161Q04SA,SC161Q05SA,SC162Q01SA,SC155Q01HA,SC155Q02HA,SC155Q03HA,SC155Q04HA,SC155Q05HA,SC156Q01HA,SC156Q02HA,SC156Q03HA,SC156Q04HA,SC156Q05HA,SC156Q06HA,SC156Q07HA,SC156Q08HA,SC012Q07TA,SC154Q01HA,SC154Q02WA,SC154Q03WA,SC154Q04WA,SC154Q05WA,SC154Q06WA,SC154Q07WA,SC154Q08WA,SC154Q09HA,SC154Q10WA,SC154Q11HA,SC036Q01TA,SC036Q02TA,SC036Q03NA,SC037Q10NA,SC165Q01HA,SC165Q02HA,SC165Q03HA,SC165Q04HA,SC165Q05HA,SC165Q06HA,SC165Q07HA,SC165Q08HA,SC165Q09HA,SC165Q10HA,SC166Q02HA,SC166Q03HA,SC166Q05HA,SC166Q06HA,SC167Q01HA,SC167Q02HA,SC167Q03HA,SC167Q04HA,SC167Q05HA,SC167Q06HA,SC158Q01HA,SC158Q02HA,SC158Q04HA,SC158Q07HA,SC158Q08HA,SC158Q09HA,SC158Q12HA,SC048Q01NA,SC048Q02NA,SC048Q03NA,SC004Q04NA,SC018Q05NA01,SC018Q05NA02,SC018Q06NA01,SC018Q06NA02,SC018Q07NA01,SC018Q07NA02,SC159Q01HA,SC053Q12IA,SC053Q13IA,SC053Q14IA,SC053Q15IA,SC053Q16IA,SC150Q01IA,SC150Q02IA,SC150Q03IA,SC150Q04IA,SC150Q05IA,SC164Q01HA,SC152Q01HA,SC160Q01WA,SC052Q01NA,SC052Q02NA,SC052Q03HA,PROAT5AB,PROAT5AM,PROAT6,SCMCEG,M,BOOKID,SC059Q01NA,SC059Q02NA,SC059Q03NA,SC059Q04NA,SC059Q05NA,SC059Q06NA,SC059Q07NA,SC059Q08NA,SC009Q01TA,SC009Q02TA,SC009Q03TA,SC009Q04TA,SC009Q05TA,SC009Q06TA,SC009Q07TA,SC009Q08TA,SC009Q09TA,SC009Q10TA,SC009Q11TA,SC009Q12TA,SC009Q13TA,SC010Q01TA,SC010Q01TB,SC010Q01TC,SC010Q01TD,SC010Q01TE,SC010Q02TA,SC010Q02TB,SC010Q02TC,SC010Q02TD,SC010Q02TE,SC010Q03TA,SC010Q03TB,SC010Q03TC,SC010Q03TD,SC010Q03TE,SC010Q04TA,SC010Q04TB,SC010Q04TC,SC010Q04TD,SC010Q04TE,SC010Q05TA,SC010Q05TB,SC010Q05TC,SC010Q05TD,SC010Q05TE,SC010Q06TA,SC010Q06TB,SC010Q06TC,SC010Q06TD,SC010Q06TE,SC010Q07TA,SC010Q07TB,SC010Q07TC,SC010Q07TD,SC010Q07TE,SC010Q08TA,SC010Q08TB,SC010Q08TC,SC010Q08TD,SC010Q08TE,SC010Q09TA,SC010Q09TB,SC010Q09TC,SC010Q09TD,SC010Q09TE,SC010Q10TA,SC010Q10TB,SC010Q10TC,SC010Q10TD,SC010Q10TE,SC010Q11TA,SC010Q11TB,SC010Q11TC,SC010Q11TD,SC010Q11TE,SC010Q12TA,SC010Q12TB,SC010Q12TC,SC010Q12TD,SC010Q12TE,SC014Q01NA,SC019Q01NA01,SC019Q01NA02,SC019Q02NA01,SC019Q02NA02,SC019Q03NA01,SC019Q03NA02,SC027Q01NA,SC040Q02NA,SC040Q03NA,SC040Q05NA,SC040Q11NA,SC040Q12NA,SC040Q15NA,SC040Q16NA,SC040Q17NA,SC041Q01NA,SC041Q03NA,SC041Q04NA,SC041Q05NA,SC041Q06NA,SC063Q02NA,SC063Q03NA,SC063Q04NA,SC063Q06NA,SC063Q07NA,SC063Q09NA,LEAD,LEADCOM,LEADINST,LEADPD,LEADTCH,RESPCUR,RESPRES,SCHAUT,TEACHPART,PROSTAT,PROSTCE,PROSTMAS,TOTST,SCIERES
0,ALB,8.0,800001.0,b'08MS',b'000800',b'ALB05',b'0080000',800.0,0.0,2.0,140.0,3.0,1.0,4.0,100.0,0.0,0.0,0.0,3.0,303.0,349.0,,,14.0,,32.0,,38.0,1.0,38.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,5.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,4.0,1.0,3.0,5.0,2.0,3.0,1.0,2.0,1.0,2.0,3.0,3.0,3.0,1.0,2.0,1.0,2.0,1.0,2.0,4.0,1.0,6.0,6.0,6.0,6.0,6.0,6.0,1.0,2.0,6.0,6.0,3.0,4.0,4.0,4.0,3.0,4.0,3.0,179.0,28.0,28.0,21.0,3.0,4.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,3.0,3.0,4.0,3.0,100.0,100.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,3.0,2.0,4.0,4.0,4.0,4.0,3.0,3.0,2.0,1.0,2.0,3.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,3.0,3.0,4.0,5.0,5.0,27.0,64.0,36.0,72.0,21.0,36.0,13.0,2.0,3.0,3.0,2.0,3.0,3.0,45.0,45.0,5.0,5.0,4.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,2.0,2.0,1.0,b'0080002',2.0,2.0,1.0,3.0,1.0,3.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,3.0,3.0,1.0,3.0,3.0,1.0,3.0,4.0,3.0,2.0,2.0,4.0,4.0,3.0,3.0,4.0,2.0,3.0,2.0,1.0,3.0,3.0,3.0,4.0,2.0,2.0,1.0,95.0,5.0,2.0,,,,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,3.0,1.0,4.0,1.0,4.0,1.0,1.0,1.0,1.0,3.0,3.0,3.0,4.0,3.0,3.0,2.0,3.0,3.0,90.0,0.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,2.0,3.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9.0,3.0,1.0,1.0,1.0,1.0,4.0,3.0,3.0,4.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,b'public',3.0,652.0,38.5,1.0,0.0779,0.0779,0.0519,16.9351,5.0,0.1299,100.0,13.0,0.3077,0.0769,0.2308,0.3846,3.0,-0.4178,,0.5,0.3333,0.9403,0.9572,0.9212,0.1564,1.0,0.1173,-0.0598,0.8961,-0.0313,0.4944,-1.6916,-0.2968,1.2048,0.6058,-0.5488,33.0,33.0,0.9589,0.1717,0.0,-0.3384,1.0,,3.0,-0.3682,0.3518,-1.0019,0.5652,2.0,2.0,0.7965,-0.8314,0.8462,0.5908,1.43376,160.5808,41.0,5.76182,b'03MAY23:10:11:34',2022,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,ALB,8.0,800002.0,b'08MS',b'000800',b'ALB07',b'0080000',800.0,0.0,2.0,140.0,1.0,1.0,4.0,90.0,5.0,5.0,0.0,2.0,88.0,95.0,,,,,20.0,,16.0,0.0,16.0,0.0,3.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,3.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,2.0,2.0,2.0,5.0,5.0,6.0,6.0,3.0,3.0,1.0,6.0,1.0,2.0,6.0,2.0,3.0,3.0,3.0,3.0,2.0,2.0,3.0,59.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,2.0,3.0,3.0,3.0,100.0,100.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,4.0,4.0,4.0,4.0,4.0,2.0,2.0,2.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,5.0,5.0,5.0,5.0,2.0,80.0,100.0,100.0,70.0,80.0,100.0,80.0,2.0,2.0,2.0,2.0,2.0,2.0,45.0,45.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,b'0080001',2.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,3.0,2.0,1.0,2.0,3.0,1.0,1.0,3.0,1.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,3.0,1.0,1.0,1.0,3.0,1.0,2.0,1.0,1.0,90.0,10.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,3.0,3.0,3.0,2.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,3.0,2.0,1.0,1.0,1.0,2.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,3.0,3.0,70.0,10.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0,3.0,1.0,1.0,1.0,1.0,4.0,4.0,4.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,2.0,2.0,2.0,2.0,3.0,1.0,1.0,1.0,b'public',3.0,183.0,16.0,1.0,0.4375,0.25,0.0,11.4375,3.0,0.1875,61.0,4.0,0.25,0.0,0.25,0.5,3.0,-0.528,,1.0,0.6,-0.1642,-0.166,-0.3835,0.0,0.0,0.0,1.3483,0.4869,1.0982,1.2535,-1.6916,-1.4551,2.9595,-0.0956,-2.0409,13.0,13.0,0.7237,1.5226,1.0,0.4074,4.0,3.0,3.0,-0.9046,2.1631,-0.6079,0.0911,2.0,1.0,-0.5687,-0.8314,0.8462,-0.3475,2.85278,133.7114,36.0,11.46442,b'03MAY23:10:11:34',2022,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,ALB,8.0,800003.0,b'08MS',b'000800',b'ALB06',b'0080000',800.0,0.0,2.0,140.0,2.0,2.0,2.0,0.0,0.0,10.0,90.0,1.0,74.0,47.0,77.0,,1.0,,64.0,,17.0,0.0,17.0,0.0,4.0,0.0,13.0,0.0,0.0,0.0,2.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,1.0,2.0,1.0,2.0,2.0,3.0,1.0,3.0,3.0,3.0,3.0,3.0,2.0,2.0,3.0,2.0,2.0,2.0,4.0,3.0,3.0,2.0,3.0,3.0,3.0,11.0,22.0,22.0,0.0,9.0,9.0,18.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,3.0,3.0,3.0,2.0,2.0,2.0,100.0,100.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,3.0,1.0,3.0,2.0,3.0,2.0,3.0,2.0,3.0,2.0,3.0,4.0,4.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,4.0,4.0,4.0,3.0,3.0,4.0,4.0,2.0,3.0,3.0,3.0,29.0,37.0,16.0,38.0,13.0,43.0,55.0,2.0,2.0,3.0,2.0,2.0,2.0,45.0,45.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,b'0080002',2.0,1.0,1.0,3.0,4.0,3.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,3.0,3.0,1.0,3.0,3.0,1.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,3.0,3.0,3.0,3.0,2.0,2.0,1.0,3.0,4.0,4.0,5.0,2.0,1.0,1.0,90.0,10.0,2.0,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,3.0,3.0,4.0,2.0,3.0,3.0,3.0,3.0,3.0,4.0,2.0,2.0,3.0,2.0,2.0,6.0,6.0,2.0,2.0,2.0,3.0,3.0,3.0,2.0,2.0,2.0,3.0,115.0,6.0,5.0,5.0,5.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9.0,8.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0,3.0,1.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,3.0,3.0,3.0,3.0,3.0,2.0,3.0,3.0,1.0,2.0,1.0,b'private',1.0,121.0,17.0,1.0,1.0,0.7647,0.0,7.1176,2.0,0.1176,60.5,4.0,0.0,0.25,0.75,0.0,3.0,1.4001,,5.0,7.0,-0.1686,-0.2686,-0.0248,2.0,1.0,0.0,0.0889,0.4569,0.0794,0.0836,2.0223,0.2833,0.6299,1.3399,0.2266,13.0,13.0,-0.2921,-0.9681,2.0,-0.8108,0.0,,3.0,0.1019,0.4118,-0.2771,-0.8952,1.0,1.0,0.1896,-0.8314,-0.8711,0.4409,7.40007,25.61079,3.0,29.73857,b'03MAY23:10:11:34',2022,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,ALB,8.0,800004.0,b'08MS',b'000800',b'ALB05',b'0080000',800.0,0.0,2.0,140.0,3.0,1.0,4.0,100.0,0.0,0.0,0.0,1.0,583.0,491.0,11.0,2.0,16.0,,4.0,,63.0,1.0,60.0,0.0,2.0,0.0,13.0,0.0,0.0,0.0,8.0,0.0,8.0,0.0,8.0,0.0,8.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,4.0,7.0,3.0,3.0,3.0,1.0,3.0,2.0,3.0,1.0,1.0,3.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,6.0,6.0,3.0,3.0,2.0,6.0,1.0,2.0,6.0,6.0,5.0,4.0,3.0,3.0,2.0,2.0,2.0,136.0,25.0,16.0,0.0,0.0,2.0,16.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,2.0,2.0,2.0,2.0,3.0,3.0,100.0,100.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,3.0,4.0,4.0,3.0,3.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,80.0,85.0,85.0,85.0,30.0,100.0,,2.0,2.0,2.0,2.0,3.0,3.0,45.0,45.0,4.0,4.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,2.0,1.0,b'0080001',2.0,2.0,1.0,3.0,1.0,3.0,1.0,1.0,,,1.0,,1.0,,1.0,2.0,,1.0,,,1.0,,1.0,,1.0,2.0,,,1.0,3.0,3.0,1.0,3.0,3.0,3.0,3.0,3.0,4.0,1.0,1.0,4.0,4.0,1.0,1.0,3.0,2.0,2.0,3.0,2.0,2.0,1.0,3.0,1.0,2.0,1.0,1.0,84.0,16.0,1.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,3.0,3.0,2.0,4.0,4.0,4.0,3.0,3.0,3.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,3.0,3.0,3.0,3.0,3.0,2.0,2.0,3.0,2.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,2.0,2.0,2.0,2.0,3.0,2.0,1.0,1.0,1.0,b'public',3.0,1074.0,63.5,0.9449,0.2362,0.2047,0.0,16.9134,8.0,0.126,100.0,11.0,0.0,0.0,0.3636,0.6364,3.0,0.2405,,0.5,1.6667,0.0243,0.1402,0.4919,0.1838,0.64,0.0,1.3483,0.4795,1.0982,2.1585,-0.8905,-1.4551,1.6863,-0.7912,-0.9138,28.0,28.0,-0.1123,1.4914,0.0,-0.1653,3.0,1.0,3.0,-0.9046,1.118,-1.1844,-0.4642,,,,,,-0.7414,4.34319,192.2413,39.0,17.45392,b'03MAY23:10:11:34',2022,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,ALB,8.0,800005.0,b'08MS',b'000800',b'ALB03',b'0080000',800.0,0.0,2.0,140.0,3.0,1.0,4.0,75.0,10.0,10.0,5.0,1.0,166.0,151.0,,1.0,1.0,1.0,3.0,,26.0,2.0,26.0,0.0,12.0,2.0,14.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,2.0,4.0,2.0,0.0,3.0,3.0,1.0,2.0,2.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,,2.0,3.0,6.0,6.0,1.0,2.0,6.0,6.0,3.0,4.0,3.0,2.0,2.0,3.0,2.0,38.0,2.0,2.0,9.0,2.0,2.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,3.0,2.0,3.0,3.0,2.0,4.0,4.0,1.0,,1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,1.0,1.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,45.0,5.0,20.0,100.0,30.0,70.0,75.0,2.0,2.0,3.0,2.0,3.0,2.0,45.0,45.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,1.0,b'0080001',1.0,1.0,1.0,2.0,2.0,3.0,5.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0,1.0,2.0,3.0,1.0,3.0,3.0,1.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,1.0,1.0,1.0,3.0,3.0,3.0,1.0,2.0,3.0,4.0,5.0,2.0,1.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,2.0,3.0,3.0,4.0,2.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,5.0,3.0,6.0,4.0,6.0,6.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,3.0,5.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,3.0,2.0,3.0,2.0,1.0,1.0,3.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0,2.0,2.0,1.0,1.0,1.0,3.0,3.0,4.0,4.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,1.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,b'public',3.0,317.0,27.0,0.963,,0.5556,0.037,11.7407,2.0,0.0741,100.0,8.0,0.25,0.5,0.25,0.0,3.0,-0.0472,,0.5,,-0.4247,-0.4562,0.3802,0.0526,1.0,0.2368,1.3483,1.2115,0.1756,2.1585,-0.8905,-0.2129,1.2478,-2.0719,-0.5079,18.0,18.0,0.858,-0.2225,1.0,-0.5672,3.0,3.0,1.0,0.1019,0.4118,1.0543,0.4217,2.0,2.0,0.6649,-0.8314,0.8462,-0.0102,7.5206,45.69535,5.0,30.22293,b'03MAY23:10:11:34',2022,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
new_indicators_df = new_indicators_df.sort_values(['column', 'year'])
new_indicators_df

Unnamed: 0,year,column
17,2018,BOOKID
88,2018,LANGTEST
194,2015,LEAD
196,2015,LEADCOM
126,2015,LEADINST
120,2015,LEADPD
168,2015,LEADTCH
44,2018,M
2,2018,PROAT5AB
41,2018,PROAT5AM


## Adding educational outcome indicators

We will calculate three indicators:
* Math, based on PV[1-10]MATH in the Student questionnaire.
* Reading, based on PV[1-10]READ in the Student questionnaire.
* Science, based on PV[1-10]SCIE in the Student questionnaire.

We will use student questionnaires for:
* 2022 - https://webfs.oecd.org/pisa2022/STU_QQQ_SAS.zip
* 2018 - https://webfs.oecd.org/pisa2018/SAS_STU_QQQ.zip
* 2015 - https://webfs.oecd.org/pisa/PUF_SAS_COMBINED_CMB_STU_QQQ.zip

In [None]:
# Define plausible value columns
pv_math = [f'PV{i}MATH' for i in range(1, 11)]
pv_read = [f'PV{i}READ' for i in range(1, 11)]
pv_scie = [f'PV{i}SCIE' for i in range(1, 11)]

In [None]:
def aggregate_to_school_level(year_stu_df, n, pv_math, pv_read, pv_scie):
    """
    Aggregate student-level PISA data to school level for schools with n+ students.

    Parameters:
    year_stu_df: DataFrame with student-level data
    n - number of students in a school necessary to have the school count
    pv_math, pv_read, pv_scie - plausible value columns

    Returns:
    DataFrame with school-level data: CNT, CNTRYID, CNTSCHID, year, math, read, sci
    """

    # Calculate average across plausible values for each student
    year_stu_df['math'] = year_stu_df[pv_math].mean(axis=1)
    year_stu_df['read'] = year_stu_df[pv_read].mean(axis=1)
    year_stu_df['sci'] = year_stu_df[pv_scie].mean(axis=1)

    # Filter schools with 10+ students
    school_counts = year_stu_df.groupby(['CNTRYID', 'CNTSCHID', 'year']).size()
    valid_schools = school_counts[school_counts >= 10].index
    filtered_df = year_stu_df.set_index(['CNTRYID', 'CNTSCHID', 'year']).loc[valid_schools].reset_index()

    # Aggregate to school level
    school_df = filtered_df.groupby(['CNTRYID', 'CNTSCHID', 'year']).agg({
        'CNT': 'first',
        'math': 'mean',
        'read': 'mean',
        'sci': 'mean'
    }).round(2)

    # Reset index and reorder columns
    school_df.reset_index(inplace=True)
    school_df = school_df[['CNT', 'CNTRYID', 'CNTSCHID', 'year', 'math', 'read', 'sci']]

    print(f"School-year combinations with ≥{n} students: {len(school_df)}")

    return school_df


In [None]:
columns_wanted = ['CNTRYID', 'CNT', 'CNTSCHID'] + pv_math + pv_read + pv_scie
print(columns_wanted)

['CNTRYID', 'CNT', 'CNTSCHID', 'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV6MATH', 'PV7MATH', 'PV8MATH', 'PV9MATH', 'PV10MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ', 'PV6READ', 'PV7READ', 'PV8READ', 'PV9READ', 'PV10READ', 'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE', 'PV6SCIE', 'PV7SCIE', 'PV8SCIE', 'PV9SCIE', 'PV10SCIE']


For 2015, we need to use the SPSS dataset, as the outcomes data is not present in the SAS dataset:

In [None]:
import pyreadstat

In [None]:
y = 2015
file = f'{y}_stu.sav'

file_path = path + file
year_stu_df, meta = pyreadstat.read_sav(file_path, usecols=columns_wanted)

# Add year column
year_stu_df['year'] = y

print(f'\n\n{y}:')
print(list(year_stu_df.columns))

# Aggregate by school
outcomes_df_2015 = aggregate_to_school_level(year_stu_df, 10, pv_math, pv_read, pv_scie)
outcomes_df_2015.head()



2015:
['CNTRYID', 'CNT', 'CNTSCHID', 'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV6MATH', 'PV7MATH', 'PV8MATH', 'PV9MATH', 'PV10MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ', 'PV6READ', 'PV7READ', 'PV8READ', 'PV9READ', 'PV10READ', 'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE', 'PV6SCIE', 'PV7SCIE', 'PV8SCIE', 'PV9SCIE', 'PV10SCIE', 'year']
School-year combinations with ≥10 students: 16121


Unnamed: 0,CNT,CNTRYID,CNTSCHID,year,math,read,sci
0,ALB,8.0,800001.0,2015,390.56,437.48,413.71
1,ALB,8.0,800004.0,2015,397.9,451.21,454.23
2,ALB,8.0,800005.0,2015,441.65,409.67,422.42
3,ALB,8.0,800007.0,2015,341.52,340.09,405.56
4,ALB,8.0,800008.0,2015,372.21,358.56,386.26


For 2018, we need to use the SPSS dataset, as the outcomes data is not present in the SAS dataset:

In [None]:
y = 2018
file = f'{y}_stu.sav'

file_path = path + file
year_stu_df, meta = pyreadstat.read_sav(file_path, usecols=columns_wanted)

# Add year column
year_stu_df['year'] = y

print(f'\n\n{y}:')
print(list(year_stu_df.columns))

# Aggregate by school
outcomes_df_2018 = aggregate_to_school_level(year_stu_df, 10, pv_math, pv_read, pv_scie)
outcomes_df_2018.head()



2018:
['CNTRYID', 'CNT', 'CNTSCHID', 'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV6MATH', 'PV7MATH', 'PV8MATH', 'PV9MATH', 'PV10MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ', 'PV6READ', 'PV7READ', 'PV8READ', 'PV9READ', 'PV10READ', 'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE', 'PV6SCIE', 'PV7SCIE', 'PV8SCIE', 'PV9SCIE', 'PV10SCIE', 'year']
School-year combinations with ≥10 students: 19028


Unnamed: 0,CNT,CNTRYID,CNTSCHID,year,math,read,sci
0,ALB,8.0,800002.0,2018,435.18,370.64,405.34
1,ALB,8.0,800004.0,2018,377.19,332.91,364.65
2,ALB,8.0,800005.0,2018,413.99,366.7,402.34
3,ALB,8.0,800006.0,2018,424.12,374.66,396.51
4,ALB,8.0,800009.0,2018,493.98,459.99,452.44


For 2022, we can use a SAS dataset:

In [None]:
y = 2022
file = f'{y}_stu.sas7bdat'

file_path = path + file
year_stu_df = pd.read_sas(file_path)

# Add year column
year_stu_df['year'] = y

print(f'\n\n{y}:')
print(list(year_stu_df.columns))

# Aggregate by school
outcomes_df_2022 = aggregate_to_school_level(year_stu_df, 10, pv_math, pv_read, pv_scie)
outcomes_df_2022.head()


  year_stu_df['year'] = y
  year_stu_df['math'] = year_stu_df[pv_math].mean(axis=1)




2022:
['CNT', 'CNTRYID', 'CNTSCHID', 'CNTSTUID', 'CYC', 'NatCen', 'STRATUM', 'SUBNATIO', 'REGION', 'OECD', 'ADMINMODE', 'LANGTEST_QQQ', 'LANGTEST_COG', 'LANGTEST_PAQ', 'Option_CT', 'Option_FL', 'Option_ICTQ', 'Option_WBQ', 'Option_PQ', 'Option_TQ', 'Option_UH', 'BOOKID', 'ST001D01T', 'ST003D02T', 'ST003D03T', 'ST004D01T', 'ST250Q01JA', 'ST250Q02JA', 'ST250Q03JA', 'ST250Q04JA', 'ST250Q05JA', 'ST250D06JA', 'ST250D07JA', 'ST251Q01JA', 'ST251Q02JA', 'ST251Q03JA', 'ST251Q04JA', 'ST251Q06JA', 'ST251Q07JA', 'ST251D08JA', 'ST251D09JA', 'ST253Q01JA', 'ST254Q01JA', 'ST254Q02JA', 'ST254Q03JA', 'ST254Q04JA', 'ST254Q05JA', 'ST254Q06JA', 'ST255Q01JA', 'ST256Q01JA', 'ST256Q02JA', 'ST256Q03JA', 'ST256Q06JA', 'ST256Q07JA', 'ST256Q08JA', 'ST256Q09JA', 'ST256Q10JA', 'ST230Q01JA', 'ST005Q01JA', 'ST006Q01JA', 'ST006Q02JA', 'ST006Q03JA', 'ST006Q04JA', 'ST006Q05JA', 'ST007Q01JA', 'ST008Q01JA', 'ST008Q02JA', 'ST008Q03JA', 'ST008Q04JA', 'ST008Q05JA', 'ST258Q01JA', 'ST259Q01JA', 'ST259Q02JA', 'ST019AQ01T', 'S

  year_stu_df['read'] = year_stu_df[pv_read].mean(axis=1)
  year_stu_df['sci'] = year_stu_df[pv_scie].mean(axis=1)


School-year combinations with ≥10 students: 19066


Unnamed: 0,CNT,CNTRYID,CNTSCHID,year,math,read,sci
0,b'ALB',8.0,800001.0,2022,392.04,373.48,413.34
1,b'ALB',8.0,800002.0,2022,362.88,352.07,381.14
2,b'ALB',8.0,800004.0,2022,326.72,336.37,334.51
3,b'ALB',8.0,800006.0,2022,421.6,404.21,418.3
4,b'ALB',8.0,800007.0,2022,358.63,373.32,374.15


In [None]:
outcomes_df_2022['CNT'] = outcomes_df_2022['CNT'].str.decode('utf-8')
outcomes_df_2022.head()

Unnamed: 0,CNT,CNTRYID,CNTSCHID,year,math,read,sci
0,ALB,8.0,800001.0,2022,392.04,373.48,413.34
1,ALB,8.0,800002.0,2022,362.88,352.07,381.14
2,ALB,8.0,800004.0,2022,326.72,336.37,334.51
3,ALB,8.0,800006.0,2022,421.6,404.21,418.3
4,ALB,8.0,800007.0,2022,358.63,373.32,374.15


In [None]:
# add year
outcomes_df_2015['year'] = 2015
print(outcomes_df_2015.head())

outcomes_df_2018['year'] = 2018
print(outcomes_df_2018.head())

outcomes_df_2022['year'] = 2022
print(outcomes_df_2022.head())

   CNT  CNTRYID  CNTSCHID  year    math    read     sci
0  ALB      8.0  800001.0  2015  390.56  437.48  413.71
1  ALB      8.0  800004.0  2015  397.90  451.21  454.23
2  ALB      8.0  800005.0  2015  441.65  409.67  422.42
3  ALB      8.0  800007.0  2015  341.52  340.09  405.56
4  ALB      8.0  800008.0  2015  372.21  358.56  386.26
   CNT  CNTRYID  CNTSCHID  year    math    read     sci
0  ALB      8.0  800002.0  2018  435.18  370.64  405.34
1  ALB      8.0  800004.0  2018  377.19  332.91  364.65
2  ALB      8.0  800005.0  2018  413.99  366.70  402.34
3  ALB      8.0  800006.0  2018  424.12  374.66  396.51
4  ALB      8.0  800009.0  2018  493.98  459.99  452.44
   CNT  CNTRYID  CNTSCHID  year    math    read     sci
0  ALB      8.0  800001.0  2022  392.04  373.48  413.34
1  ALB      8.0  800002.0  2022  362.88  352.07  381.14
2  ALB      8.0  800004.0  2022  326.72  336.37  334.51
3  ALB      8.0  800006.0  2022  421.60  404.21  418.30
4  ALB      8.0  800007.0  2022  358.63  373.32 

Unifying the datasets into one:

In [None]:
outcomes_df = pd.concat([outcomes_df_2015, outcomes_df_2018, outcomes_df_2022],
                       ignore_index=True)

print(f"Combined dataset shape: {outcomes_df.shape}")
print(f"2015 data: {len(outcomes_df_2015):,} rows")
print(f"2018 data: {len(outcomes_df_2018):,} rows")
print(f"2022 data: {len(outcomes_df_2022):,} rows")
print(f"Total combined: {len(outcomes_df):,} rows")

outcomes_df.head()

Combined dataset shape: (54215, 7)
2015 data: 16,121 rows
2018 data: 19,028 rows
2022 data: 19,066 rows
Total combined: 54,215 rows


Unnamed: 0,CNT,CNTRYID,CNTSCHID,year,math,read,sci
0,ALB,8.0,800001.0,2015,390.56,437.48,413.71
1,ALB,8.0,800004.0,2015,397.9,451.21,454.23
2,ALB,8.0,800005.0,2015,441.65,409.67,422.42
3,ALB,8.0,800007.0,2015,341.52,340.09,405.56
4,ALB,8.0,800008.0,2015,372.21,358.56,386.26


## Creating a unified codebook

We will need to understand the indicators to choose a set of impacting factors for analysis. The indicators are explained in PISA codebooks:
- for 2022: https://webfs.oecd.org/pisa2022/CY08MSP_CODEBOOK_27thJune24.xlsx
- for 2018: https://www.oecd.org/content/dam/oecd/en/data/datasets/pisa/pisa-2018-datasets/codebook-and-compendia/PISA2018_CODEBOOK.xlsx
- for 2015: https://webfs.oecd.org/pisa/Codebook_CMB.xlsx

We will create a unified codebook for all yearly datasets.

In [None]:
def create_field_mapping(df, codebook_path, target_sheet, year):
    # Get field names from the DataFrame
    df_fields = list(df.columns)
    print(f"Found {len(df_fields)} fields in DataFrame")

    # Read the codebook
    codebook_path = codebook_path + f'{year}_codebook.xlsx'
    try:
        # Read the specific sheet from codebook
        codebook_df = pd.read_excel(codebook_path, sheet_name=target_sheet)
        print(f"Successfully read codebook sheet: {target_sheet}")

        # Create mapping dictionary from codebook
        field_mapping = {}

        for _, row in codebook_df.iterrows():
            if pd.notna(row['NAME']) and isinstance(row['NAME'], str):
                field_id = row['NAME']
                field_description = row['VARIABLE'] if pd.notna(row['VARIABLE']) else 'No description available'
                field_mapping[field_id] = field_description

        print(f"Found {len(field_mapping)} field definitions in codebook")

        # Create the result DataFrame - SINGLE LOOP
        df_codes = []
        matched_count = 0

        for field in df_fields:
            if field in field_mapping:
                df_codes.append({
                    'year': year,
                    'field_id': field,
                    'field_name': field_mapping[field],
                    'if_found_in_codebook': True
                })
                matched_count += 1
            else:
                df_codes.append({
                    'year': year,
                    'field_id': field,
                    'field_name': f'NOT FOUND: {field}',  # Fixed this
                    'if_found_in_codebook': False
                })

        result_df = pd.DataFrame(df_codes)

        print(f"\nMatching Summary:")
        print(f"- Total fields in DataFrame: {len(df_fields)}")
        print(f"- Fields matched in codebook: {matched_count}")
        print(f"- Fields not found: {len(df_fields) - matched_count}")

        return result_df

    except Exception as e:
        print(f"Error reading codebook: {e}")
        return None

In [None]:
def process_multi_year_codes(years, df, codebook_path, target_sheet):
    all_codes = []
    for year in years:
        codes = create_field_mapping(df, codebook_path, target_sheet, year)
        if codes is not None:
            all_codes.append(codes)

    combined_df = pd.concat(all_codes, ignore_index=True) if all_codes else pd.DataFrame()
    return combined_df.sort_values(['field_id', 'field_name', 'year']) if not combined_df.empty else combined_df

In [None]:
codes_df = process_multi_year_codes(years, df, path, 'CY08MSP_SCH_QQQ')
codes_df.head(150)

Found 651 fields in DataFrame
Successfully read codebook sheet: CY08MSP_SCH_QQQ
Found 273 field definitions in codebook

Matching Summary:
- Total fields in DataFrame: 651
- Fields matched in codebook: 273
- Fields not found: 378
Found 651 fields in DataFrame


  warn("""Cannot parse header or footer so it will be ignored""")


Successfully read codebook sheet: CY08MSP_SCH_QQQ
Found 196 field definitions in codebook

Matching Summary:
- Total fields in DataFrame: 651
- Fields matched in codebook: 196
- Fields not found: 455
Found 651 fields in DataFrame
Successfully read codebook sheet: CY08MSP_SCH_QQQ
Found 431 field definitions in codebook

Matching Summary:
- Total fields in DataFrame: 651
- Fields matched in codebook: 431
- Fields not found: 220


Unnamed: 0,year,field_id,field_name,if_found_in_codebook
1717,2022,ABGMATH,Ability grouping for mathematics classes,True
415,2015,ABGMATH,NOT FOUND: ABGMATH,False
1066,2018,ABGMATH,NOT FOUND: ABGMATH,False
1720,2022,ACTCRESC,Creative school activities offered (WLE),True
418,2015,ACTCRESC,NOT FOUND: ACTCRESC,False
1069,2018,ACTCRESC,NOT FOUND: ACTCRESC,False
9,2015,ADMINMODE,Mode of Respondent,True
660,2018,ADMINMODE,Mode of Respondent,True
1311,2022,ADMINMODE,Mode of Respondent,True
1714,2022,ALLACTIV,Extra-curricular activities offered (all) (WLE),True


Now, let's find the indicators common to all the years we are interested in, i.e., 2015, 2018, and 2022:

In [None]:
found_fields = codes_df[codes_df['if_found_in_codebook'] == True]['field_id'].value_counts()
all_years_fields = found_fields[found_fields == len(years)].index

# Filter
matched_all_years_df = codes_df[codes_df['field_id'].isin(all_years_fields)]
print(len(matched_all_years_df))
matched_all_years_df

273


Unnamed: 0,year,field_id,field_name,if_found_in_codebook
9,2015,ADMINMODE,Mode of Respondent,True
660,2018,ADMINMODE,Mode of Respondent,True
1311,2022,ADMINMODE,Mode of Respondent,True
408,2015,CLSIZE,Class Size,True
1059,2018,CLSIZE,Class Size,True
1710,2022,CLSIZE,Class size (test language class),True
0,2015,CNT,Country code 3-character,True
651,2018,CNT,Country code 3-character,True
1302,2022,CNT,Country code 3-character,True
1,2015,CNTRYID,Country Identifier,True


After the manual inspection of the indicators common to all years, a number of indicaors have been selected to represent those dimensions that can be impacted by investment in school tech, new teachers, teacher training, etc.:

In [None]:
selected_indicators = """CNT
CNTSCHID
CYC
SC001Q01TA
CLSIZE
CREACTIV
EDUSHORT
PROATCE
RATCMP1
RATCMP2
SC002Q01TA
SC002Q02TA
SC003Q01TA
SC004Q05NA
SC004Q06NA
SC004Q07NA
SC012Q04TA
SC012Q05TA
SC012Q06TA
SC013Q01TA
SC017Q01NA
SC017Q02NA
SC017Q03NA
SC017Q04NA
SC017Q05NA
SC017Q06NA
SC017Q07NA
SC017Q08NA
SC018Q01TA01
SC018Q01TA02
SC018Q02TA01
SC018Q02TA02
SC025Q01NA
SC037Q01TA
SC037Q02TA
SC037Q03TA
SC037Q04TA
SC037Q05NA
SC037Q06NA
SC037Q07TA
SC037Q08TA
SC037Q09TA
SC042Q01TA
SC042Q02TA
SC053Q01TA
SC053Q02TA
SC053Q03TA
SC053Q04TA
SC053Q09TA
SC053Q10TA
SC061Q01TA
SC061Q02TA
SC061Q03TA
SC061Q04TA
SC061Q05TA
SC061Q06TA
SC061Q07TA
SC061Q08TA
SC061Q09TA
SC061Q10TA
SC064Q01TA
SC064Q02TA
SC064Q03TA
SC064Q04NA
SCHSIZE
STAFFSHORT
STRATIO
STUBEHA
TEACHBEHA
TOTAT"""

indicators_list = [indicator.strip() for indicator in selected_indicators.strip().split('\n') if indicator.strip()]

print("List of indicators:")
print(indicators_list)

List of indicators:
['CNT', 'CNTSCHID', 'CYC', 'SC001Q01TA', 'CLSIZE', 'CREACTIV', 'EDUSHORT', 'PROATCE', 'RATCMP1', 'RATCMP2', 'SC002Q01TA', 'SC002Q02TA', 'SC003Q01TA', 'SC004Q05NA', 'SC004Q06NA', 'SC004Q07NA', 'SC012Q04TA', 'SC012Q05TA', 'SC012Q06TA', 'SC013Q01TA', 'SC017Q01NA', 'SC017Q02NA', 'SC017Q03NA', 'SC017Q04NA', 'SC017Q05NA', 'SC017Q06NA', 'SC017Q07NA', 'SC017Q08NA', 'SC018Q01TA01', 'SC018Q01TA02', 'SC018Q02TA01', 'SC018Q02TA02', 'SC025Q01NA', 'SC037Q01TA', 'SC037Q02TA', 'SC037Q03TA', 'SC037Q04TA', 'SC037Q05NA', 'SC037Q06NA', 'SC037Q07TA', 'SC037Q08TA', 'SC037Q09TA', 'SC042Q01TA', 'SC042Q02TA', 'SC053Q01TA', 'SC053Q02TA', 'SC053Q03TA', 'SC053Q04TA', 'SC053Q09TA', 'SC053Q10TA', 'SC061Q01TA', 'SC061Q02TA', 'SC061Q03TA', 'SC061Q04TA', 'SC061Q05TA', 'SC061Q06TA', 'SC061Q07TA', 'SC061Q08TA', 'SC061Q09TA', 'SC061Q10TA', 'SC064Q01TA', 'SC064Q02TA', 'SC064Q03TA', 'SC064Q04NA', 'SCHSIZE', 'STAFFSHORT', 'STRATIO', 'STUBEHA', 'TEACHBEHA', 'TOTAT']


In [None]:
indicators_df = matched_all_years_df[
    matched_all_years_df['field_id'].isin(indicators_list)
][['field_id', 'field_name']].copy()

indicators_df = indicators_df.drop_duplicates(subset=['field_id']).reset_index(drop=True)

indicators_df = indicators_df.sort_values('field_id').reset_index(drop=True)

print(f"\nTotal indicators in list: {len(indicators_list)}")
print(f"Matched indicators found: {len(indicators_df)}")

# Check for any missing indicators
missing_indicators = set(indicators_list) - set(indicators_df['field_id'])
if missing_indicators:
    print(f"\nMissing indicators (not found in matched_all_years_df): {len(missing_indicators)}")
    for missing in sorted(missing_indicators):
        print(f"  - {missing}")
else:
    print("\nAll indicators were successfully matched!")

indicators_df


Total indicators in list: 70
Matched indicators found: 70

All indicators were successfully matched!


Unnamed: 0,field_id,field_name
0,CLSIZE,Class Size
1,CNT,Country code 3-character
2,CNTSCHID,Intl. School ID
3,CREACTIV,Creative extra-curricular activities (3 activi...
4,CYC,PISA Assessment Cycle (2 digits + 2 character ...
5,EDUSHORT,Shortage of educational material (WLE)
6,PROATCE,Index proportion of all teachers fully certified
7,RATCMP1,Availability of computers
8,RATCMP2,Computers connected to the Internet
9,SC001Q01TA,Which of the following definitions best descri...


## Removing unnecessary indicators

In [None]:
indicators_list.append('year')
indicators_list.append('CNTRYID') # append country ID

In [None]:
print(f'Original shape: {df.shape}')
df_all_indicators = df
df = df[df.columns.intersection(indicators_list)]
print(f'Shape after removing unnecessary indicators: {df.shape}')

Original shape: (61440, 651)
Shape after removing unnecessary indicators: (61440, 72)


In [None]:
df.head()

Unnamed: 0,CNT,CNTRYID,CNTSCHID,CYC,SC001Q01TA,SC013Q01TA,SC002Q01TA,SC002Q02TA,SC018Q01TA01,SC018Q01TA02,SC018Q02TA01,SC018Q02TA02,SC012Q04TA,SC012Q05TA,SC012Q06TA,SC004Q05NA,SC004Q06NA,SC004Q07NA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q08TA,SC037Q09TA,SC025Q01NA,SC017Q01NA,SC017Q02NA,SC017Q03NA,SC017Q04NA,SC017Q05NA,SC017Q06NA,SC017Q07NA,SC017Q08NA,SC061Q01TA,SC061Q02TA,SC061Q03TA,SC061Q04TA,SC061Q05TA,SC061Q06TA,SC061Q07TA,SC061Q08TA,SC061Q09TA,SC061Q10TA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC003Q01TA,SC053Q01TA,SC053Q02TA,SC053Q03TA,SC053Q04TA,SC053Q09TA,SC053Q10TA,SC042Q01TA,SC042Q02TA,SCHSIZE,TOTAT,PROATCE,STRATIO,RATCMP1,RATCMP2,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,CLSIZE,CREACTIV,year
0,ALB,8.0,800001.0,b'08MS',3.0,1.0,303.0,349.0,38.0,1.0,38.0,1.0,2.0,1.0,2.0,3.0,4.0,0.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,3.0,100.0,2.0,1.0,2.0,1.0,3.0,2.0,4.0,4.0,3.0,3.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,1.0,36.0,72.0,21.0,36.0,5.0,2.0,2.0,2.0,1.0,2.0,1.0,3.0,3.0,652.0,38.5,1.0,16.9351,0.1564,1.0,-0.2968,1.2048,0.6058,-0.5488,33.0,0.0,2022
1,ALB,8.0,800002.0,b'08MS',1.0,1.0,88.0,95.0,16.0,0.0,16.0,0.0,3.0,3.0,3.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,100.0,1.0,1.0,1.0,1.0,4.0,4.0,4.0,4.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,100.0,70.0,80.0,100.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,3.0,2.0,183.0,16.0,1.0,11.4375,0.0,0.0,-1.4551,2.9595,-0.0956,-2.0409,13.0,1.0,2022
2,ALB,8.0,800003.0,b'08MS',2.0,2.0,74.0,47.0,17.0,0.0,17.0,0.0,3.0,3.0,3.0,9.0,9.0,18.0,2.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,100.0,2.0,2.0,3.0,1.0,3.0,2.0,3.0,2.0,3.0,2.0,3.0,4.0,4.0,2.0,2.0,2.0,2.0,2.0,16.0,38.0,13.0,43.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0,3.0,3.0,121.0,17.0,1.0,7.1176,2.0,1.0,0.2833,0.6299,1.3399,0.2266,13.0,2.0,2022
3,ALB,8.0,800004.0,b'08MS',3.0,1.0,583.0,491.0,63.0,1.0,60.0,0.0,1.0,3.0,2.0,0.0,2.0,16.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,100.0,1.0,1.0,1.0,1.0,3.0,3.0,4.0,4.0,2.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,1.0,85.0,85.0,30.0,100.0,4.0,2.0,2.0,2.0,1.0,2.0,1.0,3.0,3.0,1074.0,63.5,0.9449,16.9134,0.1838,0.64,-1.4551,1.6863,-0.7912,-0.9138,28.0,0.0,2022
4,ALB,8.0,800005.0,b'08MS',3.0,1.0,166.0,151.0,26.0,2.0,26.0,0.0,2.0,2.0,3.0,2.0,2.0,4.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,3.0,1.0,1.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,20.0,100.0,30.0,70.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,3.0,1.0,317.0,27.0,0.963,11.7407,0.0526,1.0,-0.2129,1.2478,-2.0719,-0.5079,18.0,1.0,2022


Fix Country: need unique match between CNT and CNTRYID

In [None]:
print(len(unique_countries['CNT'].unique()))
print(len(unique_countries['CNTRYID'].unique()))

95
96


In [None]:
# Check 1: CNT values with multiple CNTRYID
print("=== CNT with multiple CNTRYID ===")
cnt_multiple_cntryid = unique_countries.groupby('CNT')['CNTRYID'].nunique()
problematic_cnt = cnt_multiple_cntryid[cnt_multiple_cntryid > 1]

if len(problematic_cnt) > 0:
    for cnt in problematic_cnt.index:
        rows = unique_countries[unique_countries['CNT'] == cnt][['CNT', 'CNTRYID']]
        print(f"CNT '{cnt}' has multiple CNTRYID:")
        print(rows)
        print()
else:
    print("No CNT has multiple CNTRYID")

print("=" * 40)

# Check 2: CNTRYID values with multiple CNT
print("=== CNTRYID with multiple CNT ===")
cntryid_multiple_cnt = unique_countries.groupby('CNTRYID')['CNT'].nunique()
problematic_cntryid = cntryid_multiple_cnt[cntryid_multiple_cnt > 1]

if len(problematic_cntryid) > 0:
    for cntryid in problematic_cntryid.index:
        rows = unique_countries[unique_countries['CNTRYID'] == cntryid][['CNT', 'CNTRYID']]
        print(f"CNTRYID '{cntryid}' has multiple CNT:")
        print(rows)
        print()
else:
    print("No CNTRYID has multiple CNT")

=== CNT with multiple CNTRYID ===
CNT 'KSV' has multiple CNTRYID:
    CNT  CNTRYID
43  KSV    383.0
44  KSV    411.0

=== CNTRYID with multiple CNT ===
No CNTRYID has multiple CNT


## Joining impacting factors and educational outcomes

In [None]:
outcomes_df.head()

Unnamed: 0,CNT,CNTRYID,CNTSCHID,year,math,read,sci
0,ALB,8.0,800001.0,2015,390.56,437.48,413.71
1,ALB,8.0,800004.0,2015,397.9,451.21,454.23
2,ALB,8.0,800005.0,2015,441.65,409.67,422.42
3,ALB,8.0,800007.0,2015,341.52,340.09,405.56
4,ALB,8.0,800008.0,2015,372.21,358.56,386.26


In [None]:
join_keys = ['CNT', 'CNTRYID', 'CNTSCHID', 'year']

In [None]:
# Check
print("Join keys present in df:")
for key in join_keys:
    present = key in df.columns
    print(f"  {key}: {'✅' if present else '❌'}")

Join keys present in df:
  CNT: ✅
  CNTRYID: ✅
  CNTSCHID: ✅
  year: ✅


In [None]:
print("\nJoin keys present in outcomes_df:")
for key in join_keys:
    present = key in outcomes_df.columns
    present = key in outcomes_df.columns
    print(f"  {key}: {'✅' if present else '❌'}")


Join keys present in outcomes_df:
  CNT: ✅
  CNTRYID: ✅
  CNTSCHID: ✅
  year: ✅


In [None]:
# Check sample values and data types
print(f"\n=== Sample Values ===")
for key in join_keys:
    if key in df.columns and key in outcomes_df.columns:
        df_sample = df[key].dropna().head(3).tolist()
        outcomes_sample = outcomes_df[key].dropna().head(3).tolist()
        print(f"{key}:")
        print(f"  df: {df_sample} (dtype: {df[key].dtype})")
        print(f"  outcomes_df: {outcomes_sample} (dtype: {outcomes_df[key].dtype})")



=== Sample Values ===
CNT:
  df: ['ALB', 'ALB', 'ALB'] (dtype: object)
  outcomes_df: ['ALB', 'ALB', 'ALB'] (dtype: object)
CNTRYID:
  df: [8.0, 8.0, 8.0] (dtype: float64)
  outcomes_df: [8.0, 8.0, 8.0] (dtype: float64)
CNTSCHID:
  df: [800001.0, 800002.0, 800003.0] (dtype: float64)
  outcomes_df: [800001.0, 800004.0, 800005.0] (dtype: float64)
year:
  df: [2022, 2022, 2022] (dtype: int64)
  outcomes_df: [2015, 2015, 2015] (dtype: int64)


In [None]:
# Check for duplicates in join keys
print(f"\n=== Duplicate Check ===")
df_duplicates = df.duplicated(subset=join_keys).sum()
outcomes_duplicates = outcomes_df.duplicated(subset=join_keys).sum()

print(f"Duplicate key combinations in df: {df_duplicates}")
print(f"Duplicate key combinations in outcomes_df: {outcomes_duplicates}")


=== Duplicate Check ===
Duplicate key combinations in df: 0
Duplicate key combinations in outcomes_df: 0


In [None]:
# Perform the join
print(f"\n=== Performing Join ===")
print(f"df shape before join: {df.shape}")
print(f"outcomes_df shape: {outcomes_df.shape}")

# Join the datasets - using 'inner' to keep only matching records
merged_df = pd.merge(df, outcomes_df,
                    on=['CNT', 'CNTRYID', 'CNTSCHID', 'year'],
                    how='inner',  # Keep only records that exist in both datasets
                    suffixes=('', '_outcomes'))

print(f"merged_df shape after join: {merged_df.shape}")
print(f"Rows removed (no match in outcomes_df): {len(df) - len(merged_df):,}")


=== Performing Join ===
df shape before join: (61440, 72)
outcomes_df shape: (54215, 7)
merged_df shape after join: (54212, 75)
Rows removed (no match in outcomes_df): 7,228


In [None]:
merged_df.head()

Unnamed: 0,CNT,CNTRYID,CNTSCHID,CYC,SC001Q01TA,SC013Q01TA,SC002Q01TA,SC002Q02TA,SC018Q01TA01,SC018Q01TA02,SC018Q02TA01,SC018Q02TA02,SC012Q04TA,SC012Q05TA,SC012Q06TA,SC004Q05NA,SC004Q06NA,SC004Q07NA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q08TA,SC037Q09TA,SC025Q01NA,SC017Q01NA,SC017Q02NA,SC017Q03NA,SC017Q04NA,SC017Q05NA,SC017Q06NA,SC017Q07NA,SC017Q08NA,SC061Q01TA,SC061Q02TA,SC061Q03TA,SC061Q04TA,SC061Q05TA,SC061Q06TA,SC061Q07TA,SC061Q08TA,SC061Q09TA,SC061Q10TA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC003Q01TA,SC053Q01TA,SC053Q02TA,SC053Q03TA,SC053Q04TA,SC053Q09TA,SC053Q10TA,SC042Q01TA,SC042Q02TA,SCHSIZE,TOTAT,PROATCE,STRATIO,RATCMP1,RATCMP2,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,CLSIZE,CREACTIV,year,math,read,sci
0,ALB,8.0,800001.0,b'08MS',3.0,1.0,303.0,349.0,38.0,1.0,38.0,1.0,2.0,1.0,2.0,3.0,4.0,0.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,3.0,100.0,2.0,1.0,2.0,1.0,3.0,2.0,4.0,4.0,3.0,3.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,1.0,36.0,72.0,21.0,36.0,5.0,2.0,2.0,2.0,1.0,2.0,1.0,3.0,3.0,652.0,38.5,1.0,16.9351,0.1564,1.0,-0.2968,1.2048,0.6058,-0.5488,33.0,0.0,2022,392.04,373.48,413.34
1,ALB,8.0,800002.0,b'08MS',1.0,1.0,88.0,95.0,16.0,0.0,16.0,0.0,3.0,3.0,3.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,100.0,1.0,1.0,1.0,1.0,4.0,4.0,4.0,4.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,100.0,70.0,80.0,100.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,3.0,2.0,183.0,16.0,1.0,11.4375,0.0,0.0,-1.4551,2.9595,-0.0956,-2.0409,13.0,1.0,2022,362.88,352.07,381.14
2,ALB,8.0,800004.0,b'08MS',3.0,1.0,583.0,491.0,63.0,1.0,60.0,0.0,1.0,3.0,2.0,0.0,2.0,16.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,100.0,1.0,1.0,1.0,1.0,3.0,3.0,4.0,4.0,2.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,1.0,85.0,85.0,30.0,100.0,4.0,2.0,2.0,2.0,1.0,2.0,1.0,3.0,3.0,1074.0,63.5,0.9449,16.9134,0.1838,0.64,-1.4551,1.6863,-0.7912,-0.9138,28.0,0.0,2022,326.72,336.37,334.51
3,ALB,8.0,800006.0,b'08MS',3.0,1.0,249.0,328.0,37.0,3.0,37.0,3.0,3.0,2.0,3.0,4.0,4.0,3.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,3.0,6.0,1.0,3.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,3.0,78.0,11.0,80.0,5.0,1.0,1.0,2.0,1.0,1.0,1.0,3.0,2.0,577.0,38.5,1.0,14.987,0.068,0.0714,-0.2129,-0.2805,-1.0594,-0.5205,33.0,3.0,2022,421.6,404.21,418.3
4,ALB,8.0,800007.0,b'08MS',1.0,1.0,153.0,129.0,27.0,1.0,27.0,1.0,3.0,3.0,3.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,100.0,1.0,1.0,3.0,1.0,4.0,4.0,4.0,4.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,16.0,12.0,5.0,5.0,9.0,2.0,2.0,2.0,1.0,1.0,1.0,3.0,3.0,282.0,27.5,1.0,10.2545,0.0,0.0,-0.1776,2.9595,-0.335,-2.0409,53.0,1.0,2022,358.63,373.32,374.15


In [None]:
# Fix CYC
merged_df['CYC'] = merged_df['CYC'].str.decode('utf-8')
merged_df.head()

Unnamed: 0,CNT,CNTRYID,CNTSCHID,CYC,SC001Q01TA,SC013Q01TA,SC002Q01TA,SC002Q02TA,SC018Q01TA01,SC018Q01TA02,SC018Q02TA01,SC018Q02TA02,SC012Q04TA,SC012Q05TA,SC012Q06TA,SC004Q05NA,SC004Q06NA,SC004Q07NA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q08TA,SC037Q09TA,SC025Q01NA,SC017Q01NA,SC017Q02NA,SC017Q03NA,SC017Q04NA,SC017Q05NA,SC017Q06NA,SC017Q07NA,SC017Q08NA,SC061Q01TA,SC061Q02TA,SC061Q03TA,SC061Q04TA,SC061Q05TA,SC061Q06TA,SC061Q07TA,SC061Q08TA,SC061Q09TA,SC061Q10TA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC003Q01TA,SC053Q01TA,SC053Q02TA,SC053Q03TA,SC053Q04TA,SC053Q09TA,SC053Q10TA,SC042Q01TA,SC042Q02TA,SCHSIZE,TOTAT,PROATCE,STRATIO,RATCMP1,RATCMP2,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,CLSIZE,CREACTIV,year,math,read,sci
0,ALB,8.0,800001.0,08MS,3.0,1.0,303.0,349.0,38.0,1.0,38.0,1.0,2.0,1.0,2.0,3.0,4.0,0.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,3.0,100.0,2.0,1.0,2.0,1.0,3.0,2.0,4.0,4.0,3.0,3.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,1.0,36.0,72.0,21.0,36.0,5.0,2.0,2.0,2.0,1.0,2.0,1.0,3.0,3.0,652.0,38.5,1.0,16.9351,0.1564,1.0,-0.2968,1.2048,0.6058,-0.5488,33.0,0.0,2022,392.04,373.48,413.34
1,ALB,8.0,800002.0,08MS,1.0,1.0,88.0,95.0,16.0,0.0,16.0,0.0,3.0,3.0,3.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,100.0,1.0,1.0,1.0,1.0,4.0,4.0,4.0,4.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,100.0,70.0,80.0,100.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,3.0,2.0,183.0,16.0,1.0,11.4375,0.0,0.0,-1.4551,2.9595,-0.0956,-2.0409,13.0,1.0,2022,362.88,352.07,381.14
2,ALB,8.0,800004.0,08MS,3.0,1.0,583.0,491.0,63.0,1.0,60.0,0.0,1.0,3.0,2.0,0.0,2.0,16.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,100.0,1.0,1.0,1.0,1.0,3.0,3.0,4.0,4.0,2.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,1.0,85.0,85.0,30.0,100.0,4.0,2.0,2.0,2.0,1.0,2.0,1.0,3.0,3.0,1074.0,63.5,0.9449,16.9134,0.1838,0.64,-1.4551,1.6863,-0.7912,-0.9138,28.0,0.0,2022,326.72,336.37,334.51
3,ALB,8.0,800006.0,08MS,3.0,1.0,249.0,328.0,37.0,3.0,37.0,3.0,3.0,2.0,3.0,4.0,4.0,3.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,3.0,6.0,1.0,3.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,3.0,78.0,11.0,80.0,5.0,1.0,1.0,2.0,1.0,1.0,1.0,3.0,2.0,577.0,38.5,1.0,14.987,0.068,0.0714,-0.2129,-0.2805,-1.0594,-0.5205,33.0,3.0,2022,421.6,404.21,418.3
4,ALB,8.0,800007.0,08MS,1.0,1.0,153.0,129.0,27.0,1.0,27.0,1.0,3.0,3.0,3.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,100.0,1.0,1.0,3.0,1.0,4.0,4.0,4.0,4.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,16.0,12.0,5.0,5.0,9.0,2.0,2.0,2.0,1.0,1.0,1.0,3.0,3.0,282.0,27.5,1.0,10.2545,0.0,0.0,-0.1776,2.9595,-0.335,-2.0409,53.0,1.0,2022,358.63,373.32,374.15


## Adding tables to a BigQuery Dataset

In [1]:
from google.colab import auth
from google.cloud import bigquery

# Authenticate your Google account
auth.authenticate_user()

# --- CONFIGURATION ---
PROJECT_ID = "your-project"
REGION = "us-central1"

BQ_DATASET = "edu"

client = bigquery.Client(project=PROJECT_ID)

### Data table

In [None]:
def create_table_upload_data(dataset_id, table_name, df):
    """Create table and upload DataFrame to BigQuery."""
    df.to_gbq(
        destination_table=f'{dataset_id}.{table_name}',
        project_id=PROJECT_ID,
        if_exists='replace'
    )

In [None]:
create_table_upload_data(BQ_DATASET, 'pisa_data', merged_df)

  df.to_gbq(
100%|██████████| 1/1 [00:00<00:00, 7898.88it/s]


### Codebooks table

We will create a BQ table that explains the meaning of differnet fields and note what are the years when the data for these fields were collected by PISA. The table will be created based on PISA's codebooks for 2015, 2018, and 2022.

In [None]:
# Read the field names form the BQ data table
table = client.get_table(f'{PROJECT_ID}.{BQ_DATASET}.pisa_data')
field_names = [field.name for field in table.schema]
df_fields = pd.DataFrame(columns=field_names)
df_fields

Unnamed: 0,CNT,CNTRYID,CNTSCHID,CYC,SC001Q01TA,SC013Q01TA,SC002Q01TA,SC002Q02TA,SC018Q01TA01,SC018Q01TA02,SC018Q02TA01,SC018Q02TA02,SC012Q04TA,SC012Q05TA,SC012Q06TA,SC004Q05NA,SC004Q06NA,SC004Q07NA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q08TA,SC037Q09TA,SC025Q01NA,SC017Q01NA,SC017Q02NA,SC017Q03NA,SC017Q04NA,SC017Q05NA,SC017Q06NA,SC017Q07NA,SC017Q08NA,SC061Q01TA,SC061Q02TA,SC061Q03TA,SC061Q04TA,SC061Q05TA,SC061Q06TA,SC061Q07TA,SC061Q08TA,SC061Q09TA,SC061Q10TA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC003Q01TA,SC053Q01TA,SC053Q02TA,SC053Q03TA,SC053Q04TA,SC053Q09TA,SC053Q10TA,SC042Q01TA,SC042Q02TA,SCHSIZE,TOTAT,PROATCE,STRATIO,RATCMP1,RATCMP2,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,CLSIZE,CREACTIV,year,math,read,sci


In [None]:
codebook_df = process_multi_year_codes(years, df_fields, path, 'CY08MSP_SCH_QQQ')
codebook_df.head()

Found 75 fields in DataFrame
Successfully read codebook sheet: CY08MSP_SCH_QQQ
Found 273 field definitions in codebook

Matching Summary:
- Total fields in DataFrame: 75
- Fields matched in codebook: 71
- Fields not found: 4
Found 75 fields in DataFrame


  warn("""Cannot parse header or footer so it will be ignored""")


Successfully read codebook sheet: CY08MSP_SCH_QQQ
Found 196 field definitions in codebook

Matching Summary:
- Total fields in DataFrame: 75
- Fields matched in codebook: 71
- Fields not found: 4
Found 75 fields in DataFrame
Successfully read codebook sheet: CY08MSP_SCH_QQQ
Found 431 field definitions in codebook

Matching Summary:
- Total fields in DataFrame: 75
- Fields matched in codebook: 71
- Fields not found: 4


Unnamed: 0,year,field_id,field_name,if_found_in_codebook
69,2015,CLSIZE,Class Size,True
144,2018,CLSIZE,Class Size,True
219,2022,CLSIZE,Class size (test language class),True
0,2015,CNT,Country code 3-character,True
75,2018,CNT,Country code 3-character,True


In [None]:
codebook_df.columns

Index(['year', 'field_id', 'field_name', 'if_found_in_codebook'], dtype='object')

In [None]:
def generate_source_and_field_name(group):
    codebook_years = group[group['if_found_in_codebook']]['year'].sort_values().tolist()

    # Generate source
    if codebook_years:
        years_str = ', '.join(map(str, codebook_years))
        source = f'codebooks {years_str}'
    else:
        source = 'engineered'

    # Check if field_name varies within the group
    unique_field_names = group['field_name'].unique()
    if len(unique_field_names) > 1:
        # Create year: field_name mapping
        year_field_mapping = []
        for _, row in group.iterrows():
            year_field_mapping.append(f"{row['year']}: {row['field_name']}")
        field_name = '; '.join(sorted(year_field_mapping))
    else:
        # Use the single field_name
        field_name = unique_field_names[0]

    return pd.Series({'source': source, 'field_name': field_name})

In [None]:
# Apply the function and update both columns
result = codebook_df.groupby('field_id').apply(generate_source_and_field_name)
result.head()

  result = codebook_df.groupby('field_id').apply(generate_source_and_field_name)


Unnamed: 0_level_0,source,field_name
field_id,Unnamed: 1_level_1,Unnamed: 2_level_1
CLSIZE,"codebooks 2015, 2018, 2022",2015: Class Size; 2018: Class Size; 2022: Class size (test language class)
CNT,"codebooks 2015, 2018, 2022",Country code 3-character
CNTRYID,"codebooks 2015, 2018, 2022",Country Identifier
CNTSCHID,"codebooks 2015, 2018, 2022",Intl. School ID
CREACTIV,"codebooks 2015, 2018, 2022",2015: Creative extra-curricular activities (Sum); 2018: Creative extra-curricular activities (Sum); 2022: Creative extra-curricular activities (3 activities)


In [None]:
codebook_df['source'] = result['source'].reindex(codebook_df['field_id']).values
codebook_df['field_name'] = result['field_name'].reindex(codebook_df['field_id']).values

codebook_df.head(10)

Unnamed: 0,year,field_id,field_name,if_found_in_codebook,source
69,2015,CLSIZE,2015: Class Size; 2018: Class Size; 2022: Class size (test language class),True,"codebooks 2015, 2018, 2022"
144,2018,CLSIZE,2015: Class Size; 2018: Class Size; 2022: Class size (test language class),True,"codebooks 2015, 2018, 2022"
219,2022,CLSIZE,2015: Class Size; 2018: Class Size; 2022: Class size (test language class),True,"codebooks 2015, 2018, 2022"
0,2015,CNT,Country code 3-character,True,"codebooks 2015, 2018, 2022"
75,2018,CNT,Country code 3-character,True,"codebooks 2015, 2018, 2022"
150,2022,CNT,Country code 3-character,True,"codebooks 2015, 2018, 2022"
1,2015,CNTRYID,Country Identifier,True,"codebooks 2015, 2018, 2022"
76,2018,CNTRYID,Country Identifier,True,"codebooks 2015, 2018, 2022"
151,2022,CNTRYID,Country Identifier,True,"codebooks 2015, 2018, 2022"
2,2015,CNTSCHID,Intl. School ID,True,"codebooks 2015, 2018, 2022"


In [None]:
codebook_df = codebook_df.drop_duplicates(subset=['field_id'], keep='first')
pd.set_option('display.max_colwidth', None)
codebook_df.head(10)

Unnamed: 0,year,field_id,field_name,if_found_in_codebook,source
69,2015,CLSIZE,2015: Class Size; 2018: Class Size; 2022: Class size (test language class),True,"codebooks 2015, 2018, 2022"
0,2015,CNT,Country code 3-character,True,"codebooks 2015, 2018, 2022"
1,2015,CNTRYID,Country Identifier,True,"codebooks 2015, 2018, 2022"
2,2015,CNTSCHID,Intl. School ID,True,"codebooks 2015, 2018, 2022"
220,2022,CREACTIV,2015: Creative extra-curricular activities (Sum); 2018: Creative extra-curricular activities (Sum); 2022: Creative extra-curricular activities (3 activities),True,"codebooks 2015, 2018, 2022"
78,2018,CYC,2015: PISA Assessment Cycle (2 digits + 2 character Assessment type - MS\FT); 2018: PISA Assessment Cycle (2 digits + 2 character Assessment type - MS/FT); 2022: PISA Assessment Cycle (2 digits + 2 character Assessment type - MS/FT),True,"codebooks 2015, 2018, 2022"
66,2015,EDUSHORT,Shortage of educational material (WLE),True,"codebooks 2015, 2018, 2022"
61,2015,PROATCE,2015: Index proportion of all teachers fully certified; 2018: Index proportion of all teachers fully certified; 2022: Proportion of all teachers fully certified,True,"codebooks 2015, 2018, 2022"
213,2022,RATCMP1,2015: Number of available computers per student at modal grade; 2018: Number of available computers per student at modal grade; 2022: Availability of computers,True,"codebooks 2015, 2018, 2022"
214,2022,RATCMP2,2015: Proportion of available computers that are connected to the Internet; 2018: Proportion of available computers that are connected to the Internet; 2022: Computers connected to the Internet,True,"codebooks 2015, 2018, 2022"


In [None]:
codebook_df = codebook_df.drop(columns=['year'])
codebook_df.head()

Unnamed: 0,field_id,field_name,if_found_in_codebook,source
69,CLSIZE,2015: Class Size; 2018: Class Size; 2022: Class size (test language class),True,"codebooks 2015, 2018, 2022"
0,CNT,Country code 3-character,True,"codebooks 2015, 2018, 2022"
1,CNTRYID,Country Identifier,True,"codebooks 2015, 2018, 2022"
2,CNTSCHID,Intl. School ID,True,"codebooks 2015, 2018, 2022"
220,CREACTIV,2015: Creative extra-curricular activities (Sum); 2018: Creative extra-curricular activities (Sum); 2022: Creative extra-curricular activities (3 activities),True,"codebooks 2015, 2018, 2022"


In [None]:
create_table_upload_data(BQ_DATASET, 'pisa_codebooks', codebook_df)

  df.to_gbq(
100%|██████████| 1/1 [00:00<00:00, 4185.93it/s]
