# Extract Transform and Load (ETL) Process for the Programme for International Students Assessment (PISA) data 

## Introduction
Here I document the series of steps followed to take the data from PISA and structure it properly for later analysis. 
## Data Extraction
The data was extracted from the [official PISA data website](https://www.oecd.org/pisa/data/). I first downloaded and processed the data for the year [2018](https://www.oecd.org/pisa/data/2018database/). After defining the necessary functions and processes, I will apply them for the [2015](https://www.oecd.org/pisa/data/2015database/) data. Ideally, the process is sufficiently robust it can be extended to the other years where data is available (i.e. 2000, 2003, 2006, 2009 and 2012). 

The data files provided are the following: Student Questionnaire, School Questionnaire, Teacher Questionnaire, Cognitive item data, and Questionnaire Timing. These three most important ones are the ones containing responses by the students, school and teacher. 


* Student Questionnaire: 
* School Questionnaire: 
* Teacher Questionnaire: 
* Cognitive item data: 
* Questionnaire Timing: 


## Data Processing
### Codebooks


Most of the data in this dataset is codified. This file contains the way to de-codify the different variables. It was not very well formatted, so I had to fix it before proceeding. 

In [18]:
def read_codebook(path = r"D:\Data Science Folder\PISA Analysis\Data\2018\PISA2018_CODEBOOK.xlsx"):
    """
    Description: 
    Major:
    
    Minor:
    
    Inputs: 
    - path: the location of the codebook excel file to read
    """
    import pandas as pd

    DataFrame = pd.read_excel(path)
    DataFrame = DataFrame[["NAME", "VARLABEL","VAL", "LABEL"]]

    temp_df = DataFrame[["NAME", "VARLABEL"]].dropna().reset_index()
    temp_df["repeat"] = temp_df[["index"]].shift(-1) - temp_df[["index"]]
    temp_df[["repeat"]] = temp_df[["repeat"]].fillna(len(DataFrame) - max(temp_df["index"]))

    temp_df2 = pd.DataFrame(temp_df["NAME"].repeat(temp_df["repeat"].tolist())).reset_index()[["NAME"]]
    temp_df2["VARLABEL"] = pd.DataFrame(temp_df["VARLABEL"].repeat(temp_df["repeat"].tolist())).reset_index()[["VARLABEL"]]

    DataFrame[["NAME","VARLABEL"]] = temp_df2[["NAME","VARLABEL"]]
    DataFrame = DataFrame[["NAME", "VARLABEL","VAL", "LABEL"]]

    return DataFrame

In [19]:
Codebooks = read_codebook()
Codebooks.head()

### Questionnaire Compendium

These are a set of files. Each file refers to a section in a specific questionnaire. Each file contains a "Table of Contents", which for our purposes is a table containing the list of variable names and their respective labels. Aside from that, for each variable, summary statistics as the following are given: Weighted count, missing %, and the relative frequency for the responses, or values of the variables. 

Unlike for the codebook, there's no file specifically meant to decodify the variablenames and variablevalues in the survey. However, these files allow for the creation of one. 

Let's start by looking at a simple questionnaire. The School Questionnaire. 

In [35]:
def read_questionnaires(path = r'D:\Data Science Folder\PISA Analysis\Data\2018\2018_Compendia_Questionnaire\bkg'):
    """
    Description: This function reads several all excel files containing the school questionnaires. 
    
    Inputs: 
    - path: location where all the excel files can be found.  
    """
    # Importing packages 
    import pandas as pd
    import os
    
    # This section gets the list of excel files to be read. 
    
    files = os.listdir(path)
    df = pd.DataFrame()
    for i in range(0,len(files)): 
        data = (
            pd.read_excel(
                os.path.join(
                    path, files[i]
                ),
                sheet_name = "Table of Contents")
        )
        cols = data.columns
        data = data.rename(columns = {cols[0]:"varname", cols[1]: "varlabels"})
        data["filename"] = files[i]
        df = df.append(data)
    CompendiaQuestionnaire = df
    
    # This reads every excel file and combines them to get the full dataset. 
    
    df = pd.DataFrame()
    for f in range(0,len(files)):
        for i in range(0,len(CompendiaQuestionnaire["filename"].iloc[f])):
            data = pd.read_excel(os.path.join(path,CompendiaQuestionnaire["filename"].iloc[f]), 
                 sheet_name= CompendiaQuestionnaire["varname"].iloc[i], 
                  skiprows=0, 
                  nrows=1)
            data = pd.DataFrame(data.iloc[0][4:].dropna().reset_index().iloc[:,1]).reset_index()
            data["index"] = data["index"]+1
            cols = data.columns
            data = data.rename(columns = {cols[0]:"labels", cols[1]:"value"})
            data["varname"] =  CompendiaQuestionnaire["varname"].iloc[i]
            df = df.append(data)

    df = df.reset_index().drop(['index'],axis = 1)
    return df

    

In [None]:
QuestionnaireData = read_questionnaires()
QuestionnaireData.head()

#### Compendia Questionnaire

#### Compendia Cognitive

First, get the Table of Contents from each file in the folder. 

In [36]:
CognitiveData = read_questionnaires(path = r'D:\Data Science Folder\PISA Analysis\Data\2018\2018_Compendia_Cognitive\cog')
CognitiveData.head()

KeyboardInterrupt: 

In [14]:
resutls = df

# Data Files
All the work above has been to read the metadata of the SAS Files. 

This package might be able to do that: https://pypi.org/project/pyreadstat/


In [None]:
DataFrame_Questionnaire = r"D:\Data Science Folder\PISA Analysis\Data\2018\2018_Questionnaire_Timing_Data_Files\cy07_msu_stu_tim.sas7bdat"

In [None]:
DataFrame_Cognitive = pd.read_sas(r"D:\Data Science Folder\PISA Analysis\Data\2018\2018_Cognitive_Item_Data_Files\cy07_msu_stu_cog.sas7bdat")

In [None]:
SQ_2018 = pd.read_sas(r"D:\Data Science Folder\PISA Analysis\Data\2018\2018_School_Questionnaire_Data_Files\cy07_msu_sch_qqq.sas7bdat", encoding = "iso-8859-1")

In [None]:
SQ_2018.columns.to_series().to_string

In [None]:
SQ_2018["SC013Q01TA"]