# Extract Transform and Load (ETL) Process for the Programme for International Students Assessment (PISA) data 

## Introduction
Here I document the series of steps followed to take the data from PISA and structure it properly for later analysis. 
## Data Extraction
The data was extracted from the [official PISA data website](https://www.oecd.org/pisa/data/). I first downloaded and processed the data for the year [2018](https://www.oecd.org/pisa/data/2018database/). After defining the necessary functions and processes, I will apply them for the [2015](https://www.oecd.org/pisa/data/2015database/) data. Ideally, the process is sufficiently robust it can be extended to the other years where data is available (i.e. 2000, 2003, 2006, 2009 and 2012). 

The data files provided are the following: Student Questionnaire, School Questionnaire, Teacher Questionnaire, Cognitive item data, and Questionnaire Timing. These three most important ones are the ones containing responses by the students, school and teacher. 


* Student Questionnaire: 
* School Questionnaire: 
* Teacher Questionnaire: 
* Cognitive item data: 
* Questionnaire Timing: 
* 


## Data Processing
### Codebooks


Most of the data in this dataset is codified. This file contains the way to de-codify the different variables. It was not very well formatted, so I had to fix it before proceeding. 

In [7]:
import pandas as pd
import numpy as np

The codebook data can be seen below. First of all, we only need the following columns: 

* NAME: the column containing the variable name that shows up in the data file.
* VARLABEL: the label of the variable contained in the NAME column. We want to keep this especially for those variables with names that aren't so intuitive. 
* VAL: These are the codes showing up in the data files under the variable name showing up in the columnd NAME.  
* LABEL: these are the labels of the codes in the VAL column. 



In [267]:
codebook_df = pd.read_excel(r"D:\Data Science Folder\PISA Analysis\Data\2018\PISA2018_CODEBOOK.xlsx")
codebook_df.head(10)

Unnamed: 0,NAME,VARLABEL,TYPE,FORMAT,VARNUM,MINMAX,VAL,LABEL,COUNT,PERCENT
0,CNTRYID,Country Identifier,NUM,3.0,1.0,8-840,,,,
1,,,,,,,8.0,Albania,3375.0,3.143424
2,,,,,,,31.0,Baku (Azerbaijan),4077.0,3.797256
3,,,,,,,32.0,Argentina,0.0,0.0
4,,,,,,,36.0,Australia,0.0,0.0
5,,,,,,,40.0,Austria,0.0,0.0
6,,,,,,,56.0,Belgium,0.0,0.0
7,,,,,,,70.0,Bosnia and Herzegovina,0.0,0.0
8,,,,,,,76.0,Brazil,8969.0,8.353591
9,,,,,,,96.0,Brunei Darussalam,0.0,0.0


In a properly structured database, the data as seen below would need to be separated into several tables. For example, a table containing only the data for NAME = "CNTRYID" such that one could join by VAL and get the country labels. 

Instead of having several tables, an alternative could be to keep this table as is and simply filtering before joining. This is how it will be done. In order to do so, all of the rows in the column LABEL need to be identified by the corresponding NAME (and VARLABEL). Right now, these are all showing "NaN" instead. 

In [268]:
df = codebook_df[["NAME", "VARLABEL","VAL", "LABEL"]]
df.head(10)

Unnamed: 0,NAME,VARLABEL,VAL,LABEL
0,CNTRYID,Country Identifier,,
1,,,8.0,Albania
2,,,31.0,Baku (Azerbaijan)
3,,,32.0,Argentina
4,,,36.0,Australia
5,,,40.0,Austria
6,,,56.0,Belgium
7,,,70.0,Bosnia and Herzegovina
8,,,76.0,Brazil
9,,,96.0,Brunei Darussalam


To fix this, I remove the missing values to get a table that contains unique NAME and VARLABEL columns. Then, by resetting the index, I obtain a list column that essentially counts how many rows went by before changing value for the column NAME. I shift this variable back so i can substract it and get how many times each value in NAME needs to be repeated. The only missing value results on the very last row, because shifting has run out of rows. For this case, I simply use the number of rows in the original dataframe to substract. 

In [269]:
df2 = df[["NAME", "VARLABEL"]].dropna().reset_index()
df2["repeat"] = df2[["index"]].shift(-1) - df2[["index"]]
df2[["repeat"]] = df2[["repeat"]].fillna(len(df) - max(df2["index"]))
df2.head()

Unnamed: 0,index,NAME,VARLABEL,repeat
0,0,CNTRYID,Country Identifier,84.0
1,84,CNT,Country code 3-character,83.0
2,167,CNTSCHID,Intl. School ID,2.0
3,169,CNTTCHID,Intl. Teacher ID,2.0
4,171,TEACHERID,Teacher identification code,4.0


With this, I can now repeat the rows ... 

In [270]:
df3 = pd.DataFrame(df2["NAME"].repeat(df2["repeat"].tolist())).reset_index()[["NAME"]]
df3["VARLABEL"] = pd.DataFrame(df2["VARLABEL"].repeat(df2["repeat"].tolist())).reset_index()[["VARLABEL"]]
df3.head()

Unnamed: 0,NAME,VARLABEL
0,CNTRYID,Country Identifier
1,CNTRYID,Country Identifier
2,CNTRYID,Country Identifier
3,CNTRYID,Country Identifier
4,CNTRYID,Country Identifier


... and past them to the original dataframe. 

In [271]:
codebook_df[["NAME","VARLABEL"]] = df3[["NAME","VARLABEL"]]
codebook_df = codebook_df[["NAME", "VARLABEL","VAL", "LABEL"]]
codebook_df

Unnamed: 0,NAME,VARLABEL,VAL,LABEL
0,CNTRYID,Country Identifier,,
1,CNTRYID,Country Identifier,8,Albania
2,CNTRYID,Country Identifier,31,Baku (Azerbaijan)
3,CNTRYID,Country Identifier,32,Argentina
4,CNTRYID,Country Identifier,36,Australia
5,CNTRYID,Country Identifier,40,Austria
6,CNTRYID,Country Identifier,56,Belgium
7,CNTRYID,Country Identifier,70,Bosnia and Herzegovina
8,CNTRYID,Country Identifier,76,Brazil
9,CNTRYID,Country Identifier,96,Brunei Darussalam


In [272]:
codebook_df.to_csv(r"D:\Data Science Folder\PISA Analysis\Data\2018\codebook_df.csv")

### School Questionnaire
This data contains variable names with a code that is related to the question asked. However, no codebook was officially provided. In this case, I decided to take a summary file provided by PISA to create a codebook. 



In [279]:
df = pd.read_excel(r"D:\Data Science Folder\PISA Analysis\Data\2018\2018_Compendia_Questionnaire\bkg\pisa_ms_bkg_overall_sch_compendium.xlsx", 
                  sheet_name = "Table of Contents").reset_index().rename(columns = {"index":"varname", "Table of Contents": "varlabels"})
df.head()

Unnamed: 0,varname,varlabels
0,SC001Q01TA,Which of the following definitions best descri...
1,SC013Q01TA,Is your school a public or a private school?
2,SC017Q01NA,School's instruction hindered by: A lack of te...
3,SC017Q02NA,School's instruction hindered by: Inadequate o...
4,SC017Q03NA,School's instruction hindered by: A lack of as...


In [283]:
df3 = pd.DataFrame(columns = ["value", "labels", "varname"])
for i in range(0,len(df)-1):
    df2 = pd.read_excel(r"D:\Data Science Folder\PISA Analysis\Data\2018\2018_Compendia_Questionnaire\bkg\pisa_ms_bkg_overall_sch_compendium.xlsx", 
                  sheet_name = df["varname"].iloc[i], 
              skiprows=0, 
              nrows=1)
    df2 = pd.DataFrame(df2.iloc[0][4:].dropna().reset_index().iloc[:,1]).reset_index()
    df2["index"] = df2["index"]+1
    df2 = df2.rename(columns = {0:"labels", "index":"value"})
    df2["varname"] =  df["varname"].iloc[i]
    df3 = df3.append(df2)
df3 = df3.reset_index().drop(['index'],axis = 1)
df3.head()

Unnamed: 0,value,labels,varname
0,1,"A village, hamlet or rural area (fewer than 3 ...",SC001Q01TA
1,2,A small town (3 000 to about 15 000 people),SC001Q01TA
2,3,A town (15 000 to about 100 000 people),SC001Q01TA
3,4,A city (100 000 to about 1 000 000 people),SC001Q01TA
4,5,A large city (with over 1 000 000 people),SC001Q01TA
5,6,Valid Skip,SC001Q01TA
6,7,Not Applicable,SC001Q01TA
7,8,Invalid,SC001Q01TA
8,9,No Response,SC001Q01TA
9,1,A public school (Managed by a public education...,SC013Q01TA


In [300]:
SchoolQuestionnaireCodebook = pd.merge(df,df3,on = "varname")

In [8]:
SQ_2018 = pd.read_sas(r"D:\Data Science Folder\PISA Analysis\Data\2018\2018_School_Questionnaire_Data_Files\cy07_msu_sch_qqq.sas7bdat", encoding = "iso-8859-1")

In [40]:
SQ_2018.columns.to_series().to_string

<bound method Series.to_string of CNTRYID                      CNTRYID
CNT                              CNT
CNTSCHID                    CNTSCHID
CYC                              CYC
NatCen                        NatCen
Region                        Region
STRATUM                      STRATUM
SUBNATIO                    SUBNATIO
OECD                            OECD
ADMINMODE                  ADMINMODE
LANGTEST                    LANGTEST
SC001Q01TA                SC001Q01TA
SC013Q01TA                SC013Q01TA
SC016Q01TA                SC016Q01TA
SC016Q02TA                SC016Q02TA
SC016Q03TA                SC016Q03TA
SC016Q04TA                SC016Q04TA
SC017Q01NA                SC017Q01NA
SC017Q02NA                SC017Q02NA
SC017Q03NA                SC017Q03NA
SC017Q04NA                SC017Q04NA
SC017Q05NA                SC017Q05NA
SC017Q06NA                SC017Q06NA
SC017Q07NA                SC017Q07NA
SC017Q08NA                SC017Q08NA
SC161Q01SA                SC161Q01SA
SC16

In [266]:
SQ_2018["SC013Q01TA"]

0        1.0
1        1.0
2        1.0
3        1.0
4        1.0
5        1.0
6        2.0
7        1.0
8        2.0
9        2.0
10       1.0
11       1.0
12       1.0
13       1.0
14       1.0
15       2.0
16       1.0
17       1.0
18       2.0
19       1.0
20       1.0
21       1.0
22       2.0
23       1.0
24       1.0
25       1.0
26       1.0
27       1.0
28       1.0
29       1.0
        ... 
21873    1.0
21874    1.0
21875    1.0
21876    1.0
21877    1.0
21878    1.0
21879    1.0
21880    1.0
21881    1.0
21882    1.0
21883    1.0
21884    1.0
21885    1.0
21886    1.0
21887    1.0
21888    1.0
21889    1.0
21890    1.0
21891    1.0
21892    1.0
21893    1.0
21894    1.0
21895    1.0
21896    1.0
21897    1.0
21898    1.0
21899    1.0
21900    1.0
21901    1.0
21902    1.0
Name: SC013Q01TA, Length: 21903, dtype: float64