worldbank · jpazvd · Oct 18, 2019 · Oct 17, 2019 · Oct 17, 2019
diff --git a/00_documentation/Contribution_and_Replication.md b/00_documentation/Contribution_and_Replication.md
@@ -3,7 +3,7 @@
 
 ### Table of Contents
 1. [Replicating this Repository](#replicating-this-repository)
-1. [Contributing to this Repository](#contributing-to-this-repository)
+1. [Contributing to this Repository](#contributing-to-this-repository)   
   2.1. [Bug reports and feature requests](#bug-reports-and-feature-requests)   
   2.2. [Contribution conventions](#contribution-conventions)  
 1. [Workflow in this Repository](#workflow-in-this-repository)   

diff --git a/00_documentation/Harmonized_Variables_in_GLAD.md b/00_documentation/Harmonized_Variables_in_GLAD.md
@@ -0,0 +1,50 @@
+# Harmonized variable names
+<sup>back to the [README](https://github.com/worldbank/GLAD/blob/master/README.md) :leftwards_arrow_with_hook:</sup>
+
+This page list the harmonized variable names that will be used in all GLADs regardless of which assessment, year or country the data comes from. The first table has the variables that are included in all data sets. The second table includes variable names we have harmonized but exist only in some data sets.
+
+varname | varclass | varlabel | vartype | note
+-- | -- | -- | -- | -- |
+surveyid | key | SurveyID (Region_Year_Assessment) |  String | |
+countrycode | key | WB country code (3 letters) |  String | (a) |
+national_level | key | Idcntry_raw is a national level | Indicator (1=National) | (a) |
+idcntry_raw | id | Country ID, as coded in rawdata | Numerical or String | (b) |  
+idschool | id | School ID | Numerical | | 
+idgrade | id | Grade ID | Numerical | | 
+idclass | id | Class ID | Numerical | (c) |
+idlearner | id | Learner ID | Numerical | | 
+score_[*assessment*]\_[*subject*]\_[*pv*] | value | [Plausible value *pv*:] *assessment* score for *subject* | Numerical ||  
+level_[*assessment*]\_[*subject*]\_[*pv*] | value | [Plausible value *pv*:] *assessment* level for *subject* | Categorical | | 
+age | trait | Learner age at time of assessment | Numerical |  |
+urban | trait | School is located in urban/rural area | Indicator (1=Urban) | |  
+male | trait | Learner gender is male/female | Indicator (1=Male) |  |
+escs | trait | Learner socio-economic status _(Purposefully not labeled yet)_ | Numerical | |
+learner_weight | sample | Total learner weight | Numerical |  |
+
+**Notes:**
+
+For all assessment-years, the id variables (*idcntry_raw, idschool, idgrade, idclass, idlearner*) compose a unique id.
+
+(a) The full correspondence of *countrycode*, *national_level* and *idcntry_raw* is found in the [master countrycode list](https://github.com/worldbank/GLAD/blob/master/01_harmonization/011_rawdata/master_countrycode_list.csv). Some examples:
+* in LLECE 1997 the *countrycode* MEX is linked to both the sample from the country Mexico (idcntry_raw = 21) and for the sample from the subnational unit of Nueva Leon (idntry_raw = 11). However, the first is considered *national_level* of 1, while the later is *national_level* of 0. That means that both samples are found in the GLAD module ALL, but the module CLO for Mexico is calculated using only the first sample, discarding the later.
+* in PIRLS 2001 the *countrycode* GBR is linked to both the samples from England (idcntry_raw = 926) and Scotland (idcntry_raw = 927) and both are considered *national_level* of 1. That means that both samples are found in the GLAD module ALL and the module CLO for United Kingdom is calculated pooling both samples without distinction.
+
+(b) The variable *idcntry_raw* is preserved as found in the raw data. Most assesment-years have it as a numerical variable. The only exception so far is PASEC 1996, for which this variable is a string.
+
+(c) Some assessment-years may not have the variable _idclass_.
+
+---
+
+## Variables specific to a single assessment or year
+
+Though the variable _learner_weight_ exist in all assessments, other sample-related variables vary across assessments.
+
+varname | value | varlabel| vartype | note
+-- | -- | -- | -- | --
+year | key | Year of assessment | Date | PASEC, EGRA only (when multi-year bundles)
+urban_o* | trait | Original variable of urban | Categorical | PIRLS, TIMSS, SACMEQ only (whenever available)
+learner_weight_subject* | sample | Total learner weight for specific subject | Numerical | LLECE only
+strata* | sample | Strata | Numerical | LLECE, PASEC only
+jkzone | sample | Jackknife zone | Numerical | PIRLS, TIMSS, PASEC 2014 only
+jkrep | sample | Jackknife replicate code | Numerical | PIRLS, TIMSS, PASEC 2014 only
+weight_replicate* | sample | Replicate weight # | Numerical | PASEC 2014 only
diff --git a/README.md b/README.md
@@ -19,7 +19,7 @@ For an example of analysis enabled by this collection, please check the [**Learn
 Starts from the original datasets of each assessment (pulled from _eduraw_ collection in _datalibweb_ or from a local copy, directly downloaded from the data publishers)
 and ends with the creation of the dataset GLAD_ALL and GLAD_ALL-BASE. Files receive a master vintage that reflects any possible updates of a surveyid (_region_year_assessment_).
 
-Those two modules of GLAD (ALL and ALL-BASE) are at the learner level, that is, one observation corresponds to one learner or student or pupil. Both modules contain the [harmonized variables](https://github.com/worldbank/GLAD/wiki/Agreed-variables-to-include-in-GLAD-datasets), but the module ALL-BASE additionally includes all the original variables from the raw data. Since the ALL-BASE file may be very large, we recommend using the module ALL whenever possible.
+Those two modules of GLAD (ALL and ALL-BASE) are at the learner level, that is, one observation corresponds to one learner or student or pupil. Both modules contain the [harmonized variables](https://github.com/worldbank/GLAD/blob/master/00_documentation/Harmonized_Variables_in_GLAD.md), but the module ALL-BASE additionally includes all the original variables from the raw data. Since the ALL-BASE file may be very large, we recommend using the module ALL whenever possible.
 
 The output files are saved in the clone with adaptation vintage _wrk_A_, and corresponding markdown documents are generated with the same name. The assessments currently in the loop are (click on the links for each file's markdown documentation):
 
@@ -43,7 +43,7 @@ The output files are saved in the clone with adaptation vintage _wrk_A_, and cor
 
 ### Technical notes
 
-The GLAD programs by default use data from _datalibweb_. Please see [guidelines to retrieve data from datalibweb](#--Guidelines-to-Retrieve-Data-from-datalibweb). Note that _datalibweb_ requires access and authentication to the WorldBank network.
+The GLAD programs by default use data from _datalibweb_. Please see [guidelines to retrieve data from datalibweb](https://github.com/worldbank/GLAD/blob/master/00_documentation/Datalibweb_Guidelines.md). Note that _datalibweb_ requires access and authentication to the WorldBank network.
 
 The GLAD programs also make use of the _edukit_ package. The latest version of _edukit_ and installation instructions can be found in the [EduAnalyticsToolkit repo](https://github.com/worldbank/EduAnalyticsToolkit).