# UN Sustainable Development Goals

This notebook implements the pre-processing needed for importing the UN SDG dataset into OWID's grapher database.
A rough outline of the process:

  1. Read the dataset exported from the UN SDG Indicators database website [[1]](#Data-loading-and-preprocessing)
  2. Export the referenced _entitites_ (geographic areas) [[2]](#Export-entities-(dimension-members))
  3. _Reconcile_ those entities with OpenRefine and [OWID's geographic entities reconciliation service](https://github.com/owid/lc-reconcile/)
  4. Generate a separate table for every combination distinct values of geographic entities and other nominal variables ([3](#Export-datasets-and-variables))
  5. Export a `variables.csv` file, and a set of `dataset_*.csv` files that contains each generated table. [[4]](#Export-data)

In [1]:
import pandas as pd
import numpy as np
import collections
import itertools
import functools
import math
import qgrid


pd.options.display.max_columns = None

## Data loading and preprocessing

The data was obtained from the [UN SDG Indicators database](https://unstats.un.org/sdgs/indicators/database). We selected all _Goals_ (topmost category in the classification of indicators) and requested the entire dataset. 

In [2]:
data = pd.read_csv(
    "data/20190903150325064_drifter4e@gmail.com_data.csv", low_memory=False
)

In [3]:
data[['SeriesDescription']]

Unnamed: 0,SeriesDescription
0,Proportion of population below international p...
1,Proportion of population below international p...
2,Proportion of population below international p...
3,Proportion of population below international p...
4,Proportion of population below international p...
5,Proportion of population below international p...
6,Proportion of population below international p...
7,Proportion of population below international p...
8,Proportion of population below international p...
9,Proportion of population below international p...


Keep the indicators that we care about (list taken from [the old importer](https://github.com/owid/owid-importer/blob/master/importer_django/un_sdg_importer.py)).

In [4]:
INDICATORS = [
'1.1.1','1.2.1','1.3.1','1.5.1','1.5.2','1.5.3','2.1.1','2.1.2','2.2.1','2.2.2','2.5.1','2.5.2','2.a.1','2.a.2','2.c.1','3.1.1','3.1.2','3.2.1','3.2.2','3.3.1','3.3.2','3.3.3','3.3.5','3.4.1','3.4.2','3.5.2','3.6.1','3.7.1','3.7.2','3.9.1','3.9.2','3.9.3','3.a.1','3.b.2','3.c.1','3.d.1','4.1.1','4.2.1','4.2.2','4.3.1','4.4.1','4.5.1','4.6.1','4.a.1','4.b.1','4.c.1','5.2.1','5.3.1','5.3.2','5.4.1','5.5.1','5.5.2','5.6.1','5.b.1','6.1.1','6.2.1','6.4.2','6.5.1','6.a.1','6.b.1','7.1.1','7.1.2','7.2.1','7.3.1','8.1.1','8.2.1','8.3.1','8.4.1','8.4.2','8.5.1','8.5.2','8.6.1','8.7.1','8.8.1','8.10.1','8.10.2','8.a.1','9.1.2','9.2.1','9.2.2','9.4.1','9.5.1','9.5.2','9.a.1','9.b.1','9.c.1','10.1.1','10.4.1','10.6.1','10.a.1','10.b.1','10.c.1','11.1.1','11.5.1','11.5.2','11.6.1','11.6.2','11.b.1','12.2.1','12.2.2','12.4.1','13.1.1','13.1.2','14.4.1','14.5.1','15.1.1','15.1.2','15.2.1','15.4.1','15.4.2','15.5.1','15.6.1','15.a.1','15.b.1','16.1.1','16.2.1','16.2.2','16.2.3','16.3.2','16.5.2','16.8.1','16.9.1','16.10.1','16.10.2','16.a.1','17.2.1','17.3.2','17.4.1','17.6.2','17.8.1','17.9.1','17.10.1','17.11.1','17.12.1','17.15.1','17.16.1','17.18.2','17.18.3','17.19.1','17.19.2'
]

data = data[data.Indicator.isin(INDICATORS)]


## Export entities (dimension members)

We only deal with the _Geographic_ and _temporal_ dimensions. Produce a list of countries included in the SDG data file.

This list of country names as they appear in the SDG dataset, will be reconciled through OWID's reconciler.

In [5]:
dim_geo_areas = data[['GeoAreaCode', 'GeoAreaName']] \
    .drop_duplicates() \
    .rename(columns={'GeoAreaCode': 'id', 'GeoAreaName': 'name'})

In [6]:
dim_geo_areas

Unnamed: 0,id,name
0,1,World
12,5,South America
24,8,Albania
29,9,Oceania
41,11,Western Africa
51,12,Algeria
53,13,Central America
65,14,Eastern Africa
77,15,Northern Africa
89,17,Middle Africa


In [20]:
dim_geo_areas.to_csv('./sdg_geo_areas.csv', index=False)

We take the output of the reconciliation process (saved as `sdg_owid_countries.csv`) and `merge` it with our data file.

In [None]:
sdg_owid_countries = pd.read_csv('./sdg_owid_countries.csv')
data = data.merge(sdg_owid_countries, left_on='GeoAreaCode', right_on='id')

## Export datasets and variables

Algorithm outline:

  - For each `INDICATOR`:
    - Obtain dimensions (columns named `[between brackets]`) that contain non-null values
      - For each combination of unique values values in those dimensions
        - Generate a table of values.


In [None]:
DIMENSIONS = [c for c in data.columns if c[0] == '[' and c[-1] == ']']
NON_DIMENSIONS = [c for c in data.columns if c not in set(DIMENSIONS)]

@functools.lru_cache(maxsize=256)
def get_series_with_relevant_dimensions(indicator, series):
    """ For a given indicator and series, return a tuple:
    
      - data filtered to that indicator and series
      - names of relevant dimensions
      - unique values for each relevant dimension
    """
    data_filtered = data[(data.Indicator == indicator) & (data.SeriesCode == series)]
    non_null_dimensions_columns = [col for col in DIMENSIONS if data_filtered.loc[:, col].notna().any()]
    dimension_names = []
    dimension_unique_values = []
    
    for c in non_null_dimensions_columns:
        uniques = data_filtered[c].unique()
        if len(uniques) > 1:
            dimension_names.append(c)
            dimension_unique_values.append(list(uniques))

    return (data_filtered[NON_DIMENSIONS + dimension_names], dimension_names, dimension_unique_values)

Generate tables for:

  - Rows where the dimension is `None`
  - One table for each combination of unique values of relevant dimensions

In [None]:
@functools.lru_cache(maxsize=256)
def generate_tables_for_indicator_and_series(indicator, series):
    tables_by_combination = {}
    data_filtered, dimensions, dimension_values = get_series_with_relevant_dimensions(indicator, series)
    if len(dimensions) == 0:
        # no additional dimensions
        export = data_filtered
        return export
    else:
        for dimension_value_combination in itertools.product(*dimension_values):
            # build filter by reducing, start with a constant True boolean array
            filt = [True] * len(data_filtered)
            for dim_idx, dim_value in enumerate(dimension_value_combination):
                dimension_name = dimensions[dim_idx]
                value_is_nan = type(dim_value) == float and math.isnan(dim_value)
                filt = filt \
                       & (data_filtered[dimension_name].isnull() if value_is_nan else data_filtered[dimension_name] == dim_value)

            tables_by_combination[dimension_value_combination] = data_filtered[filt].drop(dimensions, axis=1)
            
        return tables_by_combination
    

In [None]:
all_series = data[['Indicator', 'SeriesCode', 'SeriesDescription', 'Units']] \
  .groupby(by=['Indicator', 'SeriesCode', 'SeriesDescription', 'Units']) \
  .count() \
  .reset_index()

### Export data

For each series and combination of additional dimensions' members, generate an entry in the `variables` table.

In [None]:
DF_COLS_VARIABLES = ['Indicator', 'SeriesCode', 'VariableDescription', 'Units', 'variable_idx']
DF_COLS_DATASETS = ['Indicator', 'SeriesCode', 'SeriesDescription']
DF_COLS_DATAPOINTS = ['Value', 'TimePeriod', 'Time_Detail', 'Source', 'FootNote', 'Nature', 'owid_entity_id']
variables = pd.DataFrame(columns=DF_COLS_VARIABLES)
datasets = pd.DataFrame(columns=DF_COLS_DATASETS)

variable_idx = 0

for i, row in all_series.iterrows():
    datasets = datasets.append(
        {
            'Indicator': row['Indicator'], 
            'SeriesCode': row['SeriesCode'], 
            'SeriesDescription': row['SeriesDescription']
        }, 
        ignore_index=True)
    _, dimensions, dimension_members = get_series_with_relevant_dimensions(row['Indicator'], row['SeriesCode'])
    
    if len(dimensions) == 0:
        # no additional dimensions
        table = generate_tables_for_indicator_and_series(row['Indicator'], row['SeriesCode'])
        variable = { 
            'Indicator': row['Indicator'], 'SeriesCode': row['SeriesCode'], 
            'VariableDescription': row['SeriesDescription'], 'Units': row['Units'],
            'variable_idx': variable_idx
        }
        variables = variables.append(variable, ignore_index=True)
        table[DF_COLS_DATAPOINTS].to_csv('./exported_data/%04d_datapoints.csv' % variable_idx, index=False)
        variable_idx += 1

    else:
        # has additional dimensions
        for member_combination, table in generate_tables_for_indicator_and_series(row['Indicator'], row['SeriesCode']).items():
            variable = { 
                'Indicator': row['Indicator'], 'SeriesCode': row['SeriesCode'], 
                'Units': row['Units'],
                'VariableDescription': row['SeriesDescription'] + " %s" % ( ' - '.join(map(str, member_combination))),
                'variable_idx': variable_idx
            }
            variables = variables.append(variable, ignore_index=True)
            table[DF_COLS_DATAPOINTS].to_csv('./exported_data/%04d_datapoints.csv' % variable_idx, index=False)
            variable_idx += 1


variables.to_csv('./exported_data/variables.csv', index=False)
datasets.to_csv('./exported_data/datasets.csv', index=False)