# Tutorial: Downloading and preprocessing data


In this notebok, we look at how to download and preprocess data using the SDG API as an example

## Getting data

1. Define a Downloader
2. Define a request parameters
3. Get the data
4. Save the data

IMPORTANT: The key part of this step is to define the paramters correctly, each API has their own classification and can be checked at the respective documentation, you can look at the one in params/APIs to have some examples.

In [15]:
from src.downloaders.downloader import SDG_Downloader

Downloader = SDG_Downloader('https://unstats.un.org/SDGAPI/v1/sdg/Series/Data')

parameters = {'seriesCode': 'SL_TLF_NEET',
              'dimensions': "[{name:'Sex',values:['BOTHSEX']},{name:'Age',values:['15-24']}]"}

data = Downloader.get_data(parameters)

The data is a dictionnary with 2 keys: 'data' and 'metadata'. Data contains the actual response from the API while metadata contains information about the url and the downloading time

In [16]:
print(data['metadata'])

{'URL': 'https://unstats.un.org/SDGAPI/v1/sdg/Series/Data', 'DownloadDate': '2020-10-09'}


In [17]:
print(data['data'][0])

{'goal': ['8'], 'target': ['8.6'], 'indicator': ['8.6.1'], 'series': 'SL_TLF_NEET', 'seriesDescription': 'Proportion of youth not in education, employment or training, by sex and age (%)', 'seriesCount': '4585', 'geoAreaCode': '4', 'geoAreaName': 'Afghanistan', 'timePeriodStart': 2017.0, 'value': '42.0', 'valueType': 'Float', 'time_detail': None, 'timeCoverage': None, 'upperBound': None, 'lowerBound': None, 'basePeriod': None, 'source': 'HIES - Living Condition Survey', 'geoInfoUrl': None, 'footnotes': ['Repository: ILO-STATISTICS - Micro data processing'], 'attributes': {'Nature': 'C', 'Units': 'PERCENT'}, 'dimensions': {'Age': '15-24', 'Sex': 'BOTHSEX', 'Reporting Type': 'G'}}


To save the data directly, you can use the following code

In [18]:
data = Downloader.download_data('example.json', parameters) # Save the data as a JSON file

Then, to open the json file use the following command:

In [14]:
import json
with open('example.json') as f:
    data = json.load(f)
print(data['data'][0])

{'goal': ['8'], 'target': ['8.6'], 'indicator': ['8.6.1'], 'series': 'SL_TLF_NEET', 'seriesDescription': 'Proportion of youth not in education, employment or training, by sex and age (%)', 'seriesCount': '4585', 'geoAreaCode': '4', 'geoAreaName': 'Afghanistan', 'timePeriodStart': 2017.0, 'value': '42.0', 'valueType': 'Float', 'time_detail': None, 'timeCoverage': None, 'upperBound': None, 'lowerBound': None, 'basePeriod': None, 'source': 'HIES - Living Condition Survey', 'geoInfoUrl': None, 'footnotes': ['Repository: ILO-STATISTICS - Micro data processing'], 'attributes': {'Nature': 'C', 'Units': 'PERCENT'}, 'dimensions': {'Age': '15-24', 'Sex': 'BOTHSEX', 'Reporting Type': 'G'}}


The data was saved properly

## Processing the data

1. Define a preprocessor
2. Process the data
