# Tutorial: Downloading and preprocessing data
---

In this notebok, we look at how to download and preprocess data.

In [13]:
from ggdata.scripts import download

API_name = 'WB'

config = {
    'GGI_code': 'EE2',
    'params': {'indicator': 'EG.FEC.RNEW.ZS'}
}

data = download(API_name, config,raw=False)

Downloading {'GGI_code': 'EE2', 'params': {'indicator': 'EG.FEC.RNEW.ZS'}} from WB: DONE
PreProcessing: DONE


## Quick Start

1. Choose an API
2. Define a config
3. Get the data 

#### 1. Choose the  API

The are 3 options: 'WB' for world bank, 'SDG' for UNSTAT SDG, 'CW' for climate watch.

#### 2. Define a config
This part is the "key" part as different APIs have differents standards and code, explore their documentations to find the codes you need. Then, define a config as the following:
A dictionnary with 2 keys:
- GGI_code: The name you want to give your variable
- params: A dictionnary formatted with respect to the API documentation

Here are a few examples

In [1]:
WB_config_sample = {
    'GGI_code': 'EE2',
    'params': {'indicator': 'EG.FEC.RNEW.ZS'}
}

SDG_config_sample = {
    'GGI_code': 'SE2.3',
    'params': {
        'seriesCode': 'EG_ELC_ACCS',
        'dimensions': "[{name:'Location',values:['URBAN']}]"}  # Notice how 'dimensions' is a STRING 
}

CW_config_sample = {
    'GGI_code': 'GE1.0',
    'params': {
        'source_ids[]': 81,  #The parameters in this API are number, refer to the doc to find the one you need ! 
         'sector_ids[]': 957,
        'gas_ids[]': 269}
}

# for more example 

#### 3. Get the data

The download function has 2 additionnal parameters:
- raw to get raw or prefromatted data
- path to directly save the file at the given path

A few prompts will tell you how it is going !

Using the SDG API

In [2]:
from ggdata.scripts import download

data = download('SDG', SDG_config_sample, raw=False) 

Downloading {'GGI_code': 'SE2.3', 'params': {'seriesCode': 'EG_ELC_ACCS', 'dimensions': "[{name:'Location',values:['URBAN']}]"}} from SDG: DONE
PreProcessing: DONE


In [3]:
data.head(3)

Unnamed: 0,Country,ISO,Description,Source,Year,Value,Variable,From,URL,DownloadDate
15,Afghanistan,AFG,Proportion of population with access to electr...,World Bank,2005,74.0,SE2.3,SDG API,https://unstats.un.org/SDGAPI/v1/sdg/Series/Data,2020-10-11
16,Afghanistan,AFG,Proportion of population with access to electr...,World Bank,2006,79.88927,SE2.3,SDG API,https://unstats.un.org/SDGAPI/v1/sdg/Series/Data,2020-10-11
17,Afghanistan,AFG,Proportion of population with access to electr...,World Bank,2007,81.68705,SE2.3,SDG API,https://unstats.un.org/SDGAPI/v1/sdg/Series/Data,2020-10-11


Using the World bank API to get a raw file 

In [4]:
data = download('WB', WB_config_sample, raw=True) 

Downloading {'GGI_code': 'EE2', 'params': {'indicator': 'EG.FEC.RNEW.ZS'}} from WB: DONE


In [5]:
data

{'data': [{'page': 1,
   'pages': 1,
   'per_page': 16104,
   'total': 16104,
   'sourceid': '2',
   'lastupdated': '2020-10-07',
   'Source': 'World Bank, Sustainable Energy for All (SE4ALL) database from the SE4ALL Global Tracking Framework led jointly by the World Bank, International Energy Agency, and the Energy Sector Management Assistance Program.'},
  [{'indicator': {'id': 'EG.FEC.RNEW.ZS',
     'value': 'Renewable energy consumption (% of total final energy consumption)'},
    'country': {'id': '1A', 'value': 'Arab World'},
    'countryiso3code': 'ARB',
    'date': '2020',
    'value': None,
    'unit': '',
    'obs_status': '',
    'decimal': 2},
   {'indicator': {'id': 'EG.FEC.RNEW.ZS',
     'value': 'Renewable energy consumption (% of total final energy consumption)'},
    'country': {'id': '1A', 'value': 'Arab World'},
    'countryiso3code': 'ARB',
    'date': '2019',
    'value': None,
    'unit': '',
    'obs_status': '',
    'decimal': 2},
   {'indicator': {'id': 'EG.FEC

Using the CW API to **download** a file

In [6]:
data = download('CW', CW_config_sample, raw=False, path='./')  # ./ to save in the current repo

Downloading {'GGI_code': 'GE1.0', 'params': {'source_ids[]': 81, 'sector_ids[]': 957, 'gas_ids[]': 269}} from CW: DONE
PreProcessing: DONE
Saving at ./GE1.0_CW.csv: DONE


## Using Submodules

If need be, you can use the module used in the download function for doing more customed processing. 
If a new API is to be added, using the same architecture will simplify the work a lot

### Getting data

1. Define a Downloader
2. Define a request parameters
3. Get the data
4. Save the data

**IMPORTANT**: The key part of this step is to define the paramters correctly, each API has their own classification and can be checked at the respective documentation, you can look at the one in params/APIs to have some examples.

In [7]:
from ggdata.downloaders.downloader import SDG_Downloader

Downloader = SDG_Downloader('https://unstats.un.org/SDGAPI/v1/sdg/Series/Data')

parameters = {'seriesCode': 'SL_TLF_NEET',
              'dimensions': "[{name:'Sex',values:['BOTHSEX']},{name:'Age',values:['15-24']}]"}

data = Downloader.get_data(parameters)

The data is a dictionnary with 2 keys: 'data' and 'metadata'. Data contains the actual response from the API while metadata contains information about the url and the downloading data.

In [8]:
print(data['metadata'])

{'URL': 'https://unstats.un.org/SDGAPI/v1/sdg/Series/Data', 'DownloadDate': '2020-10-11'}


In [9]:
print(data['data'][0])

{'goal': ['8'], 'target': ['8.6'], 'indicator': ['8.6.1'], 'series': 'SL_TLF_NEET', 'seriesDescription': 'Proportion of youth not in education, employment or training, by sex and age (%)', 'seriesCount': '4585', 'geoAreaCode': '4', 'geoAreaName': 'Afghanistan', 'timePeriodStart': 2017.0, 'value': '42.0', 'valueType': 'Float', 'time_detail': None, 'timeCoverage': None, 'upperBound': None, 'lowerBound': None, 'basePeriod': None, 'source': 'HIES - Living Condition Survey', 'geoInfoUrl': None, 'footnotes': ['Repository: ILO-STATISTICS - Micro data processing'], 'attributes': {'Nature': 'C', 'Units': 'PERCENT'}, 'dimensions': {'Age': '15-24', 'Sex': 'BOTHSEX', 'Reporting Type': 'G'}}


To save the data directly, you can use the following code: 

In [10]:
data = Downloader.download_data('example.json', parameters) # Save the data as a JSON file

Then, to open the json file use the following command:

In [11]:
import json
with open('example.json') as f:
    data = json.load(f)
print(data['data'][0])

{'goal': ['8'], 'target': ['8.6'], 'indicator': ['8.6.1'], 'series': 'SL_TLF_NEET', 'seriesDescription': 'Proportion of youth not in education, employment or training, by sex and age (%)', 'seriesCount': '4585', 'geoAreaCode': '4', 'geoAreaName': 'Afghanistan', 'timePeriodStart': 2017.0, 'value': '42.0', 'valueType': 'Float', 'time_detail': None, 'timeCoverage': None, 'upperBound': None, 'lowerBound': None, 'basePeriod': None, 'source': 'HIES - Living Condition Survey', 'geoInfoUrl': None, 'footnotes': ['Repository: ILO-STATISTICS - Micro data processing'], 'attributes': {'Nature': 'C', 'Units': 'PERCENT'}, 'dimensions': {'Age': '15-24', 'Sex': 'BOTHSEX', 'Reporting Type': 'G'}}


The data was saved properly !

### Processing the data

1. Define a preprocessor
2. Define additionnal information
2. Process the data


In [12]:
from ggdata.preprocessors.SDG import SDG_Preprocessor

Preprocessor = SDG_Preprocessor('test') # file argument to change (Used to preprocess special cases)

information = {'Variable': 'Test', 'From': 'SDG API'} # Let's you add more information to the dataframe

df = Preprocessor.preprocess(data, information)

df.head()

Unnamed: 0,Country,ISO,Description,Source,Year,Value,Variable,From,URL,DownloadDate
0,Afghanistan,AFG,"Proportion of youth not in education, employme...",HIES - Living Condition Survey,2017,42.0,Test,SDG API,https://unstats.un.org/SDGAPI/v1/sdg/Series/Data,2020-10-11
1,Albania,ALB,"Proportion of youth not in education, employme...",LFS - Labour Force Survey,2007,33.8,Test,SDG API,https://unstats.un.org/SDGAPI/v1/sdg/Series/Data,2020-10-11
2,Albania,ALB,"Proportion of youth not in education, employme...",LFS - Labour Force Survey,2008,28.1,Test,SDG API,https://unstats.un.org/SDGAPI/v1/sdg/Series/Data,2020-10-11
3,Albania,ALB,"Proportion of youth not in education, employme...",LFS - Labour Force Survey,2009,30.7,Test,SDG API,https://unstats.un.org/SDGAPI/v1/sdg/Series/Data,2020-10-11
4,Albania,ALB,"Proportion of youth not in education, employme...",LFS - Labour Force Survey,2010,29.5,Test,SDG API,https://unstats.un.org/SDGAPI/v1/sdg/Series/Data,2020-10-11


The data is now preprocessed into a clean and standardized dataframe.