# COMP47670 | Assignment 1 | 19342403 | Task 1 | Data Identification and Collection

## API Name: WHO Global Health Observatory 

In this assignment, data will be collected from the [World Health Organisation Global Health Observatory (GHO) API](https://www.who.int/data/gho/info/gho-odata-api). Python is used to prepare, analyse and derive insights from the collected data. The GHO is an expansive database that provides access to a wide range of health-related indicators for 194 WHO member states. This assignment will pull data for the following health-related indicators:
* Alcohol, total per capita (15+) consumption (in litres of pure alcohol) (SDG Indicator 3.5.2)
* Mean Non-HDL cholesterol, age-standardized
* Domestic general government health expenditure (GGHE-D) as percentage of gross domestic product (GDP) (%)
* Prevalence of hypertension among adults aged 30-79 years, age-standardized
* Life expectancy at birth (years)
* Age-standardized NCD mortality rate (per 100 000 population)
* Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%)
* Estimate of current tobacco use prevalence (%) (age-standardized rate)

The alcohol consumption data represents the total alcohol consumption in litres of pure alcohol over a calendar year for people aged 15 and above. 

The following libraries will be used in Task 1:

In [319]:
import urllib, json, requests
import pandas as pd
from json import JSONDecodeError
from pathlib import Path

The below code assigns the API prefix and the directories for storage of the indicator data and the background information. The background information includes mapping of country codes to country names and regions, along with mapping indicator codes to indicator names.

In [320]:
# Prefix for API URLs
api_prefix = 'https://ghoapi.azureedge.net/api/'

indicator_dir = Path('assignment_1_indicator_data')
indicator_dir.mkdir(parents=True, exist_ok=True)

info_dir = Path('assignment_1_information_data')
info_dir.mkdir(parents=True, exist_ok = True)

## Data Collection

A function is defined to retrieve data from the API. This JSON data is then loaded into a python object. 

In [321]:
# Fetch function will construct the url and return the data associated with that url 
def fetch(endpoint, filter_str = ""):
    try:
        # Construct url with or without filters
        if filter_str:
            url  = api_prefix + endpoint + '?$filter=' + filter_str
        else:
            url = api_prefix + endpoint
    except TypeError:
        print("Invalid indicator name")
        return
    print(f'Fetching {url}')
    
    # Request the API
    response = requests.get(url)
    raw_json = response.text
    
    # Load the JSON data into a python object and return the data
    try:
        data = json.loads(raw_json)
        value_list = data['value']
    except JSONDecodeError:
        return
    except KeyError:
        print('Invalid url extension')
        return
    return value_list


The API associates a unique code to every indicator. The get_indicator_code function allows the retrieval of the indicator code when given the indicator name in question. This indicator code is needed to access information about the indicator. 

In [322]:
# Create function to get the indicator code from the indicator name
def get_indicator_code(indicator_name):
    # Fetch the indicator information data
    indicator_list = fetch('Indicator')
    
    # Enumerate through the indicator information until a match for the indicator name is found
    for idx, value in enumerate(indicator_list):
        if indicator_list[idx]['IndicatorName'].lower() == indicator_name.lower():
            
            # Assign the indicator code
            code = indicator_list[idx]['IndicatorCode']
            print(f'Found match for {indicator_name}: Code = {code}')
            return code
    return
    

The below function fetches the data from the API and creates a raw JSON file with that data. This file is then saved in the directory that is passed as a parameter to the function call. 

In [323]:
# Create function to get data for an indicator and save it in a JSON file
def create_file (directory, file_name, endpoint, filter_str=""):
    # The indicator data is retrieved
    try:
        data = fetch(endpoint, filter_str)
    except TypeError:
        print("Invalid file name")
        return
    
    # The file name and output path is created
    file_name = f'{file_name}.json'
    out_path = directory / file_name
    
    # The indicator file is then created with the indicator data. This file is saved to the indicator directory. 
    with open(out_path, 'w') as fout:
        if data:
            print(f'Writing data to {out_path}')
            json.dump(data, fout, indent=4)
        else:
            print("Data is null")
            return


As stated earlier, the information files are needed to map country codes to country names and regions, and indicator codes to indicator names. Here two files are created, one with the country information and one with the indicator information. 

In [324]:
create_file(info_dir, "Country Information", "DIMENSION/COUNTRY/DimensionValues")
create_file(info_dir, "Indicator Information", "Indicator")

Fetching https://ghoapi.azureedge.net/api/DIMENSION/COUNTRY/DimensionValues
Writing data to assignment_1_information_data\Country Information.json
Fetching https://ghoapi.azureedge.net/api/Indicator
Writing data to assignment_1_information_data\Indicator Information.json


Then the create_indicator_file function is called to create a file containing the data specific to an indicator that is to be studied. 

In [325]:
create_file(indicator_dir, 'Health Expenditure', get_indicator_code('Domestic general government health expenditure (GGHE-D) as percentage of gross domestic product (GDP) (%)'))
create_file(indicator_dir, 'Life Expectancy', get_indicator_code('Life expectancy at birth (years)'), "Dim1 eq 'SEX_BTSX'")
create_file(indicator_dir, 'Non-communicable Disease Mortality Rate', get_indicator_code('Age-standardized NCD mortality rate  (per 100 000 population)'), "DIM1 eq 'SEX_BTSX'")
create_file(indicator_dir, 'Obesity',get_indicator_code('Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%)'), "DIM1 eq 'SEX_BTSX'")
create_file(indicator_dir, 'Alcohol Consumption', get_indicator_code('Alcohol, total per capita (15+) consumption (in litres of pure alcohol) (SDG Indicator 3.5.2)'), "DIM1 eq 'SEX_BTSX'")
create_file(indicator_dir, 'Hypertension',get_indicator_code('Prevalence of hypertension among adults aged 30-79 years, age-standardized'), "DIM1 eq 'SEX_BTSX'")
create_file(indicator_dir, 'Cholesterol',get_indicator_code('Mean Non-HDL cholesterol, age-standardized'),  "DIM1 eq 'SEX_BTSX'")
create_file(indicator_dir, 'Tobacco Use',get_indicator_code('Estimate of current tobacco use prevalence (%) (age-standardized rate)'), "DIM1 eq 'SEX_BTSX'")

Fetching https://ghoapi.azureedge.net/api/Indicator
Found match for Domestic general government health expenditure (GGHE-D) as percentage of gross domestic product (GDP) (%): Code = GHED_GGHE-DGDP_SHA2011
Fetching https://ghoapi.azureedge.net/api/GHED_GGHE-DGDP_SHA2011
Writing data to assignment_1_indicator_data\Health Expenditure.json
Fetching https://ghoapi.azureedge.net/api/Indicator
Found match for Life expectancy at birth (years): Code = WHOSIS_000001
Fetching https://ghoapi.azureedge.net/api/WHOSIS_000001?$filter=Dim1 eq 'SEX_BTSX'
Writing data to assignment_1_indicator_data\Life Expectancy.json
Fetching https://ghoapi.azureedge.net/api/Indicator
Found match for Age-standardized NCD mortality rate  (per 100 000 population): Code = WHS2_131
Fetching https://ghoapi.azureedge.net/api/WHS2_131?$filter=DIM1 eq 'SEX_BTSX'
Writing data to assignment_1_indicator_data\Non-communicable Disease Mortality Rate.json
Fetching https://ghoapi.azureedge.net/api/Indicator
Found match for Prevalenc