# SEDAC Introduction - How to Work with this Dataset
---
This dataset contains a number of useful data sources in tabular and geospatial format. Some of the data requires processing and potentially unfamiliar functions for access. This notebook contains guides to get you started for each of the datasets contained here. In order these are:

1. 1x1 Gridded Emissions
2. Trends in Global Freshwater Availability
3. Development Threat Index
4. Environmental Performance Index
5. Climate Effects on Food Supply
6. Human Modification of Terrestrial Systems (large TIFF file)
7. Global Annual PM2.5 Grids (large TIFF file)
8. Natural Resource Protection & Child Health Indicators

The large TIFF files can cause RAM crashes if you repeatedly load them in, so make sure to only load them once in your code before processing.

In [None]:
import os
import math
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import rasterio #for working with TIFF files
from scipy import interpolate #in case TIFF grid is missing local values, we can impute them

# 1x1 Gridded Emissions
---

This dataset consists of ASCII files that create a grid of the world map, where the values in each location correspond to emission levels in that area. This is also accompanied by two CSV files that give some overall summary values for different regions of the world, and their total predicted emissions for various pollutants over sequential decades for different SRES scenarios. For more information on these projected scenarios, please consult the [IPCC guide (PDF)](https://www.ipcc.ch/site/assets/uploads/2018/03/emissions_scenarios-1.pdf).

The CSV files must first be processed as they are in a non-standard format. The desired values can then be selected using ordinary indexing:

In [None]:
def process_so2(df):
    cols = ['region', 'scenario', '1990', '2000', '2010', '2020', '2030', '2040', '2050', '2060', '2070', '2080','2090', '2100']
    df.columns = cols
    df[['region', 'scenario']] = df[['region', 'scenario']].replace(math.nan, 'nan')
    for i in range(len(df)):
        scenario = df.iloc[i, :].scenario
        if scenario!='nan':
            current_scenario = scenario
        else:
            df.loc[i, 'scenario'] = current_scenario
    return df.loc[df.region!='nan', :]


def process_rg(df):
    years = np.arange(1990, 2110, 10).astype(str).tolist()
    df[['Scenario']] = df[['Scenario']].replace(math.nan, 'nan')
    row_list = []
    for i in range(len(df)):
        scenario = df.iloc[i, :]['Scenario']
        gas_region = df.iloc[i, :]['Gas/Region']
        readings = df.iloc[i, :][years].values
        if scenario!='nan':
            current_scenario = scenario
            current_gas = gas_region
            row = pd.DataFrame(index=np.arange(1), columns=['scenario', 'region', 'gas'] + years)
            row['scenario'] = current_scenario
            row['gas'] = current_gas
            row['region'] = np.nan
        else:
            row = pd.DataFrame(index=np.arange(1), columns=['scenario', 'region', 'gas'] + years)
            row['scenario'] = current_scenario
            row['gas'] = current_gas
            row['region'] = gas_region
            row[years] = readings
            
        row_list.append(row)
                        
    out_df = pd.concat(row_list) 
    out_df[years] = out_df[years].astype('float32')
    return out_df.loc[~out_df.region.isna(), :]

so2_df = pd.read_csv('/kaggle/input/socioeconomic-data-and-applications-center/sedac-data/1x1 Gridded Emissions/so2-values.csv')
reactive_gas_df = pd.read_csv('/kaggle/input/socioeconomic-data-and-applications-center/sedac-data/1x1 Gridded Emissions/reactive-gases.csv')

so2 = process_so2(so2_df)
reactive_gas_df = process_rg(reactive_gas_df)
so2.head()

In [None]:
reactive_gas_df.head()

### Working with ASCII data
---
First, for convenience we can make a dataframe from the available files:

In [None]:
scenarios = ['A1AIM', 'A2ASF', 'B2MESSAGE', 'A1GMINICAM', 'B1IMAGE', 'A1TMESSAGE']
dict_list = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        for scenario in scenarios:
            if filename[:len(scenario)]==scenario:
                temp_dict = {}
                temp_dict['scenario'] = scenario
                temp_dict['year'] = filename[len(scenario):len(scenario)+4]
                temp_dict['gas'] = filename[len(scenario)+4:].split('.')[0]
                temp_dict['path'] = os.path.join(dirname, filename)
                dict_list.append(temp_dict)
                
grid_df = pd.DataFrame(dict_list)
grid_df                

Now we can get the data for a particular year, scenario and pollutant:

In [None]:
def get_emissions(path):
    out_df = pd.DataFrame(np.loadtxt(grid_df.iloc[0, :].path, delimiter=',', skiprows=12), columns=['long', 'lat', 'value'])
    out_df['long'] = out_df['long'].astype(int)
    out_df['lat'] = out_df['lat'].astype(int)
    return out_df
    
def get_path(grid_df, year, scenario, gas):
    return grid_df.loc[(grid_df.year==year) & (grid_df.scenario==scenario) & (grid_df.gas==gas), :].path
    
test_path = get_path(grid_df, 2080, 'A2ASF', 'NVMOC')
a2asf_2080_nvmoc_df = get_emissions(test_path)
a2asf_2080_nvmoc_df.head()

These can now be matched with any relevant coordinates for cities and corporate facilities.


We can also visualize the data to ensure it looks correct:

In [None]:
def show_grid(df):
    df['long'] = df['long'] - df['long'].min()
    df['lat'] = df['lat'] - df['lat'].min()
    x_range = np.arange(0, df.long.max()+1)
    y_range = np.arange(0, df.lat.max()+1)
    grid = np.zeros([len(y_range), len(x_range)])
    print(grid.shape)
    for x in np.unique(df['long'].values):
        temp_df = df.loc[df.long==x, :]
        lats = temp_df['lat']
        values = temp_df['value']
        grid[lats, x] = values
    
    grid = np.flipud(grid)
    plt.figure(figsize=(15, 6))
    plt.imshow(grid)
    plt.show();
    
show_grid(a2asf_2080_nvmoc_df)

This particular pollutant has high predicted concentrations across large regions of Asia, with spikes in North America, Europe and Sub-Saharan Africa.

# Trends in Global Freshwater Availability
---
A GeoTIFF is a TIFF image file that has been augmented with some additional tags which denote properties such as coordinate system and geographical extent. The rasterio package can be used to load in these files. Since the actual values are stored in a NumPy array, accessing and manipulating them is straightforward:

In [None]:
fp = '/kaggle/input/socioeconomic-data-and-applications-center/sedac-data/Trends in Global Freshwater Availability/freshwater_availability.tif'
img = rasterio.open(fp)
img_array = img.read(1)
print(f'Maximum change: {img_array.max()}')
print(f'Minimum change: {img_array.min()}')
print(f'Raster shape: {img_array.shape}')
fig = plt.figure(figsize=(15, 6))
ax = fig.add_subplot()
im = ax.imshow(img_array)
fig.colorbar(im)
plt.show()

Visually, this figure is somewhat dominated by the large projected decline in freshwater reserves at the polar regions. The researchers recommend using a mask to remove all ocean and sea coverage, as the instruments used for these measurements detect alterations in the local gravity caused by water flows, which is naturally inaccurate over moving bodies of water.

However, this data can still yield useful insights into the rate of freshwater decline in a particular region. Looking at the shape, it is a simple array of dimensions `(360, 720)` which allow values to be mapped to coordinates for cities and other locations. For this function, longitudes West must be made negative. Say, for Chicago at 41.8781° N, 87.6298° W:

In [None]:
def get_value(raster, array, lon, lat):
    # get numpy index of these coordinates
    py, px = raster.index(lon, lat)
    return array[py, px]

get_value(img, img_array, -87.6298, 41.8781)

# Development Threat Index

---
This data is also stored in GeoTIFF format. For this demonstration, we will only consider the cumulative DTI score. More specific ones relating to fossil fuels, mining, agricultural expansion and more can also be found in this directory. They are accompanied by PDF files for the methodology and documentation for each considered variable.

The values for ocean/sea coverage can be replaced with `np.nan` for better visibility:

In [None]:
fp = '/kaggle/input/socioeconomic-data-and-applications-center/sedac-data/Development Threat Index/geotiff/lulc-development-threat-index_geogrpahic.tif'
img = rasterio.open(fp)
img_array = img.read(1).astype('float32')
img_array[img_array==128] = np.nan
print(f'Raster shape: {img_array.shape}')
fig = plt.figure(figsize=(15, 6))
ax = fig.add_subplot()
im = ax.imshow(img_array)
fig.colorbar(im)
plt.show()

There are still some artifacts present in the Pacific ocean, probably a consequence of the transformation from the original Molleweide projection. Despite being an unusual shape, it is simple to map coordinates to their relevant Development Threat Index score. Again, for Chicago:

In [None]:
get_value(img, img_array, -87.6298, 41.8781)

# Environmental Performance Index
---

The EPI data is stored in .xlsx format, combining multiple spreadsheets into a single file. This can best be accessed using `pd.read_excel`, making sure to use the parameter `sheet_name=None`. This will construct a dictionary of dataframes that can be accessed via the dictionary key:

In [None]:
epi_d = pd.read_excel('/kaggle/input/socioeconomic-data-and-applications-center/sedac-data/Environmental Performance Index/2018-epi.xlsx', sheet_name=None)
epi_d.keys()

Now if we want to examine the `'2018EPI_ScoresCurrent'` table:

In [None]:
epi_d['2018EPI_ScoresCurrent']

Details on the different fields can be found in the accompanying PDF, or examined directly in the time-series data sheet. The latter also contains a wealth of data:

In [None]:
epi_ts_d = pd.read_excel('/kaggle/input/socioeconomic-data-and-applications-center/sedac-data/Environmental Performance Index/2018-epi-raw-data-time-series.xlsx', sheet_name=None)
print(epi_ts_d.keys())
epi_ts_d['DataDictionary']

So if we want to examine the Wastewater Treatment/Water Resources score for the US:

In [None]:
epi_df = epi_d['2018EPI_ScoresCurrent']
epi_df.loc[epi_df.iso=='USA']

We can see that the Wastewater Treatment score is 92.57. Its total Environmental  Performance Index score meanwhile is 71.19. There are many features that can be extracted from this dataset, and as it is dated from 2018 it can be considered an up-to-date source.

# Climate Effects on Food Supply
---
This dataset examines the projected changes in the supply of essential crops worlwide and by country. This data is a bit older than the others and is based on the SRES scenarios created for IPCC projections. You can however create useful features by examining the difference between these predictions and known crop yields in recent years - [Earthstat](http://www.earthstat.org/) would be a good place to start.

It is also in .xlsx format - please note the spaces in front of the table names `' Data Dictionary'` and `' Data sheet'`:

In [None]:
fs_d = pd.read_excel('/kaggle/input/socioeconomic-data-and-applications-center/sedac-data/Climate Effects on Food Supply/crops-yield-changes-hadcm3-sres.xls', sheet_name=None)
print(fs_d.keys())
fs_d[' Data Dictionary']

In [None]:
fs_d[' Data sheet']

So, if we want to predict the weighted average yield change for grain crops in the US by 2020:

In [None]:
fs_df = fs_d[' Data sheet']
fs_df.loc[fs_df['BLS_2_Countries_(SRES)_ABBREVNAME']=='United States']['B2B2020']

We can see that it was expected to be around 18% higher than the known values for 2000.

# Human Modification of Terrestrial Systems
---
This dataset scores areas of the Earth's land surface between 0 and 1, a combined metric that includes the effects of 13 identified 'environmental stressors'. It is a useful score for determining the local effect on the environment caused by city or corporate activity. The dataset presents some challenges since not everywhere is scored, so we have to interpolate some results.

In [None]:
fp = '/kaggle/input/socioeconomic-data-and-applications-center/sedac-data/Human Modification of Terrestrial Sytems/lulc-human-modification-terrestrial-systems_geographic.tif'
img = rasterio.open(fp)
img_array = img.read(1).astype('float32')
img_array[img_array<0] = np.nan #oceanic areas have value -3.4028235e+38
print(img_array.shape)
fig = plt.figure(figsize=(15, 6))
ax = fig.add_subplot()
im = ax.imshow(img_array)
fig.colorbar(im)
plt.show()

In [None]:
def get_hmts(raster, array, long, lat):
    
    #if value is present at these coords, return value
    out = get_value(img, img_array, long, lat)
    if not np.isnan(out):
        return 
    
    #otherwise perform 2d linear interpolation on local area and impute the value
    py, px = raster.index(lon, lat) 
    array = array[py-100:py+100, px-100:px+100]
    x = np.arange(0, array.shape[1])
    y = np.arange(0, array.shape[0])
    array = np.ma.masked_invalid(array)
    xx, yy = np.meshgrid(x, y)
    #get only the valid values
    x1 = xx[~array.mask]
    y1 = yy[~array.mask]
    newarr = array[~array.mask]

    interpolated_arr = interpolate.griddata((x1, y1),
                                            newarr.ravel(),
                                            (xx, yy),
                                            method='linear')            
    #return central value from the imputed local area
    return interpolated_arr[100, 100]

get_hmts(img, img_array, -87.6298, 41.8781)

# Global Annual PM2.5 Grids
---
PM2.5 refers to the presence of fine, particulate matter either natural or man-made which can cause health problems in high concentrations/long exposures. Specifically, PM2.5 is the mass per cubic metre of air of particles with a diameter less than 2.5 micrometres (µm). This dataset supplies a grid for these readings, indicating a valuable property of general air quality.

In [None]:
fp = '/kaggle/input/socioeconomic-data-and-applications-center/sedac-data/Global Annual PM2.5 Grids/sdei-global-annual-gwr-pm2-5-modis-misr-seawifs-aod-2016-geotiff/gwr_pm25_2016.tif'
img = rasterio.open(fp)
img_array = img.read(1).astype('float32')
img_array[img_array<-1e6] = np.nan #oceanic areas have value -3.4028235e+38
print(img_array.shape)
fig = plt.figure(figsize=(15, 6))
ax = fig.add_subplot()
im = ax.imshow(img_array)
fig.colorbar(im)
plt.show()

Similarly to the Human Modification of Terrestrial Systems dataset, there are some missing values over land areas so we can interpolate values if they are missing from the original raster:

In [None]:
def get_pm25(raster, array, long, lat):
    
    #if value is present at these coords, return value
    out = get_value(img, img_array, long, lat)
    if not np.isnan(out):
        return 
    
    #otherwise perform 2d linear interpolation on local area and impute the value
    py, px = raster.index(lon, lat) 
    array = array[py-100:py+100, px-100:px+100]
    x = np.arange(0, array.shape[1])
    y = np.arange(0, array.shape[0])
    array = np.ma.masked_invalid(array)
    xx, yy = np.meshgrid(x, y)
    #get only the valid values
    x1 = xx[~array.mask]
    y1 = yy[~array.mask]
    newarr = array[~array.mask]

    interpolated_arr = interpolate.griddata((x1, y1),
                                            newarr.ravel(),
                                            (xx, yy),
                                            method='linear')            
    #return central value from the imputed local area
    return interpolated_arr[100, 100]

get_pm25(img, img_array, -87.6298, 41.8781)

# Natural Resource Protection & Child Health Indicators
---
The final dataset in this collection provides scoring metrics for natural resource protection and child health indicators. The dataset was uploaded this year with the latest complete values being for 2019, making it an up-to-date resource. 

It is stored in .xlsx format and can be accessed like the others:

In [None]:
nrp_chi_d = pd.read_excel('/kaggle/input/socioeconomic-data-and-applications-center/sedac-data/Natural Resource Protection and Child Health Indicators/nrpi-chi-2019-xlsx.xlsx', sheet_name=None)
print(nrp_chi_d.keys())
nrp_chi_d['data dictionary']

The 2019 scores for NRPI and CHI can be trivially accessed by country:

In [None]:
nrp_chi_d['NRPI 2019'].head()

In [None]:
nrp_chi_d['CHI 2019'].head()

In these columns, the 'v2019' refers to the dataset release, while the last two digits are for the year in question. So the latest values for NRPI and CHI for the United States can be extracted as follows:

In [None]:
nrp_df = nrp_chi_d['NRPI 2019']
chi_df = nrp_chi_d['CHI 2019']
print(f"United states latest Natural Resource Protection Indicator: {nrp_df.loc[nrp_df.ISO3=='USA', :].NRPI_v2019_19.values[0]}")
print(f"United states latest Child Health Indicator: {chi_df.loc[chi_df.ISO3=='USA', :].CHI_v2019_18.values[0]}")

A full guide to the columns can be found in the accompanying PDF, and in the `'data dictionary'` dataframe.

# Conclusion
---

This collection of SEDAC resources contains a wealth of information about socioeconomic and environmental data worldwide, including current metrics and projected future values. The geospatial data can be used to enrich both visualizations and tabular data, when it is matched to coordinates using the functions provided in this notebook. Each dataset contains an explanatory PDF in case you wish to read further about the methods used for the data collection and generation. 

I hope this dataset is of use to you in your work. If you have any comments or questions, please leave them in the section below.