# Dataset description and relation to competition
This dataset is available here: http://citycarbonfootprints.info/ and derives from a recent publication on carbon footprints of world cities: https://iopscience.iop.org/article/10.1088/1748-9326/aac72a "Carbon footprints of 13 000 cities", published in 2018. The carbon footprint data refer to the year 2013. To create the footprints for the 13,000 cities around the world, the authors created a 250m grid spanning the globe:

>Units are Gg (1 Gg=1Kt) of CO2 emissions from fossil fuel combustion. The data year is 2013. The map is full-world extent (-90° to +90° and -180° to 180°) in the equal-area World Mollewiede (EPSG:54009) projection, with 250m cells. The GeoTIFF files are 200mb uncompressed but require a minimum of 5gb RAM to view or analyse.

The study appears to be one of the most extensive efforts yet to characterize carbon footprints with as much spatial coverage as possible. This opens up numerous possibilities for analysis and incorporation into KPIs for the CDP: Unlocking Climate Solutions competition. The global extent should enable actual carbon footprint estimates to be spatially joined to the competition data, which include:
- City-level information about commitments to improving carbon budgets, as well as
- Fine-grained spatial analysis within cities at the zip code and census tract level

The data presented here could be used for both large, multi-city analysis, as well as within-city analysis due to the 250m spatial resolution.

In this notebook I load the data and visualize for Los Angeles, California, USA, then join with cities in the competition dataset (TBD).

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import rasterio
from rasterio.enums import Resampling
from rasterio.plot import show
import matplotlib as mpl
import matplotlib.pyplot as plt
import geopandas as gpd
import re

# Carbon Footprint Data
These data come in the form of a GeoTIFF, which is essentially a 2D array of numbers that can be mapped to spatial locations on the earth's surface.

In [None]:
dataset = rasterio.open('../input/global-gridded-model-of-carbon-footprints-ggmcf/GGMCF_v1.0.tif')

In [None]:
dataset.count

Image data often have multiple "bands", e.g. red/green/blue, but there is just one band here, which is the array of numbers representing carbon footprints. Let's visualize the data. For visibility, downsample the data by a factor of 100 plot it on a log scale.

In [None]:
%%time
upscale_factor = 0.01

# resample data to target shape
data = dataset.read(
    out_shape=(
        dataset.count,
        int(dataset.height * upscale_factor),
        int(dataset.width * upscale_factor)
    ),
    resampling=Resampling.bilinear
)

# scale image transform
transform = dataset.transform * dataset.transform.scale(
    (dataset.width / data.shape[-1]),
    (dataset.height / data.shape[-2])
)

In [None]:
data.shape

In [None]:
mpl.rcParams['figure.dpi'] = 200
mpl.rcParams['font.size'] = 8
image_hidden = plt.imshow(np.log10(data[0,:,:]))
plt.colorbar(image_hidden)

It's apparent that more populated areas around the world have larger carbon footprints, which makes sense.

What is the coordinate reference system of these data?

In [None]:
dataset.crs

The coordinate reference system of these data need to match other systems, to be spatially joined.

# World urban areas shapefile
In order to join the global gridded data, we need to identify which regions of the gridded data correspond to cities in the competition data set. This public data set has polygons describing major cities around the world (https://geo.nyu.edu/catalog/stanford-yk247bg4748).

In [None]:
gdf = gpd.read_file('../input/world-urban-areas-landscan-110-million-2012/ne_10m_urban_areas_landscan.shp')

In [None]:
gdf.info()

In [None]:
gdf.head()

In [None]:
%time gdf.plot(figsize=(12,8))

The city polygons across the world form a faint impression of global land masses. The largest concentrations of cities appear to be in India and China, which are the most populated countries.

Let's try to find Los Angeles in here to plot one city.

In [None]:
LA_mask = gdf['name_conve'].str.contains('Angeles')
sum(LA_mask)

In [None]:
gdf[LA_mask]

It looks like there are two cities in the data called "Los Angeles". What do they look like?

In [None]:
gdf[gdf['name_conve'] == 'Los Angeles1'].plot()

In [None]:
gdf[gdf['name_conve'] == 'Los Angeles2'].plot()

The first one appears to be the Southern California metropolis. I'll assume that for other cities in these data, the first one is the largest city and the one I'll take for matching with other data.

What's the coordinate reference system here?

In [None]:
gdf.crs

This is different than the carbon footprint, requiring reprojection.

# Reproject carbon footprints data and save

In [None]:
from rasterio.warp import calculate_default_transform, reproject, Resampling

This EPSG code comes from the metadata shown above.

In [None]:
dst_crs = 'EPSG:4326'

Reprojection code adapted from https://rasterio.readthedocs.io/en/latest/topics/reproject.html

In [None]:
%%time
with rasterio.open('../input/global-gridded-model-of-carbon-footprints-ggmcf/GGMCF_v1.0.tif') as src:
    transform, width, height = calculate_default_transform(
        src.crs, dst_crs, src.width, src.height, *src.bounds)
    kwargs = src.meta.copy()
    kwargs.update({
        'crs': dst_crs,
        'transform': transform,
        'width': width,
        'height': height
    })

    with rasterio.open('/kaggle/working/reprojected.tif', 'w', **kwargs) as dst:
        for i in range(1, src.count + 1):
            reproject(
                source=rasterio.band(src, i),
                destination=rasterio.band(dst, i),
                src_transform=src.transform,
                src_crs=src.crs,
                dst_transform=transform,
                dst_crs=dst_crs,
                resampling=Resampling.nearest)

# Plot carbon footprint for one city

Now I show how to mask the carbon footprint data, which span the globe, to an individual city, LA.

In [None]:
shapes = gdf[gdf['name_conve'] == 'Los Angeles1']['geometry']
shapes

In [None]:
from rasterio.mask import mask

In [None]:
with rasterio.open('reprojected.tif') as src:
    out_image, out_transform = mask(src, shapes, crop=True)
    out_meta = src.meta

In [None]:
fig, axs = plt.subplots(1,2)
gdf[gdf['name_conve'] == 'Los Angeles1'].plot(ax=axs[0])
show(out_image, ax=axs[1])

We can also gather statistics on the carbon footprint data, such as adding up the carbon footprint for the LA metro area. I will just use the sum to add up the footprint over the whole city, but may come back later to add in other statistics.

In [None]:
out_image.shape

In [None]:
out_image.sum()

# Cities in the CDP data

Now that we've done it for one city, the next step is to match up city names from the CDP competition data with the world urban areas data, in order to capture the carbon footprints. Here are the cities in the CDP data:

In [None]:
df = pd.read_csv('/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2019_Cities_Disclosing_to_CDP.csv')

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df['City'].isnull().mean()

Looks like `City` is the field I want, but sometimes it's null. I'll fill the missing data with `Organization`.

In [None]:
df['City_filled'] = df['City'].fillna(df['Organization'])

In [None]:
df[['City', 'Organization', 'City_filled']].head(10)

Now search for cities in the map data that match the city name in the CDP data, `break`ing the first time a match is found in order to take a city name ending in 1 in case of multiples.

In [None]:
%%time
matched = []
match_count = []
for city_1 in df['City_filled'].tolist():
    for city_2 in gdf['name_conve'].tolist():
        if re.match(city_1, city_2):
            matched.append(city_2)
            match_count.append(1)
            break

In [None]:
len(match_count)

So we've found polygons for...

In [None]:
345/861

about 40% of the cities. Which isn't perfect but seems like a reasonable amount to do some analysis. A more detailed look here may yield more matches.

# Carbon footprints for cities in CDP data
TBD