# Data ingestion and formatting

This notebook explains how to convert the Climate TRACE dataset to a format that is more appropriate for data science. 

```{note}
This section is relevant for data engineers, or data scientists who want to understand how the data 
has been prepared. Skip if you just want to access the final, prepared data.
```

```{warning}
This notebook requires significant resources (300GB disk and 10GB memory for all gases).
```


The original data from Climate TRACE is offered as a series of CSV files bundled in ZIP archives. That format is universally understood, but it is not the most effective for effective analysis with data science tools. In particular, it is large: the source data, uncompressed, is about 100GB for each gas! This is the size at which most people would consider this project to be "big data" or at least "medium data". With the proper choice of data storage, we will bring it down to a breezy "small data" without losing information along the way.

Instead, we are going to use the Parquet format. This format has a number of advantages:
- it is _column-based_ : data systems can process big chunks of data at once, rather than line by line. Also, depending on the information requested, systems will read only the relevant columns and skip the rest very effectively
- it is _universal_ : most modern data systems will be able to read it.
- it is _structured_ : basic information about numbers, categories, ... are preserved.


Looking at the code, we are performing a few tricks:

_Compacting the data_ We minimize the size of the files by taking advantage of its structures. In particular, we know in many cases that values are part of known enumerations (sectors, ...). We replace all these by `polars.Enumeration`s. Not only this makes files smaller, but it also allows data systems to make clever optimization for complex operations such as joining.

_Lazy reading_ If we were to read all the source data using a traditional system such as Excel or Pandas, we would require a serious amount of memory. The files themselves are more than 5GB. Polars is capable of reading straight from the zip file in a streaming fashion. This is what Polars calls a Lazy dataframe, or LazyFrame. Even when doing complicated operations such as joining the source files with the confidence information, Polars only uses 3GB of memory on my machine. In fact, this way of working is so fast that the `ctrace` package directly reads all the country emissions data from the zip files in less than a second.

_Using known enumerations_ You will see in the source code that nearly all the variables such as column names, names of gas and sectors, etc. are replaced CONSTANT_NAMES such as `CH4`,.... You can use that to autocomplete



In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging
logging.basicConfig(level=logging.DEBUG)

In [3]:
import os
import polars as pl
from ctrace.constants import *
import ctrace as ct
import pyarrow.parquet
import pyarrow.dataset
from dds import data_function
from pathlib import Path
import duckdb
import tempfile
import shutil
import dds
import tempfile
import huggingface_hub
import huggingface_hub.utils
logging.getLogger("dds").setLevel(logging.WARNING)
logging.getLogger("fiona").setLevel(logging.WARNING)
dds.accept_module(ct)

In [4]:
os.environ["POLARS_TEMP_DIR"] = os.path.join(tempfile.gettempdir(), "polars")
duckdb.sql("SET temp_directory = '{d}'".format(d=os.path.join(tempfile.gettempdir(), "duckdb")))
duckdb.sql("SET memory_limit = '10GB'")
duckdb.sql("SELECT current_setting('temp_directory')")

┌───────────────────────────────────┐
│ current_setting('temp_directory') │
│              varchar              │
├───────────────────────────────────┤
│ /media/tjhunter/DATA/temp/duckdb  │
└───────────────────────────────────┘

## Creating optimized parquet files for source data

This first section creates files that are the most effective for reading and querying. The general approach is as follows:

1. Join the source and source confidence CSV files and writes them as parquet files for each subsector
2. Aggregate by year into a yearly parquet file
3. Optimize this parquet file for reading

This first command creates parquet files that join the source and source confidences for each subsector, and returns a list of all the created files.

In this notebook, another trick is to define the transformations as _data functions_. In short, this code will only run if the source code changes. This makes rerunning the notebooks very fast, and only updating when something has changed in the source code.

In [20]:
@data_function("/poly_paths")
def get_polys_path():
    gases = GAS_LIST
    polys_path = ct.data.extract_polygons(p=True, gases=gases)
    points_path = ct.data.extract_points(p=True, gases=gases, polys=polys_path)
    return (polys_path, points_path)
gases = GAS_LIST[:1]
polys_path = ct.data.extract_polygons(p=True, gases=gases)
# (polys_path, points_path) = get_polys_path()

DEBUG:ctrace.data:Opening path agriculture.zip co2
DEBUG:ctrace.data:extracting DATA/agriculture_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/agriculture_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path buildings.zip co2
DEBUG:ctrace.data:extracting DATA/buildings_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/buildings_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path fluorinated-gases.zip co2
INFO:ctrace.data:skipping co2:fluorinated-gases: no geometries
DEBUG:ctrace.data:Opening path forestry-and-land-use.zip co2
DEBUG:ctrace.data:extracting DATA/forestry-and-land-use_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/forestry-and-land-use_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path fossil-fuel-operations.zip co2
DEBUG:ctrace.data:extracting DATA/fossil-fuel-operations_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:Opening path manufacturing.zip co2
DEBUG:ctrace.data:extracting DATA/manufacturing_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:Opening path mineral-extraction.zip co2
DEBUG:ctrace.data:extracting DATA/mineral-extraction_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:Opening path power.zip co2
DEBUG:ctrace.data:extracting DATA/power_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:Opening path transportation.zip co2
DEBUG:ctrace.data:extracting DATA/transportation_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/transportation_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path waste.zip co2
DEBUG:ctrace.data:extracting DATA/waste_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/climate_trace-polygons_v3-2024-ct5.parquet


In [21]:
points_path = ct.data.extract_points(p=True, gases=gases, polys=polys_path)

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/agriculture_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

INFO:ctrace.data:skipping co2:buildings: no points. The current layers are: ['buildings_polygons']
INFO:ctrace.data:skipping co2:fluorinated-gases: no geometries
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/forestry-and-land-use_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/fossil-fuel-operations_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/manufacturing_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/mineral-extraction_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/power_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/transportation_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/waste_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/climate_trace-points_v3-2024-ct5.parquet


In [22]:
pl.read_parquet('/media/tjhunter/DATA/temp/climate_trace-points_v3-2024-ct5.parquet')

geometry_ref,gadm,geom_wkb,lat,lng,gadm_0,gadm_1,gadm_2,gadm_level,iso3_country
str,str,binary,f64,f64,str,str,str,u32,enum
"""trace_1613800""","""gadm_USA.39.36_1""","b""\x01\x01\x00\x00\x00o\x967,\xb2\x06S\xc0\xda\x0aP\x09>$D@""",40.283143,-76.104625,"""gadm_USA""","""gadm_USA.39_1""","""gadm_USA.39.36_1""",2,"""USA"""
"""trace_3680603""","""gadm_GBR.1.18_1""","b""\x01\x01\x00\x00\x00m\xde\xd96\x91O\x03\xc0\xbaxu\xb9\xa9\xa9J@""",53.325492,-2.413851,"""gadm_GBR""","""gadm_GBR.1_1""","""gadm_GBR.1.18_1""",2,"""GBR"""
"""trace_31739138""","""gadm_USA.5.9_1""","b""\x01\x01\x00\x00\x00\x08\x94M\xb9\xc2*^\xc0\x02aL\x9f\xc2]C@""",38.732502,-120.668135,"""gadm_USA""","""gadm_USA.5_1""","""gadm_USA.5.9_1""",2,"""USA"""
"""trace_3784618""","""gadm_FRA.3.4_1""","b""\x01\x01\x00\x00\x00W\xb0t<\xa6q\x09\xc0I&\xedH\xc6\x0bH@""",48.091989,-3.180493,"""gadm_FRA""","""gadm_FRA.3_1""","""gadm_FRA.3.4_1""",2,"""FRA"""
"""trace_1753741""","""gadm_RUS.6.19_1""","b""\x01\x01\x00\x00\x00\x8a\xcd\xc7\xb5\xa1\x08L@7QKs+vK@""",54.923201,56.067435,"""gadm_RUS""","""gadm_RUS.6_1""","""gadm_RUS.6.19_1""",2,"""RUS"""
…,…,…,…,…,…,…,…,…,…
"""trace_708872""","""gadm_USA.18.97_1""","b""\x01\x01\x00\x00\x00~\xe1\xb62\xb8\xd7T\xc0!,[\xd2n\xa6B@""",37.300257,-83.370618,"""gadm_USA""","""gadm_USA.18_1""","""gadm_USA.18.97_1""",2,"""USA"""
"""trace_3166369""","""gadm_USA.2.7_1""","b""\x01\x01\x00\x00\x00;:\x09\x00)\xd0c\xc0f\x81\xfa\xbf\xb8\x85M@""",59.044701,-158.505005,"""gadm_USA""","""gadm_USA.2_1""","""gadm_USA.2.7_1""",2,"""USA"""
"""trace_31737036""","""gadm_USA.36.56_1""","b""\x01\x01\x00\x00\x00\x84\xb5\xb5\xae&\x06U\xc0f\x11/\xf4\xad\xf4C@""",39.911559,-84.096111,"""gadm_USA""","""gadm_USA.36_1""","""gadm_USA.36.56_1""",2,"""USA"""
"""trace_3866705""","""gadm_ZAF.4.5_1""","b""\x01\x01\x00\x00\x00\x89A`\xe5\xd0\xe2=@\xaa\xf1\xd2Mb\x90>\xc0""",-30.564,29.886,"""gadm_ZAF""","""gadm_ZAF.4_1""","""gadm_ZAF.4.5_1""",2,"""ZAF"""


In [18]:
duckdb.sql("""
SELECT ST_GeomFromWKB(geom_wkb) FROM '/media/tjhunter/DATA/temp/climate_trace-points_v3-2024-ct5.parquet'
""")

┌───────────────────────────────────────────────┐
│           st_geomfromwkb(geom_wkb)            │
│                   geometry                    │
├───────────────────────────────────────────────┤
│ POINT (7.779 49.14)                           │
│ POINT (-77.6667 39.9109)                      │
│ POINT (24.55 64.0833)                         │
│ POINT (126.8044509 37.3003108)                │
│ POINT (-87.81336426874637 13.328092647974083) │
│ POINT (-79.398 45.63)                         │
│ POINT (127.8481 1.8207)                       │
│ POINT (4.705 44.225)                          │
│ POINT (17.179 47.922)                         │
│ POINT (-76.191365 -10.577017)                 │
│          ·                                    │
│          ·                                    │
│          ·                                    │
│ POINT (-1.32 53.794)                          │
│ POINT (104.831985 36.890058)                  │
│ POINT (119.115069547203 25.2916814477704)     │


In [7]:
os.environ["MALLOC_CONF"] = (
    f"narenas:{os.cpu_count()},lg_chunk:21,background_thread:true,dirty_decay_ms:10000,muzzy_decay_ms:10000"
)

In [8]:
duckdb.sql("""
INSTALL spatial;
LOAD spatial;
SET preserve_insertion_order = false;
""")

In [9]:
polys_path

PosixPath('/media/tjhunter/DATA/temp/climate_trace-polygons_v3-2024-ct5.parquet')

In [10]:
parquet_file = pyarrow.parquet.ParquetFile(polys_path)
parquet_file.metadata

<pyarrow._parquet.FileMetaData object at 0x799fa3fbe700>
  created_by: Polars
  num_columns: 4
  num_rows: 51126
  num_row_groups: 8
  format_version: 1.0
  serialized_size: 759954

In [11]:
ds = pyarrow.dataset.dataset(polys_path)
tmp_dir = tempfile.gettempdir()
arrow_path = os.path.join(tmp_dir, "arrow_polys")
pyarrow.dataset.write_dataset(
    ds,
    base_dir=arrow_path,
    basename_template="ds_{i}.parquet",
    format="parquet",
    partitioning=None,
    min_rows_per_group=0,
    max_rows_per_group=1_000,
    existing_data_behavior='overwrite_or_ignore'
)


In [12]:
polys2_path = "/media/tjhunter/DATA/temp/arrow_polys/ds_0.parquet"
parquet_file = pyarrow.parquet.ParquetFile(polys2_path)
parquet_file.metadata

<pyarrow._parquet.FileMetaData object at 0x799fa3fbf600>
  created_by: parquet-cpp-arrow version 18.1.0
  num_columns: 4
  num_rows: 51126
  num_row_groups: 56
  format_version: 2.6
  serialized_size: 22429

In [13]:
gpkg_fname = '/media/tjhunter/DATA/temp/co2/DATA/forestry-and-land-use_geometries.gpkg'
poly_f = "/media/tjhunter/DATA/temp/test.parquet"
duckdb.sql("""
COPY
    (SELECT geometry_ref, subsectors, ST_AsWKB(geom) as geom_wkb FROM ST_read('{gpkg_fname}', layer='{sector_ct}_polygons') LIMIT 10)
TO '{poly_f}'
""".format(gpkg_fname=gpkg_fname, sector_ct=FORESTRY_AND_LAND_USE, poly_f=poly_f))

In [14]:
duckdb.sql("""
    (SELECT *, ST_GeomFromWKB(geom_wkb) as geom FROM '{polys}')
""".format(polys=polys_path))

┌───────────────────┬──────────────────────┬────────────┬──────────────┬───────────────────────────────────────────────┐
│   geometry_ref    │       geom_wkb       │ gadm_level │ iso3_country │                     geom                      │
│      varchar      │         blob         │   uint32   │   varchar    │                   geometry                    │
├───────────────────┼──────────────────────┼────────────┼──────────────┼───────────────────────────────────────────────┤
│ gadm_BRA.13.157_2 │ \x01\x06\x00\x00\x…  │          2 │ BRA          │ MULTIPOLYGON (((-45.97341599999991 -21.1400…  │
│ gadm_HKG.12_1     │ \x01\x06\x00\x00\x…  │          1 │ HKG          │ MULTIPOLYGON (((114.36531800000024 22.43747…  │
│ gadm_COL.2.125_2  │ \x01\x06\x00\x00\x…  │          2 │ COL          │ MULTIPOLYGON (((-74.91311645599991 7.287219…  │
│ gadm_DEU.9.18_1   │ \x01\x06\x00\x00\x…  │          2 │ DEU          │ MULTIPOLYGON (((9.715490001000079 52.634541…  │
│ gadm_ITA.15.7_1   │ \x01\x06\x

In [15]:
duckdb.sql("""
    (SELECT *, ST_GeomFromWKB(geom) AS geom2 FROM '{polys}' LIMIT 1)
""".format(polys='/media/tjhunter/DATA/temp/co2/transportation_polygons.parquet'))

BinderException: Binder Error: Referenced column "geom" not found in FROM clause!
Candidate bindings: "transportation_polygons.geom_wkb"

In [None]:
duckdb.sql("""
    (SELECT *, ST_GeomFromWKB(geom_wkb) AS geom FROM '{polys}' WHERE geom_wkb IS NOT NULL LIMIT 1)
""".format(polys='/media/tjhunter/DATA/temp/co2/transportation_polygons.parquet'))

In [None]:
duckdb.sql("""
CREATE OR REPLACE TABLE polys AS SELECT *, ST_GeomFromWKB(geom_wkb) AS geom FROM '{polys}';
""".format(polys="/media/tjhunter/DATA/temp/climate_trace-polygons_v3-2024-ct5.parquet"))

In [None]:
eoueoue

In [None]:
@data_function("/data_sources")
def load_sources():
    (_, files) = ct.data.load_source_compact()
    return files

load_sources()

To help with the loading, the data is partitioned by year. This is the most relevant for most users: most people are expected to look at specific years and sectors (especially the latest year). This reduces the amount of data to load.

Let us have a quick peek at the data in one of these files. It looks already pretty good: a lot of the redundant data such as the enumerations has been deduplicated. All the enumeration data is now converted to integers, this is what `dictionary<values=string, indices=int32, ordered=0>` means. It is not quite ready for high performance however.

In [None]:
from pyarrow.parquet import read_table
fname = load_sources()[0]
print(fname)
read_table(fname)

## Aggregating by year and optimizing the output

The following block takes all the sector files and aggregates them by year. This is based on the expectation that most users will work on the latest year, and that some users will want to look into the trends across the years.

Since these files will be read many times (every time we want to do a graph), it pays off to optimize them. The Parquet format is designed for fast reads of the relevant data. We will do two main optimizations: optimal compression, optimizing the row groups and adding statistics.



_Compression_ Parquet allows some data to be compressed by columns. The first intuition is that, looking at each column of data separately, there will be more patterns and thus more opportunities to compress the data. The second intuition is that, in data-intensive application, reading the data is the bottleneck. It is then faster to read smaller compressed data in memory and then decompress it (losing a bit of time in compute), rather than reading larger, uncompressed data. Modern compression algorithms such as ZStandard or LZ4 are designed to be very effective at using a processor. Using them is essentially a pure gain in terms of processing speed.


```{admonition} CTODO
The year of a data record is defined by its start time. This may be different than the convention used by Climate Trace. To check.
```


In [None]:
@data_function("/ct_pre")
def ct_pre():
    write_directory = os.path.join(tempfile.gettempdir(), "ct_pre")
    data_files = [str(p) for p in load_sources()]
    duckdb.sql("""
    COPY
          (SELECT *,date_part('year', start_time) AS year FROM read_parquet({data_files}))
    TO '{tmp_dir}' (FORMAT PARQUET, PARTITION_BY (gas,year), CODEC 'zstd', OVERWRITE_OR_IGNORE)
    """.format(data_files=str(data_files), tmp_dir=str(write_directory))
    )
    return write_directory

ct_pre()

In [None]:
def _write_source_file(gas, year, ct_pre_fname):
    logger = logging.getLogger(__name__)
    tmp_dir = tempfile.gettempdir()
    ct_pre_pq = os.path.join(ct_pre_fname, f"gas={gas}", f"year={year}")
    local_pq = os.path.join(tmp_dir, "temp.parquet")
    logger.debug("writing source file for year=%s gas=%s %s", year, gas, local_pq)
    (pl.scan_parquet(ct_pre_pq)
     .pipe(ct.data.recast_parquet, conf=True)
     .sort(by=[SUBSECTOR])
     .sink_parquet(local_pq,
        compression="zstd",
        maintain_order=True,
        statistics=True,
        compression_level=2,
        row_group_size=300_000,
        data_page_size=10_000_000
        )
    )
    version = ct.data.version
    fname = os.path.join(tmp_dir,
                         "climate_trace_sources",
                         f"climate_trace-sources_{version}_{year}_{gas}.parquet") 
    logger.debug("final source file: %s", fname)
    ds = pyarrow.dataset.dataset(local_pq)
    arrow_path = os.path.join(tmp_dir, "arrow_tmp")
    pyarrow.dataset.write_dataset(
        ds,
        base_dir=arrow_path,
        basename_template="ds_{i}.parquet",
        format="parquet",
        partitioning=None,
        min_rows_per_group=300_000,
        max_rows_per_group=1_000_000,
        existing_data_behavior='overwrite_or_ignore'
    )
    os.makedirs(os.path.dirname(fname), exist_ok=True)
    shutil.copyfile(os.path.join(arrow_path, "ds_0.parquet"), fname)
    return fname



In [None]:
years = ct.data.years
gases = ct.constants.GAS_LIST

@data_function("/write_sources")
def write_sources():
    ct_pre_fname = ct_pre()
    fnames = []
    for gas in gases:
        for year in years:
            fname = _write_source_file(gas,year, ct_pre_fname)
            fnames.append(fname)
    return fnames

write_sources()

_Optimizing row groups_ A parquet file is a collection of groups of rows, and these rows are organized column-wise along with some statistics. We can choose how many groups to create: the minimum is one group (all the data into a single group), which is the most standard. This is not optimal however: reading can only be done by one processor core at a time. If we have more, they will sit idle. This is why it is better to choose the number of groups to be close to the expected number of processor cores (10-100). When reading, each core will process a different chunk of the file in parallel.

Polars is more limited as of December 2024, so the code below directly calls the `pyarrow` package to restructure the final file, calling the function `pyarrow.dataset.write_dataset`. 

Here is the parquet files produced directly by Polars. It is the result of joining datasets which themselves are the result of reading many files (each by subsector). It is very fragmented (see the `num_row_groups` statistics below).

In [None]:
fname_pre = os.path.join(tempfile.gettempdir(), "temp.parquet")
fname_post = write_sources()[-1]
parquet_file = pyarrow.parquet.ParquetFile(fname_pre)
parquet_file.metadata

The final file is more compact: only 58 row groups. It will be much faster to read (up to 50 times faster on my computer) because the readers do not need to gather information from each of the row groups.

In [None]:
parquet_file = pyarrow.parquet.ParquetFile(fname_post)
parquet_file.metadata

_Statistics_ Each row group in a parquet file has statistics. These statistics contain for each columns basic information such as minimum, maximum, etc. as you can see below. During a query, a data system first reads these statistics to check what blocks of data it should read. 

For example, the first row group only contains agriculture data (which you can infer from `min: agriculture` and `max: agriculture`). As the result, if a query is looking for waste data, it can safely skip this full block. 

Grouping the rows and creating statistics can dramatically reduce the amount of data being read and processed. Finding the right number of groups is a tradeoff between using more cores to read the data in parallel, and not having to read too many statistics descriptions. In the extreme case of the file created by Polars (5000 row groups), the statistics make up 40% of the file and can take up to 90% of the processing time! If your parquet file reads slowly, it is probably due to its internal layout.

In [None]:
parquet_file = pyarrow.parquet.ParquetFile(fname_post)
parquet_file.metadata.row_group(0).column(2).statistics

## Initial checks

We know check that it works correctly. Let's load the newly created data instead of the default version stored on the internet, for the year 2023.

In [None]:
source_path = tempfile.gettempdir()
sdf = ct.read_source_emissions(gas=CO2, year=2023, p=source_path)
sdf

About 15M records for this year. This is spread across multiple gas and also multiple trips in the case of boats or airplanes.

In [None]:
sdf.select(pl.len()).collect()

Check the number of distinct source IDs

In [None]:
by_sec = (sdf
.group_by(SOURCE_ID, SECTOR)
.agg(pl.len())
.collect())

The number of sources outside forestry and land use:

In [None]:
by_sec.filter(c_sector != FORESTRY_AND_LAND_USE).select(pl.len())

Check: no source is associated with multiple sectors.

In [None]:
by_sec.group_by(SOURCE_ID).agg(c_sector.n_unique()).filter(pl.col(SECTOR) > 1)

Check: no annual source should be duplicated by gas. It used to be the case with V2 release.

In [None]:
(sdf
.filter(c_temporal_granularity =="annual")
.group_by(SOURCE_ID, GAS)
.agg(pl.len())
.filter(pl.col("len") > 1)
.sort(by="len")
.collect())

Check: emissions should always be defined. V2 used to have empty values.

In [None]:
sdf = ct.read_source_emissions(CO2E_100YR, 2023, source_path)
(sdf
 .select(c_emissions_quantity.is_null().alias("null_emissions"), c_subsector, c_iso3_country)
 .group_by(c_subsector, "null_emissions")
 .agg(pl.len())
 .collect()
 .pivot(index=SUBSECTOR, on="null_emissions", values="len")
)

## Integrity checks

Before uploading and publishing data, it is a good idea to run a number of checks. Frameworks such as [pandera](https://pandera.readthedocs.io/en/latest/polars.html) are very helpful to implement these checks. Here we just check that Akrotiri and Dhekelia (country code XAD) is not included, as mentioned in the documentation. It used to be included in older data releases.

In [None]:
(ct.read_source_emissions(gas=GAS_LIST, year=years, p=source_path)
 .filter(c_iso3_country == "XAD")
 .select(pl.len())
.collect())

### CO2e subsector data should be a superset of all sectors

Here is a normalized check that is worth checking for any data release: one would expect the total CO2e_100yr (total emissions normalized by their CO2 equivalent) to be at least present for each sector in which emissions are reported. This was not the case until 2024-12-01 and has been fixed since then.

In [None]:
with pl.Config(tbl_rows=20):
    print(ct.read_source_emissions(gas=GAS_LIST, year=years, p=source_path)
     .group_by(c_sector, c_subsector, c_gas)
     .agg(c_emissions_quantity.sum())
     .collect(streaming=True)
     .pivot(GAS, index=[SECTOR, SUBSECTOR])
     .filter(pl.col(CO2E_100YR).is_null())
     .filter((pl.col(N2O) != 0) | (pl.col(CH4) != 0) | (pl.col(CO2) != 0))
    )

## Create parquet files for country emissions

As of V3, country emission data is also large enough that it should be compacted in parquet files. Note the dramatic difference:

- uncompressed CSV file: 106MB
- compressed CSV file: 6MB
- parquet: 2MB !!

As highlighted, the parquet file also has the advantage of being very efficient at extracting only the relevant information.

In [None]:
# Starting from the official archives, read all the gases.

@data_function("/read_country")
def read_country():
    path = Path(tempfile.gettempdir()) / f"climate-trace-countries-{ct.data.version}.parquet"
    print(path)
    cdf = ct.read_country_emissions(ct.constants.GAS_LIST, archive_path=True)
    # Optimizing to read by time and then gas.
    # The logic being that country-specific files are already available from CT.
    (cdf
     .sort(by=[c_start_time,c_gas,c_iso3_country])
      .write_parquet(path) # Not taking precautions, the file is so small.
    )
    return path

p = read_country()

## Country emissions: integrity checks

In a production pipeline, before uploading the final data, we would run a number of checks again on the country emissions. Here are a few checks that we can run (and which are currently failing).

In [None]:
cdf = ct.read_country_emissions(parquet_path=p)
cdf.head(2)

### Country emissions: CO2e data should be a superset of all country emissions

This was an issue as of 2024-12 and has been fixed since then.

In [None]:
with pl.Config(tbl_rows=20):
    print(cdf
     .group_by(c_sector, c_subsector, c_gas)
     .agg(c_emissions_quantity.sum())
     .sort(by=[c_sector, c_subsector, c_gas])
     .pivot(GAS, index=[SECTOR, SUBSECTOR])
     .filter(pl.col(CO2E_100YR).is_null())
     .filter(pl.col(CO2) != 0)
    )

### Country emissions: some countries are excluded from the dataset

The Climate TRACE documentation excludes certain countries from the final release. They used to be present as of 2024-12.

In [None]:
excluded_isos = ["XAD", "XCL", "XPI", "XSP"]
(cdf
 .filter(c_iso3_country.is_in(excluded_isos))
 .group_by([ISO3_COUNTRY, c_start_time.dt.year(), GAS, SECTOR, SUBSECTOR])
 .agg(pl.len()))

## Preparing the geographical information

The Climate TRACE dataset also includes geographical information about the location of emissions:
- point locations for _point sources_ (factories, power plants, ...)
- polygons for _area sources_ (forests, transportation, ...)

This comes with a few remarks:
- all the area sources are split and aggregated at the level of the county or city. This is following the convention of the Global Administrative Boundaries project (GADM). You will not be able to get sources for city block or road level. You will not see either each fire in Canada, it is all aggregated at county level.
- the ports (seaports, airports) gather the emissions from ships and airplanes emitting at sea. It is normal then to see airports having an enormous impact on a city

We are going to prepare two files: one with all the points, and one with all the polygons. We do a little bit of preprocessing work:

- we deduplicate the points and the polygons, they are shared between emission sources
- for the points, we also add administrative information: which country, region, county/city are they located in? 

Again, all the geographical data will be converted to the Parquet format. A new specfication for geographical features called GeoParquet provides a universal way to represent simple geographical shapes in a very compact representation. We will eventually leverage it.

The following function does all the processing and returns the path to newly created Parquet files. We will later upload them to HuggingFace Hub.


In [24]:
@data_function("/poly_paths")
def get_polys_path():
    gases = GAS_LIST
    polys_path = ct.data.extract_polygons(p=True, gases=gases)
    points_path = ct.data.extract_points(p=True, gases=gases, polys=polys_path)
    return (polys_path, points_path)

(polys_path, points_path) = get_polys_path()

DEBUG:ctrace.data:Opening path agriculture.zip co2
DEBUG:ctrace.data:extracting DATA/agriculture_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/agriculture_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path buildings.zip co2
DEBUG:ctrace.data:extracting DATA/buildings_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/buildings_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path fluorinated-gases.zip co2
INFO:ctrace.data:skipping co2:fluorinated-gases: no geometries
DEBUG:ctrace.data:Opening path forestry-and-land-use.zip co2
DEBUG:ctrace.data:extracting DATA/forestry-and-land-use_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/forestry-and-land-use_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path fossil-fuel-operations.zip co2
DEBUG:ctrace.data:extracting DATA/fossil-fuel-operations_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:Opening path manufacturing.zip co2
DEBUG:ctrace.data:extracting DATA/manufacturing_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:Opening path mineral-extraction.zip co2
DEBUG:ctrace.data:extracting DATA/mineral-extraction_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:Opening path power.zip co2
DEBUG:ctrace.data:extracting DATA/power_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:Opening path transportation.zip co2
DEBUG:ctrace.data:extracting DATA/transportation_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/transportation_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path waste.zip co2
DEBUG:ctrace.data:extracting DATA/waste_geometries.gpkg to /media/tjhunter/DATA/temp/co2
DEBUG:ctrace.data:Opening path agriculture.zip ch4
DEBUG:ctrace.data:extracting DATA/agriculture_geometries.gpkg to /media/tjhunter/DATA/temp/ch4
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/ch4/agriculture_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path buildings.zip ch4
DEBUG:ctrace.data:extracting DATA/buildings_geometries.gpkg to /media/tjhunter/DATA/temp/ch4
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/ch4/buildings_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path fluorinated_gases.zip ch4
INFO:ctrace.data:skipping ch4:fluorinated_gases: no geometries
DEBUG:ctrace.data:Opening path forestry-and-land-use.zip ch4
DEBUG:ctrace.data:extracting DATA/forestry-and-land-use_geometries.gpkg to /media/tjhunter/DATA/temp/ch4
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/ch4/forestry-and-land-use_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path fossil-fuel-operations.zip ch4
DEBUG:ctrace.data:extracting DATA/fossil-fuel-operations_geometries.gpkg to /media/tjhunter/DATA/temp/ch4
DEBUG:ctrace.data:Opening path manufacturing.zip ch4
DEBUG:ctrace.data:extracting DATA/manufacturing_geometries.gpkg to /media/tjhunter/DATA/temp/ch4
DEBUG:ctrace.data:Opening path mineral-extraction.zip ch4
DEBUG:ctrace.data:extracting DATA/mineral-extraction_geometries.gpkg to /media/tjhunter/DATA/temp/ch4
DEBUG:ctrace.data:Opening path power.zip ch4
DEBUG:ctrace.data:extracting DATA/power_geometries.gpkg to /media/tjhunter/DATA/temp/ch4
DEBUG:ctrace.data:Opening path transportation.zip ch4
DEBUG:ctrace.data:extracting DATA/transportation_geometries.gpkg to /media/tjhunter/DATA/temp/ch4
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/ch4/transportation_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path waste.zip ch4
DEBUG:ctrace.data:extracting DATA/waste_geometries.gpkg to /media/tjhunter/DATA/temp/ch4
DEBUG:ctrace.data:Opening path agriculture.zip n2o
DEBUG:ctrace.data:extracting DATA/agriculture_geometries.gpkg to /media/tjhunter/DATA/temp/n2o
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/n2o/agriculture_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path buildings.zip n2o
DEBUG:ctrace.data:extracting DATA/buildings_geometries.gpkg to /media/tjhunter/DATA/temp/n2o
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/n2o/buildings_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path fluorinated-gases.zip n2o
INFO:ctrace.data:skipping n2o:fluorinated-gases: no geometries
DEBUG:ctrace.data:Opening path forestry-and-land-use.zip n2o
DEBUG:ctrace.data:extracting DATA/forestry-and-land-use_geometries.gpkg to /media/tjhunter/DATA/temp/n2o
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/n2o/forestry-and-land-use_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path fossil-fuel-operations.zip n2o
DEBUG:ctrace.data:extracting DATA/fossil-fuel-operations_geometries.gpkg to /media/tjhunter/DATA/temp/n2o
DEBUG:ctrace.data:Opening path manufacturing.zip n2o
DEBUG:ctrace.data:extracting DATA/manufacturing_geometries.gpkg to /media/tjhunter/DATA/temp/n2o
DEBUG:ctrace.data:Opening path mineral-extraction.zip n2o
DEBUG:ctrace.data:extracting DATA/mineral-extraction_geometries.gpkg to /media/tjhunter/DATA/temp/n2o
DEBUG:ctrace.data:Opening path power.zip n2o
DEBUG:ctrace.data:extracting DATA/power_geometries.gpkg to /media/tjhunter/DATA/temp/n2o
DEBUG:ctrace.data:Opening path transportation.zip n2o
DEBUG:ctrace.data:extracting DATA/transportation_geometries.gpkg to /media/tjhunter/DATA/temp/n2o
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/n2o/transportation_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path waste.zip n2o
DEBUG:ctrace.data:extracting DATA/waste_geometries.gpkg to /media/tjhunter/DATA/temp/n2o
DEBUG:ctrace.data:Opening path agriculture.zip co2e_100yr
DEBUG:ctrace.data:extracting DATA/agriculture_geometries.gpkg to /media/tjhunter/DATA/temp/co2e_100yr
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2e_100yr/agriculture_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path buildings.zip co2e_100yr
DEBUG:ctrace.data:extracting DATA/buildings_geometries.gpkg to /media/tjhunter/DATA/temp/co2e_100yr
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2e_100yr/buildings_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path fluorinated-gases.zip co2e_100yr
INFO:ctrace.data:skipping co2e_100yr:fluorinated-gases: no geometries
DEBUG:ctrace.data:Opening path forestry-and-land-use.zip co2e_100yr
DEBUG:ctrace.data:extracting DATA/forestry-and-land-use_geometries.gpkg to /media/tjhunter/DATA/temp/co2e_100yr
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2e_100yr/forestry-and-land-use_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path fossil-fuel-operations.zip co2e_100yr
DEBUG:ctrace.data:extracting DATA/fossil-fuel-operations_geometries.gpkg to /media/tjhunter/DATA/temp/co2e_100yr
DEBUG:ctrace.data:Opening path manufacturing.zip co2e_100yr
DEBUG:ctrace.data:extracting DATA/manufacturing_geometries.gpkg to /media/tjhunter/DATA/temp/co2e_100yr
DEBUG:ctrace.data:Opening path mineral-extraction.zip co2e_100yr
DEBUG:ctrace.data:extracting DATA/mineral-extraction_geometries.gpkg to /media/tjhunter/DATA/temp/co2e_100yr
DEBUG:ctrace.data:Opening path power.zip co2e_100yr
DEBUG:ctrace.data:extracting DATA/power_geometries.gpkg to /media/tjhunter/DATA/temp/co2e_100yr
DEBUG:ctrace.data:Opening path transportation.zip co2e_100yr
DEBUG:ctrace.data:extracting DATA/transportation_geometries.gpkg to /media/tjhunter/DATA/temp/co2e_100yr
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2e_100yr/transportation_polygons.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:Opening path waste.zip co2e_100yr
DEBUG:ctrace.data:extracting DATA/waste_geometries.gpkg to /media/tjhunter/DATA/temp/co2e_100yr
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/climate_trace-polygons_v3-2024-ct5.parquet
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/agriculture_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

INFO:ctrace.data:skipping co2:buildings: no points. The current layers are: ['buildings_polygons']
INFO:ctrace.data:skipping co2:fluorinated-gases: no geometries
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/forestry-and-land-use_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/fossil-fuel-operations_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/manufacturing_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/mineral-extraction_points.parquet
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/power_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/transportation_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2/waste_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/ch4/agriculture_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

INFO:ctrace.data:skipping ch4:buildings: no points. The current layers are: ['buildings_polygons']
INFO:ctrace.data:skipping ch4:fluorinated_gases: no geometries
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/ch4/forestry-and-land-use_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/ch4/fossil-fuel-operations_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/ch4/manufacturing_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/ch4/mineral-extraction_points.parquet
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/ch4/power_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/ch4/transportation_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/ch4/waste_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/n2o/agriculture_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

INFO:ctrace.data:skipping n2o:buildings: no points. The current layers are: ['buildings_polygons']
INFO:ctrace.data:skipping n2o:fluorinated-gases: no geometries
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/n2o/forestry-and-land-use_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/n2o/fossil-fuel-operations_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/n2o/manufacturing_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/n2o/mineral-extraction_points.parquet
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/n2o/power_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/n2o/transportation_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/n2o/waste_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2e_100yr/agriculture_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

INFO:ctrace.data:skipping co2e_100yr:buildings: no points. The current layers are: ['buildings_polygons']
INFO:ctrace.data:skipping co2e_100yr:fluorinated-gases: no geometries
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2e_100yr/forestry-and-land-use_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2e_100yr/fossil-fuel-operations_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2e_100yr/manufacturing_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2e_100yr/mineral-extraction_points.parquet
DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2e_100yr/power_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2e_100yr/transportation_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/co2e_100yr/waste_points.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DEBUG:ctrace.data:writing /media/tjhunter/DATA/temp/climate_trace-points_v3-2024-ct5.parquet


## Upload the data to the Hugging Face Hub

As a final step, we make the datasets available on Hugging Face as a downloadable dataset.

This step will only work if you have the credentials to upload the dataset.

In [26]:
upload = True
if upload:
    try:
        api = huggingface_hub.HfApi()
        # Sources
        for fpath in write_sources():
            fname = os.path.join(ct.data.version, os.path.basename(fpath))
            print(fname, fpath)
            api.upload_file(
                path_or_fileobj=fpath,
                path_in_repo=fname,
                repo_id="tjhunter/climate-trace",
                repo_type="dataset",
            )
        # Country information
        fpath = read_country()
        fname = os.path.basename(fpath)
        print(fname, fpath)
        api.upload_file(
            path_or_fileobj=fpath,
            path_in_repo=fname,
            repo_id="tjhunter/climate-trace",
            repo_type="dataset",
        )
        # Geography information
        for fpath in get_polys_path():
            fname = os.path.join(ct.data.version, os.path.basename(fpath))
            print(fname, fpath)
            api.upload_file(
                path_or_fileobj=fpath,
                path_in_repo=fname,
                repo_id="tjhunter/climate-trace",
                repo_type="dataset",
            )
    except huggingface_hub.utils.HfHubHTTPError as e:
        print("error")
        print(e)

v3-2024-ct5/climate_trace-polygons_v3-2024-ct5.parquet /media/tjhunter/DATA/temp/climate_trace-polygons_v3-2024-ct5.parquet


DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/preupload/main HTTP/11" 200 172
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/batch HTTP/11" 200 51193


climate_trace-polygons_v3-2024-ct5.parquet:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/897fb3182363d76a306ab247ff5a372d9003d4f36049cf0c2285e873e5c22648?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQLC2QXPN7%2F20241219%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241219T152857Z&X-Amz-Expires=86400&X-Amz-Signature=d8e5162ebba3683cbfbb0c23580275fe795bf23dff994f92d919d51a56afc5ad&X-Amz-SignedHeaders=host&partNumber=1&uploadId=frfh2kRYj.Abl3iLB77QeJRCSmzS8URsRKLwVqnbCjPkRDnXLrkEShwkdOXNg71OOQLITib2bUe_4LnHz3Ot7oYWZenR1v_tGRBKRFw._YDp3DBTLN7qFo8E0gvOu2bF&x-id=UploadPart HTTP/11" 200 0
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e

v3-2024-ct5/climate_trace-points_v3-2024-ct5.parquet /media/tjhunter/DATA/temp/climate_trace-points_v3-2024-ct5.parquet


DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/preupload/main HTTP/11" 200 170
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/batch HTTP/11" 200 911


climate_trace-points_v3-2024-ct5.parquet:   0%|          | 0.00/10.1M [00:00<?, ?B/s]

DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/2d55aeca5974554d21bc2b132057b8a6d8437904fdddd935dd3539ec7396b500?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQLC2QXPN7%2F20241219%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241219T153208Z&X-Amz-Expires=900&X-Amz-Signature=478f8f3ea6962d550c0d0982536242fa3fa4e9b6ae5459800e0fb1fce5514f9b&X-Amz-SignedHeaders=host&x-amz-storage-class=INTELLIGENT_TIERING&x-id=PutObject HTTP/11" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/verify HTTP/11" 200 2
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/commit/main HTTP/11" 200 204
