In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging
logging.basicConfig(level=logging.DEBUG)

In [3]:
import os
import polars as pl
from ctrace.constants import *
import ctrace as ct
import pyarrow
from dds import data_function
from pathlib import Path
import tempfile
import shutil
import dds
import huggingface_hub
logging.getLogger("dds").setLevel(logging.WARNING)
dds.accept_module(ct)

## Creating optimized parquet files for source data

This first section creates files that are the most effective for reading and querying. The general approach is as follows:

1. Join the source and source confidence CSV files and writes them as parquet files for each subsector
2. Aggregate by year into a yearly parquet file
3. Optimize this parquet file for reading

This first command creates parquet files that join the source and source confidences for each subsector, and returns a list of all the created files.

In this notebook, another trick is to define the transformations as _data functions_. In short, this code will only run if the source code changes. This makes rerunning the notebooks very fast, and only updating when something has changed in the source code.

In [4]:
@data_function("/data_sources")
def load_sources():
    (_, files) = ct.data.load_source_compact()
    return files

load_sources()

[PosixPath('/tmp/co2/cropland-fires_emissions_sources.parquet'),
 PosixPath('/tmp/co2/enteric-fermentation-cattle-operation_emissions_sources.parquet'),
 PosixPath('/tmp/co2/enteric-fermentation-cattle-pasture_emissions_sources.parquet'),
 PosixPath('/tmp/co2/manure-left-on-pasture-cattle_emissions_sources.parquet'),
 PosixPath('/tmp/co2/manure-management-cattle-operation_emissions_sources.parquet'),
 PosixPath('/tmp/co2/rice-cultivation_emissions_sources.parquet'),
 PosixPath('/tmp/co2/synthetic-fertilizer-application_emissions_sources.parquet'),
 PosixPath('/tmp/co2/non-residential-onsite-fuel-usage_emissions_sources.parquet'),
 PosixPath('/tmp/co2/residential-onsite-fuel-usage_emissions_sources.parquet'),
 PosixPath('/tmp/co2/forest-land-clearing_emissions_sources.parquet'),
 PosixPath('/tmp/co2/forest-land-degradation_emissions_sources.parquet'),
 PosixPath('/tmp/co2/forest-land-fires_emissions_sources.parquet'),
 PosixPath('/tmp/co2/net-forest-land_emissions_sources.parquet'),
 Po

To help with the loading, the data is partitioned by year. This is the most relevant for most users: most people are expected to look at specific years and sectors (especially the latest year). This reduces the amount of data to load.

Let us have a quick peek at the data in one of these files. It looks already pretty good: a lot of the redundant data such as the enumerations has been deduplicated. All the enumeration data is now converted to integers, this is what `dictionary<values=string, indices=int32, ordered=0>` means. It is not quite ready for high performance however.

In [5]:
from pyarrow.parquet import read_table
fname = load_sources()[0]
print(fname)
read_table(fname)

/tmp/co2/cropland-fires_emissions_sources.parquet


pyarrow.Table
source_id: uint64
iso3_country: dictionary<values=string, indices=int32, ordered=0>
sector: dictionary<values=string, indices=int32, ordered=0>
subsector: dictionary<values=string, indices=int32, ordered=0>
original_inventory_sector: dictionary<values=string, indices=int32, ordered=0>
start_time: timestamp[ms, tz=UTC]
end_time: timestamp[ms, tz=UTC]
temporal_granularity: dictionary<values=string, indices=int32, ordered=0>
gas: dictionary<values=string, indices=int32, ordered=0>
emissions_quantity: double
emissions_factor: double
emissions_factor_units: large_string
capacity: double
capacity_units: large_string
capacity_factor: double
activity: double
activity_units: large_string
created_date: timestamp[ms, tz=UTC]
modified_date: timestamp[ms, tz=UTC]
source_name: large_string
source_type: large_string
lat: double
lon: double
other1: large_string
other2: large_string
other3: large_string
other4: large_string
other5: large_string
other6: large_string
other7: large_string
ot

## Aggregating by year and optimizing the output

The following block takes all the sector files and aggregates them by year. This is based on the expectation that most users will work on the latest year, and that some users will want to look into the trends across the years.

Since these files will be read many times (every time we want to do a graph), it pays off to optimize them. The Parquet format is designed for fast reads of the relevant data. We will do two main optimizations: optimal compression, optimizing the row groups and adding statistics.



_Compression_ Parquet allows some data to be compressed by columns. The first intuition is that, looking at each column of data separately, there will be more patterns and thus more opportunities to compress the data. The second intuition is that, in data-intensive application, reading the data is the bottleneck. It is then faster to read smaller compressed data in memory and then decompress it (losing a bit of time in compute), rather than reading larger, uncompressed data. Modern compression algorithms such as ZStandard or LZ4 are designed to be very effective at using a processor. Using them is essentially a pure gain in terms of processing speed.


```{admonition} CTODO
The year of a data record is defined by its start time. This may be different than the convention used by Climate Trace. To check.
```


In [6]:
write_directory = "/tmp"
years = ct.data.years
version = ct.data.version
gases = ct.constants.GAS_LIST

@data_function("/write_data")
def write_data():
    sort_keys = [GAS, SECTOR, SUBSECTOR, ISO3_COUNTRY, SOURCE_ID]
    data_files = load_sources()
    dfs = []
    for tmp_name in data_files:
        print(tmp_name)
        df = pl.scan_parquet(tmp_name)
        df = df.pipe(ct.data.recast_parquet, conf=True)
        dfs.append(df)
    ldf = pl.concat(dfs)
    fnames = []
    sort_keys2 = [c+"2" for c in sort_keys]
    for gas in gases:
        for year in years:
            fname1 = f"{write_directory}/pre_climate_trace-sources_{version}_{year}_{gas}.parquet"
            (
                ldf.filter(c_start_time.dt.year() == int(year))
                   .filter(c_gas == gas)
                   .with_columns(*[pl.col(c).cast(pl.UInt32()).alias(c +"2") for c in sort_keys])
                   .sort(by=sort_keys2)
                   .drop(sort_keys2)
                   .sink_parquet(
                    fname1,
                    compression="zstd",
                    maintain_order=True,
                    statistics=True,
                )
            )
            fname = f"{write_directory}/climate_trace-sources_{version}_{year}_{gas}.parquet"
            print(fname)
            # Polars does not allow yet finetuning the sizes of the groups.
            # Creating them manually for the time being.
            ds = pyarrow.dataset.dataset(fname1)
            pyarrow.dataset.write_dataset(
                ds,
                base_dir="/tmp",
                basename_template="ds_{i}.parquet",
                format="parquet",
                partitioning=None,
                min_rows_per_group=300_000,
                max_rows_per_group=1_000_000,
            )
            shutil.copyfile("/tmp/ds_0.parquet", fname)
            fnames.append((fname1, fname))
    return fnames

write_data()

/tmp/co2/cropland-fires_emissions_sources.parquet
/tmp/co2/enteric-fermentation-cattle-operation_emissions_sources.parquet
/tmp/co2/enteric-fermentation-cattle-pasture_emissions_sources.parquet
/tmp/co2/manure-left-on-pasture-cattle_emissions_sources.parquet
/tmp/co2/manure-management-cattle-operation_emissions_sources.parquet
/tmp/co2/rice-cultivation_emissions_sources.parquet
/tmp/co2/synthetic-fertilizer-application_emissions_sources.parquet
/tmp/co2/non-residential-onsite-fuel-usage_emissions_sources.parquet
/tmp/co2/residential-onsite-fuel-usage_emissions_sources.parquet
/tmp/co2/forest-land-clearing_emissions_sources.parquet
/tmp/co2/forest-land-degradation_emissions_sources.parquet
/tmp/co2/forest-land-fires_emissions_sources.parquet
/tmp/co2/net-forest-land_emissions_sources.parquet
/tmp/co2/net-shrubgrass_emissions_sources.parquet
/tmp/co2/net-wetland_emissions_sources.parquet
/tmp/co2/removals_emissions_sources.parquet
/tmp/co2/shrubgrass-fires_emissions_sources.parquet
/tmp/

[('/tmp/pre_climate_trace-sources_v3-2024-ct4_2021_co2.parquet',
  '/tmp/climate_trace-sources_v3-2024-ct4_2021_co2.parquet'),
 ('/tmp/pre_climate_trace-sources_v3-2024-ct4_2022_co2.parquet',
  '/tmp/climate_trace-sources_v3-2024-ct4_2022_co2.parquet'),
 ('/tmp/pre_climate_trace-sources_v3-2024-ct4_2023_co2.parquet',
  '/tmp/climate_trace-sources_v3-2024-ct4_2023_co2.parquet'),
 ('/tmp/pre_climate_trace-sources_v3-2024-ct4_2024_co2.parquet',
  '/tmp/climate_trace-sources_v3-2024-ct4_2024_co2.parquet'),
 ('/tmp/pre_climate_trace-sources_v3-2024-ct4_2021_co2e_100yr.parquet',
  '/tmp/climate_trace-sources_v3-2024-ct4_2021_co2e_100yr.parquet'),
 ('/tmp/pre_climate_trace-sources_v3-2024-ct4_2022_co2e_100yr.parquet',
  '/tmp/climate_trace-sources_v3-2024-ct4_2022_co2e_100yr.parquet'),
 ('/tmp/pre_climate_trace-sources_v3-2024-ct4_2023_co2e_100yr.parquet',
  '/tmp/climate_trace-sources_v3-2024-ct4_2023_co2e_100yr.parquet'),
 ('/tmp/pre_climate_trace-sources_v3-2024-ct4_2024_co2e_100yr.parquet

_Optimizing row groups_ A parquet file is a collection of groups of rows, and these rows are organized column-wise along with some statistics. We can choose how many groups to create: the minimum is one group (all the data into a single group), which is the most standard. This is not optimal however: reading can only be done by one processor core at a time. If we have more, they will sit idle. This is why it is better to choose the number of groups to be close to the expected number of processor cores (10-100). When reading, each core will process a different chunk of the file in parallel.

Polars cannot do this yet, so the code below directly calls the `pyarrow` package to restructure the final file, calling the function `pyarrow.dataset.write_dataset`. 

Here is the parquet files produced directly by Polars. It is the result of joining datasets which themselves are the result of reading many files (each by subsector). It is very fragmented (see the `num_row_groups` statistics below).


In [7]:
(fname_pre, fname_post) = write_data()[0]
print(fname_pre)
print(fname_post)
parquet_file = pyarrow.parquet.ParquetFile(fname_pre)
# print(parquet_file.metadata.row_group(0).column(2).statistics)
parquet_file.metadata

/tmp/pre_climate_trace-sources_v3-2024-ct4_2021_co2.parquet
/tmp/climate_trace-sources_v3-2024-ct4_2021_co2.parquet


<pyarrow._parquet.FileMetaData object at 0x770a35b2aed0>
  created_by: Polars
  num_columns: 54
  num_rows: 15184030
  num_row_groups: 32
  format_version: 1.0
  serialized_size: 174771

The final file is more compact: only 58 row groups. It will be much faster to read (up to 50 times faster on my computer) because the readers do not need to gather information from each of the row groups.

In [8]:
parquet_file = pyarrow.parquet.ParquetFile(fname_post)
parquet_file.metadata

<pyarrow._parquet.FileMetaData object at 0x770a351560c0>
  created_by: parquet-cpp-arrow version 18.1.0
  num_columns: 54
  num_rows: 15184030
  num_row_groups: 49
  format_version: 2.6
  serialized_size: 269904

_Statistics_ Each row group in a parquet file has statistics. These statistics contain for each columns basic information such as minimum, maximum, etc. as you can see below. During a query, a data system first reads these statistics to check what blocks of data it should read. 

For example, the first row group only contains agriculture data (which you can infer from `min: agriculture` and `max: agriculture`). As the result, if a query is looking for waste data, it can safely skip this full block. 

Grouping the rows and creating statistics can dramatically reduce the amount of data being read and processed. Finding the right number of groups is a tradeoff between using more cores to read the data in parallel, and not having to read too many statistics descriptions. In the extreme case of the file created by Polars (5000 row groups), the statistics make up 40% of the file and can take up to 90% of the processing time! If your parquet file reads slowly, it is probably due to its internal layout.

In [9]:
parquet_file = pyarrow.parquet.ParquetFile(fname_post)
parquet_file.metadata.row_group(0).column(12).statistics

<pyarrow._parquet.Statistics object at 0x770a35155620>
  has_min_max: True
  min: 0.0905689899739461
  max: 9955196.16948596
  null_count: 0
  distinct_count: None
  num_values: 327680
  physical_type: DOUBLE
  logical_type: None
  converted_type (legacy): NONE

## Initial checks

We know check that it works correctly. Let's load the newly created data instead of the default version stored on the internet, for the year 2023.

In [10]:
sdf = ct.read_source_emissions(gas=CO2, year=2023, p="/tmp")
sdf

About 15M records for this year. This is spread across multiple gas and also multiple trips in the case of boats or airplanes.

In [11]:
sdf.select(pl.len()).collect()

len
u32
15184014


Check the number of distinct source IDs

In [12]:
by_sec = (sdf
.group_by(SOURCE_ID, SECTOR)
.agg(pl.len())
.collect())

The number of sources outside forestry and land use:

In [13]:
by_sec.filter(c_sector != FORESTRY_AND_LAND_USE).select(pl.len())

len
u32
748732


Check: no source is associated with multiple sectors.

In [14]:
by_sec.group_by(SOURCE_ID).agg(c_sector.n_unique()).filter(pl.col(SECTOR) > 1)

source_id,sector
u64,u32


Check: no annual source should be duplicated by gas. It used to be the case with V2 release.

In [15]:
(sdf
.filter(c_temporal_granularity =="annual")
.group_by(SOURCE_ID, GAS)
.agg(pl.len())
.filter(pl.col("len") > 1)
.sort(by="len")
.collect())

source_id,gas,len
u64,enum,u32


Check: emissions should always be defined. V2 used to have empty values.

In [16]:
sdf = ct.read_source_emissions(CO2E_100YR, 2023, "/tmp")
(sdf
 .select(c_emissions_quantity.is_null().alias("null_emissions"), c_subsector, c_iso3_country)
 .group_by(c_subsector, "null_emissions")
 .agg(pl.len())
 .collect()
 .pivot(index=SUBSECTOR, on="null_emissions", values="len")
)

subsector,false
enum,u32
"""aluminum""",2616
"""water-reservoirs""",84408
"""chemicals""",4452
"""enteric-fermentation-cattle-pa…",608328
"""electricity-generation""",106464
…,…
"""solid-waste-disposal""",115500
"""road-transportation""",684168
"""coal-mining""",45456
"""oil-and-gas-refining""",8484


## Integrity checks

Before uploading and publishing data, it is a good idea to run a number of checks. Frameworks such as [pandera](https://pandera.readthedocs.io/en/latest/polars.html) are very helpful to implement these checks. Here we just check that Akrotiri and Dhekelia (country code XAD) is not included, as mentioned in the documentation.

In [17]:
(ct.read_source_emissions(gas=GAS_LIST, year=2022, p="/tmp")
 .filter(c_iso3_country == "XAD")
 .select(pl.len())
.collect())

len
u32
0


### CO2e subsector data should be a superset of all sectors

Here is an example of issue to investigate: one would expect the total CO2e_100yr (total emissions normalized by their CO2 equivalent) to be at least present for each sector in which emissions are reported. This is not the case for FLU, for instance for the `removals` subsector.

```{admonition} CTODO
:name: missing-co2e-subsectors
Confirm with CT.
```

In [18]:
with pl.Config(tbl_rows=20):
    print(ct.read_source_emissions(gas=GAS_LIST, year=2022, p="/tmp")
     .group_by(c_sector, c_subsector, c_gas)
     .agg(c_emissions_quantity.sum())
     #.filter(c_emissions_quantity < 0)
     .sort(by=[c_sector, c_subsector, c_gas])
     .collect()
     .pivot(GAS, index=[SECTOR, SUBSECTOR])
     .filter(pl.col(CO2E_100YR).is_null())
     .filter(pl.col(CO2) != 0)
    )

shape: (18, 4)
┌────────────────────────┬─────────────────────────────────┬────────────┬────────────┐
│ sector                 ┆ subsector                       ┆ co2        ┆ co2e_100yr │
│ ---                    ┆ ---                             ┆ ---        ┆ ---        │
│ cat                    ┆ enum                            ┆ f64        ┆ f64        │
╞════════════════════════╪═════════════════════════════════╪════════════╪════════════╡
│ buildings              ┆ non-residential-onsite-fuel-us… ┆ 1.6238e9   ┆ null       │
│ buildings              ┆ residential-onsite-fuel-usage   ┆ 1.0540e10  ┆ null       │
│ forestry-and-land-use  ┆ forest-land-clearing            ┆ 1.5943e10  ┆ null       │
│ forestry-and-land-use  ┆ forest-land-degradation         ┆ 1.1023e9   ┆ null       │
│ forestry-and-land-use  ┆ forest-land-fires               ┆ 1.6953e10  ┆ null       │
│ forestry-and-land-use  ┆ removals                        ┆ -5.8078e10 ┆ null       │
│ forestry-and-land-use  ┆ s

## Create parquet files for country emissions

As of V3, country emission data is also large enough that it should be compacted in parquet files. Note the dramatic difference:

- uncompressed CSV file: 106MB
- compressed CSV file: 6MB
- parquet: 1MB !!

As highlighted, the parquet file also has the advantage of being very efficient at extracting only the relevant information.

In [19]:
# Starting from the official archives, read all the gases.

@data_function("/read_country")
def read_country():
    path = Path(tempfile.gettempdir()) / f"climate-trace-countries-{ct.data.version}.parquet"
    print(path)
    cdf = ct.read_country_emissions(ct.constants.GAS_LIST, archive_path=True)
    # Optimizing to read by time and then gas.
    # The logic being that country-specific files are already available from CT.
    (cdf
     .sort(by=[c_start_time,c_gas,c_iso3_country])
      .write_parquet(path) # Not taking precautions, the file is so small.
    )
    return path

p = read_country()

/tmp/climate-trace-countries-v3-2024-ct4.parquet


DEBUG:ctrace.data:Opening path agriculture.zip from /home/tjhunter/.cache/climate_trace_co2/v3-2024/agriculture.zip
DEBUG:ctrace.data:sources: ['DATA/crop-residues_country_emissions.csv', 'DATA/cropland-fires_country_emissions.csv', 'DATA/enteric-fermentation-cattle-operation_country_emissions.csv', 'DATA/enteric-fermentation-cattle-pasture_country_emissions.csv', 'DATA/enteric-fermentation-other_country_emissions.csv', 'DATA/manure-applied-to-soils_country_emissions.csv', 'DATA/manure-left-on-pasture-cattle_country_emissions.csv', 'DATA/manure-management-cattle-operation_country_emissions.csv', 'DATA/manure-management-other_country_emissions.csv', 'DATA/other-agricultural-soil-emissions_country_emissions.csv', 'DATA/rice-cultivation_country_emissions.csv', 'DATA/synthetic-fertilizer-application_country_emissions.csv']
DEBUG:ctrace.data:opening agriculture.zip / DATA/crop-residues_country_emissions.csv
DEBUG:ctrace.data:opening agriculture.zip / DATA/cropland-fires_country_emissions.cs

## Contry emissions: integrity checks

In a production pipeline, before uploading the final data, we would run a number of checks again on the country emissions. Here are a few checks that we can run (and which are currently failing).

In [20]:
cdf = ct.read_country_emissions(parquet_path=p)
cdf.head(2)

iso3_country,start_time,end_time,gas,sector,subsector,emissions_quantity,emissions_quantity_units,temporal_granularity,created_date,modified_date
enum,"datetime[ms, UTC]","datetime[ms, UTC]",enum,str,enum,f64,cat,enum,"datetime[ms, UTC]","datetime[ms, UTC]"
"""ABW""",2015-01-01 00:00:00 UTC,2015-12-31 00:00:00 UTC,"""co2""","""agriculture""","""enteric-fermentation-cattle-pa…",0.0,,"""month""",,
"""ABW""",2015-01-01 00:00:00 UTC,2015-12-31 00:00:00 UTC,"""co2""","""manufacturing""","""wood-and-wood-products""",0.0,,"""month""",,


### Country emissions: CO2e data should be a superset of all country emissions

We see that some subsectors are present in CO2 emissions but are missing in the aggregated CO2e emissions

In [21]:
with pl.Config(tbl_rows=20):
    print(cdf
     .group_by(c_sector, c_subsector, c_gas)
     .agg(c_emissions_quantity.sum())
     .sort(by=[c_sector, c_subsector, c_gas])
     .pivot(GAS, index=[SECTOR, SUBSECTOR])
     .filter(pl.col(CO2E_100YR).is_null())
     .filter(pl.col(CO2) != 0)
    )

shape: (7, 4)
┌───────────────────────┬───────────────────────────────┬────────────┬────────────┐
│ sector                ┆ subsector                     ┆ co2        ┆ co2e_100yr │
│ ---                   ┆ ---                           ┆ ---        ┆ ---        │
│ str                   ┆ enum                          ┆ f64        ┆ f64        │
╞═══════════════════════╪═══════════════════════════════╪════════════╪════════════╡
│ buildings             ┆ residential-onsite-fuel-usage ┆ 2.9619e10  ┆ null       │
│ forestry-and-land-use ┆ forest-land-clearing          ┆ 6.4861e10  ┆ null       │
│ forestry-and-land-use ┆ forest-land-degradation       ┆ 3.7835e9   ┆ null       │
│ forestry-and-land-use ┆ forest-land-fires             ┆ 5.7839e10  ┆ null       │
│ forestry-and-land-use ┆ removals                      ┆ -1.5964e11 ┆ null       │
│ forestry-and-land-use ┆ shrubgrass-fires              ┆ 2.8614e10  ┆ null       │
│ forestry-and-land-use ┆ wetland-fires                 ┆ 7.66

### Country emissions: some countries are excluded from the dataset

The Climate TRACE documentation excludes certain countries from the final release, but they are still present in the dataset:

In [22]:
excluded_isos = ["XAD", "XCL", "XPI", "XSP"]
(cdf
 .filter(c_iso3_country.is_in(excluded_isos))
 .group_by([ISO3_COUNTRY, c_start_time.dt.year(), GAS, SECTOR, SUBSECTOR])
 .agg(pl.len()))

iso3_country,start_time,gas,sector,subsector,len
enum,i32,enum,str,enum,u32
"""XAD""",2021,"""co2e_100yr""","""agriculture""","""synthetic-fertilizer-applicati…",12
"""XAD""",2023,"""co2e_100yr""","""transportation""","""railways""",12
"""XAD""",2024,"""co2e_100yr""","""waste""","""industrial-wastewater-treatmen…",12
"""XAD""",2023,"""co2e_100yr""","""mineral-extraction""","""iron-mining""",12
"""XAD""",2021,"""co2e_100yr""","""manufacturing""","""other-metals""",12
…,…,…,…,…,…
"""XAD""",2023,"""co2e_100yr""","""manufacturing""","""other-metals""",12
"""XAD""",2024,"""co2e_100yr""","""waste""","""incineration-and-open-burning-…",12
"""XAD""",2024,"""co2e_100yr""","""fossil-fuel-operations""","""coal-mining""",12
"""XAD""",2021,"""co2e_100yr""","""manufacturing""","""cement""",12


## Preparing the geographical information

The Climate TRACE dataset also includes geographical information about the location of emissions:
- point locations for _point sources_ (factories, power plants, ...)
- polygons for _area sources_ (forests, transportation, ...)

This comes with a few remarks:
- all the area sources are split and aggregated at the level of the county or city. You will not be able to get sources for city block or road level. You will not see either each fire in Canada, it is all aggregated at county level.
- the ports (seaports, airports) gather the emissions from ships and airplanes emitting at sea. It is normal then to see airports having an enormous impact on a city

We are going to prepare two files: one with all the points, and one with all the polygons. We do a little bit of preprocessing work:

- we deduplicate the points and the polygons, they are shared between emission sources
- for the points, we also add administrative information: which country, region, county/city are they located in? 

Again, all the geographical data will be converted to the Parquet format. A new specfication for geographical features called GeoParquet provides a universal way to represent simple geographical shapes in a very compact representation. We will eventually leverage it.

The following function does all the processing and returns the path to newly created Parquet files. We will later upload them to HuggingFace Hub.


In [26]:
@data_function("/poly_paths")
def get_polys_path():
    gases = GAS_LIST
    polys_path = ct.data.extract_polygons(p=True, gases=gases)
    points_path = ct.data.extract_points(p=True, gases=gases, polys=polys_path)
    return (polys_path, points_path)

(polys_path, points_path) = get_polys_path()
(polys_path, points_path)

(PosixPath('/tmp/climate_trace-polygons_v3-2024-ct4.parquet'),
 PosixPath('/tmp/climate_trace-points_v3-2024-ct4.parquet'))

## Upload the data to the Hugging Face Hub

As a final step, we make the datasets available on Hugging Face as a downloadable dataset.

This step will only work if you have the credentials to upload the dataset.

In [30]:
import huggingface_hub.utils
upload = True
if upload:
    try:
        api = huggingface_hub.HfApi()
        for (_, fpath) in write_data():
            fname = os.path.basename(fpath)
            print(fname, fpath)
            # api.upload_file(
            #     path_or_fileobj=fpath,
            #     path_in_repo=fname,
            #     repo_id="tjhunter/climate-trace",
            #     repo_type="dataset",
            # )
        (polys_path, points_path) = get_polys_path()
        country_path = read_country()
        for fpath in [polys_path, points_path, country_path]:
            fname = os.path.basename(fpath)
            print(fname, fpath)
            api.upload_file(
                path_or_fileobj=fpath,
                path_in_repo=fname,
                repo_id="tjhunter/climate-trace",
                repo_type="dataset",
            )
    except huggingface_hub.utils.HfHubHTTPError as e:
        print("error")
        print(e)

climate_trace-sources_v3-2024-ct4_2021_co2.parquet /tmp/climate_trace-sources_v3-2024-ct4_2021_co2.parquet
climate_trace-sources_v3-2024-ct4_2022_co2.parquet /tmp/climate_trace-sources_v3-2024-ct4_2022_co2.parquet
climate_trace-sources_v3-2024-ct4_2023_co2.parquet /tmp/climate_trace-sources_v3-2024-ct4_2023_co2.parquet
climate_trace-sources_v3-2024-ct4_2024_co2.parquet /tmp/climate_trace-sources_v3-2024-ct4_2024_co2.parquet
climate_trace-sources_v3-2024-ct4_2021_co2e_100yr.parquet /tmp/climate_trace-sources_v3-2024-ct4_2021_co2e_100yr.parquet
climate_trace-sources_v3-2024-ct4_2022_co2e_100yr.parquet /tmp/climate_trace-sources_v3-2024-ct4_2022_co2e_100yr.parquet
climate_trace-sources_v3-2024-ct4_2023_co2e_100yr.parquet /tmp/climate_trace-sources_v3-2024-ct4_2023_co2e_100yr.parquet
climate_trace-sources_v3-2024-ct4_2024_co2e_100yr.parquet /tmp/climate_trace-sources_v3-2024-ct4_2024_co2e_100yr.parquet
climate_trace-polygons_v3-2024-ct4.parquet /tmp/climate_trace-polygons_v3-2024-ct4.parqu

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/preupload/main HTTP/11" 200 160
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/batch HTTP/11" 200 50510


climate_trace-polygons_v3-2024-ct4.parquet:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/ecad4322f8266907e8e9aa988c7e5aed5436d835bb6a65d18dd3c69a7b6988b1?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQLC2QXPN7%2F20241206%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241206T060736Z&X-Amz-Expires=86400&X-Amz-Signature=f5e96c4b4c1d83c34b80d45698b2bbc4aab4b9f91537a549e5772c7814a9762e&X-Amz-SignedHeaders=host&partNumber=1&uploadId=3.4LB5Ap6NrxKp4TTB7ryIa7.IE_adatLYhzjreE4libKToj1xLbvd5JPIfC8Y2BNEeA2V8Kxj5LIX6zfjcBXUddUdKlIOPk3Si9nPRGqhGE0YhD.hNJ23K_MsWvart1&x-id=UploadPart HTTP/11" 200 0
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e

climate_trace-points_v3-2024-ct4.parquet /tmp/climate_trace-points_v3-2024-ct4.parquet


DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/batch HTTP/11" 200 910


climate_trace-points_v3-2024-ct4.parquet:   0%|          | 0.00/5.56M [00:00<?, ?B/s]

DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/8e6068f9448cd83a061c7ef80e7e5b7f57d9103a17e5d69ee4905fd23219d338?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQLC2QXPN7%2F20241206%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241206T060924Z&X-Amz-Expires=900&X-Amz-Signature=33ac0d565dc42cc61fad84576560e8c486a1b33b34acf7cfb4fa5314bdb790d2&X-Amz-SignedHeaders=host&x-amz-storage-class=INTELLIGENT_TIERING&x-id=PutObject HTTP/11" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/verify HTTP/11" 200 2
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/commit/main HTTP/11" 200 204


climate-trace-countries-v3-2024-ct4.parquet /tmp/climate-trace-countries-v3-2024-ct4.parquet


DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/preupload/main HTTP/11" 200 234
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/batch HTTP/11" 200 910


climate-trace-countries-v3-2024-ct4.parquet:   0%|          | 0.00/3.16M [00:00<?, ?B/s]

DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/12775ed9eab6d458da3d0482a208d78d48276b6fb4992614eb25c81cb02e1921?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQLC2QXPN7%2F20241206%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241206T060926Z&X-Amz-Expires=900&X-Amz-Signature=24f9cd3196ae6ae4566644217b721056107a8d9cd6c1c97731a5d7330a7b3144&X-Amz-SignedHeaders=host&x-amz-storage-class=INTELLIGENT_TIERING&x-id=PutObject HTTP/11" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/verify HTTP/11" 200 2
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/commit/main HTTP/11" 200 204
