# Data ingestion and formatting

This notebook explains how to convert the Climate TRACE dataset to a format that is more appropriate for data science. 

```{note}
This section is relevant for data engineers, or data scientists who want to understand how the data 
has been prepared. Skip if you just want to access the final, prepared data.
```

The original data from Climate TRACE is offered as a series of CSV files bundled in ZIP archives. That format is universally understood, but it is not the most effective for effective analysis with data science tools. In particular, it is large: the source data, uncompressed, is about 20GB. This is the size at which most people would consider this project to be "big data" or at least "medium data". With the proper choice of data storage, we will bring it down to a breezy "small data" without losing information along the way.

Instead, we are going to use the Parquet format. This format has a number of advantages:
- it is _column-based_ : data systems can process big chunks of data at once, rather than line by line. Also, depending on the information requested, systems will read only the relevant columns and skip the rest very effectively
- it is _universal_ : most modern data systems will be able to read it
- it is _structured_ : basic information about numbers, categories, ... are preserved. It 


Looking at the code, we are performing a few tricks:

_Compacting the data_ We minimize the size of the files by taking advantage of its structures. In particular, we know in many cases that values are part of known enumerations (sectors, ...). We replace all these by `polars.Enumeration`s. Not only this makes files smaller, but it also allows data systems to make clever optimization for complex operations such as joining.

_Lazy reading_ If we were to read all the source data using a traditional system such as Excel or Pandas, we would require a serious amount of memory. The files themselves are more than 5GB. Polars is capable of reading straight from the zip file in a streaming fashion. This is what Polars calls a Lazy dataframe, or LazyFrame. Even when doing complicated operations such as joining the source files with the confidence information, Polars only uses 3GB of memory on my machine. In fact, this way of working is so fast that the `ctrace` package directly reads all the country emissions data from the zip files in less than a second.

_Using known enumerations_ You will see in the source code that nearly all the variables such as column names, names of gas and sectors, etc. are replaced CONSTANT_NAMES such as `CH4`,.... You can use that to autocomplete



In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging
logging.basicConfig(level=logging.DEBUG)

In [3]:
import os
import polars as pl
from ctrace.constants import *
import ctrace as ct
import pyarrow
from dds import data_function
import shutil
import dds
import huggingface_hub
logging.getLogger("dds").setLevel(logging.WARNING)
dds.accept_module(ct)

## Creating optimized parquet files

This first section creates files that are the most effective for reading and querying. The general approach is as follows:

1. Join the source and source confidence CSV files and writes them as parquet files for each subsector
2. Aggregate by year into a yearly parquet file
3. Optimize this parquet file for reading

This first command creates parquet files that join the source and source confidences for each subsector, and returns a list of all the created files.

In this notebook, another trick is to define the transformations as _data functions_. In short, this code will only run if the source code changes. This makes rerunning the notebooks very fast, and only updating when something has changed in the source code.

In [4]:
@data_function("/data_sources")
def load_sources():
    (_, files) = ct.data.load_source_compact()
    return files

load_sources()

[PosixPath('/tmp/enteric-fermentation-cattle-feedlot_emissions-sources.parquet'),
 PosixPath('/tmp/manure-management-cattle-feedlot_emissions-sources.parquet'),
 PosixPath('/tmp/rice-cultivation_emissions-sources.parquet'),
 PosixPath('/tmp/synthetic-fertilizer-application_emissions-sources.parquet'),
 PosixPath('/tmp/enteric-fermentation-cattle-pasture_emissions-sources.parquet'),
 PosixPath('/tmp/cropland-fires_emissions-sources.parquet'),
 PosixPath('/tmp/manure-left-on-pasture-cattle_emissions-sources.parquet'),
 PosixPath('/tmp/water-reservoirs_emissions-sources.parquet'),
 PosixPath('/tmp/removals_emissions-sources.parquet'),
 PosixPath('/tmp/forest-land-fires_emissions-sources.parquet'),
 PosixPath('/tmp/forest-land-degradation_emissions-sources.parquet'),
 PosixPath('/tmp/forest-land-clearing_emissions-sources.parquet'),
 PosixPath('/tmp/net-wetland_emissions-sources.parquet'),
 PosixPath('/tmp/net-forest-land_emissions-sources.parquet'),
 PosixPath('/tmp/net-shrubgrass_emissio

Because the data is loaded lazily, this step takes only 300MB of memory on my machine. Not bad for producing 2GB of data!

To help with the loading, the data is partitioned by year. This is the most relevant for most users: most people are expected to look at specific years and sectors (especially the latest year). This reduces the amount of data to load.

Let us have a quick peek at the data in one of these files. It looks already pretty good: a lot of the redundant data such as the enumerations has been deduplicated. All the enumeration data is now converted to integers, this is what `dictionary<values=string, indices=int32, ordered=0>` means. It is not quite ready for high performance however.

In [5]:
from pyarrow.parquet import read_table
fname = load_sources()[0]
print(fname)
read_table(fname)

/tmp/enteric-fermentation-cattle-feedlot_emissions-sources.parquet


pyarrow.Table
source_id: uint64
iso3_country: dictionary<values=string, indices=int32, ordered=0>
original_inventory_sector: dictionary<values=string, indices=int32, ordered=0>
start_time: timestamp[us]
end_time: timestamp[us]
temporal_granularity: dictionary<values=string, indices=int32, ordered=0>
gas: dictionary<values=string, indices=int32, ordered=0>
emissions_quantity: double
emissions_factor: double
emissions_factor_units: large_string
capacity: double
capacity_units: large_string
capacity_factor: double
activity: double
activity_units: large_string
created_date: timestamp[us]
modified_date: timestamp[us]
source_name: large_string
source_type: large_string
lat: double
lon: double
other1: large_string
other2: large_string
other3: large_string
other4: large_string
other5: large_string
other6: large_string
other7: large_string
other8: large_string
other9: large_string
other10: large_string
other11: large_string
other12: large_string
other1_def: large_string
other2_def: large_string

## Aggregating by year and optimizing the output

The following block takes all the sector files and aggregates them by year. This is based on the expectation that most users will work on the latest year, and that some users will want to look into the trends across the years.

Since these files will be read many times (every time we want to do a graph), it pays off to optimize them. The Parquet format is designed for fast reads of the relevant data. We will do two main optimizations: optimal compression, optimizing the row groups and adding statistics.



_Compression_ Parquet allows some data to be compressed by columns. The first intuition is that, looking at each column of data separately, there will be more patterns and thus more opportunities to compress the data. The second intuition is that, in data-intensive application, reading the data is the bottleneck. It is then faster to read smaller compressed data in memory and then decompress it (losing a bit of time in compute), rather than reading larger, uncompressed data. Modern compression algorithms such as ZStandard or LZ4 are designed to be very effective at using a processor. Using them is essentially a pure gain in terms of processing speed.


```{admonition} CTODO
The year of a data record is defined by its start time. This may be different than the convention used by Climate Trace. To check.
```


In [6]:
write_directory = "/tmp"
years = ct.data.years
version = ct.data.version

@data_function("/write_data")
def write_data():
    data_files = load_sources()
    dfs = []
    for tmp_name in data_files:
        df = pl.scan_parquet(tmp_name)
        df = df.pipe(ct.data.recast_parquet, conf=True)
        dfs.append(df)
    ldf = pl.concat(dfs)
    fnames = []
    for year in years:
        fname1 = f"{write_directory}/pre_climate_trace-sources_{version}_{year}.parquet"
        (
            ldf.filter(c_start_time.dt.year() == int(year))
               # Currently a bug in polars: ComputeError: buffer not aligned for mmap
               # .sort(by=[GAS, SECTOR, SUBSECTOR, ISO3_COUNTRY, SOURCE_ID])
               .sink_parquet(
                fname1,
                compression="zstd",
                maintain_order=True,
                statistics=True,
            )
        )
        fname = f"{write_directory}/climate_trace-sources_{version}_{year}.parquet"
        ds = pyarrow.dataset.dataset(fname1)
        pyarrow.dataset.write_dataset(
            ds,
            base_dir="/tmp",
            basename_template="ds_{i}.parquet",
            format="parquet",
            partitioning=None,
            min_rows_per_group=100_000,
            max_rows_per_group=300_000,
        )
        shutil.copyfile("/tmp/ds_0.parquet", fname)
        fnames.append((fname1, fname))
    return fnames

write_data()

[('/tmp/pre_climate_trace-sources_v2-2023-ct2_2015.parquet',
  '/tmp/climate_trace-sources_v2-2023-ct2_2015.parquet'),
 ('/tmp/pre_climate_trace-sources_v2-2023-ct2_2016.parquet',
  '/tmp/climate_trace-sources_v2-2023-ct2_2016.parquet'),
 ('/tmp/pre_climate_trace-sources_v2-2023-ct2_2017.parquet',
  '/tmp/climate_trace-sources_v2-2023-ct2_2017.parquet'),
 ('/tmp/pre_climate_trace-sources_v2-2023-ct2_2018.parquet',
  '/tmp/climate_trace-sources_v2-2023-ct2_2018.parquet'),
 ('/tmp/pre_climate_trace-sources_v2-2023-ct2_2019.parquet',
  '/tmp/climate_trace-sources_v2-2023-ct2_2019.parquet'),
 ('/tmp/pre_climate_trace-sources_v2-2023-ct2_2020.parquet',
  '/tmp/climate_trace-sources_v2-2023-ct2_2020.parquet'),
 ('/tmp/pre_climate_trace-sources_v2-2023-ct2_2021.parquet',
  '/tmp/climate_trace-sources_v2-2023-ct2_2021.parquet'),
 ('/tmp/pre_climate_trace-sources_v2-2023-ct2_2022.parquet',
  '/tmp/climate_trace-sources_v2-2023-ct2_2022.parquet')]

_Optimizing row groups_ A parquet file is a collection of groups of rows, and these rows are organized column-wise along with some statistics. We can choose how many groups to create: the minimum is one group (all the data into a single group), which is the most standard. This is not optimal however: reading can only be done by one processor core at a time. If we have more, they will sit idle. This is why it is better to choose the number of groups to be close to the expected number of processor cores (10-100). When reading, each core will process a different chunk of the file in parallel.

Polars cannot do this yet, so the code below directly calls the `pyarrow` package to restructure the final file, calling the function `pyarrow.dataset.write_dataset`. 

Here is the parquet files produced directly by Polars. It is the result of joining datasets which themselves are the result of reading many files (each by subsector). It is very fragmented (see the `num_row_groups` statistics below).


In [7]:
(fname_pre, fname_post) = write_data()[0]
print(fname_pre)
print(fname_post)
parquet_file = pyarrow.parquet.ParquetFile(fname_pre)
# print(parquet_file.metadata.row_group(0).column(2).statistics)
parquet_file.metadata

/tmp/pre_climate_trace-sources_v2-2023-ct2_2015.parquet
/tmp/climate_trace-sources_v2-2023-ct2_2015.parquet


<pyarrow._parquet.FileMetaData object at 0x75dc6c064900>
  created_by: Polars
  num_columns: 60
  num_rows: 4408150
  num_row_groups: 4392
  format_version: 2.6
  serialized_size: 21600097

The final file is more compact: only 58 row groups. It will be much faster to read (up to 50 times faster on my computer) because the readers do not need to gather information from each of the row groups.

In [8]:
parquet_file = pyarrow.parquet.ParquetFile(fname_post)
parquet_file.metadata

<pyarrow._parquet.FileMetaData object at 0x75dc55651b20>
  created_by: parquet-cpp-arrow version 15.0.2
  num_columns: 60
  num_rows: 4408150
  num_row_groups: 44
  format_version: 2.6
  serialized_size: 246740

_Statistics_ Each row group in a parquet file has statistics. These statistics contain for each columns basic information such as minimum, maximum, etc. as you can see below. During a query, a data system first reads these statistics to check what blocks of data it should read. 

For example, the first row group only contains agriculture data (which you can infer from `min: agriculture` and `max: agriculture`). As the result, if a query is looking for waste data, it can safely skip this full block. 

Grouping the rows and creating statistics can dramatically reduce the amount of data being read and processed. Finding the right number of groups is a tradeoff between using more cores to read the data in parallel, and not having to read too many statistics descriptions. In the extreme case of the file created by Polars (5000 row groups), the statistics make up 40% of the file and can take up to 90% of the processing time! If your parquet file reads slowly, it is probably due to its internal layout.

In [9]:
parquet_file = pyarrow.parquet.ParquetFile(fname_post)
parquet_file.metadata.row_group(0).column(58).statistics


<pyarrow._parquet.Statistics object at 0x75dc55653510>
  has_min_max: True
  min: agriculture
  max: agriculture
  null_count: 0
  distinct_count: None
  num_values: 100390
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8

## Initial checks

We know check that it works correctly. Let's load the newly created data instead of the default version stored on the internet, for the year 2022.

In [10]:
sdf = ct.read_source_emissions(year=2022, p="/tmp")
sdf

About 6M records for this year. This is spread across multiple gas and also multiple trips in the case of boats or airplanes.

In [11]:
sdf.select(pl.len()).collect()

len
u32
5798595


Check the number of distinct source IDs

In [12]:
by_sec = (sdf
.group_by(SOURCE_ID, SECTOR)
.agg(pl.len())
.collect())

The number of sources outside FLU:

```{admonition} CTODO
This number does not match the official number on the Climate Trace website (395075 for 2022). Investigate.
```

In [13]:
by_sec.filter(c_sector != FORESTRY_AND_LAND_USE).select(pl.len())

len
u32
324649


Check: no source is associated with multiple sectors.

In [14]:
by_sec.group_by(SOURCE_ID).agg(c_sector.n_unique()).filter(pl.col(SECTOR) > 1)

source_id,sector
u64,u32


Check: no annual source should be duplicated by gas.

```{admonition} CTODO
Some sources seem duplicate?
```

In [15]:
(sdf
.filter(c_temporal_granularity =="annual")
.group_by(SOURCE_ID, GAS)
.agg(pl.len())
.filter(pl.col("len") > 1)
.sort(by="len")
.collect())

source_id,gas,len
u64,enum,u32
13167024,"""co2e_20yr""",2
13174845,"""co2e_100yr""",2
13167024,"""co2""",2
13167068,"""co2""",2
13174845,"""n2o""",2
…,…,…
13168862,"""co2e_100yr""",6
13168862,"""n2o""",6
13168862,"""co2e_20yr""",6
13168862,"""ch4""",6


In [16]:
# Drilling into the record that is duplicate. It seems to be mixing multiple temporal granularities.
# Unsure how to handle this then
(sdf
.filter(c_source_id == 13168862)
.filter(c_temporal_granularity =="annual")
.filter(c_gas == CO2)
.head(20)
.collect()
)

source_id,iso3_country,original_inventory_sector,start_time,end_time,temporal_granularity,gas,emissions_quantity,emissions_factor,emissions_factor_units,capacity,capacity_units,capacity_factor,activity,activity_units,created_date,modified_date,source_name,source_type,lat,lon,other1,other2,other3,other4,other5,other6,other7,other8,other9,other10,other11,other12,other1_def,other2_def,other3_def,other4_def,other5_def,other6_def,other7_def,other8_def,other9_def,other10_def,other11_def,other12_def,geometry_ref,conf_source_type,conf_capacity,conf_capacity_factor,conf_activity,conf_co2_emissions_factor,conf_ch4_emissions_factor,conf_n2o_emissions_factor,conf_co2_emissions,conf_ch4_emissions,conf_n2o_emissions,conf_total_co2e_20yrgwp,conf_total_co2e_100yrgwp,sector,subsector
u64,enum,enum,datetime[μs],datetime[μs],enum,enum,f64,f64,str,f64,str,f64,f64,str,datetime[μs],datetime[μs],str,str,f64,f64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum
13168862,"""SGP""","""international-…",2022-01-24 00:00:05,2023-01-28 04:05:20,"""annual""","""co2""",887242.035664,384.71971,"""average kg of …",2270.0,"""voyages""",,2360400.0,"""nautical miles…",2022-01-24 00:00:00,2023-10-27 16:00:00,"""Singapore""",,1.3221,103.8281,"""Singaporean Ex…",,,,,,,,,,,,"""economic_zone""",,,,,,,,,,,,"""trace_103.8281…",,,,,,,,,,,,,"""transportation…","""international-…"
13168862,"""SGP""","""international-…",2022-03-07 00:00:12,2023-08-17 05:28:10,"""annual""","""co2""",777996.059535,382.425599,"""average kg of …",2143.0,"""voyages""",,2180400.0,"""nautical miles…",2022-03-07 00:00:00,2023-10-27 16:00:00,"""Singapore""",,1.3221,103.8281,"""Singaporean Ex…",,,,,,,,,,,,"""economic_zone""",,,,,,,,,,,,"""trace_103.8281…",,,,,,,,,,,,,"""transportation…","""international-…"
13168862,"""SGP""","""international-…",2022-02-21 00:07:40,2023-06-09 17:30:53,"""annual""","""co2""",705932.371351,378.080732,"""average kg of …",2083.0,"""voyages""",,2045600.0,"""nautical miles…",2022-02-21 00:00:00,2023-10-27 16:00:00,"""Singapore""",,1.3221,103.8281,"""Singaporean Ex…",,,,,,,,,,,,"""economic_zone""",,,,,,,,,,,,"""trace_103.8281…",,,,,,,,,,,,,"""transportation…","""international-…"
13168862,"""SGP""","""international-…",2022-05-23 00:00:07,2023-08-21 08:19:06,"""annual""","""co2""",866005.903,385.374134,"""average kg of …",2338.0,"""voyages""",,2454700.0,"""nautical miles…",2022-05-23 00:00:00,2023-10-27 16:00:00,"""Singapore""",,1.3221,103.8281,"""Singaporean Ex…",,,,,,,,,,,,"""economic_zone""",,,,,,,,,,,,"""trace_103.8281…",,,,,,,,,,,,,"""transportation…","""international-…"
13168862,"""SGP""","""international-…",2022-05-09 00:00:10,2023-06-13 04:06:34,"""annual""","""co2""",835881.630293,382.002172,"""average kg of …",2255.0,"""voyages""",,2376900.0,"""nautical miles…",2022-05-09 00:00:00,2023-10-27 16:00:00,"""Singapore""",,1.3221,103.8281,"""Singaporean Ex…",,,,,,,,,,,,"""economic_zone""",,,,,,,,,,,,"""trace_103.8281…",,,,,,,,,,,,,"""transportation…","""international-…"
13168862,"""SGP""","""international-…",2022-03-14 00:03:45,2023-07-05 16:50:59,"""annual""","""co2""",724527.140616,381.217951,"""average kg of …",2095.0,"""voyages""",,2076600.0,"""nautical miles…",2022-03-14 00:00:00,2023-10-27 16:00:00,"""Singapore""",,1.3221,103.8281,"""Singaporean Ex…",,,,,,,,,,,,"""economic_zone""",,,,,,,,,,,,"""trace_103.8281…",,,,,,,,,,,,,"""transportation…","""international-…"


In [17]:
(sdf
.filter(c_temporal_granularity =="month")
.group_by(SOURCE_ID, GAS)
.agg(pl.len())
.filter(pl.col("len") > 12)
.collect())

source_id,gas,len
u64,enum,u32
13168627,"""n2o""",15
13169229,"""ch4""",13
13169158,"""ch4""",16
13169229,"""co2e_100yr""",13
13167101,"""ch4""",19
…,…,…
13169112,"""n2o""",14
13168359,"""ch4""",13
13168272,"""co2""",13
13168272,"""co2e_100yr""",13


Check: emissions should always be defined

```{admonition} CTODO

Some source emissions have null values, it should be zero or excluded.
```

In [18]:
sdf = ct.read_source_emissions(2022, "/tmp")
(sdf
 .select(c_emissions_quantity.is_null().alias("null_emissions"), c_subsector, c_iso3_country)
 .group_by(c_subsector, "null_emissions")
 .agg(pl.len())
 .collect()
 .pivot(index=SUBSECTOR, columns="null_emissions", values="len")
)

subsector,false,true
enum,u32,u32
"""solid-waste-di…",31130,20440
"""international-…",14280,34780
"""forest-land-fi…",149055,99370
"""aluminum""",8280,5520
"""synthetic-fert…",129626,86419
…,…,…
"""enteric-fermen…",138503,92717
"""net-wetland""",128256,85504
"""international-…",419871,279914
"""manure-managem…",23164,31


### Investigation: for 2022, some source emissions are multiple (2x or 3x) the country emissions for the same country

Here is an example for domestic aviation in ARE, for CO2/CO2e

This investigation directly looks at the CSV files, it is not a bug in the preprocessing.

In [19]:
sdf = ct.data._load_csv(
    lambda fname, sname: fname == "transportation.zip" and sname == "domestic-aviation_emissions-sources.csv",
    [SOURCE_ID, SOURCE_NAME, EMISSIONS_QUANTITY, ISO3_COUNTRY, START_TIME, END_TIME, GAS])

DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: ['domestic-aviation_emissions-sources.csv']
DEBUG:ctrace.data:opening transportation.zip / domestic-aviation_emissions-sources.csv
  df = pl.read_csv(zf.open(sname), infer_schema_length=0)
DEBUG:ctrace.data:sources: []


In [20]:
cedf = ct.data._load_csv(
    lambda fname, sname: fname == "transportation.zip" and sname == "domestic-aviation_country_emissions.csv",
    [EMISSIONS_QUANTITY, ISO3_COUNTRY, START_TIME, END_TIME, GAS])

DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: ['domestic-aviation_country_emissions.csv']
DEBUG:ctrace.data:opening transportation.zip / domestic-aviation_country_emissions.csv
DEBUG:ctrace.data:sources: []


In [21]:
(cedf.filter(c_iso3_country == "ARE")
 .filter(c_start_time.str.starts_with("2022"))
 .filter(c_gas == CO2E_100YR))

emissions_quantity,iso3_country,start_time,end_time,gas,zip_name,file_name
str,str,str,str,str,str,str
"""44.62328775845…","""ARE""","""2022-01-01 00:…","""2022-12-31 00:…","""co2e_100yr""","""transportation…","""domestic-aviat…"


In [22]:
(sdf
 .filter(c_iso3_country == "ARE")
 .filter(c_start_time.str.starts_with("2022"))
 .filter(c_gas == CO2E_100YR)
 .select(c_emissions_quantity.cast(pl.Float32).sum()))

emissions_quantity
f32
89.246567


In [23]:
89.246567 / 44.6232877584

1.999999809140016

### Investigation: some sources confidences are present multiple times

There seems to be a data quality issue with the source confidence: same source ID but multiple names.
As a result, doing a left joint between sources and source confidences duplicates these for each confidence tabulation.

Current workaround: take only the first confidence row for each source and each year.

```{admonition} CTODO
Investigate data issue
```

In [24]:
sdf = ct.data._load_csv(
    lambda _, sname: sname.startswith("manure") and sname.endswith("sources.csv"),
    [SOURCE_ID, SOURCE_NAME, EMISSIONS_QUANTITY, ISO3_COUNTRY, START_TIME, END_TIME, GAS])

DEBUG:ctrace.data:sources: ['manure-management-cattle-feedlot_emissions-sources.csv', 'manure-left-on-pasture-cattle_emissions-sources.csv']
DEBUG:ctrace.data:opening agriculture.zip / manure-management-cattle-feedlot_emissions-sources.csv
DEBUG:ctrace.data:opening agriculture.zip / manure-left-on-pasture-cattle_emissions-sources.csv
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []


In [25]:
# 25743740
(sdf
 .filter(c_source_id == "25743740")
 .filter(c_start_time.str.starts_with("2022"))
 .filter(c_gas == CO2E_100YR) 
)

source_id,source_name,emissions_quantity,iso3_country,start_time,end_time,gas,zip_name,file_name
str,str,str,str,str,str,str,str,str
"""25743740""","""CHN_dairy_298""","""1225.6""","""CHN""","""2022-01-01 00:…","""2022-12-31 00:…","""co2e_100yr""","""agriculture""","""manure-managem…"


In [26]:
conf_df = ct.data._load_csv(
    lambda _, sname: sname.startswith("manure") and sname.endswith("sources_confidence.csv"))

DEBUG:ctrace.data:sources: ['manure-management-cattle-feedlot_emissions-sources_confidence.csv', 'manure-left-on-pasture-cattle_emissions-sources_confidence.csv']
DEBUG:ctrace.data:opening agriculture.zip / manure-management-cattle-feedlot_emissions-sources_confidence.csv
DEBUG:ctrace.data:opening agriculture.zip / manure-left-on-pasture-cattle_emissions-sources_confidence.csv
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []


In [27]:
(conf_df
 .filter(c_source_id == "25743740")
 .filter(c_start_time.str.starts_with("2022"))
)

source_id,source_name,iso3_country,original_inventory_sector,start_time,end_time,source_type,capacity,capacity_factor,activity,co2_emissions_factor,ch4_emissions_factor,n2o_emissions_factor,co2_emissions,ch4_emissions,n2o_emissions,total_co2e_20yrgwp,created_date,modified_date,total_co2e_100yrgwp,zip_name,file_name
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""25743740""","""CHN_dairy_298""","""CHN""","""manure-managem…","""2022-01-01 00:…","""2022-12-31 00:…","""high""","""very low""","""medium""","""high""",,"""medium""","""medium""",,"""medium""","""medium""","""medium""","""2023-09-06 00:…",,"""medium""","""agriculture""","""manure-managem…"
"""25743740""","""ZAF_beef_1""","""CHN""","""manure-managem…","""2022-01-01 00:…","""2022-12-31 00:…","""high""","""very low""","""very low""","""high""",,"""medium""","""medium""",,"""medium""","""medium""","""medium""","""2023-09-06 00:…",,"""medium""","""agriculture""","""manure-managem…"
"""25743740""","""ZAF_beef_10""","""CHN""","""manure-managem…","""2022-01-01 00:…","""2022-12-31 00:…","""medium""","""very low""","""very low""","""high""",,"""medium""","""medium""",,"""medium""","""medium""","""medium""","""2023-09-06 00:…",,"""medium""","""agriculture""","""manure-managem…"
"""25743740""","""ZAF_beef_100""","""CHN""","""manure-managem…","""2022-01-01 00:…","""2022-12-31 00:…","""medium""","""very low""","""very low""","""high""",,"""medium""","""medium""",,"""medium""","""medium""","""medium""","""2023-09-06 00:…",,"""medium""","""agriculture""","""manure-managem…"
"""25743740""","""ZAF_beef_1000""","""CHN""","""manure-managem…","""2022-01-01 00:…","""2022-12-31 00:…","""medium""","""very low""","""very low""","""high""",,"""medium""","""medium""",,"""medium""","""medium""","""medium""","""2023-09-06 00:…",,"""medium""","""agriculture""","""manure-managem…"
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""25743740""","""ZAF_dairy_95""","""CHN""","""manure-managem…","""2022-01-01 00:…","""2022-12-31 00:…","""medium""","""very low""","""low""","""high""",,"""medium""","""medium""",,"""medium""","""medium""","""medium""","""2023-09-06 00:…",,"""medium""","""agriculture""","""manure-managem…"
"""25743740""","""ZAF_dairy_96""","""CHN""","""manure-managem…","""2022-01-01 00:…","""2022-12-31 00:…","""medium""","""very low""","""low""","""high""",,"""medium""","""medium""",,"""medium""","""medium""","""medium""","""2023-09-06 00:…",,"""medium""","""agriculture""","""manure-managem…"
"""25743740""","""ZAF_dairy_97""","""CHN""","""manure-managem…","""2022-01-01 00:…","""2022-12-31 00:…","""medium""","""very low""","""low""","""high""",,"""medium""","""medium""",,"""medium""","""medium""","""medium""","""2023-09-06 00:…",,"""medium""","""agriculture""","""manure-managem…"
"""25743740""","""ZAF_dairy_98""","""CHN""","""manure-managem…","""2022-01-01 00:…","""2022-12-31 00:…","""medium""","""very low""","""low""","""high""",,"""medium""","""medium""",,"""medium""","""medium""","""medium""","""2023-09-06 00:…",,"""medium""","""agriculture""","""manure-managem…"


### Investigation: some emissions quantities are null

The false column should be all null (number of null values in emissions)

In [28]:
sdf = ct.read_source_emissions(2022, "/tmp")
(sdf
 .select(c_emissions_quantity.is_null().alias("null_emissions"), c_subsector, c_iso3_country)
 .group_by(c_subsector, "null_emissions")
 .agg(pl.len())
 .collect()
 .pivot(index=SUBSECTOR, columns="null_emissions", values="len", aggregate_function="sum")
 .sort(by="false")
)

subsector,true,false
enum,u32,u32
"""bauxite-mining…",644,231
"""petrochemicals…",116,464
"""iron-mining""",2008,672
"""copper-mining""",1527,1248
"""pulp-and-paper…",387,1548
…,…,…
"""manure-left-on…",46702,184518
"""cropland-fires…",,239700
"""wastewater-tre…",,267330
"""international-…",279914,419871


In [29]:
sdf = ct.data._load_csv(
    lambda _, sname: sname.startswith("electricity") and sname.endswith("sources.csv"))

DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: ['electricity-generation_emissions-sources.csv']
DEBUG:ctrace.data:opening power.zip / electricity-generation_emissions-sources.csv
DEBUG:ctrace.data:sources: []
DEBUG:ctrace.data:sources: []


In [30]:
(sdf
.filter(c_emissions_quantity.is_null())
.filter(c_source_id == "25448848")
)

source_id,iso3_country,original_inventory_sector,start_time,end_time,temporal_granularity,gas,emissions_quantity,emissions_factor,emissions_factor_units,capacity,capacity_units,capacity_factor,activity,activity_units,created_date,modified_date,source_name,source_type,lat,lon,other1,other2,other3,other4,other1_def,other2_def,other3_def,other4_def,geometry_ref,zip_name,file_name
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""25448848""","""USA""","""electricity-ge…","""2022-01-01 00:…","""2022-12-31 00:…","""annual""","""ch4""",,,"""field_not_mode…","""138""","""MW""","""0.345""","""417000""","""MWh""","""2023-10-31 00:…","""2023-11-01 10:…","""Gadsden""","""gas""","""34.0128""","""-85.9708""","""a""",,,,"""field_not_incl…","""biomass_emissi…","""biomass_capaci…","""biomass_genera…","""trace_-85.9708…","""power""","""electricity-ge…"
"""25448848""","""USA""","""electricity-ge…","""2022-01-01 00:…","""2022-12-31 00:…","""annual""","""n2o""",,,"""field_not_mode…","""138""","""MW""","""0.345""","""417000""","""MWh""","""2023-10-31 00:…","""2023-11-01 10:…","""Gadsden""","""gas""","""34.0128""","""-85.9708""","""a""",,,,"""field_not_incl…","""biomass_emissi…","""biomass_capaci…","""biomass_genera…","""trace_-85.9708…","""power""","""electricity-ge…"
"""25448848""","""USA""","""electricity-ge…","""2021-01-01 00:…","""2021-12-31 00:…","""annual""","""n2o""",,,"""field_not_mode…","""138""","""MW""","""0.324""","""392000""","""MWh""","""2023-10-31 00:…","""2023-11-01 10:…","""Gadsden""","""gas""","""34.0128""","""-85.9708""","""a""",,,,"""field_not_incl…","""biomass_emissi…","""biomass_capaci…","""biomass_genera…","""trace_-85.9708…","""power""","""electricity-ge…"
"""25448848""","""USA""","""electricity-ge…","""2021-01-01 00:…","""2021-12-31 00:…","""annual""","""ch4""",,,"""field_not_mode…","""138""","""MW""","""0.324""","""392000""","""MWh""","""2023-10-31 00:…","""2023-11-01 10:…","""Gadsden""","""gas""","""34.0128""","""-85.9708""","""a""",,,,"""field_not_incl…","""biomass_emissi…","""biomass_capaci…","""biomass_genera…","""trace_-85.9708…","""power""","""electricity-ge…"
"""25448848""","""USA""","""electricity-ge…","""2020-01-01 00:…","""2020-12-31 00:…","""annual""","""n2o""",,,"""field_not_mode…","""138""","""MW""","""0.335""","""406000""","""MWh""","""2023-10-31 00:…","""2023-11-01 10:…","""Gadsden""","""gas""","""34.0128""","""-85.9708""","""a""",,,,"""field_not_incl…","""biomass_emissi…","""biomass_capaci…","""biomass_genera…","""trace_-85.9708…","""power""","""electricity-ge…"
"""25448848""","""USA""","""electricity-ge…","""2020-01-01 00:…","""2020-12-31 00:…","""annual""","""ch4""",,,"""field_not_mode…","""138""","""MW""","""0.335""","""406000""","""MWh""","""2023-10-31 00:…","""2023-11-01 10:…","""Gadsden""","""gas""","""34.0128""","""-85.9708""","""a""",,,,"""field_not_incl…","""biomass_emissi…","""biomass_capaci…","""biomass_genera…","""trace_-85.9708…","""power""","""electricity-ge…"
"""25448848""","""USA""","""electricity-ge…","""2019-01-01 00:…","""2019-12-31 00:…","""annual""","""n2o""",,,"""field_not_mode…","""138""","""MW""","""0.329""","""398000""","""MWh""","""2023-10-31 00:…","""2023-11-01 10:…","""Gadsden""","""gas""","""34.0128""","""-85.9708""","""a""",,,,"""field_not_incl…","""biomass_emissi…","""biomass_capaci…","""biomass_genera…","""trace_-85.9708…","""power""","""electricity-ge…"
"""25448848""","""USA""","""electricity-ge…","""2019-01-01 00:…","""2019-12-31 00:…","""annual""","""ch4""",,,"""field_not_mode…","""138""","""MW""","""0.329""","""398000""","""MWh""","""2023-10-31 00:…","""2023-11-01 10:…","""Gadsden""","""gas""","""34.0128""","""-85.9708""","""a""",,,,"""field_not_incl…","""biomass_emissi…","""biomass_capaci…","""biomass_genera…","""trace_-85.9708…","""power""","""electricity-ge…"


## Upload the data to the Hugging Face Hub

This step will only work if you have the credentials to upload the dataset.

In [38]:
import huggingface_hub.utils
try:
    api = huggingface_hub.HfApi()
    for (_, fpath) in write_data():
        fname = os.path.basename(fpath)
        print(fname, fpath)
        api.upload_file(
            path_or_fileobj=fpath,
            path_in_repo=fname,
            repo_id="tjhunter/climate-trace",
            repo_type="dataset",
        )
except huggingface_hub.utils.HfHubHTTPError as e:
    print("error")
    print(e)

climate_trace-sources_v2-2023-ct2_2015.parquet /tmp/climate_trace-sources_v2-2023-ct2_2015.parquet


DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/preupload/main HTTP/1.1" 200 109
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/batch HTTP/1.1" 200 6115


climate_trace-sources_v2-2023-ct2_2015.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/5f8978326940938f927965f8cce35005f8f69ea800a3dfc4446471253b3ef34a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20240529%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240529T185627Z&X-Amz-Expires=86400&X-Amz-Signature=86a0ce1edd3792632dd88f8056435676d838377fc6724c569eb6e2f752421658&X-Amz-SignedHeaders=host&partNumber=1&uploadId=Ouw5qnWtqs3dnKaxVXFBY5SDNqIa0tvRaM0rIKU2OF8q56zy_tMRj.6hnYzx.IBygVUkP8BJ4Kuwojqh7xTqVOIeCvY5OMtyXscUivQZWDCF4Ql6B9B3fjxyBOGgfTPY&x-id=UploadPart HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346

climate_trace-sources_v2-2023-ct2_2016.parquet /tmp/climate_trace-sources_v2-2023-ct2_2016.parquet


DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/preupload/main HTTP/1.1" 200 109
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/batch HTTP/1.1" 200 6797


climate_trace-sources_v2-2023-ct2_2016.parquet:   0%|          | 0.00/132M [00:00<?, ?B/s]

DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/03689d462555e371df43075c86790fcf45fe8c1f6854c1831aa035f277a87801?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20240529%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240529T185645Z&X-Amz-Expires=86400&X-Amz-Signature=48bbde7594d63147b1528724b761e82f4aeb6b405f1d57e5866f7fa14027d27b&X-Amz-SignedHeaders=host&partNumber=1&uploadId=ImhuzTtLfcJ9aVqYgkvT8w4sZP2H5SJAax4vvz.8Hn8Od.H_9F4XaRIAAsyhIkTmotnFd8OucU81C5MmSjzsLQqZK73ClJxtWIGy9jS5GXmEGBHPCbx78nsM3zy2WDal&x-id=UploadPart HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/03689d462555e371df43075c86790fcf45fe8c1f6854c1831aa035f277a87801?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-

climate_trace-sources_v2-2023-ct2_2017.parquet /tmp/climate_trace-sources_v2-2023-ct2_2017.parquet


DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/preupload/main HTTP/1.1" 200 109
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/batch HTTP/1.1" 200 6797


climate_trace-sources_v2-2023-ct2_2017.parquet:   0%|          | 0.00/134M [00:00<?, ?B/s]

DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/6773f7c85bccf7ff5100e12bfee06f268aacb26c2d2758386b0838b8438a0d34?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20240529%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240529T185713Z&X-Amz-Expires=86400&X-Amz-Signature=6ea45017405fcfd3c8df7ee3cc7f5d231ff964615557eb17be4322ec106c7dfc&X-Amz-SignedHeaders=host&partNumber=1&uploadId=jHpYxOU8wnl8HW0T.yxm6qYxslZOIKiQfAKZjrTyPAUbwF24qkTKFEbWEF.C2Qs__SdNRaX_hHX8.5TJJaJTnraDL7SBNrzzXLBI70lS740hOOLIBk3Kdlwe7AP04XJO&x-id=UploadPart HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/6773f7c85bccf7ff5100e12bfee06f268aacb26c2d2758386b0838b8438a0d34?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-

climate_trace-sources_v2-2023-ct2_2018.parquet /tmp/climate_trace-sources_v2-2023-ct2_2018.parquet


DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/preupload/main HTTP/1.1" 200 109
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/batch HTTP/1.1" 200 6797


climate_trace-sources_v2-2023-ct2_2018.parquet:   0%|          | 0.00/134M [00:00<?, ?B/s]

DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/23bd0e62c3653c202235c7544337db67845f5e18d3bf9adc54e74249006a2799?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20240529%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240529T185731Z&X-Amz-Expires=86400&X-Amz-Signature=5cd91878a445637e00208b3574997494539b158a3b2b22c29acc1a4a767645e5&X-Amz-SignedHeaders=host&partNumber=1&uploadId=uR4_fdZjlDXXqFT4hWZjf9GX7q1Bh.51HycYM55.jZkDL3cREQARM7IHF2rEBk6I4vBY08p_8H8kVSvBEUtowBhJcSs_TjDKdt4KlEOUDFI16Tb70lgFzFtRk1zcLeK1&x-id=UploadPart HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/23bd0e62c3653c202235c7544337db67845f5e18d3bf9adc54e74249006a2799?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-

climate_trace-sources_v2-2023-ct2_2019.parquet /tmp/climate_trace-sources_v2-2023-ct2_2019.parquet


DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/preupload/main HTTP/1.1" 200 109
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/batch HTTP/1.1" 200 6797


climate_trace-sources_v2-2023-ct2_2019.parquet:   0%|          | 0.00/136M [00:00<?, ?B/s]

DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/8526bf8209bdb00a817d35ba7d0197a99c203cbf9ef99541eb5e20276e929a4f?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20240529%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240529T185749Z&X-Amz-Expires=86400&X-Amz-Signature=f9c39f6131221e7579c4fbb7520389635c13b329c2074386477ceaf6e8b1a1f6&X-Amz-SignedHeaders=host&partNumber=1&uploadId=Lyyb04SkrJTAXqVd.IJNnUJKURqhcmlCeQxcXTONQspi.eaABeV0LwKXiU_uPee63ZR9lhfoISc7Sj5VFYe9r5RhLNOemuVyEeyC2tpUANI3ivJL6sjnXt7.6AAjUmob&x-id=UploadPart HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/8526bf8209bdb00a817d35ba7d0197a99c203cbf9ef99541eb5e20276e929a4f?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-

climate_trace-sources_v2-2023-ct2_2020.parquet /tmp/climate_trace-sources_v2-2023-ct2_2020.parquet


DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/preupload/main HTTP/1.1" 200 109
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/batch HTTP/1.1" 200 6797


climate_trace-sources_v2-2023-ct2_2020.parquet:   0%|          | 0.00/136M [00:00<?, ?B/s]

DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/8457531a54864c60c49beceb9fe0649e545d63b61cba791d39a18f4b9b16fa69?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20240529%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240529T185806Z&X-Amz-Expires=86400&X-Amz-Signature=3ba45b23f17819b75188baa061717ef47a735d04944108f60605bc4fecc7a59f&X-Amz-SignedHeaders=host&partNumber=1&uploadId=jV_mk.8xExeRdUwkfa55MpyqV_fmOjP9ldKvQSR42xyz34tDQEJW83u4qt06.X8rr62.V2TAlWisyHaM0IZ0pHEr.3mpjCYzB86Bp1l7SoeApA89Jyj8EIQ3hnysoK0D&x-id=UploadPart HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/8457531a54864c60c49beceb9fe0649e545d63b61cba791d39a18f4b9b16fa69?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-

climate_trace-sources_v2-2023-ct2_2021.parquet /tmp/climate_trace-sources_v2-2023-ct2_2021.parquet


DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/preupload/main HTTP/1.1" 200 109
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/batch HTTP/1.1" 200 7480


climate_trace-sources_v2-2023-ct2_2021.parquet:   0%|          | 0.00/150M [00:00<?, ?B/s]

DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/4b522daf3ddee80cbbc5a3347f0f6ecd2bff210c408e6495ac60992d9d58d677?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20240529%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240529T185823Z&X-Amz-Expires=86400&X-Amz-Signature=60e4adf729d99ea4ad87591f7ab4888f09200712da25f57d4885738861661e1a&X-Amz-SignedHeaders=host&partNumber=1&uploadId=3DhNfCh8fs0FM3d2yf0wNbYYb8KJFbANsnoX3LNmEVLZ0w_g8ec60Q76jrVcJzsc7HeW0uYK8.Fx8gFDqamRXhbPVVOfTEBqR8RXJPDBkQP_Yhff3zWKM145w6625LX3&x-id=UploadPart HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/4b522daf3ddee80cbbc5a3347f0f6ecd2bff210c408e6495ac60992d9d58d677?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-

climate_trace-sources_v2-2023-ct2_2022.parquet /tmp/climate_trace-sources_v2-2023-ct2_2022.parquet


DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/tjhunter/climate-trace/preupload/main HTTP/1.1" 200 109
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/tjhunter/climate-trace.git/info/lfs/objects/batch HTTP/1.1" 200 7480


climate_trace-sources_v2-2023-ct2_2022.parquet:   0%|          | 0.00/147M [00:00<?, ?B/s]

DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/8bbb246574a61b0f3e47982e335559c642e0549f2731f2c3796ae76d8cb059bb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20240529%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240529T185847Z&X-Amz-Expires=86400&X-Amz-Signature=f7b1796f4e4cd2a51fa92f1b88c379681d53dab7edf480703014af944ab754cc&X-Amz-SignedHeaders=host&partNumber=1&uploadId=zzW9F98KUmOfaqz7pBbv7Kg10LhB5eZfwC5YLVAdcO4C.K45So03aNmhQ301bQ5iBqT2rjbycVsa_MlFs10m6qDM5TnCAcKnLCw_0obVPoqa7_6OnF.jlm74q8BM5yyi&x-id=UploadPart HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/3b/19/3b19db55b462837ac3fa6806178ef6c97dc4081acabed346e24b3a5892b1221f/8bbb246574a61b0f3e47982e335559c642e0549f2731f2c3796ae76d8cb059bb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-