# Data ingestion and formatting

This notebook explains how to convert the Climate TRACE dataset to a format that is more appropriate for data science. 

```{note}
This section is relevant for data engineers, or data scientists who want to understand how the data 
has been prepared. Skip if you just want to access the final, prepared data.
```

The original data from Climate TRACE is offered as a series of CSV files bundled in ZIP archives. That format is universally understood, but it is not the most effective for effective analysis with data science tools.

Instead, we are going to use the Parquet format. This format has a number of advantages:
- it is _column-based_ : data systems can process big chunks of data at once, rather than line by line. Also, depending on the information requested, systems will read only the relevant columns and skip the rest very effectively
- it is _structured_ : basic information about numbers, categories, ... are preserved. This provides a large speed boost
- it is _universal_ : most modern data systems will be able to read it

```{admonition} TODO
complete this notebook and publish the source code.
```

Looking at the code, we are performing a few tricks:

_Compacting the data_ We minimize the size of the files by taking advantage of its structures. In particular, we know in many cases that values are part of known enumerations (sectors, ...). We replace all these by `polars.Enumeration`s. Not only this makes files smaller, but it also allows data systems to make clever optimization for complex operations such as joining.

_Lazy reading_ If we were to read all the source data using a traditional system such as Excel or Pandas, we would require a serious amount of memory. The files themselves are more than 5GB. Polars is capable of reading straight from the zip file in a streaming fashion. This is what Polars calls a Lazy dataframe, or LazyFrame. Even when doing complicated operations such as joining the source files with the confidence information, Polars only uses 3GB of memory on my machine. In fact, this way of working is so fast that the `ctrace` package directly reads all the country emissions data from the zip files in less than a second.

_Using known enumerations_ You will see in the source code that nearly all the variables such as column names, names of gas and sectors, etc. are replaced CONSTANT_NAMES such as `CH4`,.... You can use that to autocomplete



In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging
logging.basicConfig(level=logging.DEBUG)

In [3]:
import polars as pl

from ctrace.constants import *
import ctrace as ct
from dds import data_function
import dds
dds.accept_module(ct)

In this notebook, another trick is to define the transformations as _data functions_. In short, this code will only run if the source code changes. This makes rerunning the notebooks very fast, and only updating when something has changed in the source code.

In [4]:
@data_function("/data_sources")
def load_sources():
    (_, files) = ct.data.load_source_compact()
    return files

load_sources()

DEBUG:dds._api:_eval_new_ctx: local_vars: ['ACTIVITY', 'ACTIVITY_UNITS', 'AGRICULTURE', 'ALUMINUM', 'BAUXITE_MINING', 'BIOLOGICAL_TREATMENT_OF_SOLID_WASTE_AND_BIOGENIC', 'BUILDINGS', 'C', 'CAPACITY', 'CAPACITY_FACTOR', 'CAPACITY_UNITS', 'CEMENT', 'CH4', 'CH4_EMISSIONS', 'CH4_EMISSIONS_FACTOR', 'CHEMICALS', 'CO2', 'CO2E_100YR', 'CO2E_20YR', 'CO2_EMISSIONS', 'CO2_EMISSIONS_FACTOR', 'COAL_MINING', 'CONFIDENCES', 'CONF_CH4_EMISSIONS', 'CONF_CH4_EMISSIONS_FACTOR', 'CONF_CO2_EMISSIONS', 'CONF_CO2_EMISSIONS_FACTOR', 'CONF_N2O_EMISSIONS', 'CONF_N2O_EMISSIONS_FACTOR', 'CONF_TOTAL_CO2E_100YRGWP', 'CONF_TOTAL_CO2E_20YRGWP', 'COPPER_MINING', 'CREATED_DATE', 'CROPLAND_FIRES', 'Confidence', 'DOMESTIC_AVIATION', 'DOMESTIC_SHIPPING', 'ELECTRICITY_GENERATION', 'EMISSIONS_FACTOR', 'EMISSIONS_FACTOR_UNITS', 'EMISSIONS_QUANTITY', 'EMISSIONS_QUANTITY_UNITS', 'END_TIME', 'ENTERIC_FERMENTATION_CATTLE_FEEDLOT', 'ENTERIC_FERMENTATION_CATTLE_PASTURE', 'ENTERIC_FERMENTATION_OTHER', 'FLUORINATED_GASES', 'FORESTRY

[PosixPath('/tmp/enteric-fermentation-cattle-feedlot_emissions-sources.parquet'),
 PosixPath('/tmp/manure-management-cattle-feedlot_emissions-sources.parquet'),
 PosixPath('/tmp/rice-cultivation_emissions-sources.parquet'),
 PosixPath('/tmp/synthetic-fertilizer-application_emissions-sources.parquet'),
 PosixPath('/tmp/enteric-fermentation-cattle-pasture_emissions-sources.parquet'),
 PosixPath('/tmp/cropland-fires_emissions-sources.parquet'),
 PosixPath('/tmp/manure-left-on-pasture-cattle_emissions-sources.parquet'),
 PosixPath('/tmp/water-reservoirs_emissions-sources.parquet'),
 PosixPath('/tmp/removals_emissions-sources.parquet'),
 PosixPath('/tmp/forest-land-fires_emissions-sources.parquet'),
 PosixPath('/tmp/forest-land-degradation_emissions-sources.parquet'),
 PosixPath('/tmp/forest-land-clearing_emissions-sources.parquet'),
 PosixPath('/tmp/net-wetland_emissions-sources.parquet'),
 PosixPath('/tmp/net-forest-land_emissions-sources.parquet'),
 PosixPath('/tmp/net-shrubgrass_emissio

Because the data is loaded lazily, this step takes only 300MB of memory on my machine. Not bad for producing 2GB of data!

To help with the loading, the data is partitioned by year. This is the most relevant for most users: most people are expected to look at specific years and sectors (especially the latest year). This reduces the amount of data to load.

```{admonition} CTODO
The year of a data record is defined by its start time. This may be different than the convention used by Climate Trace. To check.
```

In [5]:
write_directory = "/tmp"
years = list(range(2015, 2023))
version = ct.data.version

@data_function("/write_data")
def write_data():
    data_files = load_sources()
    dfs = []
    for tmp_name in data_files:
        df = pl.scan_parquet(tmp_name)
        df = df.pipe(ct.data.recast_parquet, conf=True)
        dfs.append(df)
    ldf = pl.concat(dfs)
    # Using snappy because it is more broadly compatible with parquet readers
    # Polars 0.20 does not support statistics
    for year in years:
        (ldf
         .filter(c_start_time.dt.year()==year)
         .sink_parquet(
            f"{write_directory}/climatetrace-sources_{version}_{year}.parquet",
            compression="snappy",
            statistics=False))

write_data()

DEBUG:dds._api:_eval_new_ctx: local_vars: ['ACTIVITY', 'ACTIVITY_UNITS', 'AGRICULTURE', 'ALUMINUM', 'BAUXITE_MINING', 'BIOLOGICAL_TREATMENT_OF_SOLID_WASTE_AND_BIOGENIC', 'BUILDINGS', 'C', 'CAPACITY', 'CAPACITY_FACTOR', 'CAPACITY_UNITS', 'CEMENT', 'CH4', 'CH4_EMISSIONS', 'CH4_EMISSIONS_FACTOR', 'CHEMICALS', 'CO2', 'CO2E_100YR', 'CO2E_20YR', 'CO2_EMISSIONS', 'CO2_EMISSIONS_FACTOR', 'COAL_MINING', 'CONFIDENCES', 'CONF_CH4_EMISSIONS', 'CONF_CH4_EMISSIONS_FACTOR', 'CONF_CO2_EMISSIONS', 'CONF_CO2_EMISSIONS_FACTOR', 'CONF_N2O_EMISSIONS', 'CONF_N2O_EMISSIONS_FACTOR', 'CONF_TOTAL_CO2E_100YRGWP', 'CONF_TOTAL_CO2E_20YRGWP', 'COPPER_MINING', 'CREATED_DATE', 'CROPLAND_FIRES', 'Confidence', 'DOMESTIC_AVIATION', 'DOMESTIC_SHIPPING', 'ELECTRICITY_GENERATION', 'EMISSIONS_FACTOR', 'EMISSIONS_FACTOR_UNITS', 'EMISSIONS_QUANTITY', 'EMISSIONS_QUANTITY_UNITS', 'END_TIME', 'ENTERIC_FERMENTATION_CATTLE_FEEDLOT', 'ENTERIC_FERMENTATION_CATTLE_PASTURE', 'ENTERIC_FERMENTATION_OTHER', 'FLUORINATED_GASES', 'FORESTRY

We know check that it works correctly. Let's load the newly created data instead of the default version stored on the internet, for the year 2022.

In [6]:
sdf = ct.read_source_emissions(2022, "/tmp")
sdf

About 6M records for this year. This is spread across multiple gas and also multiple trips in the case of boats or airplanes.

In [7]:
sdf.select(pl.len()).collect()

len
u32
5812705


Check the number of distinct source IDs

In [8]:
by_sec = (sdf
.group_by(SOURCE_ID, SECTOR)
.agg(pl.len())
.collect())

The number of sources outside FLU:

```{admonition} CTODO
This number does not match the official number on the Climate Trace website (395075 for 2022). Investigate.
```

In [9]:
by_sec.filter(c_sector != FORESTRY_AND_LAND_USE).select(pl.len())

len
u32
324649


Check: no source is associated with multiple sectors.

In [10]:
by_sec.group_by(SOURCE_ID).agg(c_sector.n_unique()).filter(pl.col(SECTOR) > 1)

source_id,sector
u64,u32


Check: no annual source should be duplicated by gas.

```{admonition} CTODO
Some sources seem duplicate?
```

In [14]:
(sdf
.filter(c_temporal_granularity =="annual")
.group_by(SOURCE_ID, GAS)
.agg(pl.len())
.filter(pl.col("len") > 1)
.sort(by="len")
.collect())

source_id,gas,len
u64,enum,u32
21096128,"""co2e_100yr""",2
21096128,"""ch4""",2
5012789,"""ch4""",2
5012802,"""co2e_100yr""",2
5012802,"""ch4""",2
…,…,…
25743740,"""co2e_20yr""",2751
25743740,"""co2""",2751
25743740,"""n2o""",2751
25743740,"""co2e_100yr""",2751


In [17]:
# Drilling into the record that is duplicate
(sdf
.filter(c_source_id == 25743740)
.head(2)
.collect()
)

source_id,iso3_country,original_inventory_sector,start_time,end_time,temporal_granularity,gas,emissions_quantity,emissions_factor,emissions_factor_units,capacity,capacity_units,capacity_factor,activity,activity_units,created_date,modified_date,source_name,source_type,lat,lon,other1,other2,other3,other4,other5,other6,other7,other8,other9,other10,other11,other12,other1_def,other2_def,other3_def,other4_def,other5_def,other6_def,other7_def,other8_def,other9_def,other10_def,other11_def,other12_def,geometry_ref,conf_source_type,conf_capacity,conf_capacity_factor,conf_activity,conf_co2_emissions_factor,conf_ch4_emissions_factor,conf_n2o_emissions_factor,conf_co2_emissions,conf_ch4_emissions,conf_n2o_emissions,conf_total_co2e_20yrgwp,conf_total_co2e_100yrgwp,sector,subsector
u64,enum,enum,datetime[μs],datetime[μs],enum,enum,f64,f64,str,f64,str,f64,f64,str,datetime[μs],datetime[μs],str,str,f64,f64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum
25743740,"""CHN""","""manure-managem…",2022-01-01 00:00:00,2022-12-31 00:00:00,"""annual""","""co2e_100yr""",1225.6,6e-06,,1.0,"""hectares """,4000.0,4000.0,"""animals""",2023-11-07 00:00:00,,"""CHN_dairy_298""","""manure_managem…",27.1,106.5,"""2015""","""reported/permi…","""2400.0""","""0.00047""","""1600.0""","""0.00034""","""18.0""","""1.0""","""[{'unknown': 1…",,,,"""year active""","""cattle_populat…","""potential_tota…","""dairy_or_beef_…","""dairy_total_ot…","""other_cattle_n…","""dairy_CH4_emis…","""beef_other_CH4…","""manure_managem…",,,,"""trace_106.5_27…","""medium""","""very low""","""very low""","""high""",,"""medium""","""medium""",,"""medium""","""medium""","""medium""","""medium""","""agriculture""","""manure-managem…"
25743740,"""CHN""","""manure-managem…",2022-01-01 00:00:00,2022-12-31 00:00:00,"""annual""","""co2e_100yr""",1225.6,6e-06,,1.0,"""hectares """,4000.0,4000.0,"""animals""",2023-11-07 00:00:00,,"""CHN_dairy_298""","""manure_managem…",27.1,106.5,"""2015""","""reported/permi…","""2400.0""","""0.00047""","""1600.0""","""0.00034""","""18.0""","""1.0""","""[{'unknown': 1…",,,,"""year active""","""cattle_populat…","""potential_tota…","""dairy_or_beef_…","""dairy_total_ot…","""other_cattle_n…","""dairy_CH4_emis…","""beef_other_CH4…","""manure_managem…",,,,"""trace_106.5_27…","""medium""","""very low""","""very low""","""high""",,"""medium""","""medium""",,"""medium""","""medium""","""medium""","""medium""","""agriculture""","""manure-managem…"


In [15]:
(sdf
.filter(c_temporal_granularity =="month")
.group_by(SOURCE_ID, GAS)
.agg(pl.len())
.filter(pl.col("len") > 12)
.collect())

source_id,gas,len
u64,enum,u32
13169229,"""co2""",13
13168776,"""ch4""",20
13168148,"""n2o""",15
13168359,"""co2e_100yr""",13
13169158,"""co2e_100yr""",16
…,…,…
13166715,"""co2e_100yr""",15
13166715,"""co2""",15
13168182,"""ch4""",16
13169222,"""co2e_20yr""",16
