# Problem Statement

In order to take meaningful action on climate, the governments, NGOs, and the private sector need access to independent, granular, and recent emissions data. In COP29, Al Gore and Gavin McCormick (co-founders of the nonprofit [Climate TRACE coalition](https://climatetrace.org/)) revealed the newest data for tracking emissions from hundreds of millions of sources around the world. The reporting of this data to ClimateTrace is voluntary and at times may even include anomalous values due to human error or governments greenwashing their figures. 

This issue of anomalous data can have global effects as greenwashing creates a false sense of sustainability. It misleads consumers, investors, and regulators about the true environmental impact of a company or government's activities. This in turn prevents meaningful progress toward climate goals, misdirects resources, and undermines trust in genuine efforts to combat climate change. Additionally, it affects any modelling being undertaken with the data that can lead to incorrect future predictions.

As part of an effort to catch any such erraneous data, this project aims to create an anomaly detection model for the Oil and Gas emissions being reported by the various governments around the world. 

Let's start by loading the data. We will load the data for the Oil and Gas Production Emissions by Country.

In [1]:
import numpy as np
import pandas as pd

ong_country = pd.read_csv("data/oil-and-gas-production_country_emissions.csv")
ong_country.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12096 entries, 0 to 12095
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   iso3_country              12096 non-null  object 
 1   sector                    12096 non-null  object 
 2   subsector                 12096 non-null  object 
 3   start_time                12096 non-null  object 
 4   end_time                  12096 non-null  object 
 5   gas                       12096 non-null  object 
 6   emissions_quantity        12096 non-null  float64
 7   emissions_quantity_units  0 non-null      float64
 8   temporal_granularity      12096 non-null  object 
 9   created_date              0 non-null      float64
 10  modified_date             0 non-null      float64
dtypes: float64(4), object(7)
memory usage: 1.0+ MB


In [2]:
print(f"There are a total of {len(np.unique(ong_country.iso3_country))} countries/territories in the dataset")

There are a total of 252 countries/territories in the dataset


As there is no information in `modified_date`, `created_date` and `emissions_quantity_units` columns. We can remove these 3 columns.

In [3]:
ong_country.drop(['emissions_quantity_units', 'created_date','modified_date' ], axis=1, inplace=True)
ong_country

Unnamed: 0,iso3_country,sector,subsector,start_time,end_time,gas,emissions_quantity,temporal_granularity
0,ABW,fossil-fuel-operations,oil-and-gas-production,2021-01-01 00:00:00,2021-01-31 00:00:00,co2e_100yr,0.0,month
1,ABW,fossil-fuel-operations,oil-and-gas-production,2021-02-01 00:00:00,2021-02-28 00:00:00,co2e_100yr,0.0,month
2,ABW,fossil-fuel-operations,oil-and-gas-production,2021-03-01 00:00:00,2021-03-31 00:00:00,co2e_100yr,0.0,month
3,ABW,fossil-fuel-operations,oil-and-gas-production,2021-04-01 00:00:00,2021-04-30 00:00:00,co2e_100yr,0.0,month
4,ABW,fossil-fuel-operations,oil-and-gas-production,2021-05-01 00:00:00,2021-05-31 00:00:00,co2e_100yr,0.0,month
...,...,...,...,...,...,...,...,...
12091,ZWE,fossil-fuel-operations,oil-and-gas-production,2024-08-01 00:00:00,2024-08-31 00:00:00,co2e_100yr,0.0,month
12092,ZWE,fossil-fuel-operations,oil-and-gas-production,2024-09-01 00:00:00,2024-09-30 00:00:00,co2e_100yr,0.0,month
12093,ZWE,fossil-fuel-operations,oil-and-gas-production,2024-10-01 00:00:00,2024-10-31 00:00:00,co2e_100yr,0.0,month
12094,ZWE,fossil-fuel-operations,oil-and-gas-production,2024-11-01 00:00:00,2024-11-30 00:00:00,co2e_100yr,0.0,month


We can also convert the date to the correct format. As there is no time data associated, we can simply use the `to_datetime()` function for this.

In [4]:
# Convert to datetime 
ong_country.start_time = pd.to_datetime(ong_country.start_time)
ong_country.end_time = pd.to_datetime(ong_country.end_time)

# Quality Check the format
ong_country.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12096 entries, 0 to 12095
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   iso3_country          12096 non-null  object        
 1   sector                12096 non-null  object        
 2   subsector             12096 non-null  object        
 3   start_time            12096 non-null  datetime64[ns]
 4   end_time              12096 non-null  datetime64[ns]
 5   gas                   12096 non-null  object        
 6   emissions_quantity    12096 non-null  float64       
 7   temporal_granularity  12096 non-null  object        
dtypes: datetime64[ns](2), float64(1), object(5)
memory usage: 756.1+ KB
