<h1>DATA PREPROCESSING </h1>
We will use simple uni-variate model time forecasting model in this tutorial. Hence, we will only use the CO2 emission history to build the forecasting model. Additional features such as GDP, polulation, or the number of vehicles sold are not incorporated in the modelling. We apply the following pre-processings to the CO2 emission data:

* Change time resolution to year-month
* Extend the time period of all csvs to include values from 1987-01~
* Merge source and fueltype into one column

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

import warnings
warnings.filterwarnings('ignore')

In [3]:
def inter_extra_polate(df,
                column_name = "time",
                by          = "MS",
                start       = "1987-01-01 00:00:00",
                end         = "2021-12-01 00:00:00"):
    
    """
    this function standardize the time range from "arg:start" to "arg:end" with linear interpolation and extrapolation.
    """
    
    # set the index of the dataframe to the timestamps
    df.set_index(column_name, inplace=True)

    # generate a series of timestamps at 1-m intervals between start_time and end_time
    extrap_index = pd.date_range(start = pd.to_datetime(start),
                                 end   = pd.to_datetime(end),
                                 freq  = by)

    # reindex the dataframe with the new timestamps, filling missing values with NaN
    df = df.reindex(extrap_index)
    
    # fill missing values with linear interpolation between neighboring values
    df = df.interpolate(method='linear', limit_direction="both").reset_index()
    
    # rename index to "time"
    df[column_name] = df["index"]
    del df["index"]
    
    return df

In [4]:
# load the data again
df = pd.read_csv("../datasets/train.csv")

In [5]:
# change time resolution from year-month to datetime
df["time"] = df.apply(lambda row : pd.to_datetime(str(row["year"]) + " " + str(row["month"])), axis = 1)

In [6]:
# delete unnecessary columns
del df["year"]
del df["month"]

In [7]:
# merge  source and fueltype to source_fueltype
df["source_fueltype"] = df[["source", "fuel_type"]].apply("-".join, axis=1)
del df["source"]
del df["fuel_type"]

In [9]:
df.head()

Unnamed: 0,emissions_tons,time,source_fueltype
0,1588.61,1987-01-01,transport-oil
1,1428.29,1987-02-01,transport-oil
2,1581.16,1987-03-01,transport-oil
3,1557.4,1987-04-01,transport-oil
4,1513.35,1987-05-01,transport-oil


In [12]:
# unmelt
df = df.pivot(index='time', columns='source_fueltype', values='emissions_tons') 
df = df.reset_index()

df = inter_extra_polate(df)

df = df.set_index("time")

In [13]:
df.head()

source_fueltype,industry-coal,industry-natural_gas,industry-oil,other-oil,transport-natural_gas,transport-oil
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1987-01-01,190.5,11.37,531.26,587.72,0.43,1588.61
1987-02-01,280.42,7.11,408.65,535.58,0.43,1428.29
1987-03-01,143.21,8.36,444.82,584.31,0.43,1581.16
1987-04-01,186.45,6.72,433.29,565.79,0.43,1557.4
1987-05-01,273.05,9.45,495.61,563.24,0.43,1513.35


In [18]:
# save the data
df.to_csv("../datasets/train-processed.csv", index=False)