# Preprocessing Practical Example


## Low dimensional

In this notebook, we preprocess the data from the National Institute of Statistics and Geography (INEGI) regarding the components associated with the inflation in Mexico.


In [1]:
# packages
remove(list = ls())
options(warn = -1)
suppressMessages(library(data.table))
suppressMessages(library(magrittr))
suppressMessages(library(imputeTS))
suppressMessages(library(seasonal))


source("../source/simulations.R")
source("../source/vectorial_methods.R")
source("../source/auxiliar_methods.R")

We read the databases

In [2]:
# Raw data
df_inflation <- read.csv("../databases/data_inflation.csv", header = TRUE, row.names = 1)
catalogue_inflation <- read.csv("../databases/catalogue_inflation.csv")
df_inflation <- na.omit(df_inflation)

The database is:

- `data_inflation.csv`: Different variables associated with the inflation.

We use the period of time where all the variables are available, this period is since January 2005


The detail information of each variables is in `catalogue_inflation.csv`.


In [3]:
#############
# TS format #
#############
start <- "2005/01"
d <- 12

# dates
dates_inflation <- rownames(df_inflation)[which(rownames(df_inflation) == start):nrow(df_inflation)]
df_inflation <- df_inflation[dates_inflation, ]
variables_inflation <- colnames(df_inflation)

# nans
for (i in 1:length(variables_inflation)) {
  ts_aux <- ts(df_inflation[, variables_inflation[i]][!is.na(df_inflation[, variables_inflation[i]])],
    start = as.numeric(substring(start, 1, 4)), frequency = d
  )
  df_inflation[1:length(ts_aux), variables_inflation[i]] <- ts_aux
}
df_inflation <- na_kalman(df_inflation)

We use the logarithm of `BYM`


In [4]:
df_inflation[, "BYM"] <- log(df_inflation[, "BYM"])

The basic assumptions in the paper are that the data does not have an *stational* neither a *deterministic* component. 

First, we use the suggestion in the paper removing a linear trend.


In [5]:
#################
# linear trends #
#################
data_trend <- matrix(0, nrow = nrow(df_inflation), ncol = ncol(df_inflation))
coefs_trend <- matrix(0, nrow = ncol(df_inflation), ncol = 2)
colnames(data_trend) <- colnames(df_inflation)
rownames(coefs_trend) <- colnames(df_inflation)
colnames(coefs_trend) <- c("const", "slope")

alpha <- 0.05 # nivel de significancia de los coeficientes

time <- 1:nrow(df_inflation)
log_time <- log(time)

for (i in 1:length(variables_inflation)) {
    aux_reg <- lm(df_inflation[, variables_inflation[i]] ~ time)
    summary_aux_reg <- summary(aux_reg)$coefficients
    # coefs
    if (summary_aux_reg[, 4][[1]] < alpha) coefs_trend[variables_inflation[i], "const"] <- aux_reg$coef[[1]]
    if (summary_aux_reg[, 4][[2]] < alpha) coefs_trend[variables_inflation[i], "slope"] <- aux_reg$coef[[2]]
    # detrend
    data_trend[, variables_inflation[i]] <- coefs_trend[i, "const"] + coefs_trend[i, "slope"] * time
    # update
    df_inflation[, variables_inflation[i]] <- df_inflation[, variables_inflation[i]] - data_trend[, variables_inflation[i]]
}

Now, we do the same using the stational flag  `Seas` in `catalogue_inflation.csv`


In [6]:
###############
# seasonality #
###############
data_seasonality <- matrix(0, nrow = nrow(df_inflation), ncol = ncol(df_inflation))
colnames(data_seasonality) <- colnames(df_inflation)

# stationality indicator
ind_seas <- as.character(catalogue_inflation[catalogue_inflation[, "Seas"] == 1, "Variables"])

# decomposition
for (i in 1:length(ind_seas)) {
    ts_aux <- ts(df_inflation[, ind_seas[i]], start = as.numeric(substring(start, 1, 4)), frequency = d)
    descomp_aux <- decompose(ts_aux)

    # seasonal component
    data_seasonality[, ind_seas[i]] <- descomp_aux$seasonal

    # stochastic part
    df_inflation[, ind_seas[i]] <- df_inflation[, ind_seas[i]] - descomp_aux$seasonal
}
df_inflation <- apply(df_inflation, 2, function(x) x - mean(x, na.rm = TRUE))

Testing for stationary on each serie

In [7]:
df_inflation <- na.omit(df_inflation)
as.table(round(apply(df_inflation, 2, function(x) tseries::kpss.test(x)$p.value), digits = 3))

    P     U   BYM     E     R     W  PUSA 
0.010 0.021 0.100 0.085 0.018 0.010 0.060 

We save the preprocess dataset with the name `variables_inflation.csv` on `databases` directory.


In [8]:
dt_inflation <- as.data.table(df_inflation)
dt_inflation[,Date:= lubridate::ym(dates_inflation)]
setcolorder(dt_inflation, c("Date", setdiff(names(dt_inflation), "Date")))
fwrite(dt_inflation,file = "../databases/variables_inflation.csv")