# Preprocessing Practical Example

## High dimensional

In this notebook, we preprocess the data from the National Institute of Statistics and Geography (INEGI) regarding different economic indicators in Mexico


In [1]:
# packages
remove(list = ls())
options(warn = -1)
suppressMessages(library(data.table))
suppressMessages(library(magrittr))
suppressMessages(library(ggplot2))
suppressMessages(library(GGally))
suppressMessages(library(imputeTS))
suppressMessages(library(seasonal))


source("../source/simulations.R")
source("../source/vectorial_methods.R")
source("../source/auxiliar_methods.R")

We read the databases

In [2]:
# Raw data
df_BIE <- read.csv("../databases/data_BIE.csv", header = TRUE, row.names = 1)
catalogue_BIE <- read.csv("../databases/catalogue_BIE.csv")
df_BIE <- na.omit(df_BIE)

In [3]:
dim(df_BIE)

The database is:

- `data_BIE.csv`: A set of $m=205$ economic indicators.

We use the period of time where all the variables are available, this period is since January 2017. 

The series has montly frequency, thus it has a length $T=96$, then $m>T$ that it is a characteristic in the high-dimensional setting. 


In [4]:
#############
# TS format #
#############
start <- "2017/01"
d <- 12

# dates
dates_BIE <- rownames(df_BIE)[which(rownames(df_BIE) == start):nrow(df_BIE)]
df_BIE <- df_BIE[dates_BIE, ]
variables_BIE <- colnames(df_BIE)

# nans
for (i in 1:length(variables_BIE)) {
  ts_aux <- ts(df_BIE[, variables_BIE[i]][!is.na(df_BIE[, variables_BIE[i]])],
    start = as.numeric(substring(start, 1, 4)), frequency = d
  )
  df_BIE[1:length(ts_aux), variables_BIE[i]] <- ts_aux
}
df_BIE <- na_kalman(df_BIE)

The basic assumptions in the paper are that the data does not have an *stational* neither a *deterministic* component. 

First, we use the suggestion in the paper removing a linear trend.


In [5]:
#################
# linear trends #
#################
data_trend <- matrix(0, nrow = nrow(df_BIE), ncol = ncol(df_BIE))
coefs_trend <- matrix(0, nrow = ncol(df_BIE), ncol = 2)
colnames(data_trend) <- colnames(df_BIE)
rownames(coefs_trend) <- colnames(df_BIE)
colnames(coefs_trend) <- c("const", "slope")

alpha <- 0.05 # nivel de significancia de los coeficientes

time <- 1:nrow(df_BIE)
log_time <- log(time)

for (i in 1:length(variables_BIE)) {
    aux_reg <- lm(df_BIE[, variables_BIE[i]] ~ time)
    summary_aux_reg <- summary(aux_reg)$coefficients
    # coefs
    if (summary_aux_reg[, 4][[1]] < alpha) coefs_trend[variables_BIE[i], "const"] <- aux_reg$coef[[1]]
    if (summary_aux_reg[, 4][[2]] < alpha) coefs_trend[variables_BIE[i], "slope"] <- aux_reg$coef[[2]]
    # detrend
    data_trend[, variables_BIE[i]] <- coefs_trend[i, "const"] + coefs_trend[i, "slope"] * time
    # update
    df_BIE[, variables_BIE[i]] <- df_BIE[, variables_BIE[i]] - data_trend[, variables_BIE[i]]
}

Now, we do the same using the stational flag  `SA` in `catalogue_inflation.csv`

In [6]:
###############
# seasonality #
###############
data_seasonality <- matrix(0, nrow = nrow(df_BIE), ncol = ncol(df_BIE))
colnames(data_seasonality) <- colnames(df_BIE)

# stationality indicator
ind_seas <- as.character(catalogue_BIE[catalogue_BIE[, "SA"] == 1, "Short"])

# decomposition
for (i in 1:length(ind_seas)) {
    ts_aux <- ts(df_BIE[, ind_seas[i]], start = as.numeric(substring(start, 1, 4)), frequency = d)
    descomp_aux <- decompose(ts_aux)

    # seasonal component
    data_seasonality[, ind_seas[i]] <- descomp_aux$seasonal

    # stochastic part
    df_BIE[, ind_seas[i]] <- df_BIE[, ind_seas[i]] - descomp_aux$seasonal
}
df_BIE <- apply(df_BIE, 2, function(x) x - mean(x, na.rm = TRUE))

Testing for stationary on each serie. Count how many stationary series we have

In [7]:
df_BIE <- na.omit(df_BIE)
sum(apply(df_BIE, 2, function(x) tseries::kpss.test(x)$p.value > 0.05))

We save the preprocess dataset with the name `variables_BIE.csv` on `databases` directory.

In [8]:
dt_BIE <- as.data.table(df_BIE)
dt_BIE[,Date:= lubridate::ym(dates_BIE)]
setcolorder(dt_BIE, c("Date", setdiff(names(dt_BIE), "Date")))
fwrite(dt_BIE,file = "../databases/variables_BIE.csv")