cat(paste("(C) (cc by-sa) Wouter van Atteveldt & Jan Kleinnijenhuis, file generated", format(Sys.Date(), format="%B %d %Y")))
Note on the data used in this howto: This data can be downloaded from http://piketty.pse.ens.fr/files/capital21c/en/xls/, but the excel format is a bit difficult to parse at it is meant to be human readable, with multiple header rows etc. For that reason, I've extracted csv files for some interesting tables that I've uploaded to http://vanatteveldt.com/uploads/rcourse/data
Time Series Analysis with R
*Caveat*: I am not an expert on time series modelling. Please take this document as a source of inspiration rather than as a definitive set of answers.
Time series analysis consists of statistical procedures that take into account that observations are sequential and often serially autocorrelated. Typical questions for time series analysis are whether change in one variable affects another variable at a later date; and in reciprocal series which effect is stronger (granger causality).
In R, there is an object type
ts that can be used to represent time series.
income = read.csv("data/income_toppercentile.csv") income = ts(income[-1], start=income[1,1], frequency=1) class(income)
As you can see from the
class output, a time series is really a specialised type of matrix. One of the nice things is that the default plot for
ts objects gives an overview of the different time series over time:
plot(income, main="Income Inequality (percentage of income earned by top 1%)")
Univariate time series analysis
Let's first consider the univariate case. We focus on the US series since it is a long series without missing values.
us = na.omit(income[, "US"]) plot(us)
We could easily do a "naive" time series by just running a linear model with a lagged variable, e.g. to model the autocorrelation:
summary(lm(us ~ lag(us)))
However, we generally want to use the more specialized methods such as VAR and ARIMA.
Transforming time series
Before going on with the analysis, we need to know whether the time series is /stationary/, i.e. whether its properties such as mean and variance are constant over time.
Nonstationarity means that a time series shows an upward or downward trend for many time points; the series does not revert to it mean (almost) immediately. Nonstationary time series easily show either a positive or a negative spurious correlation, since it's unlikely that two series that show an upward or downward trend for a long sequence of time points will not show such a correlation.
Since a non-stationary time series gives estimation problems it should be transformed first, generally with a combination of differencing, log transformation or Box-Cox transformation.
The first test for stationarity is just looking at the graph. It looks like the series is much more variable initially and at the end, which could be a problem. Moreover, there is an obvious U-shape which means that the mean is not constant.
Formal tests for stationarity include the KPSS tests and the Augmented Dickey-Fuller (ADF) test.
library(tseries) kpss.test(us) adf.test(us)
Take care with interpreting these: the null hypothesis for KPSS is stationarity (rejected in this case), while the null hypothesis for ADF is a unit root, i.e. non-stationarity (not rejected).
A final diagnostic tools is looking at the auto-correlation grapf and partial auto-correlation graph. For a stationary series, both ACF and PACF should drop to zero relatively quickly.
par(mfrow=c(3,1)) plot(us) acf(us) pacf(us)
Thus, the ACF plot also points to a non-stationary series. As a first start, let's difference the series, i.e. instead of analysing income inequality we will analyse change in income inequality.
The reasoning for differencing is as follows: Since it would be highly coincidental that two series that show a spurious correlations if we compare their levels also show a correlation if we compare their changes, it's often a good idea to difference the time series. A famous example is the spurious correlation that indeed gradually less babies were born in Western countries in the twentieth century as the number of storks gradually decreased. If this is not a spurious correlation, then one would expect also a correlation of the differences. Thus, one would expect that the birth cohort immediately after the second world war was preceded by an increase of storks a year before. One would also expect that the decrease of birth immediately after the introduction of the contraception pill was preceded by a mass extinction of storks.
We can difference is series using the
par(mfrow=c(3,1)) plot(diff(us)) acf(diff(us)) pacf(diff(us))
This looks a lot more stationary, but variance still seems bigger at the extremes of the series.
So that is also looking better. However, we should probably still do a log transformation to normalize the variance:
par(mfrow=c(3,1)) plot(diff(log(us))) acf(diff(log(us))) pacf(diff(log(us)))
Now, variance looks about right for the whole series, and there are not ACF's or PACF's.
One should realize that these transformations may either facilitate or complicate the interpretation of research results. A linear relationship between log transformed variables, for example, can be understood often in economics and in finance as a relationship between percentual changes, thus in terms of elasticities. Theory comes first, which means that without a proper interpretation transformations a regression equation based on non-transformed variables may be preferable, although his increases the risk of spurious correlations (stork-baby-correlation). We will come back to the "error correction model" as another way to get rid of trends in the data at the end of this document.
ARIMA (AutoRegressive Integrated Moving Average) models model a time series with a mixture of autoregressive, integrative, and moving average components. One can base the components needed on the (P)ACF plots as presented above. Very briefly put: decaying ACF's such as in the plot of the untransformed series point to an integrative component. After eliminating integrative components, we check the remaining (p)acf plots, and ACF spikes suggest an AR component, while if there are more PACF spikes then ACF spikes we should use MA components.
In our case, since the transformed plots had no significant spikes in either the ACF or the PACF plots,
we don't need an AR or MA component.
Thus, we can fit the ARIMA model, and inspect the residuals with the
library(forecast) m = arima(diff(log(us)), c(0,0,0)) summary(m) tsdiag(m)
What this in fact means is that the best estimate of next years inequality is slightly higher than this years, but other than the very slight trend there are no regularities in the time series.
We can also use the
ets functions that automatically fit a time series,
and we can use
forecast to project the regularities into the future.
In this case, of course, that is not very interesting as there were no regularities in the time series,
m = auto.arima(us) summary(m) forecast(m, 4)
Note that the auto.arima even ignores the slight trend, presumably because the gain in fit is too small.
VAR models for multivariate time series
VAR (Vector Auto Regression) models can be used to model effects between different time series. Let's take the time series of France, US, and Canada and see whether change in income inequality in one country predicts change in the other countries.
income = na.omit(income[, c("US", "France", "Canada")]) plot(income, main="Income Inequality")
library(vars) m = VAR(income) summary(m) plot(m)
All countries have a strong autocorrelation, and only for Canada an second effect is found: a high level of income inequality in the US predicts a high level of inequality in Canada the next year, controlled for the autocorrelation. We can do a VAR analysis on the differenced time series to get rid of the strong autocorrelations (and essentially predict how changes in one country respond to changes in another country)
library(vars) m = VAR(diff(income)) summary(m)
In the case of reciprocal relations it can be useful to use a Granger causality test. Let's use this test to see whether ht
m = VAR(income) causality(m, cause="US")
So, the null hypothesis that there is no granger or instantaneous causality between the US is (presumably erronously) rejected. Add a small piece on Cointegration and vector error correction models?
Error correction models
Another way to look at the data fro the US, France and Canada is that they show in spite of their differences a common trend called "cointegration": up until somewhere in the roaring twenties, downward until the eighties, upwards thereafter, with also some common ripples, e.eg. the short recovery in the late thirties, and the common fall roughly in 2007. The idea of "vector error correction models" (VECM) is that the nonstationarity in each country is caused by the long-term cointegration of economic trends in these countries - due to a common world market. The procedure to test a VECM model is to (1) test whether covariation is present indeed (e.g. with the Johannsen test) (2) test a model in which current changes do not simply depend on previous changes, but also on previous levels. A nickname of the VECM model is the drunk lady and her dog model. They may run apart unexpectedly long, but after a while they will walk together once more.
In statistical tests for small n one often has to assume a constant variance. Both in economics and in the social sciences changes in volatility (variance) are often more important. GARCH models deal not only with changes in means, but also with changes in volatility (AutoRegressive Conditional Heteroscedasticity). Let's give an example in which changes in volatility are extremely important. Understanding the gambler's paradox essentially comes down to understanding that precisely with a fair toss without memory (equal chances on +1p or -1p at each successive toss, and a starting capital C, the chances that one will be bankrupt if the number of tosses is extended forever, however large C, and however small p, since the variance (of a random walk) increases. Whether the risk of going bankrupt is still acceptable depends critically on the volatility, thus (simplified) on the question whether p increases or decreases in a subsequence of tosses.