## [Forecasting Economic Aggregates Using Dynamic Component Grouping](https://mpra.ub.uni-muenchen.de/81585/1/MPRA_paper_81585.pdf)

__Abstract__

In terms of aggregate accuracy, whether it is worth the effort of modelling a disaggregate process, instead of forecasting the aggregate directly, depends on the
properties of the data. Forecasting the aggregate directly and forecasting each of
the components separately, however, are not the only options. This paper develops
a framework to forecast an aggregate that dynamically chooses groupings of components based on the properties of the data to benefit from both the advantages
of aggregation and disaggregation. With this objective in mind, the dimension of
the problem is reduced by selecting a subset of possible groupings through the use
of agglomerative hierarchical clustering. The definitive forecast is then produced
based on this subset. The results from an empirical application using CPI data for
France, Germany and the UK suggest that the grouping methods can improve both
aggregate and disaggregate accuracy

#### Intro

The usual argument behind using the components is that
allowing for different specifications across disaggregate variables may capture more
precisely the dynamics of a process that becomes too complex through aggregation.
Favouring forecasting directly is that it would be less affected by disaggregate misspecification, data measurement error and structural breaks. Ultimately, whether it is
better to forecast components together or separately depends on the particular forecasting models and data. 



The options available for forecasting are many, even when only the level of disaggregation is considered. 
The usual argument behind using the components to forecast an aggregate is that allowing for different specifications across disaggregate variables may capture more precisely the dynamics of a process that becomes too complex through aggregation. In support of this view, Granger (1990) show that the summing
many simple stationary processes can produce a fractional integrated aggregate, while
Bermingham and D’Agostino (2014) show that the dispersion of the persistence of individual series has an accelerating effect on the increase of complexity in the aggregate.

Favouring forecasting the aggregate directly is that, in practical applications, it is likely
that the disaggregate processes may suffer from misspecification. For example, if the
disaggregate models neglect that a number of components share common factors, the
forecasting errors will tend to cluster having a negative effect on the aggregate forecast. The direct aggregate forecast would be less affected by these features
in the data and other problems, like those resulting from data measurement error and
structural breaks.

The theoretical literature supports using the disaggregate forecasts, or bottom-up approach, but the results in the empirical literature are mixed. Ultimately, whether the
magnitude of the aggregation error compensates the specification errors in the disaggregate model depends on the particular forecasting models and data.

An option to improve forecasting performance in this setting, is to work on the modelling, that include disaggregate information in a direct aggregate approach or include common factors
in a bottom-up approach. Another less obvious way, is to look for data transformations
that allow existing models to perform better.

> Examples: Marcellino et al. (2003), Hahn and Skudelny (2008), Burriel (2012) and Esteves
(2013) for European GDP growth; and Zellner and Tobias (2000), Perevalov and Maier (2010) and Drechsel
and Scheufele (2013) for GDP growth in specific industrialized countries.

__As mentioned before, adding components together results in new series with characteristics that may differ quite significantly from those of the originating ones. In this
context, it may be possible to purposefully find specific groupings that show more desirable properties than those of the individual components and the aggregate.__

Some authors have proposed using purpose-built groupings to increase overall forecasting accuracy, but it would seem that, at least in economic forecasting, it has had
little impact. A reason for this may be that the number of possible
groupings grows exponentially with the number of components meaning that traditional methods, that would usually rely on evaluating all possible outcomes, are really only usable for problems with relatively few components. For larger problems, a different
approach becomes necessary.

One that has been relatively successful recently, particularly given the increase in popularity of methods for Big Data, is one that performs grouping conditional on some feature of the original data. The success of these methods, however, depends on the chosen feature being useful in obtaining the desired outcome. 
- The assumption upon which many of these models are built on, is that by grouping series that behave in a similar way, the idiosyncratic errors within groups will tend to offset each other while the more relevant individual dynamics will be retained to be modelled.
Although these problems are set in a different context, the purpose of the methods are
very similar to those of grouping components to increase the forecasting accuracy of an
economic aggregate. 
They belong, however, to an area of research of statistical learning
that has focused almost exclusively on extracting information from very large datasets.
__Many relevant economic aggregates, like GDP and CPI, do not fall in this category and
it is unclear whether these methods will work with relatively small samples.__

In this context, it might be
possible to find specific groupings that avoid the problems associated with disaggregate
forecasting while still allowing for distinct disaggregate dynamics to be picked up in the
process. The two-stage
method consists of trying to find the grouping of components at each point in time that
produces the best aggregate forecast. 


- In the first stage, we use agglomerative hierarchical clustering to reduce the dimension of the problem by choosing a subset of feasible groupings based on the commonality among the different components. 
- In the second stage, we try different selection procedures on the resulting hierarchy to produce the final aggregate forecast. These selection procedures include choosing a single grouping based on some criterion and combining the whole subset of groups.

The results from an empirical application using CPI data for France, Germany and the
UK show that the grouping method can improve overall accuracy. 
The results show
that some of the methods that selected a unique grouping performed better than the
best performing non-grouping method, both in terms of aggregate and disaggregate
accuracy.  




### A purpose driven grouping framework for aggregate forecasting

Although the implementations and techniques differ, the assumption on which many of
the models intended to forecast time-series are built on, is that forecasting series that
behave similarly as a group will __tend to produce more accurate aggregate forecasts than
if they are modelled separately.__ This assumption would also seem reasonable within the
context of forecasting economic aggregates, given that the relevant literature shows
that accounting for commonality among components is key to forecasting accuracy and,
in particular, that ignoring it would be detrimental for the bottom-up approach 

Regarding the method that performs the grouping, within the area of unsupervised
learning there are many (for example, Support Vector Machines, Support Vector Regression, Self Organizing Maps). One that seems well suited for the particular setting is Hierarchical Clustering. The method is concerned with discovering unknown subgroups in data. The most commonly used method is __the agglomerative alternative__, that starts with
a set of groups, or clusters, that contain a single element each and proceeds by grouping the data into fewer units with more elements each (the less popular divisive approach starts from one large group that contains all the elements and
divides it up accordingly). 


At first sight, it could seem that hierarchical clustering might be the solution to the
grouping problem. However, the method provides no guidance on whether the groupings in the structure are meaningful nor if one grouping is better than another in any
particular sense. 
The problem with identifying an appropriate grouping right away, is that, even if there is
one, the particular dissimilarity threshold below which components should be grouped
so as to obtain the most accurate aggregate forecast is unknown. 


##### Guided selection of a subset of groupings

Dissimilarity measures and linkage methods have a defining impact on the results and
the relevant literature provides many alternatives to choose from. In the statistical learning literature it is not unusual to use __simple correlation__ as the dissimilarity measure for time-series. 

The implementations of deterministic agglomerative hierarchical clustering are relatively simple.
In the context of an aggregate with $n$ components, the algorithm proceeds
by calculating the pairwise commonality between the $n$ series and aggregating the two
with the highest commonality. This leaves $n − 1$ series. The process is repeated until only the aggregate is left.

__For the  dissimilarity measures, five measures are evaluated for deterministic groping and one for probabilistic:__

1. Pearson’s Correlation

In the machine learning literature there are many alternatives, but in the context of
time-series the most obvious are measures for correlation. Probably the best known is
Pearson’s correlation coefficient that measures the strength of the linear relationship
between two variables. Although its limitations are many, its widespread use make it an
obvious benchmark for the rest of the measures.
As a higher correlation, in absolute terms, is associated with similarity,
the corresponding dissimilarity measure is defined as:
$$PC_{x_i,x_j} = 1 - abs ( \rho_{x_ix_j}) = 1 − abs \left( \frac{cov(x_i, x_j)}{σ_{x_i}σ_{x_j}} \right) $$

2. Spearman’s Correlation

Spearman’s rank correlation coefficient is a non-parametric rank statistic that assesses
how well an arbitrary monotonic function can describe the relationship between two
variables. Therefore, it is not affected by non-linearity (like PC). In practice, however, it is just
the Pearson’s Correlation coefficient in which the data are converted to ranks before
calculating the coefficient.
$$ SC_{x_i,x_j} = 1 - abs(r_{x_ix_j}) = 1 - abs \left( \frac{cov(x^{rank}_i, x^{rank}_j) }{σ_{x^{rank}_i}σ_{x^{rank}_j}} \right)$$
where $r_{x_ix_j}$ is the rank correlation coefficient between $x_i$ and $x_j$ 

3. Latent factor

In the context of measuring commonality in applications with financial data, the variance explained by the first principal component to measure the commonality among a set of variables. The
decomposition transforms the original variables into a new set that are orthogonal and
in which they are ordered so that the first retains most of the variation present in all
of the original variables while the last has the least. This is in line with the approaches
in the Dynamic Factor Models literature that try to capture the common factors using
Principal Component Analysis.

For $n$ series of length $T$, the sample’s covariance
matrix $\frac{1}{T} X^TX$ can be rewritten using the eigen decomposition as $VD^2V^T$. The columns
of $V$, the eigenvectors, are the principal component directions of $X$ and $z_1 = Xv_1$, with
$v_1$ being the first column of $V$, is the first principal component. The values on the
diagonal of $D^2$ are the eigenvalues associated with each eigenvector, that is $d^2_1$ for $v_1$. It can be shown that $Var(z_1) = Var(Xv_1) = d^2_1 / T $. 

As a higher total explained variance is associated
with similarity, the corresponding dissimilarity measure is defined as:
$$VE_{x_i,x_j} = 1 − \left( \frac{d^2_1}{\sum_{l=1}^n d^2_l} \right)$$
where the total variance explained by the first principal component is $\frac{d^2_1}{\sum_{l=1}^n d^2_l}$

4. Persistence

Roughly speaking, the term persistence in time series context is often related to the notion of memory properties of time series. 
Series that have very different persistence will tend to suffer more of omitted variable bias if they are forecasted together
than series with a similar persistence.  So forecasting series with different
persistence separately.
For example, fitting an AR(1) model to each component, we use the difference in the estimated persistence parameter as a measure for
dissimilarity:
$$PE_{x_i,x_j} = abs (abs (\hat{ρ}_i) − abs (\hat{ρ}_j ))$$

5. Forecast-error clustering

Ignoring the common factor and
interdependencies will tend to make forecasting errors cluster instead of cancelling out.
The dissimilarity measure the correlations of the out-of-sample forecasting errors for the most recent periods.
Specifically, for each component $i$ we fit $x_{i,t−p+1} = a_i+ρx_{i,t−p}+e_{i,t}$, where $p$ is the number
of periods that are evaluated for the measure. With the model, we generate forecasts
from $t − p + 1$ to $t$ and calculate the corresponding forecasting errors as $\hat{x}_{i,s|s-1} − x_{i,s}$ for
$s = t − p + 1$ to $t$ and collect them in $\hat{e}^t_i$. With this, the dissimilarity measure is defined
as:
$$FC_{x_i,x_j} = 1 − abs \left( \frac{cov(\hat{e}^t_i, \hat{e}^t_j)}{σ_{\hat{e}^t_i} σ_{\hat{e}^t_i}} \right)$$


6. Probabilistic grouping algorithm

It would be desirable for a clustering method to
provide some insight into the quality of the groupings. However, as traditional clustering methods are deterministic, this is not possible. 
Probabilistic algorithms have
been proposed, but until recently their increased complexity have hindered their implementation.
One that does compare favourably to the traditional methods is the __Bayesian Hierarchical Clustering__ method by Heller and Ghahramani (2005). __The main idea, is that,
through _empirical Bayesian methods_, it performs the grouping based on the probability
of two observations being generated from the same underlying function.__

Let
$D = {x_1, ... , x_n}$ represent all the data and $D_i$ the data at subtree $T_i$. Then, at each
step, subtrees $T_i$ and $T_j$ are compared to see if they should be merged together. The
hypothesis to be evaluated, is that $x_i$ and $x_j$ come from the same probabilistic model $p(x | θ)$ of unknown parameters $θ$. Then define $D_{ij}$ as the merged data, and let $M_{ij}$
equal one if they should be merged and zero if they should not. The probability of a merge is given by
$$r_{ij} = \frac{p(D_{ij} | M_{ij} = 1)p(M_{ij} = 1)}{p(D_{ij} | M_{ij} = 1)p(M_{ij} = 1) + p(D_{ij} | M_{ij} = 0)p(M_{ij} = 0)}$$
$p(M_{ij} = 1)$ is the prior probability of a merge and can be computed from the data. If $M_{ij}$ equal to one, the data is assumed to come from the same model meaning
$$p(D_{ij} | M_{ij} = 1) = \int \left[\prod_{x_n\in D_{ij}} p(x_n | θ) \right] p(θ | λ)dθ$$
with $λ$ being a hyperparameter than can be provided or estimated from the data. If $M_ij$
equal to zero, the data is assumed to generated independently and
$$p(D_{ij} | M_{ij} = 0) = p(D_i| T_i)p(D_j | T_j )$$

The algorithm starts with each observation in its own cluster. It calculates all the pairwise merge probabilities and proceeds to merge the clusters with the highest posterior
merge probability. It then recalculates the pairwise merge probabilities. It continues in
this way, merging the pairs with the highest merge probability until only the aggregate
is left.


##### Producing a unique aggregate forecast
As the hierarchical clustering proceeds by fusing two
observations or series at a time, it produces an intuitive tree-based representation of
the final structure. This representation is called a dendrogram. 
As mentioned before, the algorithm by itself does not provide any advice with regards
to what grouping to use. On the dendrogram, however, the vertical axis presents the
level of dissimilarity and therefore visual inspection can provide some guidance. On the dendrogram, the height of the first fusion of any two observations indicates how different the
two observations are. Observations that fuse at the very bottom are quite similar to each other, whereas
observations that fuse close to the top will tend to be quite different.  
Though it
is not uncommon that no obvious cutting points are revealed. In these cases it is necessary turn to an
exogenous criterion. 

For this purpose, we present six different alternatives separating the methods in those
that seek to select a single level of disaggregation and those that use a combination of
the different groupings.

1. In-sample fit 

Probably the most commonly used approach to judge a model is in-sample fit. For our particular case we use the in-sample forecasting error. To choose the level of aggregation
for forecasting period $t + 1$, for each level of aggregation within the proposed hierarchy
at time $t$, we use the forecasting models and parameters calculated using data up to
period $t$ to calculate the one-step-ahead root mean squared forecasting error (RMSFE)
for the sample up to period $t$. The level of aggregation with the lowest in-sample forecasting error is then used to
forecast period $t + 1$.
With this, the in-sample fit for disaggregation level $i$, at time $t$ is:
$$ ISF_{i,t,v} = \sqrt{1/v \sum^{t-1}_{s=t−1−v} (\hat{x}_{i,s+1|t} − x_{i,s+1})^2}$$
where $v$ determines how much data is included in the measure.


2. Past out-of-sample forecasting performance

One of the drawbacks of the in-sample criteria is that it will tend to over-fit the data.
Therefore, it is very common to also use out-of-sample evaluation. For our case, the out of-sample criterion, for forecasting period $t + 1$, is calculated using a recursive out-of sample forecasting exercise. That is, for each level of aggregation within the proposed
hierarchy at time $t$, we estimate the parameters with data up to period $t−v$ and forecast
$t − v + 1$, then estimate the parameters with data up to period $t − v + 1$ and forecast $t − v + 2$ and continue in the same way stopping with the forecast for period $t$. Then, we
calculate the RMSFE using these forecasts.
With this, the out-of-sample performance for disaggregation level $i$, at time t is: 
$$OOS_{i,t,v} = \sqrt{1/v \sum^{t-1}_{s=t−1−v}  (\hat{x}_{i,s+1|s} − x_{i,s+1})^2 }$$
where $v$ determines how much data is included in the measure.


3. Lowest average error dissimilarity threshold

Unsupervised learning, of which the clustering method used to produce the subset of
groups is part of, is often challenging because there is no response variable. To find the level of aggregation at which the
resulting aggregate forecast error is lowest, we can use a supervised
method.
The way in which we do this is by calculating for the training sample the average forecasting error conditional on the level of dissimilarity. This corresponds to calculating
the forecasting error associated with the values on the vertical axis of all the dendrograms for the sample up to period $t$ and averaging the results. To make the averaging
over different periods possible, we use a smoothing spline to interpolate the forecasting errors for each period. To forecast period $t + 1$ we choose the level of aggregation
associated with the dissimilarity that is closest to the minimum average error.

4. Probabilistic criterion

The Bayesian Hierarchical Clustering method proceeds by building the hierarchy based
on the estimated probability of two observations coming from the same underlying function. A natural decision rule for groupings in
this context, is to only perform fusions that have a posterior merge probability greater
than 50%. This criterion, however, can only be applied to hierarchies produced by the
probabilistic algorithm.


5. Equal-weights among aggregate forecasts

A very attractive feature of forecast combination is that simple combination schemes
are surprisingly effective. __In fact, the equal-weighted forecast
combination performs so well that researchers have tried to explain why this is the case (Smith and Wallis, 2009).__ In view of this, given that each level of the hierarchy produces
an aggregate forecast, the most straightforward thing is to average the aggregate forecasts for all levels.

6. Equal-weights among distinct forecasts

However, averaging the aggregates is not the same as assigning equalweights to each distinct forecast.  If the forecasts are generate independently of each
other, for all of the groupings below their fusion, the aggregate forecast involves including the forecast for two individual components. Then, when all aggregate
forecasts are averaged, the forecast for both components are implicitly given a weight
that is ten times larger than the forecasts of the components that are fused in the first 
step.
An alternative approach is to give equal weights to each unique forecast. That means
only including each individual component forecast, each intermediate aggregate forecast and the aggregate forecast once.

To do this it is necessary to combine forecasts from multiple levels of aggregation and we do so by
extending the method for combining two different aggregation levels proposed in Cobb (2017).

###  Empirical Application

As an empirical application of the method we perform a forecasting exercise using CPI
data from France, Germany and the United Kingdom. We use univariate autoregressive
and Bayesian multivariate methods to forecast the data and evaluate the aggregate and
overall forecasting accuracy of the grouping procedure by comparing the results with
that of the direct forecast and that of the corresponding bottom-up approach.

That is, we compare the improvement of the grouping against the corresponding direct and bottom-up
approach as opposed to finding the best aggregation from the pool of alternatives for both AR(1)’s and
BVAR’s.

##### Autoregressive model of order one (AR1)

In particular, we use an autoregressive model of order one, $x_{i,t} = a_i +ρ_ix_{i,t−1} +e_{i,t}$, for the variables made stationary through differentiation according to unit root tests. The forecasts are then produced using:
$$\hat{x}_{i,t+1|t} = \hat{a}_i + \hat{ρ}_ix_{i,t}$$

##### Bayesian VAR (BVAR)
We do acknowledge, however, that interdependencies among components could play an
important role, so we also use Bayesian Vector Autoregressive models (BVARs). In practice, we forecast the whole
multivariate process using five lags and the choice of overall tightness,  that produces the same in-sample of that of the direct aggregate forecast.
The estimated model is
$$X_t = c + A_1X_{t−1} + . . . + A_5X_{t−5} + e_t$$
and the forecasts are produced using
$$\hat{X}_{t+1|t} = \hat{c} + \hat{A}_1X_t + . . . + \hat{A}_5X_{t−4}$$

#### Forecasting Accuracy Comparison

The evaluation exercise is performed over the 2001-2015 period leaving the first ten
years of data to estimate the models. It is set up in a quarterly rolling scheme using a
ten year window where in each period the models are re-estimated and a one-step-ahead
forecast is generated.

The forecasting accuracy is presented by means of the model’s mean square forecasting
error (MSFE) relative to that of a benchmark model. That is, for variable $i$ and using
model $m$, the relative MSFE is
$$RelMSF^{(i,m)} = \frac{MSFE^{(i,m)}_{T_0,T_1}}{MSFE^{(i,0)}_{T_0,T_1}}$$
with
$MSFE^{(i,m)}_{T_0,T_1}
=
\frac{1}{T_1 − T_0 + 1} \sum^{T_1}_{t=T_0} (\hat{y}^{(m)}_{i,t+1|t} − y_{i,t+1})^2$, $T_0$ is the last period of actual
data in the first sample used for the evaluation and $T_1$ is the last period of actual data in
the last sample. A RelMSFE lower than one reflects an improvement over the
benchmark model for which m = 0. To evaluate the significance of these differences, we
compare the forecasts using the modified __Diebold-Mariano test for equality of prediction__
mean squared errors proposed by Harvey et al. (1997).

Regarding measuring the overall forecasting accuracy of the components we do so by
comparing the cumulative absolute errors in the contribution to the aggregate level.
For this purpose we define the cumulative absolute root mean square forecasting error
for an aggregate with $N$ components $q_n$ and using model $m$ as
$$CumRMSFE^{(m)}_{T_0,T_1} = \sqrt{\frac{1}{T_1 − T_0 + 1}\sum^{T_1}_{t=T_0} \left( \sum^{N}_{n=1} w_{n,t+1} abs (\hat{q}^{(m)}_{n,t+1|t} − q_{n,t+1}) \right)^2}$$

We also acknowledge that Bermingham and D’Agostino (2014) find that the performance
from the bottom-up approach could improve if the common features among components
are accounted for. To see how our application measures up to an alternative approach
we also compare it to a factor __augmented autoregressive model.__ Following their implementation, we extend each univariate autoregressive model from the bottom-up approach to include one factor
$$x_{i,t} = a_i + ρ_ix_{i,t−1} + γ_iF_{t−1} + e_{i,t}$$
The factor, F, is estimated with the first principal component following Stock and Watson (2002) and computed over all components. The corresponding forecast for each component is generated using
$$\hat{x}^{F AAR}_{i,t+1|t} = \hat{a}_i + \hat{ρ}_ix_{i,t} + \hat{γ}_i\hat{F}_t$$

###### Results:

We see that in five out of six of the cases the respective bottom-up approach performs
better than the direct approach. In particular, the univariate approach tends to do better
than the BVARs with improvements going from 5 to 12%, while the BVAR’s improve for
France and Germany, about 5%, but do quite a bit worse than the direct method for
the UK. In regards to the factor augmented AR, it does not seem to give any advantage
to the simple AR. Although some of the differences could seem quite large, it is worth
noting that they are not statistically significant.

From the results that are common among the different cases we can draw some overall
conclusions. One is that the forecast combination choice methods performed well with
most dissimilarity measure choices and, in particular, in most cases the improvements
were statistically significant. The other is that the persistence dissimilarity measure
combined with the dissimilarity threshold choice method performed best overall.

As it is the case in most empirical applications, the impact of the grouping methods
depends on the specific dataset. In particular, improvements in disaggregate accuracy
were obtained only in the case where the direct approach was better than the bottomup approach. It was also in this case that relatively more non-combination grouping
methods improved aggregate accuracy.  Such a result
would not be entirely surprising, given the motivation for using dynamic grouping in the
first place; that is to capture disaggregate dynamics in cases where full disaggregation
could introduce to much noise.

Having said that, the use of the grouping methods could increase aggregate accuracy
even in cases where full disaggregation is better than the direct approach. The overall
good performance of the forecast combination choice methods suggests that the grouping methods can provide a way of introducing the robustness of forecasting combination
into the procedure without having to introduce different forecasting models. Although, in terms of disaggregate accuracy there were hardly any gains, in many cases the accuracy was similar to that of the best non-grouping method.

### Other literature:

- [Forecast Combinations](https://ideas.repec.org/h/eee/ecofch/1-04.html): [slides](http://www.oxford-man.ox.ac.uk/sites/default/files/events/combination_Sofie.pdf)

- [Direct vs bottom–up approach when forecasting GDP](https://ac.els-cdn.com/S026499931300151X/1-s2.0-S026499931300151X-main.pdf?_tid=93484b90-eefe-487d-8fd9-4cc2bda7764e&acdnat=1550331512_ff6e1f111c3c98bf1a35def83aa72288)

- [A real-time disaggregated forecasting model for the euro area gdp.](https://pdfs.semanticscholar.org/6d78/2d41cd8aa359d7038dc4705a56caacb9f47f.pdf)

- [A linear benchmark for forecasting GDP growth and inflation?](https://onlinelibrary.wiley.com/doi/abs/10.1002/for.1059)

- [Early estimates of euro area real gdp growth: a bottom up approach from the production side.](https://www.ecb.europa.eu/pub/pdf/scpwps/ecbwp975.pdf)

- [On the Advantages of Disaggregated Data:
Insights from Forecasting the U.S. Economy
in a Data-Rich Environment](https://www.bankofcanada.ca/wp-content/uploads/2010/05/wp10-10.pdf)

- [Macroeconomic forecasting in the Euro area: Country specific versus area-wide information
Author links open overlay panel](https://www.sciencedirect.com/science/article/pii/S0014292102002064)

- [Bottom-up or direct? forecasting german gdp in a data-rich environment](https://www.snb.ch/n/mmr/reference/working_paper_2012_16/source/working_paper_2012_16.n.pdf)



- [Forecasting aggregates and disaggregates with common features](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.890.5889&rep=rep1&type=pdf)

- [Estimation and Prediction from Aggregate Data when Aggregates are Measured More Accurately than Their Components](https://www.jstor.org/stable/1913689?seq=1#page_scan_tab_contents)

- [Forecasting Aggregates by Disaggregates](https://www.researchgate.net/publication/24128624_Forecasting_Aggregates_by_Disaggregates)

- [Aggregation in large dynamic panels](https://pdfs.semanticscholar.org/03ad/175699af9c659bef3aba1644876c46cb31cb.pdf)

- [The Importance of Disaggregation
in Economic Modelling](https://www.ssb.no/a/histstat/doc/doc_199912.pdf)

- [Implications of Aggregation with Common Factors](https://www.jstor.org/stable/3532462?seq=1#page_scan_tab_contents)
