### [Forecast Combinations](http://www.oxford-man.ox.ac.uk/sites/default/files/events/combination_Sofie.pdf)

Forecast combinations have been successfully applied in several areas of
forecasting GNP and estimation of GDP

- Why combine?
Many models or forecasts with similar predictive accuracy; diversification gains
- When to combine?
Individual forecasts are misspecified; 
unstable forecasting environment (past track record unreliable);
short track record
- What to combine? 
Forecasts using different information sets;
forecasts based on different modeling approaches (linear/nonlinear)

- Combinations of forecasts is motivated by 
misspecified forecasting models due to (e.g., structural breaks); 
diversification across forecasts


__Essentials of forecast combination__:
- Dimensionality reduction: Combination reduces the information in a vector of forecasts to a single summary measure using a set of combination weights
- Optimal combination chooses weights to minimize the expected loss of the combined forecast 
    - more accurate forecasts tend to get larger weights, 
    - combination weights also reáect correlations across forecasts
    - estimation error is important to combination weights
- Irrelevance Proposition: In a world with no model misspecification, infinite data samples (no estimation error) and complete access to the information sets underlying the individual forecasts, there is no need for forecast combination.

Combined forecast $f (\hat{f}_1, \hat{f}_2) $ dominates individual forecasts if
$$ E[L(\hat{f}_{i,y_{T+h}})] > \min_{f (.)} E [L(f (\hat{f}_1, \hat{f}_2), y_{T +h} ), for \  i = 1, 2$$
where $L$ - loss function (e.g., MSE loss)


Forecast combination is essentially a model selection and parameter
estimation problem with special constraints on the estimation problem

Using a middle step of first constructing forecasts limits the flexibility of the final forecasting model. Why not directly map the underlying data to the forecasts?
- Estimation error plays a key role in the risk of any given method. Model combination yields a risk function which, through parsimonious use of the data, could result in an attractive risk function
- Combined forecast can be viewed simply as a different estimator of the final model

Theory:

Specialized concepts in optimal forecast combination arise from additional restrictions placed on the search for combination models
Because the underlying data are forecasts, they can be expected to obtain non-negative weights that sum to unity. 
Such constraints can be used to reduce the relevant parameter space for the combination weights and offer a more attractive risk function

Solving for the MSE-optimal combination weights, can lead to 
negative combination weight.
Negative weight on a forecast does not mean that it has no value - it means
the forecast can be used to offset the prediction errors of other models

Equal weights (EW) play a special role in forecast combination. 
EW are optimal in population when the individual forecast errors have
identical variance and identical pair-wise correlations.  
This situation holds to a close approximation when all models are based on similar data and perform roughly the same.
More generally, EW are the optimal combination weights when the unit
vector lies in the eigen space of $Σ_e$.

Simple combination schemes such as EW satisfy these constraints and do not require estimation of any parameters.
EW can be viewed as a reasonable prior when no data has been observed

Simple combination methods:
- Equal-weighted forecast
- Median forecast
- Trimmed mean. Order forecasts. Trim top/bottom $λ$%



Existence of many estimation methods boils down to a number of standard issues
in constructing forecasts:
- role of estimation error
- lack of a single optimal estimation scheme
- simple methods are difficult to beat in practice
- common baseline is to use a simple EW average of forecasts:
- no estimation error here since the combination weights are imposed rather than estimated

Empirical studies often find that simple equal-weighted forecast combinations perform very well compared with more sophisticated combination schemes that rely on estimated combination weights
[Smith and Wallis (2009)] 

Errors introduced by estimation of the combination weights could overwhelm any gains from setting the weights to their optimal values over using equal weights

Explanations of the puzzle based on estimation error must show that
estimation error is large and/or 
gains from setting the combination weights to their optimal values are small
relative to using equal weights

#### Bates-Granger restricted least squares

Bates and Granger (1969): use plug-in weights in the optimal solution based on the estimated variance-covariance matrix

This is numerically identical to restricted least squares estimator of the
weights from a regression of the outcome on the vector of forecasts 
and no intercept subject to the restriction that the coefficients sum to one



#### Diebold and Pauly (1987) shrinkage estimator

Forecast combination weights formed as a weighted average of the prior
$ω_p = ι_m/m$ and the least squares estimates $\hat{ω}_{OLS}$:
$$ \hat{ω}_B = \tilde{A} \hat{ω}_OLS + (I-\tilde{A})ι_m/m$$

__Empirical Bayes approach__ sets $\tilde{A} = I(1 - \hat{σ}^2/\hat{τ}^2)$
- $\hat{τ}^2 = (\hat{ω}_{OLS}  ω_p )' (\hat{ω}_{OLS} - ω_p )/tr[(Z'_f
Z_f)]^{-1} $
- $\hat{\sigma}^2$ - MLE for variance of the residuals from the OLS combination regression
- $Z_f$ - matrix of regressors (ignoring the constant)
- $Z'_f Z_f$ is an unscaled estimate of the variance-covariance matrix of the forecasts


#### Aiolfi and Timmermann (2006) robust weighting scheme

Weights forecast models inversely to their rank, $Rank_{it+h|t}$
$$\hat{ω}_{it+h|t} = \frac{Rank^{-1}_{it+h|t}}{\sum^m_{i=1} Rank^{-1}_{it+h|t}}$$
Best model gets a rank of 1, second best model a rank of 2,etc


#### Bates and Granger (1969) Adaptive combination weights

Adaptive estimation schemes: 
rolling window of the forecast models' relative performance over the most
recent observations

Adaptive updating scheme discounts older performance, $λ \in (0;1)$
The closer to unity is $λ$, the smoother the combination weights


#### Time-varying combination weights

- Time-varying parameter (Kalman filter)
- Discrete (observed) state switching (Deutsch et al., 1994)
- Regime switching weights (Elliott and Timmermann, 2005)

Forecast combinations can work well empirically because they provide
insurance against model instability
- Empirically, Elliott and Timmermann (2005) allow for regime switching in combinations of forecasts  and find strong evidence that the relative performance of the underlying forecasts changes over time
- Performance of combined forecasts tends to be more stable than that of individual forecasts used in the empirical combination study of Stock and Watson (2004)
- Combination methods that attempt to explicitly model time-variations in the combination weights often fail to perform well, suggesting that regime switching or model breakdown can be difficult to predict or even to track through time

#### Bayesian Model Averaging

When the data underlying the individual forecasts is observed, we can
construct forecasts from many different models and average over the
resulting forecasts

Same issues as when only the forecasts are observed - but new possibilities
like BMA (Bayesian Model Averaging) arise

But we do not directly observe the outcome density we only observe a
draw from this and so cannot directly choose the weights to minimize the loss between this object and the combined density

Kullback Leibler (KL) loss for a linear combination of densities $\sum^m_{i=1} ω_i p_{it}(y )$
relative to some unknown true density $p(y )$ is given by
$$KL = \int p(y )\ln{(p(y ))}dy -   \int p(y )\ln{(\sum^m_{i=1} ω_i p_i(y ))} dy = C -  E \ln{(\sum^m_{i=1} ω_i p_i(y ))}$$
where $C$ is constant for all choices of the weights $ω_i$

__Minimizing the KL distance is the same as maximizing the log score in
expectation__


_Bayesian Model Averaging (BMA):_

$$p^c(y) = \sum^m_{i=1} ω_i p(y |M_i)$$
where $M_1, ...., M_m$ - models

- BMA weights predictive densities by the posteriors of the models, $M_i$
- BMA is a model averaging procedure rather than a predictive density combination procedure per se
- BMA assumes the availability of both the data underlying each of the densities, $p_i(y ) = p(y |M_i)$, and knowledge of how that data is employed to obtain a predictive density

Computation of $p(M_i|Z)$ requires computation of the marginal likelihood $p(Z|Mi)$ which can be time consuming

If the models' marginal likelihoods are difficult to compute, one can use a simple approximation based on BIC


Madigan and Raftery (1994) suggest removing models for which $p(M_i|Z) $ is much smaller than the posterior probability of the best model


BMA forecasts are more robust than individual forecasts, with unbiased and serially uncorrelated forecast errors. Model uncertainty reduces the strength of the evidence on return predictability


### [Averaging and the Optimal Combination of Forecasts](https://econweb.ucsd.edu/~gelliott/AveragingOptimal.pdf)

The optimal combination of forecasts, detailed in Bates and Granger (1969), has empirically often been overshadowed in practice by using the simple average instead.

Explanations of why averaging might in practice work better than constructing the optimal combination have centered on estimation error and the effects variations of the data generating process have on this error. 

The flip side of this explanation is that
the size of the gains must be small enough to be outweighed by the estimation error.
This paper examines the sizes of the theoretical gains to optimal combination, providing
bounds for the gains for restricted parameter spaces and also conditions under which
averaging and optimal combination are equivalent. The paper also suggests a new
method for selecting between models that appears to work well with SPF data.

###  Forecast Combination Puzzle

#### [A Machine Learning Approach to the Forecast Combination Puzzle](https://halshs.archives-ouvertes.fr/halshs-01317974/document)


This paper introduces the only algorithm that automatically manages the
forecast combination puzzle. The proposed algorithm adapts the structure of
the AB-Prod algorithm introduced in to solve the novel problem of automatically managing the forecast combination puzzle within macroeconomic
forecasting. The result is the first distribution-free performance guarantees for
both the mean combination and any alternative combination.




Forecast combination algorithms provide a robust solution to noisy data and shifting process dynamics. However in practice, sophisticated combination methods often fail to consistently outperform the simple mean combination.
This “forecast combination puzzle” limits the adoption of alternative combination approaches and forecasting algorithms by policy-makers. 

Through
an adaptive machine learning algorithm designed for streaming data, this paper proposes a novel time-varying forecast combination approach that retains distribution-free guarantees in performance while automatically adapting combinations according to the performance of any selected combination approach
or forecaster. 

In particular, the proposed algorithm offers policy-makers the
ability to compute the worst-case loss with respect to the mean combination
ex-ante, while also guaranteeing that the combination performance is never worse than this explicit guarantee. Theoretical bounds are reported with respect to the relative mean squared forecast error. Out-of-sample empirical performance is evaluated on the seven-country dataset



Forecast combination methods often outperform forecasting approaches
that estimate parameters on noisy data, structural breaks, inconsistent predictors and changing environmental dynamics. Unfortunately forecast
combination methods often fail to consistently outperform the mean combination over varying pools of forecasters and varying horizons. This paper offers
the first automatic procedure to manage this so-called “forecast combination
puzzle”. Accordingly, a large body of research has focused on the theoretical and empirical development of complex forecast combination procedures
that aim to fully exploit the information content within a pool of forecasters.
__However, empirical results in the literature demonstrate that existing forecast
combination approaches fail to consistently outperform the mean__. This negative
result is often referred to as the mean forecast combination puzzle.

Building on recent advances in the machine learning literature, this paper introduces the only automatic procedure to manage this puzzle.
First, we recast the forecast combination setting as a game of “prediction
with expert advice”. Next, we adapt the general structure of the AB-Prod algorithm to automatically hedge performance against the mean
combination, inheriting its distribution-free performance guarantees 

##### AB-Prod is computed as follows:

Input:
- Combination algorithms A and B
- A history of observations for the target variable, in our case output or inflation, up to the current time t, of length T.
- Preference weight $λ_B ∈ (0, 1)$ for the algorithm $B$.

Initialization:
- $λ_{A,0} = 1 − λ_B$
- Learning rate $η = \min (\sqrt{\frac{-\log(1-λ_B)}{T}},1/2)$
- Set $s_0$ 

Repeat the following for each observation from time $t = 0, . . . , T$:
- Compute combination weight $s_t = \frac{λ_{A,t}}{λ_{A,t} + λ_B}$
- Observe the target variable $y_t$ and compute the loss $l_{A,t}$ and $l_{B,t}$.
- Compute the combination loss $l_{AB−Prod,t} = s_tl_{A,t} + (1 − s_t)l_{B,t}$.
- Compute the deviation $δ_t = l_{B,t} − l_{A,t}$
- Update the Score $λ_{A,t+1} = λ_{A,t}(1 + ηδ_t)$




#### [Learning Time-Varying Forecast Combinations](https://simpolproject.eu/download/dolfins_research/mandel2016learning.pdf)

Combining forecasts has been demonstrated as a robust solution to noisy
data, structural breaks, unstable forecasters and shifting environmental dynamics. In practice, sophisticated combination methods have failed to consistently outperform the mean over multiple horizons, pools of varying forecasters
and different endogenous variables. This paper addresses the challenge to “develop methods better geared to the intermittent and evolving nature of predictive relations”, noted in Stock and Watson (2001), by proposing an adaptive
nonparametric “meta” approach that provides a time-varying hedge against
the performance of the mean for any selected forecast combination approach.
This approach arguably solves the so-called “Forecast Combination Puzzle”
using a meta-algorithm that adaptively hedges weights between the mean and
a specific forecast combination algorithm or pool of forecasters augmented
with one or more forecast combination algorithms. Theoretical performance
bounds are reported and empirical performance is evaluated on the sevencountry macroeconomic output and inflation dataset introduced in Stock and
Watson (2001) as well as the Euro-area Survey of Professional Forecasters

#### [A Simple Explanation of the Forecast Combination Puzzle](https://warwick.ac.uk/fac/soc/economics/staff/academic/wallis/publications/smithwallis_obes_09.pdf)

__This article presents a formal explanation of the forecast combination puzzle, that simple combinations of point forecasts are repeatedly found to outperform sophisticated weighted combinations in empirical applications.__ The explanation lies in the
effect of finite-sample error in estimating the combining weights. A small Monte Carlo study and a reappraisal of an empirical study by Stock and Watson support this explanation. The Monte Carlo evidence, together with a large-sample
approximation to the variance of the combining weight, also supports the popular
recommendation to ignore forecast error covariances in estimating the weight.


Three main conclusions emerge from the foregoing analysis.
- If the optimal combining weights are equal or close to equality, a simple average of competing forecasts is expected to be more accurate, in terms of MSFE, than a combination based on estimated weights. The parameter estimation effect is not large, nevertheless it explains the forecast combination puzzle.
- However, if estimated weights are to be used, then it is better to neglect any covariances between forecast errors and base the estimates on inverse MSFEs alone, than to use the optimal formula originally given by Bates and Granger for two forecasts, or its regression generalization for many forecasts. 
- When the number of competing forecasts is large, so that under equal weighting each has a very small weight, the simple average can gain in efficiency by trading off a small bias against a larger estimation variance. 

#### [Some Theoretical Results on Forecast Combinations](http://paneldataconference2015.ceu.hu/Program/Felix-Chan.pdf)


By setting up the forecast combination problem as a panel data model, the paper was
able to provide the necessary and sufficient condition for optimal weight as well as the
necessary and sufficient condition for simple average to be the optimal weight under
Mean Squared Forecast Errors (MSFE). It also provided theoretical justifications on
the superior forecast performance of simple average or individual models in the MSFE
sense. The paper also provided a theoretical exposition on the relative performance of
simple average and estimated optimal weight.

The results show that the performance of simple average can often outperform the estimated optimal weight in the presence of
estimation error. This theoretical justification is consistent with the empirical observation that the simple average often has superior performance over estimated optimal
weight

The paper also investigated the forecast combination problem under Mean Absolute
Deviation (MAD). By applying the __Fundamental Theorem of Linear Programming,__ the
paper was able to establish the necessary and sufficient condition for the simple average
to outperform a single model in the MAD sense. This result is new and the method
adopted in the paper might suggest a feasible way to analyse the forecast combination
problems for non-differentiable forecast criteria further.

#### [On the Forecast Combination Puzzle](https://arxiv.org/pdf/1505.00475.pdf)

It is often reported in forecast combination literature that a simple average
of candidate forecasts is more robust than sophisticated combining methods. This phenomenon is usually referred to as the “forecast combination puzzle”. Motivated by this
puzzle, we explore its possible explanations including estimation error, invalid weighting
formulas and model screening. We show that existing understanding of the puzzle should
be complemented by the distinction of different forecast combination scenarios known as
combining for adaptation and combining for improvement. Applying combining methods without consideration of the underlying scenario can itself cause the puzzle. Based
on our new understandings, both simulations and real data evaluations are conducted
to illustrate the causes of the puzzle. We further propose __a multi-level AFTER strategy__
that can integrate the strengths of different combining methods and adapt intelligently
to the underlying scenario. In particular, by treating the simple average as a candidate
forecast, the proposed strategy is shown to avoid the heavy cost of estimation error and,
to a large extent, solve the forecast combination puzzle.

#### [Is there an optimal forecast combination?](https://dornsife.usc.edu/assets/sites/462/docs/papers/forthcoming/Hsiao_and_Wan_FC08312011.pdf)

We consider several geometric approaches for combining forecasts in large samples  a simple
eigenvector approach, a mean corrected eigenvector and trimmed eigenvector approach. We give
conditions where geometric approach yields identical result as the regression approach. We also
consider a mean and a mean and scale corrected simple average of all predictive models for finite
sample and give conditions where simple average is an optimal combination. Monte Carlos
are conducted to compare the finite sample performance of these and some popular forecast
combination and information combination methods and to shed light on the issues of "forecast
combination" vs "information combination". We also try to shed light on whether there exists
an optimal forecast combination method by comparing various forecast combination methods
to predict __US real output growth rate__
and excess equity premium.

- We note first that there are periods where the predictions based on fixed and continuously updating forecasts are way off from the actual, but the rolling window approach appears able to narrow the gap. In general, the rolling framework performs better than the other 2 frameworks (except the three GR approaches). 
- Second, because the information for generating predictive model is readily available, the __rolling window model selection approach__ of selecting the best predictive model appears to yield most accurate predictions. 
- Third, if information is not readily available, then the ranking of forecast combination methods in the rolling window framework appear to be consistent with the simulation results in which the eigenvector approach of obtaining relative weights for forecasting models yield more accurate predictions than regression approach or Bayesian averaging. However, the mean corrected eigenvector approach appears to dominate simple eigenvector approach, perhaps because some predictive models are biased. 
- Fourth, perhaps because of frequent "breaks" between the actual and predictive models, trimming does not lead to the improvement and the mean corrected simple average yields as good (or a slightly better) forecasts as the mean corrected eigenvector approach.