# Chapter 1 : Basics of Time Series Analysis



Data obtained from observations collected sequentially over time.

The purpose of time series analysis is generally twofold: 

    1) To understand or model the stochastic mechanism that gives rise to an observed series and to predict 

    2) forecast the future values of a series based on the history of that series and, possibly, other related series or factors.

A somewhat unique feature of time series and their models is that we usually cannot assume that the observations arise independently from a common popu- lation (or from populations with different means, for example). Studying models that incorporate dependence is the key concept in time series analysis.



Examples of Time Series Data:


*   Annual Rainfall in LA: 
    The plot shows the time series data for LA over a 100 yr interval.The plot shows considerable variation in rainfall amount over the years—some years are low, some high, and many are in-between in value. 

    For analysis and modeling purposes we are interested in whether or not consecutive years are related in some way. If so, we might be able to use one year’s rainfall value to help forecast next year’s rainfall amount. One graphical way to investigate that question is to pair up consecutive rainfall values and plot the resulting scatterplot of pairs.

      Depending wheather or not we can see a pattern in the scatter plot we can say if their is a correlation between, this year's and next year's rainfall.

      
*   Industrial Chemical Process: 
    The variable measured here is a color property from consecutive batches in the process.

    Here values that are neighbors in time tend to be similar in size. It seems that neighbors are related to one another. We can interpret such result using a scatter plot between neighboring pairs.

    We can see an upward trend in the scatter plot, low values tend to follow in next batch by low values and high one's by high one's.


*   Monthly Average Temperatures in Dubuque, Iowa: 
    The average monthly temperatures (in degrees Fahrenheit) over a number of years recorded in Dubuque, Iowa.

    This time series displays a very regular pattern called seasonality. Seasonality for monthly values occurs when observations twelve months apart are related in some man- ner or another. All Januarys and Februarys are quite cold but they are similar in value and different from the temperatures of the warmer months of June, July, and August, for example. There is still variation among the January values and variation among the June values.


*   Monthly Oil Filter Sales:

    concerns the monthly sales to dealers of a specialty oil filter for construction equipment manufactured by John Deere. When these data were first presented to one of the authors, the manager said, “There is no reason to believe that these sales are seasonal.” Seasonality would be present if January values tended to be related to other January values, February values tended to be related to other Febru- ary values, and so forth. The time series plot shown in Exhibit 1.8 is not designed to dis- play seasonality especially well. Exhibit 1.9 gives the same plot but amended to use meaningful plotting symbols. In this plot, all January values are plotted with the charac- ter J, all Februarys with F, all Marches with M, and so forth.†


    With these plotting sym- bols, it is much easier to see that sales for the winter months of January and February all tend to be high, while sales in September, October, November, and December are generally quite low. The seasonality in the data is much easier to see from this modified time series plot. 




Model building Strategy


 

*   Model Specification (or Identification)
    In model specification (or identification), the classes of time series models are selected that may be appropriate for a given observed series. In this step we look at the time plot of the series, compute many different statistics from the data, and also apply any knowledge of the subject matter in which the data arise, such as biology, business, or ecology. It should be emphasized that the model chosen at this point is tentative and subject to revision later on in the analysis.

    Principle of parsimony; that is, the model used should require the smallest number of parameters that will adequately represent the time series. 
*   Model Fitting 
    The model will inevitably involve one or more parameters whose values must be estimated from the observed series. Model fitting consists of finding the best possible estimates of those unknown parameters within a given model. We shall consider criteria such as least squares and maximum likelihood for estimation.

*   Model Diagnostics
    Model diagnostics is concerned with assessing the quality of the model that we have specified and estimated. 

    If no inadequacies are found, the modeling may be assumed to be complete, and the model may be used, for example, to forecast future values. Otherwise, we choose another model in the light of the inadequacies found; that is, we return to the model specification step. In this way, we cycle through the three steps until, ideally, an acceptable model is found.

    











# Chapter 2 :Concepts in Time Series Analysis


Time Series and Stochastic Process:

    The sequence of random variables {Yt: t = 0, ±1, ±2, ±3,...} is called a stochastic process and serves as a model for an observed time series. It is known that the complete probabilistic structure of such a process is determined by the set of distributions of all finite collections of the Y’s. 

    Although, we usually do not deal with multivariate distributions. As much of the information in these joint distributions can be described in terms of means, variances and covariances. If the joint distributions of the Y’s are multivariate normal distributions, then the first and second moments completely determine all the joint distributions.

Means, Variances and Covariances:

    For a stochastic process {Yt: t = 0, ±1, ±2, ±3,...}, the mean function is defined by

    μt = E(Yt) for t=0,±1,±2,...

    That is, μt is just the expected value of the process at time t. In general, μt can be different at each time point t.
    The autocovariance function, γt,s, is defined as
    #the Product of both variable's mean of (deviation of the variable with respect to mean).

    γt,s = Cov(Yt,Ys) fort,s=0,±1,±2,...

    where Cov(Yt, Ys) = E[(Yt − μt)(Ys − μs)] = E(YtYs) − μt μs. 
    
    The autocorrelation function, ρt,s, is given by
    
    ρt,s = Corr(Yt,Ys) fort,s=0,±1,±2,... 
    where,

    Cov(Yt , Ys) =  (Corr(Yt , Ys))/ sqrt(Var(Yt)Var(Ys)) 


    Recall that both covariance and correlation are measures of the (linear) dependence between random variables but that the unitless correlation is somewhat easier to interpret. The following important properties follow from known results and our definitions:
    γt,t = Var(Yt)     ρt,t = 1 ⎫
    γt,s = γs,t      ρt,s = ρs,t⎬
    γt,s ≤ γt,tγs,s     ρt,s ≤1 ⎭

    If the Corelation between variable's if closer to 1 or -1, means a strong corelation and if it is closer to zero its a weak corelation.


    So inorder to investigate the same for time series data, To investigate the covariance properties of various time series models, the follow- ing result will be used repeatedly: If c1, c2,..., cm and d1, d2,..., dn are constants and t1, t2,..., tm and s1, s2,..., sn are time points, then

    Cov[ ∑ c*Yt, ∑ dY*s ]= ( ∑ ∑ c*d Cov(Yt, Ys) ) where summation is from 1 to m,n respectively for Ys and Yt.

    The proof of Equation, though tedious, is a straightforward application of the linear properties of expectation. As a special case, we obtain the well-known result:

    Var ∑c*Y = ∑(c**2)*Var(Yt)+ 2 ∑ ∑c(i)*c(j)*Cov(Yt,Yt ) 


The Random Walk:

    Let e1, e2,... be a sequence of independent, identically distributed random variables each with zero mean and variance σe2 . The observed time series, {Yt : t = 1, 2,...}, is constructed as follows:
    
    Y1 = e1             ⎫
    Y2 = e1 + e2        ⎬ 
    Yt = e1+e2+...+et   ⎭

    Alternatively, we can write

    Y(t) = Y(t-1) + e(t)
    
    with “initial condition” Y1 = e1. If the e’s are interpreted as the sizes of the “steps” taken (forward or backward) along a number line, then Yt is the position of the “random walker” at time t.

    So we take steps incrementally or decrementally depending on the sign of 'e', which leads to progressive movements in data.

    We get mean as follows: 

    μt = E(Yt) = E(e1 +e2 +...+et) = E(e1)+E(e2)+...+E(et) = 0+0+..+0

    Therefore:  ( μt = 0 for all t )

    Var(Yt) = Var(e1 + e2 + .. + et) 
    = Var(e1) + Var(e2) +  + Var(et)
    = σe2+σe2+...+σe2


    Therefore: (Var(Yt) = tσe2)

    In such a case, variance increses linearly as time increases.
    The same applies to autocorrelation also,

    Therefore:    γ(t,s) = t*σ(e)2

    The values of Y at neighboring time points are more and more strongly and posi- tively correlated as time goes by. On the other hand, the values of Y at distant time points are less and less correlated.

    Random walk provides a good model, for a phenomena as diverse as the movement of common stock price and also position of small particles suspended in a fluid—so-called Brownian motion.


A Moving Average:
    Suppose that {Yt} is constructed 
    
    Yt  =  (e(t) +e(t–1)) / 2

    where (as always throughout this book) the e’s are assumed to be independent and identically distributed with zero mean and variance σ(e)2.

    μt = E(Yt) = E⎨(e(t) + e(t-1)) / 2⎬ = (E(e(t)) + E(e(t-1))) / 2
    
    which also equal's '0'.

    Variance becomes, half of the usual variance.

    Covariance becomes, quater of the usual covariance.

Stationarity:
    
    To make statistical inferences about the structure of a stochastic process on the basis of an observed record of that process, we must usually make some simplifying (and pre- sumably reasonable) assumptions about that structure. 

    The most important such assumption is that of stationarity. The basic idea of stationarity is that the probability laws that govern the behavior of the process do not change over time. In a sense, the process is in statistical equilibrium. 

    Specifically, a process {Yt} is said to be strictly stationary if the joint distribution of Yt1, Yt2,..., Ytn is the same as the joint distribution of Y(t1 – k), Y(t2 – k),..., Y(tn – k) for all choices of time points t1, t2,..., tn and all choices of time lag k.

    1) Thus, when n = 1 the (univariate) distribution of Yt is the same as that of Yt − k for all t and k; in other words, the Y’s are (marginally) identically distributed.

    It then follows that E(Yt) = E(Yt − k) for all t and k so that the mean function is constant for all time. Additionally, Var(Yt) = Var(Yt − k) for all t and k so that the variance is also constant over time.

    Therefore if stationarity is present in a univariate distribution,
    a) Mean and Variance are constant over a particular period of time.
    b) Y's are (marginally) identically distributed.

    2) Setting n = 2 in the stationarity definition we see that the bivariate distribution of Yt and Ys must be the same as that of Yt − k and Ys − k from which it follows that Cov(Yt, Ys) =Cov(Yt − k,Ys − k)forallt,s,and k.

    That is, the covariance between Yt and Ys depends on time only through the time difference |t − s| and not otherwise on the actual times t and s.

    If a process is strictly stationary and has finite variance, then the covariance func- tion must depend only on the time lag.

    A stochastic process {Yt} is said to be weakly (or second-order) stationary if
    1.The mean function is constant over time,
    2.γt,t–k = γ0,k for all time t and lagk

    In this book the term stationary when used alone will always refer to this weaker form of stationarity. However, if the joint distributions for the process are all multivariate normal distributions, it can be shown that the two definitions coincide. For stationary processes, we usually only consider k ≥ 0.


    White Noise:

    An Important example of stationarity is called White Noise.

    Which is defined as a sequence of independent, identically distributed random variables {et}. Its importance stems not from the fact that it is an interesting model itself but from the fact that many useful processes can be constructed from white noise. 

    Pr(et1 ≤x1,et2 ≤x2,...,etn ≤xn)
    = Pr(et1 ≤ x1)Pr(et2 ≤ x2)...Pr(etn ≤ xn)   (by independence) 
    
    = Pr(et1–k≤x1)Pr(et2–k≤x2)...Pr(etn–k≤xn)   (by identical distributions) 
    
    = Pr(et1–k≤x1,et2–k≤x2,...,etn–k≤xn)        (by independence)


    The term white noise arises from the fact that a frequency analysis of the model shows that, in analogy with white light, all frequencies enter equally. We usually assume that the white noise process has mean zero and denote Var(et) by σe2 .

    The moving average is an example where Yt = (e(t) + e(t-1)) / 2;
    is a stationary process conustructed from white noise.


    We cannot always assume the stationary of timeseries data using time series plot, 'Random Cosine Wave' , proves the following. 

    But when we take the difference between Y(t) = Y(t) - Y(t-1) is stationary for random cosine wave.

    Clearly, many real time series cannot be reasonably modeled by stationary processes since they are not in statistical equilibrium but are evolving over time. However, we can frequently transform non- stationary series into stationary series by simple techniques such as differencing. 






# Chapter 3 :Trends

As we know in general time series data, mean function is totally arbitary function of time. In a stationary time series , the mean must be constant in time. 

    Deterministic Vs Stochastic Trends
    a) Stochastic Trend:

    Its like interpreting a random walk, because different analysts usually have different opinions on a given random walk. Consider a random walk having an upward trend.  However, we know that the random walk process has zero mean for all time. The perceived trend is just an artifact of the strong positive correlation between the series values at nearby time points and the increasing variance in the process as time goes by. A second and third simulation of exactly the same process might well show completely different “trends.” 

    b) Deterministic Trend:

    Consider the average monthly data of temperature for a location over an year. Such a trend has Cyclical or Seasonal Trend. 
    
    In this case, a possible model might be Yt = μt + Xt, where μt is a deterministic function that is periodic with period 12; that is μt, should satisfy
    μ(t) = μ(t – 12)    for all t

    We might assume that Xt, the unobserved variation around μt, has zero mean for all t so that indeed μt is the mean function for the observed series Yt. We could describe this model as having a deterministic trend.

    In other situations we might hypothesize a deterministic trend that is linear in time (that is, μt = β0 + β1t) or perhaps a quadratic time trend, μt = β0 + β1t + β2t2. Note that an implication of the model Yt = μt + Xt with E(Xt) = 0 for all t is that the deterministic trend μt applies for all time.
    
    Thus, if μt = β0 + β1t, we are assuming that the same linear time trend applies forever. 

    Therefore, we should have good reason for assuming such a model, not just visuallzing the data to be linear over time.
    

Modelling of Deterministic Trends.

1) Estimation of a Constant Mean.

Lets first assume, constant mean function. 

    Y(t) =  μ + Xt     ( μ, does not change over time)
    Where E(Xt) = 0 for all t, we wish to estimate  μ with our observed series of Yt.
    The most common estimate of  μ is:
    _
    Y =  (∑ Y(t)) / n  (for t=1 to n)

    Therefore E(Y) = μ; therefore mean(Y) is an unbiased estimate of μ. Because the series tends to oscillate back and forth across the mean.
    Notice that in this special case the variance of our estimate of the mean actually increases as the sample size n increases. Clearly this is unacceptable, and we need to consider other estimation techniques for nonstationary series.

2) Regression Methods.
The classical statistical method of regression analysis may be readily used to estimate the parameters of common nonconstant mean trend models. 
We can consider 1) Linear, Quadratic 
                2) Seasonal Means.
                3) Cosine Trends
    1) Linear and Quadratic Trends:

    Consider the deterministic time trend expressed as:
    μt = β0 + β1t

    where the slope and intercept, β1 and β0 respectively, are 
    unknown parameters.

    According to the classical least squares method, we try to minimize the cost function sum(sq(predicted - actual)).
    The solution may be obtained in several ways, for example, by computing the partial derivatives with respect to both β’s, setting the results equal to zero, and solving the resulting linear equations for the β’s. 
    Suppose, if we try to fit a linear regression curve on random walk, it is difficult for a model to express the data in and capture the variances.

    2) Cyclic and Seasonal Trend.
 a) Regression with Independent Parameters.
    Consider now modeling and estimating seasonal trends, such as for the average monthly temperature data.

    In case of cyclic trend, we assume the mean is not constant over time (μt).

    Yt = μt + Xt      where E(Xt) = 0 for all t.

    The most general assumption for μt with monthly seasonal data is that there are 12
    constants (parameters), β1, β2,..., and β12, (each parameter for each month), to represent the seasonality. This can be called as 'Seasonal Means Model'.

    But, by using a Regressed Line as Model to represent the data, it might not capture the complete seasonality. Example the month (march, april) mean's are closer and different from mean's of (june,july) is not reflected in the model.

  b) Cosine Trends 
    Seasonal trends can be modeled eco- nomically with cosine curves that incorporate the smooth change expected from one time period to the next while still preserving the seasonality.

    Consider the cosine curve with equation
    
    μt = βcos(2πft + Φ)

    We call β (> 0) the amplitude, f the frequency, and Φ the phase of the curve. As t varies, the curve oscillates between a maximum of β and a minimum of −β. Since the curve repeats itself exactly every 1/f time units, 1/f is called the period of the cosine wave. 

    Φ serves to set the arbitrary origin on the time axis.

    For monthly data with time indexed as 1, 2,..., the most important frequency is f = 1/12, because such a cosine wave will repeat itself every 12 months. We say that the period is 12.

    The simplest such model for the trend would be expressed as
    
    μt = β0 + β1cos(2πft) + β2sin(2πft).

    Here the constant term, β0, can be meaningfully thought of as a cosine with frequency zero.

    



Reliability and Efficiency of Regression Estimates.

    We assume that the series is represented as Yt = μt + Xt, where μt is a deterministic trend of the kind considered above and {Xt} is a zero-mean stationary process with autocovariance and autocorrelation functions γk and ρk, respectively. Ordinary regression estimates parameters in a linear model according to the criterion of least squares regardless of whether we are fitting linear time trends, seasonal means, cosine curves, or whatever.

    In Case of Seasonal Trends, the least squares estimates of the seasonal means are just seasonal averages; thus, if we have N (complete) years of monthly data, we can write the estimate for the mean for the jth season as.

    ^ 
    β(j) = (∑ Y(j+12i))/ N    (for i=0 to N-1)

In some circumstances, seasonal means and cosine trends could be considered as competing models for a cyclical trend. 

    The parameters themselves are not directly comparable, but we can compare the estimates of the trend at comparable time points.
    
    Consider the two estimates for the trend in January; that is, μ1. With seasonal means, this estimate is just the January average, which has variance given by Equation. With the cosine trend model, the corresponding estimate is

    μt = β0 + β1cos(2πft) + β2sin(2πft).

    Thus, in the cosine model, we estimate the January effect with a standard deviation that is only half as large as it would be if we estimated with a seasonal means model a substantial gain. (Of course, this assumes that the cosine trend plus white noise model is the correct model.)

    We turn now to comparing the least squares estimates with the so-called best linear unbiased estimates (BLUE) or the generalized least squares (GLS) estimates. If the stochastic component {Xt} is not white noise, estimates of the unknown parameters in the trend function may be made; they are linear functions of the data, are unbiased, and have the smallest variances among all such estimates—the so-called BLUE or GLS estimates. These estimates and their variances can be expressed fairly explicitly by using certain matrices and their inverses.

    However, constructing these estimates requires complete knowledge of the covariance function of the stochastic component, a function that is unknown in virtually all real applications. It is possible to iteratively estimate the covariance function for {Xt} based on a preliminary estimate of the trend. The trend is then estimated again using the estimated covariance function for {Xt} and thus iterated to an approximate BLUE for the trend.

    Fortunately, there are some results based on large sample sizes that support the use of the simpler least squares estimates for the types of trends that we have considered.

    We assume that the trend is either a polynomial in time, a trigonometric poly- nomial, seasonal means, or a linear combination of these. Then, for a very general stationary stochastic component {Xt}, the least squares estimates for the trend have the same variance as the best linear unbiased estimates for large sample sizes.

    Although the simple least squares estimates may be asymptotically efficient, it does not follow that the estimated standard deviations of the coefficients as printed out by all regression routines are correct. 

    For example, Fuller (1996) shows that if Y(t) = β*Z(t) + X(t), where {Xt} has a simple stochastic structure but {Zt} is also a stationary series, then the least squares estimate of β can be very inefficient and biased even for large samples.


Interpreting Regression Output

    




Usually Regression output depends on the assumption that {X(t)} is white noise (Data comes from stationarity). Also further assumes that {X(t)} is approximately normally distributed.

we have μt = β0 + β1t. For each t, the unobserved stochastic component. 

X(t) can be estimated (predicted) by Yt − μ^t. If the {Xt} process has constant variance, then we can estimate the standard deviation of Xt, namely γ0 , by the residual standard deviation.

    s = sqrt( sum(sqr(Y(t) - μ(t))) / (n-p))
    where p is the number of parameters estimated in μt and n − p is the so-called degrees of freedom for s. The value of s gives an absolute measure of the goodness of fit of the estimated trend—the smaller the value of s, the better the fit. However, a value of s of, say, 60.74 is somewhat difficult to interpret.
  
    A unitless measure of the goodness of fit of the trend is the value of R2, also called the coefficient of determination or multiple R-squared. One interpretation of R2 is that it is the square of the sample correlation coefficient between the observed series and the estimated trend.

It is also the fraction of the variation in the series that is explained by the estimated trend.

Residual Analysis:

the unobserved stochastic component {Xt} can be estimated, or predicted, by the residual
    X^ = Y – μ^

    Predicted is really a better term. We reserve the term estimate for the guess of an unknown parameter and the term predictor for an estimate of an unobserved random variable.
  We call X^ the residual corresponding to the tth observation. If the trend model is reasonably correct, then the residuals should behave roughly like the true stochastic component, and various assumptions about the stochastic component can be assessed by
looking at the residuals. If the stochastic component is white noise, then the residuals should behave roughly like independent (normal) random variables with zero mean and standard deviation s.

 Since a least squares fit of any trend containing a constant term
automatically produces residuals with a zero mean, we might consider standardizing the residuals as X^ ⁄ s. However, most statistics software will produce standardized residuals using a more complicated standard error in the denominator that takes into account the specific regression model being fit.

We can understand this from examining the residual plot. The Residual Histogram plot is somewhat, similar to the normal distribution 

An excellent test of normality is known as the Shapiro-Wilk test.† It essentially calculates the correlation between the residuals and the corresponding normal quantiles. The lower this correlation, the more evidence we have against normality. Applying that test to these residuals gives a test statistic of W = 0.9929 with a p-value of 0.6954. We cannot reject the null hypothesis that the stochastic component of this model is normally distributed.




# Time Series Forecasting

Forecasting situations vary widely in their time horizons, factors determining actual outcomes, types of data patterns, and many other aspects. Forecasting methods can be simple, such as using the most recent observation as a forecast (which is called the naïve method), or highly complex, such as neural nets and econometric systems of simultaneous equations. 

Forecasting is a common statistical task in business, where it helps to inform decisions about the scheduling of production, transportation and personnel, and provides a guide to long-term strategic planning.

Quantitative forecasting can be applied when two conditions are satisfied:

    1)   numerical information about the past is available;
    2) It is reasonable to assume that some aspects of the past patterns will continue into the future.

We use time series forcasting, either if we have a continued data over a period of time or if we have continous data in a single window.

The aim of timeseries forecasting is to extrapolate trend and seasonal patterns, but they ignore all other information such as marketing initiatives, competitor activity, changes in economic conditions, and so on.

Timeseries models include decomposition models, exponential smoothing, ARIMA models.

Predictor variables are often useful in time series forecasting. For example, suppose we wish to forecast the hourly electricity demand (ED) of a hot region during the summer period. A model with predictor variables might be of the form

Pred = f( current temperature, strength of economy, population, time of day, day of week, error);

The error term, on the right allows for random variation and effects of relevant variables that are not included in the model.

They are known as dynamic regression models, panel data models, longitudinal models, transfer function models, and linear system models (assuming that f is linear).

    But we donot usually use mixed or explanatory models for forecasting, as first it is complex to model the relationship between various variables. Second, we are dependent on data of other dependent variables so it might effect the results in case of sparsity in data. Third, mostly the timeseries forecasting, might lead to better results than the mixed or explanatory model.
    


# Chapter 4: Time Series Visualization

    Python Time Series Visualization
    In python, we have datetime object which is a data structure object to store the time stamp, It helps in easily visualizing the data. We can manipulate and easily play around with datetime object for data cleaning. 

    Time Series Plot:
    We can use the plot() function to simply plot a univariate series aganist time, which the time attribute is converted as 'Date_time' object and set as 'index'. We can log many interesting observations from this visualization. We can check for both trend and seasonality in that series.

    Time Series Patterns
    
    Trend
    A trend exists when there is a long-term increase or decrease in the data. It does not have to be linear. Sometimes we will refer to a trend as “changing direction”, when it might go from an increasing trend to a decreasing trend. 

    Seasonal
    A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known frequency.

    Cyclic
    A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency. These fluctuations are usually due to economic conditions, and are often related to the “business cycle”. The duration of these fluctuations is usually at least 2 years.

    If the fluctuations are not of a fixed frequency then they are cyclic; if the frequency is unchanging and associated with some aspect of the calendar, then the pattern is seasonal. In general, the average length of cycles is longer than the length of a seasonal pattern, and the magnitudes of cycles tend to be more variable than the magnitudes of seasonal patterns.

    1) We can use plot(),seasonal_decompose in python to understand how the variable changes with seasonality (weekly,daily,monthly,yearly) etc. We can use box_plot() in timeseries to see the outliers in a particular iterated periods of time.

    2) We can use lag_scatter plot() to figure out how the data is dependent on previous lagged values, if their is a good correlation we can use arima, to model the data.
   
    3) We can also use correlation coefficient to check the strength of relationship between variables.
    
    4) In case of Multivariate Analysis, we can scatter plot to check the relationship between variables.
    5) Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series.

    6) Trend and seasonality in ACF plots

    When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in size. So the ACF of trended time series tend to have positive values that slowly decrease as the lags increase.

    When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal frequency) than for other lags.


    White Noise

    Time series that show no autocorrelation are called white noise
    For white noise series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 
    
    '95% of the spikes in the ACF to lie within  ±2/√T where  T is the length of the time series.'If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise.

    

# Chapter 5: Forecasting Tool-Box


    Simple Forecasting Techniques

    1) Average Method
    2) Naive Method
    3) Seasonal Naive Method
    4) Drift Method

    a) Average method, under the assumption that the data point are average of previous observations

    b) Naive method,For naïve forecasts, we simply set all forecasts to be the value of the last observation. This sort of modelling works well for economic and financial data. Naive method works very well for random walk.

    c) Seasonal Naive method, A similar method is useful for highly seasonal data. In this case, we set each forecast to be equal to the last observed value from the same season of the year (e.g., the same month of the previous year). 

    For example, with monthly data, the forecast for all future February values is equal to the last observed February value. With quarterly data, the forecast of all future Q2 values is equal to the last observed Q2 value (where Q2 means the second quarter). Similar rules apply for other months and quarters, and for other seasonal periods.

    d) Drift method, A variation on the naïve method is to allow the forecasts to increase or decrease over time, where the amount of change over time (called the drift) is set to be the average change seen in the historical data. 
    This is equivalent to drawing a line between the first and last observations, and extrapolating it into the future.
    

    Modifications and Transformations
    Adjusting the historical data can often lead to a simpler forecasting task. 
    We deal with four sorts of adjustments:
    1) Calendar adjustments
    2) Population adjustments
    3) Inflation adjustments 
    4) Mathematical transformations.

    The purpose of these adjustments and transformations is to simplify the patterns in the historical data by removing known sources of variation or by making the pattern more consistent across the whole data set. Simpler patterns usually lead to more accurate forecasts.

    a) Calender Adjustments, Some of the variation seen in seasonal data may be due to simple calendar effects. In such cases, it is usually much easier to remove the variation before fitting a forecasting model. 
    Example,: Maybe sometime to check production of pencils factory month wise, might have slight variation because of additional days in few months.So instead of total production, we can plot the averge production.

    b) Population Adjustments, Any data that are affected by population changes can be adjusted to give per-capita data. That is, consider the data per person (or per thousand people, or per million people) rather than the total. (for comparing between countries).
    Example,if you consider the number of hospital beds in a community over time, the results are much easier to interpret if you remove the effects of population changes by considering the number of beds per thousand people.

    c) Inflation Adjustments, Data which are affected by the value of money are best adjusted before modelling.
    Example, the average cost of a new house will have increased over the last few decades due to inflation. A $200,000 house this year is not the same as a $200,000 house twenty years ago. For this reason, financial time series are usually adjusted so that all values are stated in dollar values from a particular year. For example, the house price data may be stated in year 2000 dollars.

    To make such accomodations in data, price index's are just. Example for consumer goods, a common price index is the Consumer Price Index (or CPI).

    d) Mathematical Transformations, If the data show variation that increases or decreases with the level of the series, then a transformation can be useful. For example, a logarithmic transformation is often useful. It usually makes Data to be more stationary. On applying a log transform, we constrain the forecasts to stay positive.

    Sometimes other transformations are also used (although they are not so interpretable). For example, square roots and cube roots can be used. These are called power transformations

    A useful family of transformations, that includes both logarithms and power transformations, is the family of Box-Cox transformations. A Box-Cox transform(λ) can make a non-normal data to normal data.
    A good value of  λ is one which makes the size of the seasonal variation about the same across the whole series, as that makes the forecasting model simpler. 
    Having chosen a transformation, we need to forecast the transformed data. Then, we need to reverse the transformation (or back-transform) to obtain forecasts on the original scale.

    We need to make 'bais-adj' true, so that forecast interval considers the mean rather than median.
    


    Residual Analysis

    The “residuals” in a time series model are what is left over after fitting a model. For many (but not all) time series models, the residuals are equal to the difference between the observations and the corresponding fitted values

    Residuals are useful in checking whether a model has adequately captured the information in the data. A good forecasting method will yield residuals with the following properties:

    1) The residuals are uncorrelated. If there are correlations between residuals, then there is information left in the residuals which should be used in computing forecasts.

    2) The residuals have zero mean. If the residuals have a mean other than zero, then the forecasts are biased.


    Any forecasting method that does not satisfy these properties can be improved. However, that does not mean that forecasting methods that satisfy these properties cannot be improved. It is possible to have several different forecasting methods for the same data set, all of which satisfy these properties. Checking these properties is important in order to see whether a method is using all of the available information, but it is not a good way to select a forecasting method.

    Adjusting for bias is easy: if the residuals have mean  m, then simply add  m to all forecasts and the bias problem is solved. Fixing the correlation problem is harder.

    Its better for residuals to have the following properties,
    3) The residuals have constant variance.
    4) The residuals are normally distributed with zero mean.

    These properties makes the prediction interval easier.

    Box-Pierce Test and Ljung-Box Test can be used to see if the series is white noise or not, which inturn says, if series is uncorrelated then it is white noise. 

    

    Evaluation Forecasting Accuracy
    Consequently, the size of the residuals is not a reliable indication of how large true forecast errors are likely to be. The accuracy of forecasts can only be determined by considering how well a model performs on new data that were not used when fitting the model.

    When choosing models, it is common practice to separate the available data into two portions, training and test data, where the training data is used to estimate any parameters of a forecasting method and the test data is used to evaluate its accuracy. Because the test data is not used in determining the forecasts, it should provide a reliable indication of how well the model is likely to forecast on new data.

    a) Scale dependent errors, rmse and mse.
    b) Percentage errors might be effective to show the error percentage, but it might not consider the scale. We calculate (Mean absolute percentage error) by finding the average of series values in a particular period of interest.
    For example, a percentage error makes no sense when measuring the accuracy of temperature forecasts on either the Fahrenheit or Celsius scales, because temperature has an arbitrary zero point.
   
    c) Scaled percentage errors, an alternative to percentage errors.It proposed scaling the errors based on the training MAE from a simple forecast method.
    For a non-seasonal time series, a useful way to define a scaled error uses naïve forecasts.Because the numerator and denominator both involve values on the scale of the original data, the (error) term is indepenent of the scale of the data.

    
    Prediction Intervals,

    It gives the interval in which the forecast lies, with a specified probability. 

# Judgement Forecasting

    We use judgement forecasting when their is lack of historical data or incomplete historical data, example while deciding to launch a new product into market, market growth for a recent product launch, color of cigarrate packet.

    Accuracy of judgmental forecasting improves when the forecaster has (i) important domain knowledge, and (ii) more timely, up-to-date information. A judgmental approach can be quick to adjust to such changes, information or events.

    Although it can be modeled and quantified, its important to recognise that judgmental forecasting is subjective and comes with limitations. 
    There are three general settings in which judgmental forecasting is used: 
    (i) There are no available data, so that statistical methods are not applicable and judgmental forecasting is the only feasible approach; 
    (ii) Data are available, statistical forecasts are generated, and these are then adjusted using judgement; 
    (iii) Data are available and statistical and judgmental forecasts are generated independently and then combined.




    Limitations in Judgement forecasting

    Judgmental forecasts can be inconsistent. Unlike statistical forecasts, which can be generated by the same mathematical formulas every time, judgmental forecasts depend heavily on human cognition, and are vulnerable to its limitations. 

    For example, a limited memory may render recent events more important than they actually are and may ignore momentous events from the more distant past; or a limited attention span may result in important information being missed; or a misunderstanding of causal relationships may lead to erroneous inferences. Furthermore, human judgement can vary due to the effect of psychological factors.

    Using a systematic and well structured approach in judgmental forecasting helps to reduce the adverse effects of the limitations of judgmental forecasting.

    Key Principles in Judgement Forecasting

    a) Set the forecasting task clearly and concisely
    It is important that everyone be clear about what the task is. All definitions should be clear and comprehensive, avoiding ambiguous and vague expressions. Also, it is important to avoid incorporating emotive terms and irrelevant information that may distract the forecaster. 

    b) Implement a systematic approach
    Forecast accuracy and consistency can be improved by using a systematic approach to judgmental forecasting involving checklists of categories of information which are relevant to the forecasting task. 

    For example, it is helpful to identify what information is important and how this information is to be weighted. When forecasting the demand for a new product, what factors should we account for and how should we account for them? Should it be the price, the quality and/or quantity of the competition, the economic environment at the time, the target population of the product? It is worthwhile to devote significant effort and resources to put together decision rules that will lead to the best possible systematic approach.

    c) Document and justify
    Formalising and documenting the decision rules and assumptions implemented in the systematic approach can promote consistency, as the same rules can be implemented repeatedly. Also, requesting a forecaster to document and justify their forecasts leads to accountability, which can lead to reduced bias. 

    d) Systematically evaluate forecasts
    Systematically monitoring the forecasting process can identify unforeseen irregularities. In particular, keep records of forecasts and use them to obtain feedback when the corresponding observations become available. Although you may do your best as a forecaster, the environment you operate in is dynamic. Changes occur, and you need to monitor these in order to evaluate the decision rules and assumptions. Feedback and evaluation help forecasters learn and improve their forecast accuracy.


    e) Segregate forecasters and users
    Forecast accuracy may be impeded if the forecasting task is carried out by users of the forecasts, such as those responsible for implementing plans of action about which the forecast is concerned.
    A classic case is that of a new product being launched. The forecast should be a reasonable estimate of the sales volume of a new product, which may differ considerably from what management expects or hopes the sales will be in order to meet company financial objectives. In this case, a forecaster may be delivering a reality check to the user.

    It is important that forecasters communicate forecasts to potential users thoroughly. 
    Explaining and clarifying the process and justifying the basic assumptions that led to the forecasts will provide some assurance to users.
    

    The Delphi Method.

    The method relies on the key assumption that forecasts from a group are generally more accurate than those from individuals. 

    The aim of the Delphi method is to construct consensus forecasts from a group of experts in a structured iterative manner. A facilitator is appointed in order to implement and manage the process. 
    
    The Delphi method generally involves the following stages:

    a) A panel of experts is assembled.
    b) Forecasting tasks/challenges are set and distributed to the experts.
    c) Experts return initial forecasts and justifications. These are compiled and summarised in order to provide feedback.
    d) Feedback is provided to the experts, who now review their forecasts in light of the feedback. This step may be iterated until a satisfactory level of consensus is reached.
    e) Final forecasts are constructed by aggregating the experts’ forecasts.


    Forecasting by analogy. 
    A useful judgmental approach which is often implemented in practice is forecasting by analogy. 
    A common example is the pricing of a house through an appraisal process. An appraiser estimates the market value of a house by comparing it to similar properties that have sold in the area. The degree of similarity depends on the attributes considered. With house appraisals, attributes such as land size, dwelling size, numbers of bedrooms and bathrooms, and garage space are usually considered.
    So even while designing a high school curriculum, teachers need to specify the accurate time period of completion. 

    Structured Analogy: 
    We should aspire to base forecasts on multiple analogies rather than a single analogy, which may create biases. However, these may be challenging to identify. Similarly, we should aspire to consider multiple attributes. Identifying or even comparing these may not always be straightforward. As always, we suggest performing these comparisons and the forecasting process using a systematic approach. Developing a detailed scoring mechanism to rank attributes and record the process of ranking will always be useful.

    Scenario Forecasting:
    A fundamentally different approach to judgmental forecasting is scenario-based forecasting. The aim of this approach is to generate forecasts based on plausible scenarios. In contrast to the two previous approaches (Delphi and forecasting by analogy) where the resulting forecast is intended to be a likely outcome, each scenario-based forecast may have a low probability of occurrence. The scenarios are generated by considering all possible factors or drivers, their relative impacts, the interactions between them, and the targets to be forecast.

    Building forecasts based on scenarios allows a wide range of possible forecasts to be generated and some extremes to be identified. For example it is usual for “best”, “middle” and “worst” case scenarios to be presented, although many other scenarios will be generated. Thinking about and documenting these contrasting extremes can lead to early contingency planning.

    New Product Forecasting:
    The definition of a new product can vary. It may be an entirely new product which has been launched, a variation of an existing product (“new and improved”), a change in the pricing scheme of an existing product, or even an existing product entering a new market.
    Judgmental forecasting is usually the only available method for new product forecasting, as historical data are unavailable. The approaches we have already outlined (Delphi, forecasting by analogy and scenario forecasting) are all applicable when forecasting the demand for a new product.
    3 Methods used in New Product Forecasting are:

    1) Sales Force Composite, In this approach, forecasts for each outlet/branch/store of a company are generated by salespeople, and are then aggregated.
    2) Executive Opinion, In contrast to the sales force composite, this approach involves staff at the top of the managerial structure generating aggregate forecasts. Such forecasts are usually generated in a group meeting, where executives contribute information from their own area of the company. 
    3) Customer intentions, Customer intentions can be used to forecast the demand for a new product or for a variation on an existing product. Questionnaires are filled in by customers on their intentions to buy the product. A structured questionnaire is used, asking customers to rate the likelihood of them purchasing the product on a scale
    


    Judgement Adjustments,
    These adjustments can potentially provide all of the advantages of judgmental forecasting which have been discussed earlier in this chapter. For example, they provide an avenue for incorporating factors that may not be accounted for in the statistical model, such as promotions, large sporting events, holidays, or recent events that are not yet reflected in the data. However, these advantages come to fruition only when the right conditions are present. Judgmental adjustments, like judgmental forecasts, come with biases and limitations, and we must implement methodical strategies in order to minimise them.

    Judgmental adjustments should not aim to correct for a systematic pattern in the data that is thought to have been missed by the statistical model. This has been proven to be ineffective, as forecasters tend to read non-existent patterns in noisy series. Statistical models are much better at taking account of data patterns, and judgmental adjustments only hinder accuracy.

    Judgmental adjustments are most effective when there is significant additional information at hand or strong evidence of the need for an adjustment.
    

# Chapter 6: Time Series Regression Models



    The basic concept is that we forecast the time series of interest  y assuming that it has a linear relationship with other time series  x.

    Example we can forecast monthly sales using total advertising spending.

    The forecast variable  y is sometimes also called the regressand, dependent or explained variable. The predictor variables  x are sometimes also called the regressors, independent or explanatory variables. 

    a) Simple linear Regression
    In the simplest case, the regression model allows for a linear relationship between the forecast variable  y and a single predictor variable  x
    
    b) Multiple linear regression

    When there are two or more predictor variables, the model is called a multiple regression model. Thus, the coefficients measure the marginal effects of the predictor variables.

    Assumptions in Regression:
    1) First, we assume that the model is a reasonable approximation to reality; that is, the relationship between the forecast variable and the predictor variables satisfies this linear equation.
    2) Assumptions about errors:
      a) they have mean zero; otherwise the forecasts will be systematically biased.
      b) they are not autocorrelated; otherwise the forecasts will be inefficient, as there is more information in the data that can be exploited.
      c) they are unrelated to the predictor variables; otherwise there would be more information that should be included in the systematic part of the model.

    It is also useful to have the errors being normally distributed with a constant variance  σ2 in order to easily produce prediction intervals.

    We use least square error function to reduce the error.
    The standard error gives a measure of the uncertainty in the estimated  β coefficient.

    predictions of  y(t) within the training-sample, referred to as fitted values. Note that these are predictions of the data used to estimate the model, not genuine forecasts of future values of  y.

    Model Metrics

    1) Goodness of Fit (R2).
    A common way to summarise how well a linear regression model fits the data is via the coefficient of determination, or  R2. This can be calculated as the square of the correlation between the observed  y values and the predicted  ^y values. 

    In simple linear regression, the value of  R2 is also equal to the square of the correlation between  y and  x (provided an intercept has been included).

    If the predictions are close to the actual values, we would expect  R2 to be close to 1. On the other hand, if the predictions are unrelated to the actual values, then  R2 = 0.

    2) Standard Error of Prediction.

    Another measure of how well the model has fitted the data is the standard deviation of the residuals, which is often known as the “residual standard error”.

    The standard error is related to the size of the average error that the model produces. We can compare this error to the sample mean of  y or with the standard deviation of  y to gain some perspective on the accuracy of the model. The standard error will be used when generating prediction intervals


Evaluating the regression model:

    We use the residuals which is the training error. Each residual is the unpredictable component of the associated observation.

    Its neccesary to plot the residuals to make sure our assumptions are held.

    ACF: Plot for residuals.

  

Useful Predictors:

    A) Trend:

    It is common for time series data to be trending. A linear trend can be modelled by simply using  x(1,t) = t as a predictor, y(t) = β(0)+ β(1)t + ε(t), where  t = 1,…,T.

    A trend variable can be specified in the tslm() function using the trend predictor. In Section 5.8 we discuss how we can also model a nonlinear trends.

    B) Dummy Variables:

    Using dummy variables we can use numeric values and one hot encoding to represent categorical variables. An example is the case where a special event has occurred. For example when forecasting tourist arrivals to Brazil, we will need to account for the effect of the Rio de Janeiro summer Olympics in 2016.

    We can also have Seasonal Dummy Variables.

    C) Intervention Variables:

    It is often necessary to model interventions that may have affected the variable to be forecast. For example, competitor activity, advertising expenditure, industrial action, and so on, can all have an effect.

    When the effect lasts only for one period, we use a “spike” variable. This is a dummy variable that takes value one in the period of the intervention and zero elsewhere. A spike variable is equivalent to a dummy variable for handling an outlier.

    Other interventions have an immediate and permanent effect. If an intervention causes a level shift (i.e., the value of the series changes suddenly and permanently from the time of intervention), then we use a “step” variable. A step variable takes value zero before the intervention and one from the time of intervention onward.

    Another form of permanent effect is a change of slope. Here the intervention is handled using a piecewise linear trend; a trend that bends at the time of intervention and hence is nonlinear. 
 

    D) Trading days

    The number of trading days in a month can vary considerably and can have a substantial effect on sales data. To allow for this, the number of trading days in each month can be included as a predictor.

    E) Distributed lags

    It is often useful to include advertising expenditure as a predictor. However, since the effect of advertising can last beyond the actual campaign, we need to include lagged values of advertising expenditure. 

    F) Easter

    Easter differs from most holidays because it is not held on the same date each year, and its effect can last for several days. In this case, a dummy variable can be used with value one where the holiday falls in the particular time period and zero otherwise.

    With monthly data, if Easter falls in March then the dummy variable takes value 1 in March, and if it falls in April the dummy variable takes value 1 in April. When Easter starts in March and finishes in April, the dummy variable is split proportionally between months.

    G) Fourier Series

    An alternative to using seasonal dummy variables, especially for long seasonal periods, is to use Fourier terms. Jean-Baptiste Fourier was a French mathematician, born in the 1700s, who showed that a series of sine and cosine terms of the right frequencies can approximate any periodic function. We can use them for seasonal patterns.

    If we have monthly seasonality, and we use the first 11 of these predictor variables, then we will get exactly the same forecasts as using 11 dummy variables.

    With Fourier terms, we often need fewer predictors than with dummy variables, especially when  m is large. This makes them useful for weekly data, for example, where  m ≈ 52 . For short seasonal periods (e.g., quarterly data), there is little advantage in using Fourier terms over seasonal dummy variables.



Selecting Predictors:

    When there are many possible predictors, we need some strategy for selecting the best predictors to use in a regression model.
    
    A common approach that is not recommended is to plot the forecast variable against a particular predictor and if there is no noticeable relationship, drop that predictor from the model. This is invalid because it is not always possible to see the relationship from a scatterplot, especially when the effects of other predictors have not been accounted for.

    Another common approach which is also invalid is to do a multiple linear regression on all the predictors and disregard all variables whose  p-values are greater than 0.05. To start with, statistical significance does not always indicate predictive value. Even if forecasting is not the goal, this is not a good strategy because the  p-values can be misleading when two or more predictors are correlated with each other.

    Five such measures are introduced in this section. They can be calculated using the CV() function

    We compare these values against the corresponding values from other models. For the CV, AIC, AICc and BIC measures, we want to find the model with the lowest value; for Adjusted  R2, we seek the model with the highest value.

    Many statisticians like to use the BIC because it has the feature that if there is a true underlying model, the BIC will select that model given enough data. However, in reality, there is rarely, if ever, a true underlying model, and even if there was a true underlying model, selecting that model will not necessarily give the best forecasts (because the parameter estimates may not be accurate).

    Consequently, we recommend that one of the AICc, AIC, or CV statistics be used, each of which has forecasting as their objective. If the value of  T is large enough, they will all lead to the same model. In most of the examples, we use the AICc value to select the forecasting model.

Forecasting with Regression

    Types of Forecasting:

    Ex-ante Vs Ex-post-ante forecasting:

    a) Ex-ante forecasts are those that are made using only the information that is available in advance. For example, ex-ante forecasts for the percentage change in US consumption for quarters following the end of the sample, should only use information that was available up to and including 2016 Q3. These are genuine forecasts, made in advance using whatever information is available at the time. Therefore in order to generate ex-ante forecasts, the model requires forecasts of the predictors. 

    b) Ex-post forecasts are those that are made using later information on the predictors. For example, ex-post forecasts of consumption may use the actual observations of the predictors, once these have been observed. These are not genuine forecasts, but are useful for studying the behaviour of forecasting models. 
    The model from which ex-post forecasts are produced should not be estimated using data from the forecast period. That is, ex-post forecasts can assume knowledge of the predictor variables (the  x variables), but should not assume knowledge of the data that are to be forecast (the y variable).

    A comparative evaluation of ex-ante forecasts and ex-post forecasts can help to separate out the sources of forecast uncertainty. This will show whether forecast errors have arisen due to poor forecasts of the predictor or due to a poor forecasting model.

    c) We also have scenerio based forecasting where we employ in strategy and decision making process.
    Ex: 'a US policy maker may be interested in comparing the predicted change in consumption when there is a constant growth of 1% and 0.5% respectively for income and savings with no change in the employment rate, versus a respective decline of 1% and 0.5%, for each of the four quarters following the end of the sample. '

Building a Predictive Regression model.

    The great advantage of regression models is that they can be used to capture important relationships between the forecast variable of interest and the predictor variables. A major challenge however, is that in order to generate ex-ante forecasts, the model requires future values of each predictor. If scenario based forecasting is of interest then these models are extremely useful. However, if ex-ante forecasting is the main focus, obtaining forecasts of the predictors can be challenging (in many cases generating forecasts for the predictor variables can be more challenging than forecasting directly the forecast variable without using predictors).


Matrix Formulation:

    While implementing linear regression we implement it using Matrix formula for efficient and quick operations in updating cost function and optimizing loss.






Non Linear Regression:

    The simplest way of modelling a nonlinear relationship is to transform the forecast variable y and/or the predictor variable x before estimating a regression model. While this provides a non-linear functional form, the model is still linear in the parameters. The most commonly used transformation is the (natural) logarithm

    The log-linear form is specified by only transforming the forecast variable and the linear-log form is obtained by transforming the predictor.

    One of the simplest specifications is to make  f piecewise linear. That is, we introduce points where the slope of  f can change. These points are called knots.

    

Correlation, Causation & forecasting.

    a) Correlation is not causation:

    It is important not to confuse correlation with causation, or causation with forecasting. A variable  x may be useful for forecasting a variable  y, but that does not mean  x is causing  y. It is possible that  x is causing  y, but it may be that  y is causing  x, or that the relationship between them is more complicated than simple causality.

    b) Forecasting with multi correlated variables:

    When two or more predictors are highly correlated it is always challenging to accurately separate their individual effects. Suppose we are forecasting monthly sales of a company for 2012, using data from 2000–2011. In January 2008, a new competitor came into the market and started taking some market share. At the same time, the economy began to decline. In your forecasting model, you include both competitor activity (measured using advertising time on a local television station) and the health of the economy (measured using GDP). It will not be possible to separate the effects of these two predictors because they are highly correlated.

    

# Chapter 7: Time Series Decomposition:

    Time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several components, each representing an underlying pattern category.

    'time series patterns: trend, seasonality and cycles. '

    Thus we think of a time series as comprising three components: a trend-cycle component, a seasonal component, and a remainder component (containing anything else in the time series).

    

    The additive decomposition is the most appropriate if the magnitude of the seasonal fluctuations, or the variation around the trend-cycle, does not vary with the level of the time series. When the variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series, then a multiplicative decomposition is more appropriate. Multiplicative decompositions are common with economic time series.

    An alternative to using a multiplicative decomposition is to first transform the data until the variation in the series appears to be stable over time, then use an additive decomposition. When a log transformation has been used, this is equivalent to using a multiplicative decomposition 

Moving Averages

    The classical method of time series decomposition originated in the 1920s and was widely used until the 1950s. It still forms the basis of many time series decomposition methods, so it is important to understand how it works. The first step in a classical decomposition is to use a moving average method to estimate the trend-cycle.
    Observations that are nearby in time are also likely to be close in value. Therefore, the average eliminates some of the randomness in the data, leaving a smooth trend-cycle component. We call this an m -MA, meaning a moving average of order  m.

    Types of Decomposition:
    a) Classifical Decomposition
    b) X11 Decomposition
    c) SEATS Decompisition
    d) STL Decomposition
    




Exponential Smoothing

    Forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, the more recent the observation the higher the associated weight. This framework generates reliable forecasts quickly and for a wide range of time series, which is a great advantage and of major importance to applications in industry.

    

# Chapter 8: Exponential Smoothing

 
    Forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, the more recent the observation the higher the associated weight. This framework generates reliable forecasts quickly and for a wide range of time series, which is a great advantage and of major importance to applications in industry.

    The selection of the method is generally based on recognising key components of the time series (trend and seasonal) and the way in which these enter the smoothing method (e.g., in an additive, damped or multiplicative manner).

    

Simple Exponential Smoothing:


    The simplest of the exponentially smoothing methods is naturally called simple exponential smoothing (SES).

    a) Using the naïve method, all forecasts for the future are equal to the last observed value of the series,
    Hence, the naïve method assumes that the most recent observation is the only important one, and all previous observations provide no information for the future. This can be thought of as a weighted average where all of the weight is given to the last observation.

    b) Using the average method, all future forecasts are equal to a simple average of the observed data,

    c) We often want something between these two extremes. For example, it may be sensible to attach larger weights to more recent observations than to observations from the distant past. This is exactly the concept behind simple exponential smoothing. Forecasts are calculated using weighted averages, where the weights decrease exponentially as observations come from further in the past — the smallest weights are associated with the oldest observations

    where  0≤α≤1 is the smoothing parameter. The one-step-ahead forecast for time  T+1 is a weighted average of all of the observations in the series  y1,…,yT
    
    The rate at which the weights decrease is controlled by the parameter  α.

    For any  α between 0 and 1, the weights attached to the observations decrease exponentially as we go back in time, hence the name “exponential smoothing”. If  α is small (i.e., close to 0), more weight is given to observations from the more distant past. If  α is large (i.e., close to 1), more weight is given to the more recent observations. For the extreme case where  
    α = 1,  y(T+1)|(T) = y(T), and the forecasts are equal to the naïve forecasts.
    
Weighted average form:
    
    The forecast at time  T+1 is equal to a weighted average between the most recent observation  y(T) and the previous forecast.

Component form:    

    An alternate representation is component form. For simple exponential smoothing, the only component included is the level, ℓ(t)
    .(Other methods which are considered later in this chapter may also include a trend  b(t) and a seasonal component  s(t) .)
    
    Component form representations of exponential smoothing methods comprise a forecast equation and a smoothing equation for each of the components included in the method.

Flat forecasts:

    Simple exponential smoothing has a “flat” forecast function. As it will not include the seasonal or trend component.

    That is, all forecasts take the same value, equal to the last level component. Remember that these forecasts will only be suitable if the time series has no trend or seasonal component.


Optimization:

    The application of every exponential smoothing method requires the smoothing parameters and the initial values to be chosen. In particular, for simple exponential smoothing, we need to select the values of  α and  ℓ0
    
    All forecasts can be computed from the data once we know those values. For the methods that follow there is usually more than one smoothing parameter and more than one initial component to be chosen.

    A more reliable and objective way to obtain values for the unknown parameters is to estimate them from the observed data.

    As we optimize regression using "Sum of Squared Error" gradient descent. Similarly, the unknown parameters and the initial values for any exponential smoothing method can be estimated by minimising the SSE. Error b/w [y(t+1) - y(t)].

    This is a non-linear minimization problem.

    

Trend Methods

a) Holts Method:

    Holt (1957) extended simple exponential smoothing to allow the forecasting of data with a trend. This method involves a forecast equation and two smoothing equations (one for the level and one for the trend).

    where  ℓ(t) denotes an estimate of the level of the series at time  t,  b(t) denotes an estimate of the trend (slope) of the series at time  t,  α is the smoothing parameter for the level,  0≤α≤1 and  β∗ is the smoothing parameter for the trend,  0≤β∗≤1.

    The forecast function is no longer flat but trending. The  h-step-ahead forecast is equal to the last estimated level plus  h times the last estimated trend value. Hence the forecasts are a linear function of  h.

b) Damed trend method:

    The forecasts generated by Holt’s linear method display a constant trend (increasing or decreasing) indefinitely into the future. Empirical evidence indicates that these methods tend to over-forecast, especially for longer forecast horizons. Motivated by this observation, Gardner & McKenzie (1985) introduced a parameter that “dampens” the trend to a flat line some time in the future. Methods that include a damped trend have proven to be very successful, and are arguably the most popular individual methods when forecasts are required automatically for many series.

    So, In addition to the above parameter we also add damping parameter  0<ϕ<1.

    The process of selecting a method was relatively easy as both MSE and MAE comparisons suggested the same method (damped Holt’s). However, sometimes different accuracy measures will suggest different forecasting methods, and then a decision is required as to which forecasting method we prefer to use. As forecasting tasks can vary by many dimensions (length of forecast horizon, size of test set, forecast error measures, frequency of data, etc.), it is unlikely that one method will be better than all others for all forecasting scenarios. What we require from a forecasting method are consistently sensible forecasts, and these should be frequently evaluated against the task at hand.
 





Holt-Winters Seasonal Method:

    The Holt-Winters seasonal method comprises the forecast equation and three smoothing equations — one for the level  ℓ(t), one for the trend  b(t), and one for the seasonal component  s(t), with corresponding smoothing parameters  α,  β∗and  γ. We use  m to denote the frequency of the seasonality, i.e., the number of seasons in a year. For example, for quarterly data  m = 4, and for monthly data  m = 12.

    There are two variations to this method that differ in the nature of the seasonal component. The additive method is preferred when the seasonal variations are roughly constant through the series, while the multiplicative method is preferred when the seasonal variations are changing proportional to the level of the series. With the additive method, the seasonal component is expressed in absolute terms in the scale of the observed series, and in the level equation the series is seasonally adjusted by subtracting the seasonal component. Within each year, the seasonal component will add up to approximately zero. With the multiplicative method, the seasonal component is expressed in relative terms (percentages), and the series is seasonally adjusted by dividing through by the seasonal component. Within each year, the seasonal component will sum up to approximately  m.

Holt-Winter Damed Method:

    Damping is possible with both additive and multiplicative Holt-Winters’ methods. A method that often provides accurate and robust forecasts for seasonal data is the Holt-Winters method with a damped trend and multiplicative seasonality.

    

    Exponential smoothing methods are not restricted to those we have presented so far. By considering variations in the combinations of the trend and seasonal components, nine exponential smoothing methods are possible.

    The statistical models  generate the same point forecasts, but can also generate prediction (or forecast) intervals. A statistical model is a stochastic (or random) data generating process that can produce an entire forecast distribution. 

    Each model consists of a measurement equation that describes the observed data, and some state equations that describe how the unobserved components or states (level, trend, seasonal) change over time. Hence, these are referred to as state space models.

    For each method there exist two models: one with additive errors and one with multiplicative errors. The point forecasts produced by the models are identical if they use the same smoothing parameter values. They will, however, generate different prediction intervals.

    To distinguish between a model with additive errors and one with multiplicative errors (and also to distinguish the models from the methods), we add a third letter to the classification of We label each state space model as ETS( ⋅,⋅,⋅) for (Error, Trend, Seasonal).

    These two equations, together with the statistical distribution of the errors, form a fully specified statistical model. Specifically, these constitute an innovations state space model underlying simple exponential smoothing.


    

Estimation Using ETS Models:

    An alternative to estimating the parameters by minimising the sum of squared errors is to maximise the “likelihood”. The likelihood is the probability of the data arising from the specified model. Thus, a large likelihood is associated with a good model. For an additive error model, maximising the likelihood (assuming normally distributed errors) gives the same results as minimising the sum of squared errors. However, different results will be obtained for multiplicative error models. 

    Model selection

    A great advantage of the ETS statistical framework is that information criteria can be used for model selection. The AIC, AIC(c) and BIC.

    Three of the combinations of (Error, Trend, Seasonal) can lead to numerical difficulties. Specifically, the models that can cause such instabilities are ETS(A,N,M), ETS(A,A,M), and ETS(A,Ad,M), due to division by values potentially close to zero in the state equations. We normally do not consider these particular combinations when selecting a model.

    Models with multiplicative errors are useful when the data are strictly positive, but are not numerically stable when the data contain zeros or negative values. Therefore, multiplicative error models will not be considered if the time series is not strictly positive. In that case, only the six fully additive models will be applied.

    

ETS Model Forecasting:

    ETS point forecasts are equal to the medians of the forecast distributions. For models with only additive components, the forecast distributions are normal, so the medians and means are equal. For ETS models with multiplicative errors, or with multiplicative seasonality, the point forecasts will not be equal to the means of the forecast distributions.

    Prediction intervals

    A big advantage of the models is that prediction intervals can also be generated — something that cannot be done using the methods. The prediction intervals will differ between models with additive and multiplicative methods.

    

# Chapter 9: ARIMA
ARIMA models provide another approach to time series forecasting. Exponential smoothing and ARIMA models are the two most widely used approaches to time series forecasting, and provide complementary approaches to the problem. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.

Stationarity and Differencing:

    A stationary time series is one whose properties do not depend on the time at which the series is observed.
    Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time.

    A time series with cyclic behaviour (but with no trend or seasonality) is stationary. This is because the cycles are not of a fixed length, so before we observe the series we cannot be sure where the peaks and troughs of the cycles will be.

    One way to make a non-stationary time series stationary — compute the differences between consecutive observations. This is known as differencing.


    Transformations such as logarithms can help to stabilise the variance of a time series. Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.

    The ACF plot is also useful for identifying non-stationary time series. For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of non-stationary data decreases slowly. Also, for non-stationary data, the value of  r(1) is often large and positive.


Random walk model

    The differenced series is the change between consecutive observations in the original series.

    Random walk models are widely used for non-stationary data, particularly financial and economic data. 
    
    a) long periods of apparent trends up or down
    b) sudden and unpredictable changes in direction.

    The forecasts from a random walk model are equal to the last observation, as future movements are unpredictable, and are equally likely to be up or down. Thus, the random walk model underpins naïve forecasts.

    The value of  c is the average of the changes between consecutive observations. If  c is positive, then the average change is an increase in the value of  y(t). Thus,  y(t) will tend to drift upwards. However, if  c is negative,  y(t) will tend to drift downwards.

Second-order differencing

    Occasionally the differenced data will not appear to be stationary and it may be necessary to difference the data a second time to obtain a stationary series.

    Then, we would model the “change in the changes” of the original data. In practice, it is almost never necessary to go beyond second-order differences.

Seasonal differencing

    A seasonal difference is the difference between an observation and the previous observation from the same season.

    These are also called “lag- m differences”, as we subtract the observation after a lag of  m periods. Forecasts from this model are equal to the last observation from the relevant season. 

    To distinguish seasonal differences from ordinary differences, we sometimes refer to ordinary differences as “first differences”.


Unit root tests
    
    One way to determine more objectively whether differencing is required is to use a unit root test. These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required.

    ex KPSS Test, Adfuller Test

    Kpss test:
    Null Hypothesis: Data is having Deterministic Stationarity
    Alternate Hypothesis: Data is not having Deterministic Stationarity

Backshift Notation:

    The backward shift operator  B is a useful notational device when working with time series lags.
    

Autoregressive Models:

    In a multiple regression model, we forecast the variable of interest using a linear combination of predictors. In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. The term autoregression indicates that it is a regression of the variable against itself.
    We refer to this as an AR(p) model, an autoregressive model of order  p.
    Autoregressive models are remarkably flexible at handling a wide range of different time series patterns. 

Moving Average Models:

    Rather than using past values of the forecast variable in a regression, a moving average model uses past forecast errors in a regression-like model.

    Notice that each value of  y(t) can be thought of as a weighted moving average of the past few forecast errors. However, moving average models should not be confused with the moving average smoothing.
    A moving average model is used for forecasting future values, while moving average smoothing is used for estimating the trend-cycle of past values.

ARIMA:

    If we combine differencing with autoregression and a moving average model, we obtain a non-seasonal ARIMA model. ARIMA is an acronym for AutoRegressive Integrated Moving Average (in this context, “integration” is the reverse of differencing).
    We have (p,d,q) parameters in ARIMA. P (Autoregressive term), D (Differencing term), Q (Moving Average Component).

    The same stationarity and invertibility conditions that are used for autoregressive and moving average models also apply to an ARIMA model.

ACF and PACF plots:

    It is usually not possible to tell, simply from a time plot, what values of  p and  q are appropriate for the data. However, it is sometimes possible to use the ACF plot, and the closely related PACF plot, to determine appropriate values for  p and  q.

    Recall that an ACF plot shows the autocorrelations which measure the relationship between  y(t) and  y(t)−k for different values of  k. 
    These measure the relationship between  y(t) and  y(t)−k after removing the effects of lags  1,2,3,…,k−1 . So the first partial autocorrelation is identical to the first autocorrelation, because there is nothing between them to remove. Each partial autocorrelation can be estimated as the last coefficient in an autoregressive model. 

Maximum Likelihood Estimation:

    When R estimates the ARIMA model, it uses maximum likelihood estimation (MLE). This technique finds the values of the parameters which maximise the probability of obtaining the data that we have observed. For ARIMA models, MLE is similar to the least squares estimates that would be obtained by minimising error function.

Information Criteria:

    Akaike’s Information Criterion (AIC), which was useful in selecting predictors for regression, is also useful for determining the order of an ARIMA model.
    AIC = −2log(L) + 2(p+q+k+1)  ,where  L is the likelihood of the data,  k = 1,
    Good models are obtained by minimising the AIC, AICc or BIC. Our preference is to use the AICc.

SARIMA:

    we have restricted our attention to non-seasonal data and non-seasonal ARIMA models. However, ARIMA models are also capable of modelling a wide range of seasonal data.

    A seasonal ARIMA model is formed by including additional seasonal terms in the ARIMA models we have seen so far.

    We can determine Seasonality by using ACF plot.

    

 




Dynamic Regression, Forecasting Hierarchial time series and Advanced Forecasting such as (boosting, Bagging, Neural Nets, Modelling Data for complex seasonality)