### [Fitting models to short time series](https://robjhyndman.com/hyndsight/short-time-series/)

Using least squares estimation, or some other non-regularized estimation method, it is possible to estimate a model only if you have more observations than parameters.  (If you use the LASSO, or some other regularization technique, it is possible to estimate a model with fewer observations than parameters.) However, there is no guarantee that a fitted model will be any good for forecasting, especially when the data are noisy.



The only reasonable approach is to first check that there are enough observations to estimate the model, and then to test if the model performs well out-of-sample. With short series, there is not enough data to allow some observations to be witheld for testing purposes. However, the AIC can be used as a [proxy for the one-step forecast out-of-sample MSE](https://robjhyndman.com/hyndsight/aic/). The AIC allows both the number of parameters and the amount of noise to be taken into account.


What tends to happen with short series is that the AIC suggests very simple models because anything with more than one or two parameters will produce poor forecasts due to the estimation error.  After applying the auto.arima() function from the forecast package in R to all the series from the M-competition, 32 of 144 series had models with zero parameters (random walks), 95 had models with one parameter.

Seasonal models bring their own difficulties because the seasonality usually takes up m-1 
 degrees of freedom where 
m
 is the seasonal period. Fourier terms are one way to reduce the problem — useful whenever the ratio of 
m
 to sample size is large. 
 
 Consequently, at least $p+q+P+Q+d+mD+1$ observations
are required to estimate a seasonal ARIMA model. 

### [Classification Of Short Time Series](https://www.researchgate.net/publication/46447515_Classification_Of_Short_Time_Series)

__Abstract:__ In this paper, we consider several ways of assigning a dissimilarity between univariate time series in short term
behavior. In particular, we have defined a measure that works irrespective of different baselines and scaling factors and its effectiveness has been evaluated on real
and synthetic data sets.

Classic problems in handling short time series involve the clustering of such series into similar categories and the classification of new observed series into two or more known categories. 

These two problems, of course, are very common and there exists a
vast literature on methods of discriminant and cluster analysis as applied to time
independent observations. 

The basic idea is to extract distinctive features from
the data, compare them and perform the grouping of the units into distinct categories. The clustering is satisfactory if the distance between units within clusters
is relatively small compared with distances between clusters. Once the structure
and the required number of clusters have been established, the cluster representatives can be employed to classify the old and new units using, for example, the
nearest-centroid method.

Clustering methods can identify meaningful patterns even in time dependent
observations; however, they have some limitations if standard algorithms are blindly applied measuring the closeness of the observed values, but ignoring the temporal dimension. In this case, one wants to assign a value to the distance between individual time series rather than quantify the strength of relationship between the stochastic processes that generate the observations. 


####  Cluster analysis

We chose the partitioning around medoids method (PAM)
for several reasons. 
- First, the typical representative of each group (the cluster medoid) is the most centrally located item in a cluster, that is, the item in the cluster whose average dissimilarity to all other items in the same cluster is minimal. 
- Second, it can operate directly on a distance matrix. In fact, the computation of cluster medoids does not require the presence of feature vectors, but can be done for a distance matrix. 
- Third, it is a partitional algorithm that does not impose a hierarchical structure, which is not necessarily present in the underlying hypothetical population. 
- Fourth, rather than selecting starting centers at random, PAM evaluates all possible starting centers and chooses the best centers to start cluster building. This gives consistent results when clustering is repeated. 
- Finally, PAM has been shown to be both more robust to inclusion of outliers than the popular k-means method because it uses the most centrally located object in a cluster