# Streaming Data Science
## Exam preparation

### Breadth questions: TSA

#### 1) What is the i.i.d. assumption? How do SML and TSA allow the loosening of part of it?

**Answer:**

**i.i.d.** refers to independent and identically distributed. Is is a strong assumption that traditional statistical analysis makes on the data sequences. But usually, the data we receive from real-world situations often does not follow this properties. Streaming Machine Learning allows the loosening of the identically distributed assumption, by detecting data and concept drifts on the distribution of incoming data and adapting to it. On the other hand, Time Series Analysis allows the loosening of the independent assumption, by giving methods to fit and predict data with a high temporal dependence. 

#### 2) Which characteristics can a non-stationary time series have? Illustrate it with an example.

**Answer:**

A non-stationary time series, as the name says, is a time series that does not follows a stationary behavior, i.e., does not have a constant mean and variance over time. These time series often have a trend (a continuos change of the mean over time) and/or a seasonality (a periodic change on the mean and variance over time) component. 

For example, if we analyse the number of passengers of an airline every month, this time series probably will show a periodic behavior every 12 months (corresponding to the travel seasons), and an increasing trend over the years (corresponding to an increase of popularity of the airline, for example). This time series is clearly non-stationary.

#### 3) Why is the white noise a predictable time series? What is its forecast? What is its error? Explain it w.r.t. the possible components of a time series.

**Answer:**

The white noise (normal distribution with mean 0 and variance $\sigma^2$) is the perfect stationary time series, as it is highly predictable. This is because, if we predict 0 on every
step (the mean), we are going to minimize the error, that is going to be proportional to the known variance of the distribution. 

#### 4) Which are the methods to test for stationarity? Explain one of them in detail.

**Answer:**

We have various methods to test for stationarity. We can do it by hand, or we can rely on statistical tests like the ADF or KPSS, that test the null hypothesis of the data series being non-stationary (or stationary, depending on the test)

For the sake of simplicity, let us explain the first method. We can detect if a time series is stationary or not by, for example, plotting the time series and looking for obvious trends or seasonalities. Often, these are not very clear, so the next step is to compare the summary statistics of two or more partitions of the series.

In other words, to check that a time series is stationary, we can split it into 2 or more parts (depending if we are only testing for trend or if we suspect a seasonality with a certain period). Then, on each part, we calculate its statistics, specifically, its mean and variance, and then compare the values with the other parts. If we find a significant difference, it is a strong suggestion of non-stationarity of the whole series.

Finally, if we still are not sure of the stationarity fo the series, we can perform an ADF test or a KPSS test, that are more rigorous and help us determine this by giving indicators like the p-value of the hypothesis test (non-stationary for ADF, stationary for KPSS).

#### 5) Which are the typical time series components? Illustrate your explanation with an example

**Answer:**

Usually, a time series is composed by the following 3 components:
* Trend component: represents an overall, persistent, long-term change of the mean
* Seasonality: represents regular and periodic fluctuations
* Residual component: is the component that in some way, gives randomness to the series. It is the remaining after the previous components are removed, and it is ideally stationary.

For example, we can have a time series representing the passengers of an airline. It will present a periodic change every 12 months, representing the months with higher demand and lower demand. Also, it could present an increasing trend, representing the increase on flight travel popularity. And in the end, we also have a randomness component, meaning that nonetheless we expect a higher passengers mean on december, the true value cannot be exactly determined (could be 20045 in 2023 and 20103 in 2024).

#### 6) Which are the methods to detrend a time series? Explain one of them in detail.

**Answer:**

We have three main methods to detrend a time series:
* Detrend by differencing
* Detrend by model fitting and removal
* Combination of 1 and 2

Let us explain the first one. To remove a trend from a time series, we can differentiate it.
The process of differencing consists of obtaining a new time series with the values of differences between the observations and its predecesor, i.e.:
$$y_t = x_t - x_{t-1}$$

where $\{x_t\}_t$ is the original time series. This has the effect of removing any linear trend present on the time series. If the trend is quadratic, we can differentiate two times to remove it. In general, if a trend is polynomial with degree d, differencing it d times will remove the trend. Be aware that differentiation always removes data points, as the first point has no predecesor.

#### 7) Which are the methods to identify seasonality in a time series? Explain one of them in detail.

**Answer:**

To identify seasonality in a time series, we can inspect it by plotting it, and looking for periodic changes over time. If we detect a cycle of period d, we can the use techinques like seasonal differencing to remove it. If the result of this seems to be stationary, it is a great signal of an initial presence of seasonality with period d.

If this test doesn't give clear results, we can use the Auto Correlation Function to detect more easily the presence of seasonality. We first calculate the correlation between the series with a lagged version of itself, for various lag values. Then, we plot these values, and look for clear cycles in this function. For example, if we detect peaks in lags 12, 24, 36, it is a suggestion of a seasonality of period 12.


#### 8) Which are the methods to forecast a time series? Compare two methods of your choice (excluding the basic ones).

**Answer:**

To forecast a time series, we have basic methods like predicting the last value, predicting the average value or predicting the last-k average value. However, there are more precise methods like the following:
* Exponential smoothing (simple, double and triple)
* ARIMA and SARIMA Models
* Prophet

Let us explain the simple exponential smoothing method. This method allows us to fit and predict the next value of a stationary time series. It is based on the idea that of weighting the recent observation along with the previous forecasts to compute the next prediction. It is calculated as follows:
$$\hat{y}_{t+1} = \alpha y_t + (1 - \alpha) \hat{y}_t$$

with $\alpha$ the decaying factor. Note that, as this function is recursive, the forecast is obtained by a weighted average of previous observations, with the weights decaying exponentially as the observations get older:
$$\hat{y}_{t+1} = \alpha y_t + \alpha (1 - \alpha) y_{t-1} + \alpha (1 - \alpha)^2 y_{t-2} + ... + \alpha (1-\alpha)^{t-1} y_1 + (1 - \alpha)^t \ell_0$$

with $\ell_0$ being the first forecast made.

Note that this is a streaming algorithm, meaning that we need the previous observation to make the next forecast, so we need a constant flow of information, and we cannot make a forecast with a lenght greatier than 1.

#### 9) What are the components of a SARIMAX model?

**Answer:**

A SARIMAX model has the following components:
* A non-seasonal AutoRegressive model
* A non-seasonal MovingAverage model
* A non-seasonal differencing
* A seasonal AR model
* A seasonal MA model
* A seasonal differencing
* An included regressor for external (exogenous variables)


#### 10) Which are the benefits and limitations of [MAE / MAPE / MSE / RMSE]? Provide a practical example where its use is beneficial and one where it should be avoided.

**Answer:**

Benefits:
* MAE: easy to calculate, conserves original units, making it easy to interpret.
* MAPE: as it provides a percentage error, it can be used to make comparisons between datasets.
* MSE: penalizes larger errors more significantly.
* RMSE: penalizes larger errors like MSE, and provides the original units, making it easier to interpret.

Limitations: 
* MAE: sensitive to outliers, and may not penalize large errors sufficiently
* MAPE: susceptible to division by zero, not recommended when actual values are close to zero.
* MSE: provides the original units, but squared, which can be less intuitive.
* RMSE: sensitive to outliers and not robust against extreme values.


#### 11) Describe how Prophet works and what each of its key components does. Highlight at least three advantages of using Prophet in time series forecasting.

**Answer:**

Prophet is an additive regression model, that has 3 main components:
* A growth function, that can be piecewise linear or logistic, and captures the trend changes by automatically (or manually) detecting ans saving changepoints.
* A seasion function, that captures periodic changes in the time series, by relying on Fourier series. Usually, periods like daily, weekly and yearly are considered.
* A holiday function, to capture the peaks and drops usually caused by special dates.
* An error term to model the random noise.

Advantages of Prophet:
* Ease of Use: requires minimal data preprocessing and domain expertise. Automatically handles missing data, outliers, and irregular sampling.

* Interpretability: clearly separates trend, seasonality, and holiday effects. Makes it easy to understand and explain the factors driving forecasts.

* Customizability: allows users to specify holidays, adjust seasonal effects, and set changepoints manually. Offers flexibility for domain-specific needs.

* Handles Non-Stationarity: effectively manages series with changing trends and patterns over time.

* Scalability: fast enough for large datasets and supports parallelization for efficient computation.


### Depth questions: TSA

#### 1) What’s the difference between an additive and a multiplicative model for time series decomposition? Illustrate your explanation with an example.

**Answer:**

The additive and the multiplicative models are two models for decomposing a time series. Both methods perform a decomposition extracting a trend, a seasonal and a residual component, but the main difference between them is, as their name says, the operations each of them use to recompose the time series from the said components.

An additive model decomposes a time series in the following way:
$$X_t = m_t + s_t + Y_t$$

while a multiplicative model does:
$$X_t = m_t \cdot s_t \cdot Y_t$$

where $m_t$, $s_t$ and $Y_t$ are the trend, seasonality and residuals, respectively, and $X_t$ is the original time series.

#### 2) Why is exponential smoothing named in this way? Discuss the formula of at least one of the three methods presented in the course.

**Answer:**

The exponential smoothing method is called like that, because it implements the idea of weighting each past observation to obtain te next forecast, and making each weight decay exponentially as the observations get older.

The formula is as follows:
$$\hat{y}_{t+1} = \alpha y_t + (1 -\alpha) \hat{y}_{t}$$

Notice that it is a recursive formula, so if we replace backwards each term, we obtain:
$$\hat{y}_{t+1} = \alpha y_t + \alpha (1 - \alpha) y_{t-1} + \alpha (1 - \alpha)^2 y_{t-2} + ... + \alpha (1 - \alpha)^{t - 1} y_1 + (1 - \alpha)^t \ell_0$$

where $\ell_0$$ is the initial forecast. Notice that, because $\alpha \in [0, 1]$, the coefficient $\alpha (1 - \alpha)^k$ decays exponentially as $k$ increases, hence the name.

#### 3) What’s the difference between simple, double, and triple exponential smoothing? Illustrate the explanation by highlighting the respective components in the triple exponential smoothing formula.

**Answer:**

The main difference between the simple, double and triple exponential smoothing, is the amount of components each of them considers to make the fit and forecast. For instance, the simple ES assumes the time series is stationary, as it has no component for considering a trend or a seasonality. On the other hand, the double ES can handle the forecast of a time series with a trend component, by adjusting its forecast equation with a variable that captures the trend change. Finally, the triple ES also captures the possible seasonality of the time series, by adding an extra variable that captures periodic changes to the forecast equation, as well as the trend variable.

The TES equations are as follows:

* Forecast eq: $\hat{y}_{t+h} = \ell_t + h b_t + s_{t+h-d}$
* Level eq: $\ell_t = \alpha (y_t - s_{t-d}) + (1 - \alpha)(\ell_{t-1} + b_{t-1})$
* Trend eq: $b_t = \beta(\ell_t - \ell_{t-1}) + (1 - \beta)b_{t-1}$
* Season eq: $s_t = \gamma (y_t - \ell_{t-1} - b_{t-1}) + (1 - \gamma)s_{t-d}$

In the level eq, we deseasonalize by substracting $s_{t-d}$ from the previous observation. The trend eq is the same as the DES, and in the season eq, we detrend by substracting $\ell_{t-1}$ from the previous observation.

#### 4) What’s the difference between the meaning of moving average in time series decomposition and the MA component in ARMA models? Illustrate your explanation using the corresponding formulas.

**Answer:**

When we talk about moving average in time series decomposition, we refer to the method that calculates the averages of the data points inside a sliding window (usually the size of the expected seasonality period), to remove the trend. With the MA component of an ARMA model, we refer to the Moving Average component, that consists of a regression of the past estimation errors of the time series, and it is used to make a forecast of it.

In other words, on the first, we use the actual points of the data series to compute a sliding average and detrend a time series, while in the second we use the estimation errors of a time series to fit a model and predict the following steps.

Moving Average in time series decomp (assume $d$ odd)
$$y_t = \frac{\sum_{i = -d}^d x_{t+i}}{d}$$

MA model:
$$y_t = c + \epsilon_t + \sum_{i=1}^q \theta_i \epsilon_{t-i}$$


#### 5) What’s the definition of Autocorrelation? How does it differ from the definition of correlation? Illustrate your explanation with an example.

**Answer:**

Autocorrelation is a magnitude that measures the degree of similarity between a time series with a lagged version of itself. It differs from normal correlation exactly on the previous fact: correlation is calculated for 2 different time series, while autocorrelation is calculated with the same time series, only lagged by a certain magnitude.

#### 6) What’s the difference between the AR and the MA part of an ARMA model? What is their relation to Autocorrelation and Partial Autocorrelation? Illustrate your explanation using ACF and PACF plots.

**Answer:**

An AR model consists of a regression model that considers as regressors the last $p$ observations of a time series. The MA model also consists of a regresion model, but it takes into account the past $q$ estimation errors of that time series. On an AR model of degree $p$, we will notice that the ACF plot decays (tails off) and the PACF plot experiments a sharp cutoff after lag $p$. On the other hand, an MA model of degree $q$, the PACF plot will tail off and the ACF plot will show a sharp cutoff after lag $q$.

#### 7) How does the Box-Jenkins Methodology for ARMA models allow us to estimate the orders of the model? Illustrate your explanation with an example.

**Answer:**

Refer to the previous answer. By plotting the ACF and PACF of the time series, if we notice any of the above behaviors, we can determine the degrees of AR and MA. Also, we can perform a grid search and evaluate each pair of values $(p, q)$ inside the grid, and assess each resulting model with an indicator like AIC or BIC.

#### 8) What are the exogenous variables in TSA? How do they contribute to the forecasting? What’s their impact on the confidence interval? Illustrate your explanation with an example.

**Answer:**

The exogenous variables are variables coming from a different time series (generally, a different event) that may influence the outcome of the time series we are studying, but are not influenced by it. By giving extra information, the adding of exogenous variables may help to increase the accuracy of the forecast of our time series. With exogenous variables, the confidence interval shrinks with respect to the absence of them, making the model more robust.

### Breadth questions: SML

#### 1) What are the differences between batch-oriented Machine Learning and Streaming Machine Learning?

**Answer:**

Batch-oriented ML and SML differ mainly in how they process and handle data, Batch-oriented ML processes data in fixed batches, analyzing a complete dataset at once, and using it to train a model that is then used to forecast the next batch of data. In constrast, SML processes data in real-time, analyzing and using it to update the model as it arrives. 

This makes the first approach slow to adapt to changes and less robust to data and concept drifts than the first one, that updates the model constantly, and contains mechanisms to detect and adapt to drifts in data distributions as it arrives.

The SML is then more suitable to real-time applications like fraud detection or live event monitoring, while batch-only ML is more suitable for offline analytics and some recommendation systems, in which data changes are slow or practically inexistant.


#### 2) What are the benefits and the challenges of Streaming Machine Learning?

**Answer:**

The benefits of SML are:
* We can receive and incorporate data one sample at a time
* We build incremental models, robust to drifts on incoming data.
* These models are memory efficient, as they store the model and not each data point, only using each data point to adapt the model as it arrives. Also, because they are continuously updating one sample at a time, the models have low-latency for prediction.

The challenges of SML are:
* Design a way to detect data drifts, as we assume that the data won´t be always identically distributed.
* Find a way to handle possible class imbalance of the incoming stream
* As with every other branch of ML, every model has hyperparameters that need to be tunned, to achieve maximum performance.



#### 3) What is a concept drift? Which are the types of concept drift? Why is it so important to detect it? Illustrate the difference using the Bayes Theorem and with an example.

**Answer:**

A concept drift refers to a change in the statistical properties of the target variable or the data distribution over time. This change affects the predictive performance of machine learning models. It happens when the relationship between the input data $X$ and the lables $y$ changes, causing previously learned patterns to become outdated.

We can illustrate the types of concept drifts using the Bayes theorem:
$$p(y | X) = \frac{p(y) \cdot p(X | y)}{P(X)}$$

* Change in prior $p(y)$: we can experiment a change on the frequency of each class, that may or may not change the decision boundary. For example, if the frquency of fraudulent transactions increase.
* Change in likelihood $p(X|y)$: we can have a change in the features asociated with each class, for example, if fraudsters use different patterns or methods.
* Change in data distribution $P(X)$: when the overall feature distribution changes, for example the general customer behavior changes due to external factors. This is constant for every class, and usually not considered.
* Change in the posterior $p(y|X)$: this is when a real concept drift occurs, when the probability of belonging to a class given a set of features changes. This moves the decision boundary, and may cause the model to become outdated. 


#### 4) How can you classify a concept drift with respect to the speed of change? Illustrate the difference with an example for each type.

**Answer:**

Based on the speed of change, a concept drift can be classified into the following categories:
* Abrupt drift: the relationship between the features and the labels changes suddenly. We start to misclassify almost all data points we receive, as they previouly belonged to a different class.

* Gradual drift: the relationship changes slowly, blending old and new patterns until there is only new patterns.

* Incremental drift: similar to the gradual drift, but the change is more subtle and continuous.

* Recurring drift: new patterns appear and disappear periodically.

#### 5) Which are the typical Streaming Machine Learning algorithms for classification? Illustrate one of them in detail.

**Answer:**

The typical SML algorithms for classification are:
* Naive Bayes
* Online KNN
* Hoeffding Trees (VFDT)
* Hoeffding Adaptive Tree (HAT)

For the sake of simplicity, let us explain Online K Nearest Neighbours. This algorithm works by assigning to a new observation, the class of the closest of the previous observations. The distance can vary depending on the context, but usually the Euclidean distance is used. This algorithm also stores a limited amount of previous observations, and the size of this history window may change due to possible concept drifts. This size can be controlled by methods as ADWIN.


#### 6) What are the key factors to consider when building an SML Ensemble Classification model?

**Answer:**

There are some key factors to consider when creating an ensemble model:

* Diversity: we need to choose the mechanism to induce diversity in our ensemble, with either horizonal partitioning (training learners on different subsets of samples) or vertical partitioning (train learners on different subsets of features).

* Combination of learners: we need to decide how are we going to combine each learner, including the architecture of the ensemble, and the voting system to decide the prediction.

* Adaptation: we also need to decide how our ensemble is going to adapt over time, either by changing its cardinality, changing the distribution of incoming data across the ensemble (for training) ad others.


#### 7) How does SML regression compare to time series forecasting w.r.t. types of input features, training, forecasting horizon, and adaptability?

**Answer:**

### Depth questions: SML

#### 1) Which are the concept drift detectors that monitor the error rate? Illustrate how one of them works.

**Answer:**

The have the following concept drift detectors:

* Drift Detection Method (DDM)
* Early Drift Detection Method (EDDM)
* Adaptive Sliding Window (ADWIN)

The DDM method works by monitoring the classification error. For each prediction, we update the classification error mean $p_t$ and standard deviation $\sigma_t$. We also save the minimum seen of these values. Then:

* If $p_t + \sigma_t > p_{min} + 2 * \sigma_{min}$, raise a warning
* If $p_t + \sigma_t > p_{min} + 3 * \sigma_{min}$, raise a change





#### 2) Which are the data drift detectors? Illustrate how one of them works.

**Answer:**

The data drift detectors are the following:

* Cummulative sum test (CUSUM)
* Page-Hinkley test

The CUSUM test works by constantly monitoring the mean of the data, and raising a warning when sufficient amount of significative drifts from the mean have happened. We have two parameters, $v$ and $h$:

* $v$ indicates the threshold for a deviation from the mean to be significative
* $h$ indicates the threshold in which the cummulative drifts imply a change.

Then, the algorithm is as follows:

$$g_0 = 0$$
$$g_t = \max(0, g_{t-1} + (x_t - \hat{x}) - v)$$
$$\text{if } g_t > h: \text{ alarm}$$

#### 3) How does ADWIN detect a concept drift? Illustrate it with an example.

**Answer:**

ADWIN is an algorithm designed to detect concept drifts in streaming data. It dynamically mantains a window of recent data points, adjusting its size based on changes in the statistical properties of the data. The core idea is to detect when the distribution within the window has changed significantly.

Let us show an example with clicks over movie recommendations, between Action and Drama genres. Let Action be represented as a 1 and Drama as a 0. ADWIN will mantain a window $W$ of past observations as follows:
$$W = [1, 0, 1, 1, 0, 1, 1]$$

Suppose now that new data arrives, and the window increases to: 
$$W = [1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0]$$

ADWIN will divide the window in 2:
$$W_1 = [1, 0, 1, 1, 0, 1], \quad W_2 =[1, 0, 0, 0, 1, 0]$$

Then, it computes the mean of the windows:
$$\mu_{W_1} = 0.83 \quad \mu_{W_2} = 0.33$$

Assume that we have a threshold $\epsilon = 0.2$. We then compute the difference:
$$|\mu_{W_1} - \mu_{W_2}| = 0.5 > 2$$

As the difference is greater than the threshold, we discard the old window ($W_1$) and store only the more recent window $W = W_2$.

#### 4) How does the SML version of KNN work? Illustrate it with an example.

**Answer:** 

The SML version of KNN works by mantaining a window of previous observations, and assigning a prediction to a new observation based on the class of the closest previous observation. The size of the memory window is selected using the ADWIN algorithm.

#### 5) How does the SML version of Naïve Bayes work? Illustrate it with an example.

**Answer:**

This model bases on the Bayes theorem to obtain the prediction of the new data point. It also assumes that the features of the data are independent from each other. So we have:

$$p(y = c | X) \propto p(y = c) \cdot \prod_{i} p(x_i | y = c)$$

The SML version of Naive Bayes stores the occurrence of each class for the previous observations (on a window determined by ADWIN) to calculate $p(y = c)$ and stores incrementally the mean and variance of the features given that they belong to a specific class, for then estimating the value of $p(x_i | y = c)$.

#### 6) How does the Hoeffding Tree work? Illustrate it with an example.

**Answer:**

A Hoeffding Tree works by incrementaly building a decision tree, splitting each node after a sufficient amount of statistical evidence is gathered to set a new attribute to split. This splitting is made after some attribute surpasses the Hoeffding bound with respect to the difference with the current node attribute. 

The following of this rule (Hoeffding bound) gives us theoretical guarantees that the decision tree will approximate to the ideal tree built with the whole dataset.



#### 7) How does the Hoeffding Adaptive Tree work? Illustrate it with an example.

**Answer:**

A HAT works just like a normal Hoeffding tree, but it mantains a window of previous observations and constructs alternate subtrees that can potentially replace the main branch. When a concept drift is detected, it discards the previous tree in favor of the new one build by the more recent observations (new window from ADWIN).


#### 8) What is Hoeffding Bound? How and why is it used in SML to build a decision tree?

**Answer:**

Refer to the previous answers. The Hoeffdig bound is a theoretical bound used to determine when a node of a Hoeffding tree should be branched. It is used as it gives theoretical guarantees that the tree will approximate to the ideal tree built with the whole dataset, if we had it. 

#### 9) Why is the Poisson distribution used in SML ensemble methods? What’s the impact of the value of lambda? Illustrate how lambda is used in the SML methods

**Answer:**

In Streaming Machine Learning, the Poisson distribution is used in ensemble methods like Online Bagging to determine how many times a data instance is used to train each base model. The parameter $\lambda$ (mean and variance of the distribution) controls the sampling frequency:

* $\lambda = 1$: Simulates traditional bagging, where each instance is, on average, included once.

* $\lambda < 1$: Instances are included less often, reducing their influence.

* $\lambda > 1$: Instances are included more frequently, increasing their influence.

This sampling promotes diversity among base models without storing large datasets. It’s efficient for streaming contexts and crucial for ensemble performance.

### Breadth and Depth questions: Continual Learning

#### 1) What’s the stability-plasticity dilemma? Illustrate your explanation with an example and discuss the different learning abilities.

**Answer:**

This dilemma refers to the trade-off between a model having plasticity, i.e., a high capacity of adquiring new knowlegde (learn over time), and stability, i.e., the ability to remember the past adquired knowlegde.

If a model is too plastic, it will forget past knowledge in favor of new experiences, problem known as catastrophic forgetting. On the other hand, if a model is too stable, it will have a lot of difficulties in learning from new data and adapting over time.


#### 2) What are the main differences between Streaming Machine Learning and Continual Learning paradigms? Discuss it w.r.t to the objective of the methods and the type of data they process.

**Answer:**

Both paradigms focus on learning data distributions that are assumed to change over time. The main differences between each paradigm are:

* CL focuses on avoiding catastrophic forgetting, while SML focuses on quick adaptation to new concepts. Since CL assumes the data drifts to be virtual, each task can be seen as a subproblem on a feature subspace, so avoiding forgetting past experiences is crucial. On the other hand, SML ignores the forgetting problem as it assumes that there could be real concept drifts that change the class boundary and contradict previous concepts, so it just focus on learning the current concept.

* CL distinguishes between train and test sets, usually separating data into batches called experiences (each with a train and test). On the other hand, SML usually applies prequential evaluation, meaning that it predicts on every data point (or mini-batch) as it arrives, and uses the result to update the model and detect possible concept drifts.


#### 3) What is the difference between the definitions of concept drift in SML and CL? When is avoiding forgetting meaningful? Illustrate it with an example.

**Answer:**

Refer to part of the previous answer. In SML, concept drifts are assumed to be real, i.e., drifts that change the decision boundary, so it mainly focus on learning the current concept, ignoring the problem of forgetting. On the other hand, in CL, concept drifts are assumed to be virtual, meaning that only the feature distribution changes, but not the decision boundary. In this sense, avoiding forgetting is important, as each task is seen like a subproblem on a subset of different features, but with the same classification problem (same boundary), so previously learned patterns may be crucial to ease the learning of the new task.

#### 4) What are the three main scenarios in Continual Learning?

**Answer:**

The 3 main scenarios in CL are the following:

* Incremental task learning: each experience is labeled with a specific task.
  
* Incremental domain learning: no task labelling, but we have the same target domain for every experience. The goal is to incrementally expand the known feature domain for each class 
  
* Incremental class learning: no task labelling and different target domains for each experience. The goal is to incrementally expand the known classes for the observed features.

#### 5) How do CL replay strategies work? Explain one specific strategy in detail. What are the other two categories of CL strategies? Briefly explain them.

**Answer:**

Replay strategies in CL work by including past data points (observations) into current experiences, to avoid the forgetting of previously learned patterns. One of this strategies is called Random Replay. It consist on the following:

* Have a Random Memory for storing random data points from each training batch.
* Train the model on each batch joined and mixed with the Random Memory set. 
* After training on a batch, select random points from this batch and replace random points of the Random Memory set
* Go on to train the next batch.

The other strategies in CL are called Regularization and Architectural. The first one consists in modifying the objective function with regularization terms that are specific on each task. The second consist in modifying the architecture of the model, adding new shared layers between task or freezing weights and transfer learning.

#### 6) Which are the primary evaluation metrics used in Continual Learning? Illustrate what they measure and why they are all useful to assess the properties of a method. 

**Answer:**

We have the following metrics in CL:

* Average accuracy: it is the average of the accuracy of the final performances on all experiences test. It is the most standard metric, similar to others in ML.

* A metric: it measures the progressive performace of the model by calculating the average accuracy of each experience training on its correspondant and previous tests.

* Forward transfer metric: it measures how the current training improves the performance for the next experiences

* Backward transfer metric: it measures how the final model has improved or decreased the performance of previous experiences. A negative value indicates forgetting. 