# Sidekick - Formal Definition of the Problems

## Notations
 - Let us denote $\mathbf{v}_{i:j}$, $i \leq j$ the subvector $\mathbf{u} = [v_i, ..., v_j]^T$ consisting of the elements $v_i$ to $v_j$ of the vector $\mathbf{v}$.

 - Let us denote $(v_{i:j}, i \leq j)$ subsequence $(u_n, n = i, ..., j) = (v_i, ..., v_j)$ consisting of the elements $v_i$ to $v_j$ of the sequence $(v_n, n = 1, ..., N)$.

## Problems

### Single-Project Regression
#### Model
We are approaching the problem as time series regression, considering only one project. Our dataset $\mathcal{D} = \left\{ (x_i, y_i) \mid i = 1, ..., T \right\}$ consists of $T$ observations, with $x_i$ the time index of the amount of money $y_i$. Hence, we have $X = [1, ..., T]^T$ an $(Tx1)$ matrix of time indices and $\mathbf{y} = [y_1, ..., y_T]^T$ a vector of observed values. We model the pledged money $f(\mathbf{x})$ at time indices $\mathbf{x}$ as a Gaussian Process:

$$f(\mathbf{x}) \sim GP \left( m(\mathbf{x}), k(\mathbf{x}, \mathbf{x'}) \right). $$

Our goal is to predict the future values of the pledged money $\mathbf{f}_* = \mathbf{f}_{t:T} = f(X_{t:T})$ at future time indices $X_* = X_{t:T} = [t, ..., T]^T$ after observing the values $\mathbf{y} = \mathbf{y}_{1:t} = [y_1, ..., y_t]^T$ at time indices $X = X_{1:t} = [1, ..., t]^T$. In the GP framework, we can compute this prediction using

$$\mathbf{f}_* \mid X, \mathbf{y}, X_* \sim \mathcal{N}\left(\overline{\mathbf{f}}_*, \text{ cov}(\mathbf{f}_*)  \right) \\
\overline{\mathbf{f}}_* = K(X_*, X) \left[ K(X, X) + \sigma_n^2I \right]^{-1}\mathbf{y} \\
\text{cov}(\mathbf{f}_*) = K(X_*, X_*) - K(X_*, X)\left[ K(X, X) + \sigma_n^2I \right]^{-1}K(X, X_*).
$$ 

Finally, the kernel's (hyper)parameters $\theta_*$ are learned by maximizing the *log marginal likehood*

$$\theta_* = \underset{\theta} {\arg\max} \log p(\mathbf{y} \mid X, \theta) = \underset{\theta} {\arg\max} \left\{ -\frac{1}{2}\mathbf{y}^T \left[ K+ \sigma_n^2I \right]^{-1}\mathbf{y} -\frac{1}{2}\log det\left[K+ \sigma_n^2I\right] -\frac{T}{2}\log 2\pi \right\},$$

with $K = K(X, X)$.

#### Results
The major problem in this context was that the predictive mean $\mathbf{f}_*$ always falls back to the mean $m(\mathbf{x})$ very quickly. One solution has been to combine two squared-exponential kernels and initializing one of them to a large length-scale in order to capture the global trend. This yields to some reasonable result. However, when applying the same model to another ones ($\theta$ learned over one project and used to predict another) gives very poor performance.


### Multi-Project Regression
#### Model
Hence, our next idea is to consider $P$ projects at the same time and try to learn the hyperparameters $\theta$ over various time series. For a given project $p$, we have the following dataset $\mathcal{D}^{(p)} = \left\{ (x_i^{(p)}, y_i^{(p)}) \mid i = 1, ..., n \right\}$. Note that we have $x_i^{(p)}= x_i = i$. We combine the projects together to obtain a new dataset $\mathcal{D} = \left\{ \mathcal{D}^{(p)} \mid p = 1, ..., P \right\}$ (*multi-task learning*). We then have $X = [1, ..., N]^T$ an $(Nx1)$ matrix of time indices and $Y = \left[\mathbf{y}^{(p)} \right]_{p=1}^P$ an $(NxP)$ matrix of observed values per project. For a given project $p$, we are trying to predict the pledged money $\mathbf{f}_*  \equiv \mathbf{f}_*^{(p)} = \mathbf{f}_{t:T}^{(p)} = f(X_{t:T}^{(p)})$ at future time indices $X_*  \equiv X_*^{(p)} = X_{t:T}^{(p)} = X_{t:T} = [t,...,T]^T$ after observing the values $\mathbf{y}  \equiv \mathbf{y}^{(p)} = \mathbf{y}_{1:t}^{(p)}$ at time indices $X  \equiv X_{1:t} = [1, ..., t]^T$. To do so in the GP framework, we have

$$\mathbf{f}_* \mid X, \mathbf{y}, X_* \equiv \mathbf{f}_{t:T}^{(p)} \mid X_{1:t}, \mathbf{y}_{1:t}^{(p)}, X_{t:T} \sim \mathcal{N} \left( \overline{\mathbf{f}}_{t:T}^{(p)}, \text{ cov}(\mathbf{f}_{t:T}^{(p)}) \right), $$

with $\overline{\mathbf{f}}_{t:T}^{(p)} \equiv \overline{\mathbf{f}}_*$ as before and $\text{ cov}(\mathbf{f}_{t:T}^{(p)}) \equiv \text{ cov}(\mathbf{f}_*)$. The hyperparameters $\mathbf{\theta}$ of the GP are learned by maximizing the log marginal likelihood over all the projects, that is

$$\theta^* = \underset{\theta} {\arg\max} \sum_{p=1}^P \log p(\mathbf{y}^{(p)} \mid X, \theta).$$

#### Results
Again, we couldn't obtain good results with this approach, as the predictions always fall back very quickly to the mean of the GP. **[MORE DETAILS]**

### Project Classification
#### Model
We then decide to try a simpler task. Instead of trying to predict a number of (future) points after some observations, we try now to classify whether a project will be successful or not. Indeed, by separating the dataset in two classes (*successul* and *failed*), we notice that they have a very different profile (look at the mean of both classes in [sidekick-classification]). To do so, we train one GP on the successful projects, one on the failed projects and try to determine whether a new project will be successful or not. To do so, we would learn $\theta_s$ the hyperparameters of a GP over the *successful* projects only and $\theta_f$ the hyperparameters of a GP over the *failed* projects. Formally, we maximize the marginal log-likelihoods

$$\theta_s^* = \underset{\theta} {argmax} \log p(Y_s \mid X_s, \theta)$$
$$\theta_f^* = \underset{\theta} {argmax} \log p(Y_f \mid X_f, \theta)$$

where $s$ and $f$ denotes the data of the successful and failed projects respectively. We then determine the class of a new, partially observed project as

$$c_{*} = s\text{ if } \log p(\mathbf{y}_{1:t}^{(n)} \mid \mathbf{x}_{1:t}, \theta_s) > \log p(\mathbf{y}_{1:t}^{(n)} \mid \mathbf{x}_{1:t} \mid \theta_f)$$

#### Results

### Mixture of Gaussian Processes
#### Model
Instead of using the time as input and trying to predict the output at new time indices, we now consider the value at each time step as input (`y` becomes `x`) and the last time index as the output. 
#### Results