# Sidekick - Output as input

## Outputs as Inputs
### Model
We change completely the approach. Instead of using the time as input and trying to predict the output at new time indices, we now consider the pledged money at each time step as input ($\mathbf{y}$ becomes $\mathbf{x}$) and the last time index as the output. That is, we have now a dataset $\mathcal{D} = \left\{ (\mathbf{x}^{(p)}, y^{(p)}) \mid p = 1, ..., P \right\}$ with $\mathbf{x}^{(p)} = \mathbf{y}_{1:t}^{(p)}$ and $\mathbf{y}^{(p)} = y_T^{(p)}$. We then have $X = \left[\mathbf{x}^{(p)}\right]_{p=1}^P$ a $(P \times t)$ matrix and $\mathbf{y} = \left[y_T^{(p)}\right]_{p=1}^P$ a $(P \times 1)$ target vector. The difference with the second approach is that now the features for each project are the amount of pledged money at different time step and not the same (shared) input values ($1,...,T$).

Our goal now is to predict, for a new project $p$, the final amount of pledged money $f_*^{(p)} = y_T^{(p)} = f(\mathbf{x}^{(p)}) = f(\mathbf{y}_{1:t}^{(p)})$ for the money received up to time $t$ $X_*^{(p)} = \mathbf{y}_{1:t}^{(p)}$ after observing the total pledged money for all projects $\mathbf{y} = \left[y_T^{(p)}\right]_{p=1}^P$ and the money they received up to time $t$, $X = \left[ \mathbf{y}_{1:t}^{(p)} \right]_{p=1}^P$. In the GP framework, we can compute this prediction using

$$f_* \mid X, \mathbf{y}, X_* \sim \mathcal{N}\left(\overline{f}_*, \text{ cov}(f_*)  \right) \\
\overline{f}_* = K(X_*, X) \left[ K(X, X) + \sigma_n^2I \right]^{-1}\mathbf{y} \\
\text{cov}(f_*) = K(X_*, X_*) - K(X_*, X)\left[ K(X, X) + \sigma_n^2I \right]^{-1}K(X, X_*).
$$ 

As seen in [this plot](http://localhost:8888/notebooks/notebooks/sidekick-1-eda.ipynb#Output-as-input---Global), it seems impossible to do a regression considering all the projects together. However, the two modes corresponding to the successful and failed projects are clearly distinguisable. Therefore we try to perform the regression on each of the two classes separately. 

### Results
As seen in [this plot](http://localhost:8888/notebooks/notebooks/sidekick-1-eda.ipynb#Output-as-input---Global), it seems impossible to do a regression considering all the projects together. However, the two modes corresponding to the successful and failed projects are clearly distinguisable, implying that a mixture of (GP) models can probably be used.

In [3]:
%matplotlib inline
import os
import sys
sys.path.insert(0, os.path.abspath('../utils/')) # Add sibling to Python path
sys.path.insert(0, os.path.abspath('../src/')) # Add sibling to Python path
sys.stdout.flush() # Print output on the fly in Notebook
import matplotlib
matplotlib.rcParams['figure.figsize'] = (18,8)
matplotlib.rcParams['font.size'] = 16
matplotlib.rcParams['legend.fontsize'] = 16
from IPython.display import display
import numpy as np
import GPy
import cPickle as cp
import matplotlib.pyplot as plt
from math import floor
from dataset import Sidekick
from misc_utils import progress

DATA_DIR = "../data/sidekick"

def predict_total_pledged(project, m_s_test, m_f_test):
    money = np.expand_dims(project.money[:t], axis=0)
    X_observed = np.ndarray(shape=(1, t), buffer=money, dtype=float)
    yT_f = m_f_test.predict(X_observed)
    yT_s = m_s_test.predict(X_observed)
    yT_f_mean = yT_f[0][0][0]
    yT_f_std = yT_f[1][0][0]
    yT_s_mean = yT_s[0][0][0]
    yT_s_std = yT_s[1][0][0]
    
    return yT_s_mean, yT_s_std, yT_f_mean, yT_f_std



## Load data and separate successful from failed projects

In [4]:
sk = Sidekick()
sk.load()
projects_train, projects_test = sk.split()
successful = [project.money for project in projects_train if project.successful and project.money[-1] >= 1]
failed = [project.money for project in projects_train if not project.successful]

Loading projects...
Loading statuses...
Converting to project instances...
Data loaded.


In [22]:
N = 2000
t = 500
T = 999
ard = False

## Train GP-RBF on successful projects

In [30]:
successful_light = successful[:N]
successful_light = [s for s in successful_light if s[T] < 2]
X_train_s = np.ndarray(shape=(len(successful_light), t), buffer=np.array([money[0:t] for money in successful_light]), dtype=float) 
Y_train_s = np.expand_dims(np.array([money[T] for money in successful_light]), axis=1)
print X_train_s.shape
print Y_train_s.shape

kernel = GPy.kern.Linear(input_dim=t, ARD=ard)
m_s = GPy.models.GPRegression(X_train_s, Y_train_s, kernel)
m_s.optimize()
display(m_s)

(1683, 500)
(1683, 1)


GP_regression.,Value,Constraint,Prior,Tied to
linear.variances,0.028364809962,+ve,,
Gaussian_noise.variance,0.234272559833,+ve,,


## Train GP-RBF on failed projects

In [31]:
failed_light = failed[:N]
X_train_f = np.ndarray(shape=(len(failed_light), t), buffer=np.array([money[0:t] for money in failed_light]), dtype=float) 
Y_train_f = np.expand_dims(np.array([money[T] for money in failed_light]), axis=1)
print X_train_f.shape
print Y_train_f.shape

kernel = GPy.kern.Linear(input_dim=t, ARD=ard)
m_f = GPy.models.GPRegression(X_train_f, Y_train_f, kernel)
m_f.optimize()
display(m_f)

(2000, 500)
(2000, 1)


GP_regression.,Value,Constraint,Prior,Tied to
linear.variances,0.134220935202,+ve,,
Gaussian_noise.variance,0.00567944646715,+ve,,


## Set up experiment

In [27]:
m_s_test = m_s.copy()
m_f_test = m_f.copy()
project_test = projects_test[7]
yT_s_mean, yT_s_std, yT_f_mean, yT_f_std = predict_total_pledged(project_test, m_s_test, m_f_test)
print "If successful, predicted as %0.4f (%0.4f)" % (yT_s_mean, yT_s_std)
print "If failed, predicted as %0.4f (%0.4f)" % (yT_f_mean, yT_f_std)
print "Actual total pledged money: %0.4f" % project_test.money[T]

If successful, predicted as 1.2641 (0.0263)
If failed, predicted as 0.1742 (0.1627)
Actual total pledged money: 1.0413


## Run experiment

In [32]:
m_s_test = m_s.copy()
m_f_test = m_f.copy()
N_test = 200
se_successful = 0
std_successful = 0
se_failed = 0
std_failed = 0
#np.random.shuffle(projects_test_small)
projects_test_small = projects_test[:N_test]
for project in projects_test_small:
    yT_s_mean, yT_s_std, yT_f_mean, yT_f_std = predict_total_pledged(project, m_s_test, m_f_test)
    if project.successful:
        se_successful += (project.money[T] - yT_s_mean)**2
        std_successful += yT_s_std 
    else:
        se_failed += (project.money[T] - yT_f_mean)**2
        std_failed += yT_f_std
print "Successful: RMSE = %0.4f (±%0.4f)" % (np.sqrt(np.mean(se_successful)), std_successful / float(N_test))
print "Failed: RMSE = %0.4f (±%0.4f)" % (np.sqrt(np.mean(se_failed)), std_failed / float(N_test))

Successful: RMSE = 5.2954 (±0.1135)
Failed: RMSE = 0.5725 (±0.0034)
