## Training data

A crucial property of CNPs is their flexibility at test time, as they can model
a whole range of functions and narrow down their prediction as we condition on
an increasing number of context observations. This behaviour is a result of the
training regime of CNPs which is reflected in our datasets.

![](https://bit.ly/2O2Lq8c)

Rather than training using observations from a single function as it is often
the case in machine learning (for example value functions in reinforcement
learning) we will use a dataset that <h6>consists of many different functions that
share some underlying characteristics.</h6> This is visualized in the figure above.
The example on the left corresponds to a classic training regime: we have a
single underlying ground truth function (eg. our value function for an agent) in
grey and at each learning iteration we are provided with a handful of examples from this
function that we have visualized in different colours for batches of different
iterations. On the right we show an example of a dataset that could be used for
training neural processes. <h6>Instead of a single function, it consists of a large number of functions of a function-class</h6> that we are interested in modeling. At each iteration we randomly choose one from the dataset and provide some observations from that function for training. For the next iteration we put that function back and
pick a new one from our dataset and use this new function to select the training
data. <h6>This type of dataset ensures that our model can't overfit to a single
function but rather learns a distribution over functions.</h6> This idea of a
hierarchical dataset also lies at the core of current meta-learning methods.
Examples of such datasets could be:

*  Functions describing the evolution of temperature over time in different cities 
of the world.
*  A dataset of functions generated by a motion capture sensor of different humans
    walking.
*   As in this particular example differents functions generated by a Gaussian process (GP)
    with a specific kernel.

<h6>We have chosen GPs for the data generation of this example because they
constitute an easy way of sampling smooth curves that share some underlying
characteristic (in this case the kernel).</h6> Other than for data generation of this
particular example neural processes do not make use of kernels or GPs as they
are implemented as neural networks.


In [2]:
# importing modules
import tensorflow as tf
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Activation
import matplotlib.pyplot as plt
import pandas as pd 

     
          query: Array containing ((context_x, context_y), target_x) where:
          
          context_x: Array of shape batch_size x num_context x 1 
          context_y: Array of shape batch_size x num_context x 1
          
          target_x: Array of shape batch_size x num_target x 1 
          target_y: Array of shape batchsize x num_targets x 1.The ground truth y values of the target y.
          
          num_total_points: Number of target points.

    Returns:
      log_p: The log_probability of the target_y given the predicted
      distribution.
      mu: The mean of the predicted distribution.
      sigma: The variance of the predicted distribution.
    """

## Data generator

In the following section we provide the code for generating our training and
testing sets using a GP to generate a dataset of functions. As we will explain
later, CNPs use two subset of points at every iteration: one to serve as the
context, and the other as targets. In practise we found that including the
context points as targets together with some additional new points helped during
training. Our data generator divides the generated data into these two groups
and provides it in the correct format.<br>

<h6>
CNPRegressionDescription ::  iput of CNP<br>
GPCurveReader :: data sampled from GP at each iteration 
    </h6>

In [10]:
df = pd.DataFrame(columns=["name" , "age" ,"text"])
for i in range(10):
    df.loc[i] = [f"user{i}" , i**2 , "there is sth in the middle"]
df

Unnamed: 0,name,age,text
0,user0,0,there is sth in the middle
1,user1,1,there is sth in the middle
2,user2,4,there is sth in the middle
3,user3,9,there is sth in the middle
4,user4,16,there is sth in the middle
5,user5,25,there is sth in the middle
6,user6,36,there is sth in the middle
7,user7,49,there is sth in the middle
8,user8,64,there is sth in the middle
9,user9,81,there is sth in the middle


In [13]:
df.shape

(10, 3)

### concat note 
#### [3,10,5]
####  [3,10,5]
<br>
axis= 0 means [6,10,5] <br>
axis= 1 means [3,20,5] <br>
axis= 2 means [3,10,10] 

In [25]:
bs = 10 
tensor = tf.zeros([bs,2,5])
context_x = tf.zeros([bs , 100 , 1])
context_y = tf.zeros([bs , 100 , 1])
encoder_input = tf.concat([context_x, context_y], axis=-1)
encoder_input.shape.as_list()

[10, 100, 2]

## Encoder

The encoder **e** is shared between all the context points and consists of an
MLP with a handful of layers. For this experiment four layers are enough [128,128,128,128] </br>, but we
can still change the number and size of the layers when we build the graph later
on via the variable **`encoder_output_sizes`**. Each of the context pairs **(x,
y)<sub>i</sub>** results in an individual representation **r<sub>i</sub>** after
encoding. These representations are then combined across context points to form
a single representation **r** using the aggregator **a**.

In this implementation we have included the aggregator **a** in the encoder as
we are only taking the mean across all points. The representation **r** produced
by the aggregator contains the information about the underlying unknown function
**f** that is provided by all the context points.


In [3]:
class DeterministicEncoder(object):
  """The Encoder."""

  def __init__(self, output_sizes):
    """CNP encoder."""
    self._output_sizes = output_sizes

    
  def __call__(self, context_x, context_y, num_context_points):

    # Concatenate x and y along the filter axes
    encoder_input = tf.concat([context_x, context_y], axis=-1)

    
    
    # Get the shapes of the input and reshape to parallelise across observations
    batch_size, _, filter_size = encoder_input.shape.as_list()
    hidden = tf.reshape(encoder_input, (batch_size * num_context_points, -1))
    hidden.set_shape((None, filter_size))

    # Pass through MLP
    # , reuse=tf.compact.v1.AUTO_REUSE
    with tf.compat.v1.variable_scope("encoder"):
      for i, size in enumerate(self._output_sizes[:-1]):
        hidden = tf.nn.relu(
            tf.compat.v1.layers.dense(hidden, size, name="Encoder_layer_{}".format(i)))

      # Last layer without a ReLu
      hidden = tf.compat.v1.layers.dense(
          hidden, self._output_sizes[-1], name="Encoder_layer_{}".format(i + 1))

    # Bring back into original shape
    hidden = tf.reshape(hidden, (batch_size, num_context_points, size))

    # Aggregator: take the mean over all points
    representation = tf.reduce_mean(hidden, axis=1)

    return representation

## Decoder

Once we have obtained our representation **r** we concatenate it with each of
the targets **x<sub>t</sub>** and pass it through the decoder **d**. As with the
encoder **e**, the decoder **d** is shared between all the target points and
consists of a small MLP with layer sizes defined in **`decoder_output_sizes`**.<br>
[128,128,2] <br>
The decoder outputs a mean **&mu;<sub>t</sub>** and a variance
**&sigma;<sub>t</sub>** for each of the targets **x<sub>t</sub>**. To train our
CNP we use the log likelihood of the ground truth value **y<sub>t</sub>** under
a Gaussian parametrized by these predicted **&mu;<sub>t</sub>** and
**&sigma;<sub>t</sub>**.

In this implementation we clip the variance **&sigma;<sub>t</sub>** at 0.1 to
avoid collapsing.

In [4]:
class DeterministicDecoder(object):
  """The Decoder."""

  def __init__(self, output_sizes):
    """CNP decoder.

    Args:
      output_sizes: An iterable containing the output sizes of the decoder MLP 
          as defined in `basic.Linear`.
    """
    self._output_sizes = output_sizes

  def __call__(self, representation, target_x, num_total_points):
    """Decodes the individual targets.

    Args:
      representation: The encoded representation of the context
      target_x: The x locations for the target query
      num_total_points: The number of target points.

    Returns:
      dist: A multivariate Gaussian over the target points.
      mu: The mean of the multivariate Gaussian.
      sigma: The standard deviation of the multivariate Gaussian.
    """

    # Concatenate the representation and the target_x
    representation = tf.tile(
        tf.expand_dims(representation, axis=1), [1, num_total_points, 1])
    input = tf.concat([representation, target_x], axis=-1)

    # Get the shapes of the input and reshape to parallelise across observations
    batch_size, _, filter_size = input.shape.as_list()
    hidden = tf.reshape(input, (batch_size * num_total_points, -1))
    hidden.set_shape((None, filter_size))

    # Pass through MLP
    with tf.compat.v1.variable_scope("decoder"):
      for i, size in enumerate(self._output_sizes[:-1]):
        hidden = tf.nn.relu(
            tf.compat.v1.layers.dense(hidden, size, name="Decoder_layer_{}".format(i)))

      # Last layer without a ReLu
      hidden = tf.compat.v1.layers.dense(
          hidden, self._output_sizes[-1], name="Decoder_layer_{}".format(i + 1))

    # Bring back into original shape
    hidden = tf.reshape(hidden, (batch_size, num_total_points, -1))

    # Get the mean an the variance
    mu, log_sigma = tf.split(hidden, 2, axis=-1)

    # Bound the variance
    sigma = 0.1 + 0.9 * tf.nn.softplus(log_sigma)

    # Get the distribution
#     dist = tf.contrib.distributions.MultivariateNormalDiag(
#         loc=mu, scale_diag=sigma)
#     dist = tf.compat.v1.distributions.MultivariateNormalDiag(
#         loc=mu, scale_diag=sigma)
    dist = tfp.distributions.MultivariateNormalDiag(
        loc=mu, scale_diag=sigma)
    

    return dist, mu, sigma

## Model

Now that the main building blocks (encoder, aggregator and decoder) of the CNP
are defined we can put everything together into one model. Fundamentally this
model only needs to include two main methods: 1. A method that returns the log
likelihood of the targets' ground truth values under the predicted
distribution.This method will be called during training as our loss function. 2.
Another method that returns the predicted mean and variance at the target
locations in order to evaluate or query the CNP at test time. This second method
needs to be defined separately as, unlike the method above, it should not depend
on the ground truth target values.

In [5]:
class DeterministicModel(object):
  """The CNP model."""

  def __init__(self, encoder_output_sizes, decoder_output_sizes):
   
    self._encoder = DeterministicEncoder(encoder_output_sizes)
    self._decoder = DeterministicDecoder(decoder_output_sizes)

  def __call__(self, query, num_total_points, num_contexts, target_y=None):

    (context_x, context_y), target_x = query

    # Pass query through the encoder and the decoder
    representation = self._encoder(context_x, context_y, num_contexts)
    dist, mu, sigma = self._decoder(representation, target_x, num_total_points)

    # If we want to calculate the log_prob for training we will make use of the
    # target_y. At test time the target_y is not available so we return None
    if target_y is not None:
      log_p = dist.log_prob(target_y)
    else:
      log_p = None

    return log_p, mu, sigma

In [9]:
dataset = pd.read_csv("AAPL_price.csv")
dataset.tail(10)
# data_train = 0 
# data_test =  0 


Unnamed: 0.1,Unnamed: 0,Date,Close
4129,4129,5/30/2023,177.300003
4130,4130,5/31/2023,177.25
4131,4131,6/1/2023,180.089996
4132,4132,6/2/2023,180.949997
4133,4133,6/5/2023,179.580002
4134,4134,6/6/2023,179.210007
4135,4135,6/7/2023,177.820007
4136,4136,6/8/2023,180.570007
4137,4137,6/9/2023,180.960007
4138,4138,6/12/2023,183.789993


### RUN

In [12]:
encoder_output_sizes = [128, 128, 128, 128]
decoder_output_sizes = [128, 128, 2]
model = DeterministicModel(encoder_output_sizes, decoder_output_sizes)



log_prob, _, _ = model(data_train.query, data_train.num_total_points,
                       data_train.num_context_points, data_train.target_y)
loss = -tf.reduce_mean(log_prob)



# _, mu, sigma = model(data_test.query, data_test.num_total_points,
#                      data_test.num_context_points)





# # Set up the optimizer and train step
# optimizer = tf.train.AdamOptimizer(1e-4)
# train_step = optimizer.minimize(loss)
# init = tf.initialize_all_variables()

AttributeError: 'int' object has no attribute 'query'

In [None]:
with tf.Session() as sess:
  sess.run(init)

  for it in range(TRAINING_ITERATIONS):
    sess.run([train_step])

    # Plot the predictions in `PLOT_AFTER` intervals
    if it % PLOT_AFTER == 0:
      loss_value, pred_y, var, target_y, whole_query = sess.run(
          [loss, mu, sigma, data_test.target_y, data_test.query])

      (context_x, context_y), target_x = whole_query
      print('Iteration: {}, loss: {}'.format(it, loss_value))

      # Plot the prediction and the context
      plot_functions(target_x, target_y, context_x, context_y, pred_y, var)