# Conditionally Risk-Averse Contextual Bandits Python tutorial

In this notebook we present the Python implementation of the SquareCB algorithm and the expectile loss.

# SquareCB algorithm with expectile loss

The SquareCB algorithm tackles contextual bandit problems via reduction to regression. After observing the context at time *t*, the online regression oracle predicts losses for each action. It assigns higher probabilities to actions with lower predicted losses and the other way around. The exact weighting of actions can be read in the link and is implemented below.

https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Contextual-Bandit-Exploration-with-SquareCB

In [None]:
# Return action distribution based on the predicted losses for playing each action
def get_distribution(loss_predictions, gamma):
  # Number of loss predictions = number of actions
  K = len(loss_predictions) 

  # Set the first one as default value
  minimum_predicted_loss = loss_predictions[0] 

  # Get the best action with the minumum predicted loss
  for i in range(K):
    if loss_predictions[i] <= minimum_predicted_loss:
      #best_action = actions[i]
      best_action_idx = i
      minimum_predicted_loss = loss_predictions[i]

  # Calculate probabilities over the actions
  p_sum = 0
  p = [] * K
  for i in range(K):
    if i == best_action_idx:
      continue
    p[i] = 1/(K+gamma*(loss_predictions[i]-minimum_predicted_loss))
    p_sum += loss_predictions[i]

  # The remaining probability is assigned to the best action
  p[best_action_idx] = 1 - p_sum

  return p

In [None]:
# Expectile loss of the regression oracle
def expectile_loss(q, prediction, true_value):
  error = label - prediction
  loss = 1/2 * (label-prediction)**2 # Squared loss
  if error < 0:
    return q * loss
  return  (1 - q) * loss

In [None]:
# gamma_scale and gamma_exponent: two hyperparameters of the SquareCB algorithm
gamma_scale = 1000
gamma_exponent = 0.5

# Expectile parameter
q = 0.2

for t in range(T):
  # Observe context 
  context = get_context() 

  # Algorithm predicts losses for the available actions
  loss_predictions = alg.predict(context) 

  # Larger gamma leads to a greedier algorithm
  gamma = gamma_scale * t**gamma_exponent 

  # Calculate action distribution from predicted losses
  distr = get_distribution(loss_predictions, gamma) 

  # Sample an action from the action distribution and also return the predicted loss for that action 
  action, predicted_loss = sample_action(distr)

  # Observe true loss from the played action
  observed_loss = getloss(context, action) 

  # The default SquareCB algorithm uses squared loss
  # To be risk-averse, we use the expectile loss instead of squared loss
  # Calculate expectile loss, q is the expectile parameter
  # Note: with q=0.5, it results in the default squared loss
  loss = expectile_loss(q, predicted_loss, observed_loss) 

  # Update the algorithm with the context-action-loss combination
  alg.update(context, action, loss) 

# Expectile loss

Here we show the expectile loss and its connection to the risk measure of EVaR.

In [None]:
from scipy.optimize import minimize_scalar

In [None]:
# Equation of the expectile loss
# It's an asymmetric function with an expectile parameter q
def f(m, data):
    return q * np.sum(np.square(np.clip(data - m, a_min=0, a_max=None))) + (1 - q) * np.sum(np.square(np.clip(m - data, a_min=0, a_max=None)))  

# The minimizer of the above equation is the expectile (EVaR)
def evar(data):
  res = minimize_scalar(lambda m: f(m, data), bounds=(np.min(data), np.max(data)), method='bounded')
  return res.x