In [None]:
import numpy as np
import pandas as pd
import plotnine as gg

# Part 1: Human Behavior

In this Colab, we will put to practive what we've learned in today's lecture:
* **We will first train an RL agent to perform a task, using policy evaluation and policy improvement.**
* **After that, we will fit an RL model to a real human dataset, and analyze (fake) fMRI data!**

To get started, let's first load our dataset to get it out of the way. Execute the following cell to load the dataset from gitbub into this Colab.

In [42]:
human_data = pd.read_csv("./bahrami_100.csv")

First, let's understand our task. We are working with a "4-armed bandit task". The figure below depicts what is happening on each trial of this task from participants' perspective:

<center><img src="https://github.com/trendinafrica/Comp_Neuro-ML_course/blob/main/notebooks/23-Friday/TaskOverview.png?raw=1" width=1000></center>

## Exercise 1 (*5 minutes*)

* Find a partner (turn to your neighbor)
* Together, understand the task design:
  * First, let the person sitting left explain the first two stages ("Participant choice" and "Chosen stimulus") to the person sitting right. (*2 minutes*)
  * Then, let the person sitting right explain the last two stages ("Reward" and "Inter-trial interval") to the person sitting left. (*1 minute*)
  * Lastly, talk about any questions you still have about this task. (*2 minutes*)

## Solution 1

Now, expand the cell below to see the solution.

Participants perform the task on a computer. On each trial of the task, participants see four items on the screen (which we sometimes call "bandits"). In the "participant choice" stage, participants have 4 seconds to pick one of the bandits, using four keys on their keyboard ("d", "f", "j", and "k"). Once the participant has made a choice, all bandit except the selected one disappear, and only the selected one stays on the screen for 400 miliseconds (0.4 seconds). Then, the reward is displayed: Participants can win between 1-100 points on each trial, depending on which bandit they choose. The reward stays on the screen for 800 miliseconds, then a fixation cross appears in the center of the screen for half a second. After the fixation cross, participants enter the next trial, which is structured in the same way.

In this task, participants do a total of 150 trials, and thereby learn which bandits tend to give more points than others, so they can maximize the points they win. Let's take a look at the dataset!

In [40]:
human_data

Unnamed: 0,id,choice,reward,rt,payoff_group,reward_c1,reward_c2,reward_c3,reward_c4
0,1,1.0,84.0,1104.0,2,84,87,42,23
1,1,2.0,90.0,1076.0,2,90,90,46,18
2,1,3.0,53.0,612.0,2,80,84,53,28
3,1,4.0,24.0,742.0,2,87,81,50,24
4,1,2.0,92.0,927.0,2,86,92,61,28
...,...,...,...,...,...,...,...,...,...
14995,100,3.0,62.0,679.0,2,47,35,62,48
14996,100,3.0,61.0,686.0,2,46,47,61,57
14997,100,3.0,70.0,600.0,2,46,35,70,43
14998,100,3.0,60.0,641.0,2,46,44,60,59


Let's understand this


## Exercise 2 (10 minutes)

* With your partner or by yourself, vizualize how many points each bandit gives on each trial of the task. (You will find the columns `reward_actionX` helpful for this exercise: These columns indicate how much reward each action X would have given on each trial had it been chosen.)
* To do this, plot trials (from 0-150) on the x-axis
* And plot the reward each arm would have given (from 1-100) on the y-axis
* Select a different color for each to distinguish them

In [None]:
Let's f

In [37]:
#@title Click to show solution

In [None]:
# Load data (100 subjects of 1 reward schedule)
# PROVIDE task explanation figure
# PROVIDE: time on x-axis, points for each arm on the y-axis -> insight: reward payoffs change over time
# TASK 1a: Describe this figure in words. Can you explain the task based on this figure? What would an optimal strategy look like?
# TASK 1b: Plot average choices over x-axis
# TASK 1c: What do you see? How do you interpret this finding? -> insight: looks similar to plot above! -> people tend to pick some actions over others

# Part 2: Train an RL agent to solve the same task

Explain a bit.

In [None]:
# PROVIDE class for Q-learning agent: random behavioral policy, but calculates Q-values according to Bellman/Q-Update
# TASK 2a: Let the agent perform the task (fill in some pieces of code here and there to complete the loop)
# TASK 2b: Inspect the behavior (should be random)
# TASK 2c: Inspect the value function (should approximate plots above)
# TASK 2d: Describe what we have done. (-> Policy Evaluation). What is missing? -> Policy Improvement.
# TASK 2e: Implement the policy improvement step (Choose actions according to values)
# TASK 2f: Replot behavior (should now look like humans)
# TASK 2g: Describe your results in words.

# Part 3: Use RL as a model for human behavior

We have now trained an RL agent to perform the task. We next want to test if humans might be using RL in a similar way to learn the task. How can we do this?

To see if humans use RL to solve the task, we "fit" the RL model to human behavior. This means that we "squeeze" and "stretch" the RL agent until it produces behavior that corresponds to the human behavior.

In this case, the "squeezing" and "stretching" consists of increasing or decreasing the values of the *free parameters* of the model, $\alpha$ and $\beta$.

How do we know if we need to increase or decrease the values? By checking how close the behavior of the model is to human behavior. The closer the model behavior matches human behavior, the better the model "fits" the human dataset. We want the best possible fit, so we are looking for the values of $\alpha$ and $\beta$ that *maximizes the probability* that the RL model chooses the same actions that humans have chosen.

**In other (more fancy) words, our goal is to find the values for our model parameters ($\alpha$ and $\beta$) that maximize the likelihood of the observed (human) behavior under the model.**

To do this, we first need to know how likey the observed behavior is under each model. Once we know, all we have to do is to maximize this likelihood.

**TASK 1**: Calculate the likelihood of the human dataset under model parameters $\alpha=0.3$ and $\beta=3$, by filling in the blanks below.

**TASK 2**: Maximize the likelihood of the human dataset by finding the optimal parameters. Fill in the blanks below.

In [None]:
# Use the same agent, set alpha and beta to the values above. calculate likelihood for one subject.
# Write the loss function: negative log likelihood.
# Set up a loop that performs SGD on the loss function.
# TASK 3: Simulate behavior from the agent with the fitted parameters.
# TASK 3b: Plot the behavior like before. Is it closer to humans?

# Part 4: Use RL as a neural model

Like we've seen in the lecture, there is lots of evidence that the brain might implent an RL algorithm: Most notably, the dopamine system has been argued to calculate reward prediction errors (RPEs), such that dopamine neurons *increase* their firing rates when there is a *positive* RPE (reward is *larger* than expected), and *decrease* their firing rates when there is a *negative* RPE (reward is *smaller* than expected).

In this section, we will see if this is the case in our dataset.

TASK 1: Calculate RPEs. For each trial in the task, calculate the RPE the model is encountering. (Make sure you save the RPEs for each trial so we can later compare them to human striatal activity.)

TASK 2: Plot the model-based RPEs against the human fMRI signal (RPEs on the y-axis and BOLD signal on the x-axis). What do you see?

TASK 3: Calcualte the correlation between model-based RPEs and striatal BOLD signal. What do you conclude about the hypothesis that the striatal dopamine systems encodes RPEs?

In [None]:
# TASK 4: Calculate RPEs in the model (already done; just need to save)
# TASK 4b: Plot
# TASK 4c: Calculate correlation

# Improve the model: Add forgetting

# Model comparison

# Bonus: Fit a neural network to human behavior