# Preprocessing
**Copyright 2023 (c) Naomi Chaix-Echel & Nicolas P Rougier**  
Released under a BSD 2-clauses license

This notebook reads and process the original dataset to ensure that tasks are named properly.
The original dataset is untouched and the processed dataset is saved using an alternative filename.

| Name           | Type     | Signification                 |
| :------------- | :------- | :---------------------------- |
| **subject_id** | string   | Identification of the subject |
| **date**       | datetime | Date whe then trial was made   | 
| **task_id**    | integer  | Identification of the task    | 
| **P_left**     | float    | Reward probability of the left stimulus |
| **V_left**     | float    | Reward amount of the left stimulus |
| **P_right**    | float    | Reward probability of the right stimulus |
| **P_right**    | float    | Reward amount of the right stimulus |
| **response**   | int      | Response (0: left, 1: right) |
| **reward**     | int      | Reward delivered (1) or not (0) |
| **RT**         | int      | Response time (ms) |


## Lottery description

For all the following types of lottery, we consider a choice between (x1, p1) and (x2, p2), xi being the value and pi being the probability:
* xi can be positive, null or negative: -3, -2, -1, 0, +1, +2, +3
* pi can be: 0.25, 0.50, 0.75 or 1.00


### Type 1 : x1 > 0 and x2 < 0, p1 = p2

* Lottery pairs containing one lottery with potential losses (LPL) and on lottery with potential gains (LPG)
* assess the discrimination of losses from the gains
* 36 different lottery pairs.

### Type 2 : p1 = p2 and x1 > x2 > 0

* LPG with a stochastic dominant option differentiating only by the x values
* assess the discrimination of positive x-values
* 12 different lottery pairs

### Type 3 : p1 = p2 and x1 < x2 < 0

* LPL with a stochastic dominant option differentiating only by the x values;
* assess the discrimination of negative x-values
* 12 different lottery pairs

### Type 4 : p1 > p2 and x1 = x2 > 0

* LPG with a stochastic dominant option differentiating only by the p values
* assess the discrimination of p-values associated to positive x-values
* 12 different lottery pairs 

### Type 5 : p1 < p2 and x1 = x2 < 0

* LPL with a stochastic dominant option differentiating only by the p values
* assess the discrimination of probabilities associated to negative quantities
* 18 different lottery pairs

### Type 6 : p1 < p2 and x1 > x2 > 0

* LPG with no stochastic dominant option
* 18 different lottery pairs.

### Type 7 : p1 < p2 and x1 < x2 < 0

* LPL with no stochastic dominant option
* 18 different lottery pairs.

## Import packages

In [1]:
import datetime                 # Time operations
import numpy as np              # Array operations
import pandas as pd             # Database operations
import matplotlib.pyplot as plt # Figures

## Importer la page de fonction

In [2]:
%run "00-common.ipynb"

## Load and Fusion data

In [3]:
print("Loading data... ", end="")

# Liste des noms de fichiers à charger et fusionner
files_to_concat = ["./data/ECO/Abricot.csv", "./data/ECO/Alaryc.csv", "./data/ECO/Alvin.csv", "./data/ECO/Anubis.csv", "./data/ECO/Barnabé.csv", "./data/ECO/César.csv", 
                  "./data/ECO/Dory.csv", "./data/ECO/Eric.csv", "./data/ECO/Ficelle.csv", "./data/ECO/Gaia.csv", "./data/ECO/Havanna.csv", "./data/ECO/Gandhi.csv", 
                  "./data/ECO/Hercules.csv", "./data/ECO/Horus.csv", "./data/ECO/Iron.csv", "./data/ECO/Jeanne.csv", "./data/ECO/Joy.csv", "./data/ECO/Lassa.csv", 
                  "./data/ECO/Néma.csv", "./data/ECO/Néréis.csv", "./data/ECO/Olaf.csv", "./data/ECO/Olga.csv", "./data/ECO/Olli.csv", "./data/ECO/Patchouli.csv", 
                  "./data/ECO/Patsy.csv", "./data/ECO/Yin.csv", "./data/ECO/Yoh.csv", "./data/ECO/Bérénice.csv"]

# Charger chaque fichier CSV dans un DataFrame
dataframe = [pd.read_csv(filename, sep=",", decimal=',') for filename in files_to_concat]

# Concaténer les lignes des DataFrames
original_data = pd.concat(dataframe, ignore_index=True)

# Sauvegarder le résultat dans un nouveau fichier CSV
original_data.to_csv("data.csv", index=False)

#Enlever les lignes des taches ECO_T (phase d'entrainement), Lot11 (tâches plus complexes) et ECO2 (affichage des options différent)
original_data = original_data[~original_data['tache'].str.contains("ECO_T|Lot11|ECO2")]

print("done!")


Loading data... done!


In [4]:
print(original_data["nom_singe"].unique())

['Abricot' 'Alaryc' 'Alvin' 'Anubis' 'Barnabé' 'César' 'Dory' 'Eric'
 'Ficelle' 'Gandhi' 'Hercules' 'Horus' 'Jeanne' 'Lassa' 'Néma' 'Néréis'
 'Olaf' 'Olga' 'Olli' 'Patchouli' 'Patsy' 'Yin' 'Yoh' 'Bérénice']


In [5]:
data_filename = "./data/data.csv"

## Filter, rename & retype fields

In [6]:
# Keep only relevant fields
data = original_data[["date_debut",
                      "heure_debut",
                      "nom_singe",
                      "tache",
                      "palier",
                      "r1_stim_a",
                      "r2_stim_a",
                      "p2_stim_a",
                      "r1_stim_b",
                      "r2_stim_b",
                      "p2_stim_b",
                      "id_stim_a",
                      "id_stim_b",
                      "stim_a_x",
                      "stim_a_y",
                      "stim_b_x",
                      "stim_b_y",
                      "stim_choisi",
                      "resultat",
                      "recompense",
                      "temps_demarrage",
                      "temps_reponse",
                      "id_module"]].copy()

# Rename fields
data = data.rename(columns={"nom_singe"      :    "subject_id",
                            "date_debut"     :    "date",
                            "tache"          :    "task_id",
                            "r1_stim_a" :    "V1_left",
                            "r2_stim_a" :    "V2_left",
                            "p2_stim_a" :    "P2_left",
                            "r1_stim_b" :    "V1_right",
                            "r2_stim_b" :    "V2_right",
                            "p2_stim_b" :    "P2_right",
                            "id_stim_a":    "id_stim_gauche",
                            "id_stim_b":    "id_stim_droite",
                            "stim_a_x"  :    "stim_gauche_x",
                            "stim_a_y"  :    "stim_gauche_y",
                            "stim_b_x"  :    "stim_droite_x",
                            "stim_b_y"  :    "stim_droite_y",
                            "temps_reponse"  :    "RT",
                            "temps_demarrage":    "ST"})

noms_a_remplacer = ["Barnabé", "Néréis", "Néma", "César", "Bérénice"]
noms_remplaces = ["Barnabe", "Nereis", "Nema", "Cesar", "Berenice"]

data = data[~data['subject_id'].str.contains("Test_Id3|Test_Saumon|Baal|Anyanka|Yelena|Eowyn|Samael|Natasha")]

# Remplacez les noms dans la colonne "subject_id"
data['subject_id'] = data['subject_id'].replace(noms_a_remplacer, noms_remplaces, regex=True)


# Convertir les colonnes P2_left et P2_right en numériques
data['P2_left'] = pd.to_numeric(data['P2_left'], errors='coerce')
data['P2_right'] = pd.to_numeric(data['P2_right'], errors='coerce')


# Ajouter les colonnes P1_left et P1_right
data['P1_left'] = 1 - data['P2_left']
data['P1_right'] = 1 - data['P2_right']

data["EV_left"] = data["V1_left"] * data["P1_left"] + data["V2_left"] * data["P2_left"]
data["EV_right"] = data["V1_right"] * data["P1_right"] + data["V2_right"] * data["P2_right"]

# Convert date type (from string to datetime64)
data["date"] = pd.to_datetime(data["date"], dayfirst=True)

In [7]:
print(data["subject_id"].unique())

['Abricot' 'Alaryc' 'Alvin' 'Anubis' 'Barnabe' 'Cesar' 'Dory' 'Eric'
 'Ficelle' 'Gandhi' 'Hercules' 'Horus' 'Jeanne' 'Lassa' 'Nema' 'Nereis'
 'Olaf' 'Olga' 'Olli' 'Patchouli' 'Patsy' 'Yin' 'Yoh' 'Berenice']


In [8]:
# Dictionnaire de remplacement
remplacement = {
    'Abricot': 'abr',
    'Alaryc': 'ala',
    'Alvin': 'alv',
    'Anubis': 'anu',
    'Barnabe': 'bar',
    'Berenice': 'ber',
    'Cesar': 'ces',
    'Dory': 'dor',
    'Eric': 'eri',
    'Ficelle': 'fic',
    'Hercules': 'her',
    'Horus': 'hor',
    'Lassa': 'las',
    'Nema': 'nem',
    'Nereis': 'ner',
    'Olli': 'oll',
    'Patchouli': 'pac',
    'Patsy': 'pat',
    'Yoh': 'yoh',
    'Olaf' : 'ola',
    'Yin' : 'yin',
    'Jeanne' : 'jea',
    'Olga' : 'olg',
    'Gandhi' : 'gan'
}

# Remplacement des valeurs dans la colonne 'subject_id'
data['subject_id'] = data['subject_id'].replace(remplacement)

## Enrich data with actual gain or loss and subject's answer

In [9]:
# Ajouter la colonne "gain"
data['gain'] = data['recompense'].apply(lambda x: x if x > 0 else 0)

# Ajouter la colonne "loss"
data['loss'] = data['recompense'].apply(lambda x: x if x < 0 else 0)

# Ajouter la colonne "response" correspond au choix de l'option
data['response'] = data.apply(lambda row: 0 if row['stim_choisi'] == row['id_stim_gauche'] else (1 if row['stim_choisi'] == row['id_stim_droite'] else None), axis=1)


## Assign task id

In [10]:
# We assign task ids based on probablities and values
p1, x1 = data["P1_left"], data["V1_left"]
p2, x2 = data["P1_right"], data["V1_right"]

data.loc[(p1 == p2) & (x1 == x2), "task_id"] = 0

data.loc[(p1 == p2) & (x1 <  0) & (x2 >   0), "task_id"] = 1
data.loc[(p1 == p2) & (x2 <  0) & (x1 >   0), "task_id"] = 1

data.loc[(p1 == p2) & (x2 >  0) & (x1 >   x2), "task_id"] = 2
data.loc[(p1 == p2) & (x1 >  0) & (x2 >   x1), "task_id"] = 2

data.loc[(p1 == p2) & (x1 < x2) & (x2 < 0), "task_id"] = 3
data.loc[(p1 == p2) & (x2 < x1) & (x1 < 0), "task_id"] = 3

data.loc[(p1 >  p2) & (x1 >  0) & (x1 == x2), "task_id"] = 4
data.loc[(p2 >  p1) & (x1 >  0) & (x1 == x2), "task_id"] = 4

data.loc[(p1 >  p2) & (x1 <  0) & (x1 == x2), "task_id"] = 5
data.loc[(p2 >  p1) & (x1 <  0) & (x1 == x2), "task_id"] = 5

data.loc[(p1 <  p2) & (x1 > x2) & (x2 >   0), "task_id"] = 6
data.loc[(p2 <  p1) & (x2 > x1) & (x1 >   0), "task_id"] = 6

data.loc[(p1 <  p2) & (x1 < x2) & (x2 <   0), "task_id"] = 7
data.loc[(p2 <  p1) & (x2 < x1) & (x1 <   0), "task_id"] = 7

# Remplacer les noms des task_id restants
data['task_id'] = data['task_id'].replace({'ECO_Lot1' : 11, 'ECO_Lot2' : 12, 'ECO_Lot3' : 13, 'ECO_Lot4' : 14, 'ECO_Lot5' : 15, 'ECO_Lot6' : 16, 'ECO_Lot7' : 17, 'ECO_Lot10' : 110, 'ECO_Lot11' : 111})
# Convertir la colonne "task_id" en numérique
data['task_id'] = data['task_id'].astype(int)


# Il reste des tâches non attribuées -> à définir

# Filtering data

In [11]:
tasks = [1, 2, 3, 4, 5, 6, 7]
data = data[data["task_id"].isin(tasks)]

## Save new dataset

In [12]:
import os
filename, extension = os.path.splitext(data_filename)
filename = f"{filename}-processed{extension}"

print("Saving new dataset... ", end="")
data.to_csv(filename)
print("done!")
print("New dataset:", filename)

Saving new dataset... done!
New dataset: ./data/data-processed.csv
