RAIN - Real & Artificial Intelligence for Neuroscience

## Create models
- This notebook will create and train Artificial Neural Networks to identify exploration using rodent and target position along with manually labeled data.

#### Requirements:

- A set of position files
- Labeled data for those position files (to train the model)

or

- Access to the example file **colabels.csv**, where we can find:
    - Position and labels for representative exploration events
    - Manual labels from 5 viewers (so far)

---
#### Load the necessary modules

In [1]:
from pathlib import Path
import rainstorm.modeling as rst

import datetime
time = datetime.datetime.now() # Get the current date and time

rainstorm.modeling successfully imported. GPU devices detected: []


---
#### 1. State your project path
You need to define the path to the same folder used in [2a-Prepare_positions](2a-Prepare_positions.ipynb), and the path to the parameters file (which contains parameters for training models).

`base` : The path to the downloaded repository.

`folder_path` : The path to the experiment folder.

`params` : The path to the parameters file.

In [2]:
base = Path.cwd()
folder_path = base / 'examples' / 'NOR'
params = folder_path / 'params.yaml'

---
#### 2. Open the params.yaml file and modify the following parameters:

The params.yaml file can be edited directly by opening the file, or...

you can use the following function to open the **NEW `params editor GUI`**.

In [3]:
rst.open_params_editor(params) # Open the parameters editor

Parameters file edited successfully at c:\Users\dhers\Desktop\RAINSTORM\examples\NOR\params.yaml


Automatic Analysis

- `models_path` : Path to the models folder containing trained neural networks  
- `analyze_with` : Model file to use for analysis (.keras format)  
- `model_bodyparts` : Bodyparts used to train the model  

Colabels Configuration

- `colabels_path` : Path to the colabels folder  
- `labelers` : List of labelers on the colabels file (as found in the columns)  
- `target` : Name of the target on the colabels file  

<div style="
    border: 2px solid #333;
    padding: 15px;
    border-radius: 8px;
    background-color: #2f2f2f;
    color: #f1f1f1;
    font-size: 16px;
">

#### 🔹 Creating your own colabels file

The Colabels file is a csv file that contains positions for both mice and exploration targets, and one or more sets of labels for the behavior you want to analyze.

If you want to train the models using your own colabels file, you can create it using the create_colabels function below.

All you need is:
- A folder containing the positions of the mice
- A folder for each labeler, containing the labels for the behavior you want to analyze
- A list of the targets (stationary points) present on your videos

```python
path = r'path/to/colabels_folder' # The path to the directory containing the positions folder and labelers folders
labelers = ['labeler_A', 'labeler_B', 'etc']
targets = ['tgt_1', 'tgt_2', 'etc']

rst.create_colabels(path, labelers, targets)
```

#### 💡 Note: A ready-to-use Colabels file is already available in the `models` folder. It contains positions and labels for mice on a novel object recognition task.

</div>

Data Split Settings

- `focus_distance` : Window of frames to consider around an exploration event  
- `validation` : Percentage of the data to use for validation (e.g., 0.15 = 15%)  
- `test` : Percentage of the data to use for testing (e.g., 0.15 = 15%)  

RNN Configuration

- `rescaling` : Whether to rescale the data  
- `reshaping` : Whether to reshape the data (set to True for RNN)  
- `RNN_width` : Defines the shape of the RNN  
  - `past` : Number of past frames to include  
  - `future` : Number of future frames to include  
  - `broad` : Broaden the window by skipping some frames as we stray further from the present  
- `units` : Number of neurons on each layer (list of integers, e.g., [32, 16, 8])  

Training Parameters

- `batch_size` : Number of training samples processed before updating weights  
- `dropout` : Randomly turn off a fraction of neurons (0.0-1.0, e.g., 0.2 = 20%)  
- `total_epochs` : Each epoch is a complete pass through the training dataset  
- `warmup_epochs` : Epochs with increasing learning rate at the start  
- `initial_lr` : Initial learning rate (e.g., 1e-05 = 0.00001)  
- `peak_lr` : Peak learning rate reached after warmup (e.g., 0.0001)  
- `patience` : Number of epochs to wait before early stopping if no improvement  


---
#### 3. Before training a model, we need to prepare our training data
- First, we load the dataset from the colabels file and create one 'labels' column out of all the labelers.
- Next (optional, but recommended) we can erase the rows of the dataset that are too far away from exploration events.
- Finally, we split the dataset into training, testing and validation subsets.

In [4]:
# Prepare the data
dataset = rst.prepare_data(params)

# Focus on the rows near exploratory behaviour
dataset = rst.focus(params, dataset, filter_by='labels')

# Split the data
model_dict = rst.split_tr_ts_val(params, dataset)

# Save the split
rst.save_split(params, model_dict)

Focused around 'labels' events (10231 found)
Rows reduced: 167012 -> 38161
📊 Splitting data into training, validation, and test sets...
Training set:    27021 samples
Validation set:  5792 samples
Testing set:     5348 samples
Total samples:   38161
💾 Split data saved to: C:\Users\dhers\Desktop\RAINSTORM\examples\models\splits\split_2025-09-06.h5


WindowsPath('C:/Users/dhers/Desktop/RAINSTORM/examples/models/splits/split_2025-09-06.h5')

<div style="
    border: 2px solid #333;
    padding: 15px;
    border-radius: 8px;
    background-color: #2f2f2f;
    color: #f1f1f1;
    font-size: 16px;
">

If you later want to load a previous split, run:

```python
saved_split = models_folder / 'splits/split_{example_date}.h5' # Select the split you want to rescue
model_dict = rst.load_split(saved_split)
```
</div>

---
##### We can see on the testing data that the exploratory events happen when the nose gets close to the object

In [5]:
rst.plot_example_data(model_dict['X_ts'], model_dict['y_ts'])

---
#### 4. With the training data ready, we can use TensorFlow to design our very first model
- It will look at the positions of one frame at a time, and try to decide if the mouse is exploring.
- If the decision is correct the architecture will be reinforced, else it will be corrected according to the learning rate.
- We will train it for some epochs (cycles through the whole dataset) and plot how the accuracy and loss evolve.
- Also, we will be validating the training using the validation split, which contains frames that were not used for training.

In [6]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense

# Build a simple neural network
model_simple = tf.keras.Sequential([
    
    # Input layer
    Input(shape=(model_dict['X_tr'].shape[1],)), 

    # Hidden layers
    Dense(32, activation='relu'),
    Dense(16, activation='relu'),
    Dense(8, activation='relu'),
    
    # Output layer
    Dense(1, activation='sigmoid')
])

# Compile the model
model_simple.compile(optimizer=tf.keras.optimizers.Adam(learning_rate = 0.0001), 
                   loss='binary_crossentropy', metrics=['accuracy'])

model_simple.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 32)                416       
                                                                 
 dense_1 (Dense)             (None, 16)                528       
                                                                 
 dense_2 (Dense)             (None, 8)                 136       
                                                                 
 dense_3 (Dense)             (None, 1)                 9         
                                                                 
Total params: 1,089
Trainable params: 1,089
Non-trainable params: 0
_________________________________________________________________


#### Train the model
Each ``epoch`` is a complete pass through the entire training dataset, while the ``batch_size`` is the number of training samples the model processes before updating its weights

In [7]:
history_simple = model_simple.fit(model_dict['X_tr'], model_dict['y_tr'],
                                  epochs=10, batch_size=128,
                                  validation_data=(model_dict['X_val'], model_dict['y_val']))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Plot the training and validation loss

In [8]:
rst.plot_history(history_simple, "Simple")

#### Calculate accuracy and precision of the model

In [9]:
y_pred_simple = model_simple.predict(model_dict['X_ts'])
metrics_simple = rst.evaluate(y_pred_simple, model_dict['y_ts'], show_report = True)


--- Classification Report ---
              precision    recall  f1-score   support

           0       0.92      0.95      0.93      4024
           1       0.84      0.74      0.78      1324

    accuracy                           0.90      5348
   macro avg       0.88      0.84      0.86      5348
weighted avg       0.90      0.90      0.90      5348

-----------------------------
Accuracy: 0.8992
Precision: 0.8375
Recall: 0.7356
F1_Score: 0.7833
MSE: 0.0587
MAE: 0.1508
R2: 0.6480
-----------------------------



#### And finally, save the model

In [10]:
model_name = f'{time.date()}_simple'
rst.save_model(params, model_simple, model_name)

Model '2025-09-06_simple' saved to: C:\Users\dhers\Desktop\RAINSTORM\examples\models\trained_models\2025-09-06_simple.keras


---
#### 5. Now that we have a simple model trained, we can start building more complex models

To make our artificial networks as real as possible, we can let them see a sequence of frames to decide if the mouse is exploring
- Our build_RNN function will use Bidirectional LSTM layers that allow the model to take into account the temporal sequence of frames
- It also implements early stopping and learning rate scheduler mechanisms that will prevent the model from overfitting

We can control the RNN model by changing the following parameters on the modeling.yaml file:

`units` : The number of neurons on each layer of the LSTM model

`batch_size` : The number of training samples the model processes before updating its weights

`lr` : The learning rate of the model

`epochs` : The number of times the model will be trained on the entire training dataset

`past` & `future` : If you use a LSTM model, you can set the window size by saying how many frames into the past and how many into the future you want to see.

`broad` : Once you have your window size, we can broaden the window by skipping some frames as it strays further from the present.

In [11]:
model_wide = rst.build_RNN(params, model_dict)

Model: "CleanedBidirectionalRNN"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_sequence (InputLayer)  [(None, 7, 12)]          0         
                                                                 
 bidirectional (Bidirectiona  (None, 7, 64)            11520     
 l)                                                              
                                                                 
 bn_0 (BatchNormalization)   (None, 7, 64)             256       
                                                                 
 dropout_0 (Dropout)         (None, 7, 64)             0         
                                                                 
 bidirectional_1 (Bidirectio  (None, 7, 32)            10368     
 nal)                                                            
                                                                 
 bn_1 (BatchNormalization)   (None, 7, 32) 

#### Train the model

In [14]:
model_wide_name = f'{time.date()}_wide'
history_wide = rst.train_RNN(params, model_wide, model_dict, model_wide_name)

ValueError: Unknown initializer: 0.00001. Please ensure this object is passed to the `custom_objects` argument. See https://www.tensorflow.org/guide/keras/save_and_serialize#registering_the_custom_object for details.

#### Plot the training and validation loss

In [None]:
rst.plot_history(history_wide, model_wide_name)
rst.plot_lr_schedule(history_wide)

#### Calculate accuracy and precision of the model

In [None]:
y_pred_wide = model_wide.predict(model_dict['X_ts_wide'])
metrics_wide = rst.evaluate(y_pred_wide, model_dict['y_ts'], show_report=True)

#### Save the wide model

In [None]:
rst.save_model(params, model_wide, model_wide_name)

---
#### 6. Finally, compare the trained models

- Since we trained using the training dataset, and validated using the validation dataset... we test each model using the testing dataset.

In [None]:
# Print the model results
print("Evaluate model vs testing data")
print(f"{metrics_simple} -> simple")
print(f"{metrics_wide} -> wide")

---
---
#### Our trained models are stored safely in our repository, with today's date.
We can:
- Continue on this notebook and evaluate the trained models.
- Skip the evaluation and go use the models on our data in [3b-Automatic_analysis](3b-Automatic_analysis.ipynb).

---


## Evaluate models

I see you've decided to continue on this notebook! You wont regret it.

One may think that the evaluation we did on the testing set is enough, and in many cases it is. However, for our purpose of finding a model that resembles the labeling of an expert, It's better to compare the performance of the model against all the manually labeled data we have.

In [None]:
evaluation_dict = rst.build_evaluation_dict(modeling)

---
#### 7. Calculate a good reference labeler
Since we want to compare the models and the labelers, we need to create a reference labeler.

This reference could be the mean of all the labelers, but then the labelers would have an unfair advantage.

To avoid this, we choose to simultaneously create a chimera labeler and a leave-one-out-mean:
- The chimera is created by randomly selecting a labeler on each row of the data.
- The leave-one-out-mean is created by averaging the remaining labelers.

This way, we can compare the chimera to the leave-one-out-mean knowing that they are independent.

In [None]:
chimera_dict = rst.create_chimera_and_loo_mean(evaluation_dict['manual_labels'], seed=42)

---
#### 8. Load the models & use them to label exploration on all the available data

In [None]:
# List the models you want to evaluate
model_paths = {
    'example_simple': models_folder / 'trained_models/example_simple.keras',
    'example_wide': models_folder / 'trained_models/example_wide.keras',
    f'{time.date()}_simple': models_folder / f'trained_models/{time.date()}_simple.keras',
    f'{time.date()}_wide': models_folder / f'trained_models/{time.date()}_wide.keras',
    # Add more models as needed...
    }

In [None]:
models_dict = rst.build_and_run_models(modeling, model_paths, evaluation_dict['position'])

In [None]:
evaluation_dict.update(chimera_dict)
evaluation_dict.update(models_dict)
evaluation_dict = {k: v for k, v in evaluation_dict.items() if k not in {'position', 'manual_labels'}}
print(evaluation_dict.keys())  # Check the keys to confirm the additions

---
#### 9. With all the labels organized, we can evaulate the performance of each

In [None]:
for name, pred in evaluation_dict.items():
    metrics = rst.evaluate(pred, evaluation_dict['loo_mean'])
    print(f"{metrics} -> {name}")

We can visualize the similarity between labelers using a cosine similarity plot

In [None]:
rst.plot_cosine_sim(evaluation_dict)

And finally, run a PCA (Principal Components Analysis) to see how much each labeler resembles eachother and the mean

In [None]:
rst.plot_PCA(evaluation_dict, make_discrete=False)

Also, we can see both the models and the labelers performance on an example video

In [None]:
example_folder_path = base / 'examples' / 'colabeled_video'

labelers_example = {
    "labeler_A": "Example_labeler_A.csv",
    "labeler_B": "Example_labeler_B.csv",
    "labeler_C": "Example_labeler_C.csv",
    "labeler_D": "Example_labeler_D.csv",
    "labeler_E": "Example_labeler_E.csv"
}

rst.plot_performance_on_video(example_folder_path, model_paths, labelers_example, fps = 25, 
                              bodyparts = ['nose', 'left_ear', 'right_ear', 'head', 'neck', 'body'], 
                              targets = ['obj_1', 'obj_2'], plot_tgt = "obj_2")

---
---
#### Once we get to this point, we should have selected our favorite model.
We can move on to the next notebook, [3b-Automatic_analysis](3b-Automatic_analysis.ipynb), and use the chosen model to label our position files.

---
RAINSTORM - Created on Dec 12, 2023 - @author: Santiago D'hers