Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time
therapist logo


This repo implements a family of neural components for various hierarchical dialogue models described in "Observing Dialogue in Therapy: Categorizing and Forcasting Behavioral Codes" By Cao et al. in ACL 2019.

      author    = {Cao, Jie and Tanana, Michael and Imel, Zac E.
      and Poitras, Eric and Atkins, David C and Srikumar, Vivek},
      title     = {Observing Dialogue in Therapy: Categorizing and Forecasting Behavioral Codes},
      booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
      year      = {2019}

Besides replicating the results on the psychotherapy dataset used in our paper, we also offer a guideline or building models with the SOTA neural components for conversational analysis in other domains.

Table of Contents

Part I. Usage

Required Software

  • Install pyenv or other python environment manager

In our case, we use pyenv and its plugin pyenv-virtualenv to set up the python environment. Please follow the detailed steps in for details. Alternative environments management such as conda will be fine.

  • Install required packages
pyenv install 2.7.12
# in our default setting, we use `pyenv activate py2.7_tf1.4` to
# activate the envivronment, please change this according to your preference.

pyenv virtualenv 2.7 py2.7_tf1.4
pyenv activate py2.7_tf1.4
pip install tensorflow-gpu==1.4.0 spacy pandas ujson h5py sklearn matplotlib
  • Checkout this project.
    git clone therapist-observer

tensorflow folder is the source code directory for nerual models.

Expt folder is a folder for experiment managing, which includes all the commands(Expt/psyc_scripts/commands), config files(Expt/psyc_scripts/configs) to launch the experiments, and store all experiment outputs. In this repo, except Expt/psyc_scirpts/commands/ contains the global variables, all model hyperparameters and reltaed configurations will be assigned in the config files in Expt/psyc_scripts/configs, each of them is corresponding to a model. For a detailed description for folders in Expt folder, please refer to Expt README file

Data Preprocessing

Preprocessing pipeline consisted of 4 sub steps: 0) Put original data into Expt/data/psyc_ro/download/data_filename

  1. Data Transformation (, check the path in
  2. Dataset split and Placement (
  3. Tokenization (
  4. Extra Preprocessing ( The following command can run each of them in squeunce to fulfill the preprocessing pipeline.
# it will end after 30 minutes.
cd Expt/psyc-scripts/commands/

When re-executing this, finished sub tasks will be skipped because the correponding output folder has existed. Please manually delete the corresponding folder for not skipping

For more details for preprocessing, please refer to document on README of commands

Preparing Embedding

# download glove.840B.300d into $RO_DATA_DIR,
# WORD_EMB_FILE in each config files will point to the path of this downloaded file

# download elmo weights and options file into $DATA_DIR/psyc_elmo
# ELMO_OPTION_FILE and ELMO_WEIGHT_FILE will point the downloaded elmo weights and options file

# prepare vocabulary and elmo for training
# generating vocabulary embedding in $VOCAB_DIR in the corresponding config file
# which can be used by any task with $CONTEXT_WINDOW = 8, here, we take our selected model on categorizing client codes as a example.
./ ../configs/categorizing/selected/

# Commands ends with "gpuid" means, CUDA_VISIBLEE_DEVICE will be specified by a second GPUID argument.
# ./ ../configs/categorizing/selected/ 1

The above commands will mainly for preparing the vocabulary and building elmo embeddings for every sentence and everytoken. When ELMo enabled, this command may last for 25 minutes, and around 12G GPU memory.

You only need to do the preparation again when you need to update the embeding, or you have retokenzied the data(, or you want to build vocabulary for large context window. Once $VOCAB_DIR is generated, this vocabulary can be used for other reciept by pointing $VOCAB_DIR to this vocab folder.

All the following embedding related configurations in the config file will impact the vocabulary preparation.


By default, we use glove.840B.300d, which is default value of WORD_EMB_FILE in our config files. For using other word embedding, please change this configuration and do preparation again.


By default, these two files where point the default location of the download elmo files. If using domain specific ELMo or other pretrained ELMo, make sure to change the above two variables in config file, and prepare.


By simply set $CONTEXT_WINDOW=16, it is recommended to re-preprepare the vocab when changing the window size. Because when genenrating sliding window dialogue segments, the words in last $CONTEXT_WINDOW utterance of a dialogue may have slight impact on word frequency.

More details about the configuration, please refer to README on configs


Training from scratch

# all training command simply follows a single arguments
./ <config_file>

# training from scratch, see `tensorflow/classes/` for details of each arguments in config_file
# Again, we use selected model on categoring client codes as an example, ../configs/categorizing/selected/
# $CONFIG_DIR will be made, train.log shows the training progress
# $CONFIG_DIR/models/ will save the models and checkpints every $STEPS_PER_CHECKPINTS batch
./ ../configs/categorizing/selected/

# Commands ends with "gpuid" means, CUDA_VISIBLEE_DEVICE will be specified by a second GPUID argument.
./ ../configs/categorizing/selected/ 1

Worth to mention, when training, best model with respect to different metric will be saved in $CONFIG_DIR/models/. $CONFIG_DIR is required to be set in the model config file.

model prefix = $ALGO + sub_model_prefix.

$ALGO is just a name to identify your model. see tensorflow/classes/ for more details. $sub_model_prefix is relared to the metrics we used for evaluation, which follows a pattern "_A_B"

# A can be in {P, R, F1, R@K}
# B can be in {macro, weighted_macro, micro} and all MISC labels.

Hence, sub_model_prefix can be _F1_macro, that is what we used for our performance evaluation.

Analysis for training

# for analyzing training log for Patient(client) models
python $ROOT_DIR/Expt/stats_scripts/ train.log

# for analyzing training log for Therapist models
python $ROOT_DIR/Expt/stats_scripts/ train.log

The whole training will last for around 20 hours on a V100 GPU. The following command will analyze the train.log and print current best performance.

Resume Training from a checkpoint model

# training from saved checkpoint, matched by model file name with prefix as $MODEL_PREFIX_TO_RESTORE
./ <config_file> sub_model_prefix

# The sub_model_prefix argument is optional, when it is not loaded, the save model with best loss will be loaded. # However, model with smallest loss may not indicate best performance. You can resume from the model with repected to best metric.
./ ../configs/categorizing/ _F1_macro


# For evaluating from a trained model, sub_model_prefix follows the same guide as
./ <config_file> sub_model_prefix

# dev with the saved model on dev test with respect to macro F1.
./ ../configs/categorizing/selected/ _F1_macro

# dev on test means do the same evalution on test set.
./ ../configs/categorizing/ _F1_macro

This scripts can be manually evoked once the model to be restored is saved in the "folder". After evaluation, a dev_{model_name}.log will generated in $CONFIG_DIR/training folder, and results on dev set will show in $CONFIG_DIR/results, results on test will show in $CONFIG_DIR/results_on_test

Part II. Experiment Desgining

The two tasks in our paper is distinguished by the following configurations in the config file

All selected receipts are in Expt/psyc-scripts/configs/categorizing/selected/ and Expt/psyc-scripts/configs/forecasting/selected/.

You can follow the steps above to cook each of them. Worth to mention, if $VOCAB_DIR is already built, then please skip preprocessing and preparing steps, only training and evalution are required. If you would like to try diffrent tokenization or embedding, then redo from the corresponding steps.

# categorization task will use the last utterance(response) to be labeled
# forecasting task will not use the last utterance(response) to be labeled
# `x` just means switch on, leave it empty for swith off

# We always use the speaker infomation for both context and response

# decode_goal in ['SPEAKER','ALL_LABEL','P_LABEL','T_LABEL','SEQ_TAG']
# use T_LABEL for therapist code only
# use P_LABEL for patient code only

We offer the performance table on the selected models in our paper as follows. For more, description for each configuration, please refer to README for config file

For the name of selected models, last chaceracter 'C' or 'T' means client or therapist. The second last character 'C' or 'F' means categorizing task or forecasting task. The remaining part of the name is a id for distinguish differrent nerual architecture. See more details in the paper


For client, the best model does not need any word or utterance attention.

Majority 30.6 91.7 0.0 0.0
Xiao et al. (2016) 50.0 87.9 32.8 29.3
BiGRU_generic_C 50.2 87.0 35.2 28.4
BiGRU_ELMo_C 52.9 87.6 39.2 32.0
Can et al. (2015) 44.0 91.0 20.0 21.0
Tanana et al. (2016) 48.3 89.0 29.0 27.0
CONCAT_C_C 51.8 86.5 38.8 30.2
GMGRU_H_C_C 52.6 89.5 37.1 31.1
BiDAF_H_C_C 50.4 87.6 36.5 27.1
Our Best 53.9 89.6 39.1 33.1
Change +3.5 -2.1 +3.9 +3.8

For the therapist, it uses GMGRUH for word attention and ANCHOR42 for utterance attention.

Majority 5.87 47.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Xiao et al. (2016) 59.3 94.7 50.2 48.3 71.9 68.7 80.1 54.0 6.5
BiGRU_generic_T 60.2 94.5 50.5 49.3 72.0 70.7 80.1 54.0 10.8
BiGRU_ELMo_T 62.6 94.5 51.6 49.4 70.7 72.1 80.8 57.2 24.2
Can et al. (2015) - 94.0 49.0 45.0 74.0 72.0 81.0 - -
Tanana et al. (2016) - 94.0 48.0 39.0 69.0 68.0 77.0 - -
CONCAT_C_T 61.0 94.5 54.6 34.3 73.3 73.6 81.4 54.6 22.0
GMGRU_H_C_T 64.9 94.9 56.0 54.4 75.5 75.7 83.0 58.2 21.8
BiDAF_H_C_T 63.8 94.7 55.9 49.7 75.4 73.8 80.0 56.2 24.0
Our Best 65.4 95.0 55.7 54.9 74.2 74.8 82.6 56.6 29.7
Change +5.2 +0.3 +3.9 +3.8 +0.2 +2.8 +1.6 +2.6 +18.9


For both client and therapist, the best model uses no word attention, and uses SELF42 utterance attention.

Method Dev Dev Test Test Test Test
CONCAT_F_C 20.4 30.2 43.6 84.4 23.0 23.5
HGRU_F_C 19.9 31.2 44.4 85.7 24.9 22.5
GMGRU_H_F_C 19.4 30.5 44.3 87.1 23.3 22.4
Forecast_C 21.1 31.3 44.3 85.2 24.7 22.7

Except for R@3, all others are F1 score.

CONCAT_F_T 72.5 23.5 63.5 0.6 0.0 53.7 27.0 15.0 18.2 9.0
HGRU_generic_F_T 76.8 24.0 71.0 2.7 20.5 58.8 27.5 12.9 15.2 1.6
HGRU_F_T 76.0 28.6 71.4 12.7 24.9 58.3 28.8 5.9 17.4 9.7
GMGRU_H_F_T 76.6 26.6 72.6 10.2 20.6 58.8 27.4 6.0 8.9 7.9
Forecase_T 77.0 31.1 71.9 19.5 24.7 59.2 29.1 16.4 15.2 12.8

Part VI. Usage for Other Dataset or Tasks

Building Data Input

Preprocessing your own dataset into DSTC-like conversational json format is the main job to do before modeling.

     "correct_seq_labels": [],
     "options-for-correct-answers": [
             "tokenized_utterance": "it 's just",
             "codes": [
                     "origin_code": "GI",
                     "translated_code": "giving_info",
                     "coder_order": [
                             "order_id": 1,
                             "coder_id": "ms",
                             "cid": 72427
             "uid": "(BAER_936)_31_5_T_49_51",
             "agg_label": "giving_info",
             "speaker": "T",
             "snt_id": 9878
     "example-id": "(BAER_936)_(T, 27, 3)-(T, 31, 51)",
     "messages-so-far": [
             "tokenized_utterance": "mm - hmm",
             "codes": [
                     "origin_code": "FA",
                     "translated_code": "facilitate",
                     "coder_order": [
                             "order_id": 1,
                             "coder_id": "ms",
                             "cid": 72411
             "uid": "(BAER_936)_27_9_T_3_4",
             "agg_label": "facilitate",
             "speaker": "T",
             "snt_id": 5
     "correct_labels": [
     "pred_probs": [
             "label_index": 2,
             "label_name": "reflection_complex",
             "prob": 0.2700542211532593
             "label_index": 3,
             "label_name": "reflection_simple",
             "prob": 0.100542211532593

Our current code base is based on feeddict-based tensorflow inputs. In future, we will upgrade it with newer tensforflow feattures, such as estimator and tensorflow serving.

Model Designing

Our code base allows user to build converstational baseline models without writing much tensorflow code. For all supported model components, creating customized config file is the only thing to do for building a model for your dataset.

Hierarchical Encoder

Various Attention Mechansims

Various Embeddings

  • Domain Specific Glove

  • Domain Specific ELMo

Known Issues (To be moved to issues)

  1. Known issues about spaCy with python 2.7.5

see, Please use python 2.7.12. But Python 2 will be dropped in Jan 2020, we will try to test our code on python 3 and publish a new repo for python 3.


Code for the ACL 2019 paper "Observing Dialogue in Therapy: Categorizing and Forecasting Behavioral Codes"





No releases published


No packages published