Skip to content

Analyze Dialogue Act Recognition (DAR) from audio and text features, compare how different features affect predictions of DAR tags.

Notifications You must be signed in to change notification settings


Repository files navigation

Dialogue Act Recognition

Analyze Dialogue Act Recognition (DAR) from audio and text features, compare how different features affect predictions of DAR tags.


This project utilize jupyter notebook and python 3.10.1 for analysis, please make sure it is installed. Main packages used in this project: parselmouth, pandas, matplotlib, scikit, tensorflow. To run the project, pip install all required enviroments using:

pip install <project_name>

Or one can also automatically install all required environments with environment.yml :

conda env update --file environment.yml --prune


There are two main jupyter notebook files for analysis: feature_extraction_and_analysis.ipynb and classification.ipynb.

Feature Extraction and Analysis

feature_extraction_and_analysis.ipynb have 3 main components:

  1. Select audio features
  2. Preprocessing text and audio features
  3. Analysis & Hypothesis

Select audio features

For each .wav file, I had selected features using parselmouth: maximum pitch, mean pitch, standard deviation of pitch, max intensity, mean intensity, standard deviation of intensity, speaking rate, jitter, shimmer, HNR. All computed sound features are saved in file sound_features.csv for later preprocessing. Speaking rate is computed using total duration of audio and aggregation of length of all corresponding transcripts.

Preprocessing text and audio features

For text features, all LIWC features are selected except identifier features (start_time, end_time). Then all LIWC features with total 0 values >= 70 are eliminated to reduce input vector sparseness, and reduce required converge time. Preprocessed text features are stored in file processed_train_text_features.csv. This preprocess step will only be done on training dataset (train.csv), not testing dataset (test.csv).

For audio features, total duration < 1 seconds recordings will be eliminated using speaking_rate variable computed before. Preprocessed sound features are stored in file processed_sound_features.csv.

For convinence, I had combined both text and audio features using inner join, eliminating difference in transcripts and audios. Also, 10 most common tags are computed and kept, and other tags are removed from the joined dataset. The combined features are stored in combined_processed_train_filtered.csv and combined_processed_test_filtered.csv.

Analysis & Hypothesis

LIWC and Audio feature analysis: compared level of certainty on different tags and different speaking intensity level on different tags. Detailed visual analysis and conclusion can be shown in notebook.

Classification & Results

classification.ipynb have 2 main components:

  1. Model Training
  2. Model Testing

Model Training

Extra Preprocessing Steps: numerical features are normalized and taget variable da_tag are one hot encoded.

Three models are trained: Speech Features Only, Text Features Only, Speech + Text Features with a small neural net models of total 15210 parameters. Adam optimizer and categorical crossentropy are used to train the model with 50 epochs each. To prevent overfitting, Dropout layers are used after each Dense layer.

General structure of the model (with input size varies for different models): Markdown Monster icon

Model Test

For each model trained in previous component, accuracy and macro f1 score are computed. For each class, Speech + Text Features model (the best performing model) was selected to perform confusion matrix and per class analysis (precision, recall, f1 score). Findings and answer to 3c can be found at the end of this section.

Notebook Structure


Markdown Monster icon


Markdown Monster icon


Confusion Matrix

Markdown Monster icon

Per class Accuracy

Markdown Monster icon

According to the diagram above, we can see tag b has the highest precision, recall, and f1 (tag b is Acknowledge). Classes such as sd also has a high f1 score. These classes are easier to predict than other classes might be because they have much more audio clues and text clues. For example, acknowledgement might have a significantly high intensity than other tags.

From above classification report and confusion matrix, the model performed worst on sd^e (Statement expanding y/n answer) because it was confused with sd, and all of the sd^e had been classified as sd (statement non opinion). This might be due to model find these two tags really similar, thus cannot tell them apart since both of them are statements. One possible fix might be the feature sets should also include tag of previous sentences, so the model can distinguist whether current sentence is an answer to a question or a statement.

The other common errors are lots of sd (Statement-non-opinion) are classified as % (Abandoned or Turn-Exit) and + (continuation). This might be because the sentence was suitable of multiple labels, but transcript only has 1 label.


All answers to questions can be found in corresponding notebook section: hypothesis and result analysis can be found in section 4: Analysis & Hypothesis in feature_extraction_and_analysis.ipynb; model analysis questions' answer can be found in section 2.2: Model Analysis in classification.ipynb.


Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.


Public Domain


Analyze Dialogue Act Recognition (DAR) from audio and text features, compare how different features affect predictions of DAR tags.






No releases published
