Dialogue Act Recognition

Analyze Dialogue Act Recognition (DAR) from audio and text features, compare how different features affect predictions of DAR tags.

Installation

This project utilize jupyter notebook and python 3.10.1 for analysis, please make sure it is installed. Main packages used in this project: parselmouth, pandas, matplotlib, scikit, tensorflow. To run the project, pip install all required enviroments using:

pip install <project_name>

Or one can also automatically install all required environments with environment.yml :

conda env update --file environment.yml --prune

Components

There are two main jupyter notebook files for analysis: feature_extraction_and_analysis.ipynb and classification.ipynb.

Feature Extraction and Analysis

feature_extraction_and_analysis.ipynb have 3 main components:

Select audio features
Preprocessing text and audio features
Analysis & Hypothesis

Select audio features

For each .wav file, I had selected features using parselmouth: maximum pitch, mean pitch, standard deviation of pitch, max intensity, mean intensity, standard deviation of intensity, speaking rate, jitter, shimmer, HNR. All computed sound features are saved in file sound_features.csv for later preprocessing. Speaking rate is computed using total duration of audio and aggregation of length of all corresponding transcripts.

Preprocessing text and audio features

For text features, all LIWC features are selected except identifier features (start_time, end_time). Then all LIWC features with total 0 values >= 70 are eliminated to reduce input vector sparseness, and reduce required converge time. Preprocessed text features are stored in file processed_train_text_features.csv. This preprocess step will only be done on training dataset (train.csv), not testing dataset (test.csv).

For audio features, total duration < 1 seconds recordings will be eliminated using speaking_rate variable computed before. Preprocessed sound features are stored in file processed_sound_features.csv.

For convinence, I had combined both text and audio features using inner join, eliminating difference in transcripts and audios. Also, 10 most common tags are computed and kept, and other tags are removed from the joined dataset. The combined features are stored in combined_processed_train_filtered.csv and combined_processed_test_filtered.csv.

Analysis & Hypothesis

LIWC and Audio feature analysis: compared level of certainty on different tags and different speaking intensity level on different tags. Detailed visual analysis and conclusion can be shown in notebook.

Classification & Results

classification.ipynb have 2 main components:

Model Training
Model Testing

Model Training

Extra Preprocessing Steps: numerical features are normalized and taget variable da_tag are one hot encoded.

Three models are trained: Speech Features Only, Text Features Only, Speech + Text Features with a small neural net models of total 15210 parameters. Adam optimizer and categorical crossentropy are used to train the model with 50 epochs each. To prevent overfitting, Dropout layers are used after each Dense layer.

General structure of the model (with input size varies for different models):

Model Test

For each model trained in previous component, accuracy and macro f1 score are computed. For each class, Speech + Text Features model (the best performing model) was selected to perform confusion matrix and per class analysis (precision, recall, f1 score). Findings and answer to 3c can be found at the end of this section.

Notebook Structure

feature_extraction_and_analysis

classification

Results

Confusion Matrix

Per class Accuracy

According to the diagram above, we can see tag b has the highest precision, recall, and f1 (tag b is Acknowledge). Classes such as sd also has a high f1 score. These classes are easier to predict than other classes might be because they have much more audio clues and text clues. For example, acknowledgement might have a significantly high intensity than other tags.

From above classification report and confusion matrix, the model performed worst on sd^e (Statement expanding y/n answer) because it was confused with sd, and all of the sd^e had been classified as sd (statement non opinion). This might be due to model find these two tags really similar, thus cannot tell them apart since both of them are statements. One possible fix might be the feature sets should also include tag of previous sentences, so the model can distinguist whether current sentence is an answer to a question or a statement.

The other common errors are lots of sd (Statement-non-opinion) are classified as % (Abandoned or Turn-Exit) and + (continuation). This might be because the sentence was suitable of multiple labels, but transcript only has 1 label.

Important

All answers to questions can be found in corresponding notebook section: hypothesis and result analysis can be found in section 4: Analysis & Hypothesis in feature_extraction_and_analysis.ipynb; model analysis questions' answer can be found in section 2.2: Model Analysis in classification.ipynb.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

Public Domain

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
._feature_extraction_and_analysis.ipynb		._feature_extraction_and_analysis.ipynb
._img1.png		._img1.png
._img2.png		._img2.png
._img3.png		._img3.png
._img4.png		._img4.png
._sound_features.csv		._sound_features.csv
classification.ipynb		classification.ipynb
combined_processed_test_filtered.csv		combined_processed_test_filtered.csv
combined_processed_train_filtered.csv		combined_processed_train_filtered.csv
environment.yml		environment.yml
feature_extraction_and_analysis.ipynb		feature_extraction_and_analysis.ipynb
img1.png		img1.png
img2.png		img2.png
img3.png		img3.png
img4.png		img4.png
img5.png		img5.png
processed_sound_features.csv		processed_sound_features.csv
processed_train_text_features.csv		processed_train_text_features.csv
readme.md		readme.md
sound_features.csv		sound_features.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dialogue Act Recognition

Installation

Components

Feature Extraction and Analysis

Select audio features

Preprocessing text and audio features

Analysis & Hypothesis

Classification & Results

Model Training

Model Test

Notebook Structure

feature_extraction_and_analysis

classification

Results

Confusion Matrix

Per class Accuracy

Important

Contributing

License

About

Releases

Packages

Languages

yanruc123/Dialogue-Act-Recognition

Folders and files

Latest commit

History

Repository files navigation

Dialogue Act Recognition

Installation

Components

Feature Extraction and Analysis

Select audio features

Preprocessing text and audio features

Analysis & Hypothesis

Classification & Results

Model Training

Model Test

Notebook Structure

feature_extraction_and_analysis

classification

Results

Confusion Matrix

Per class Accuracy

Important

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages