Smart turn detection

Note

Contribute to model development. Generate turn detection data by playing the turn training games

Smart turn detection

This is an open source, community-driven, native audio turn detection model.

Turn detection is one of the most important functions of a conversational voice AI technology stack. Turn detection means deciding when a voice agent should respond to human speech.

Most voice agents today use voice activity detection (VAD) as the basis for turn detection. VAD segments audio into "speech" and "non-speech" segments. VAD can't take into account the actual linguistic or acoustic content of the speech. Humans do turn detection based on grammar, tone and pace of speech, and various other complex audio and semantic cues. We want to build a model that matches human expectations more closely than the VAD-based approach can.

This is a truly open model (BSD 2-clause license). Anyone can use, fork, and contribute to this project. This model started its life as a work in progress component of the Pipecat ecosystem. Pipecat is an open source, vendor neutral framework for building voice and multimodal realtime AI agents.

Current state of the model

This is an initial proof-of-concept model. It handles a small number of common non-completion scenarios. It supports only English. The training data set is relatively small.

We have experimented with a number of different architectures and approaches to training data, and are releasing this version of the model now because we are confident that performance can be rapidly improved.

We invite you to try it, and to contribute to model development and experimentation.

Run the proof-of-concept model checkpoint

Set up the environment.

python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Installing audio dependencies

You may need to install PortAudio development libraries if not already installed as those are required for PyAudio:

Ubuntu/Debian

sudo apt-get update
sudo apt-get install portaudio19-dev python3-dev

macOS (using Homebrew)

brew install portaudio

Run a command-line utility that streams audio from the system microphone, detects segment start/stop using VAD, and sends each segment to the model for a phrase endpoint prediction.

# 
# It will take about 30 seconds to start up the first time.
#

# "Vocabulary" is limited. Try:
#
#   - "I can't seem to, um ..."
#   - "I can't seem to, um, find the return label."

python record_and_predict.py

Project goals

The current version of this model is based on Meta AI's Wav2Vec2-BERT backbone. More on model architecture below.

The high-level goal of this project is to build a state-of-the-art turn detection model that is:

Anyone can use,
Is easy to deploy in production,
Is easy to fine-tune for specific application needs.

Current limitations:

English only
Relatively slow inference
- ~150ms on GPU
- ~1500ms on CPU
Training data focused primarily on pause filler words at the end of a segment.

Medium-term goals:

Support for a wide range of languages
Inference time <50ms on GPU and <500ms on CPU
Much wider range of speech nuances captured in training data
A completely synthetic training data pipeline
Text conditioning of the model, to support "modes" like credit card, telephone number, and address entry.

Model architecture

Wav2Vec2-BERT is a speech encoder model developed as part of Meta AI's Seamless-M4T project. It is a 580M parameter base model that can leverage both acoustic and linguistic information. The base model is trained on 4.5M hours of unlabeled audio data covering more than 143 languages.

To use Wav2Vec2-BERT, you generally add additional layers to the base model and then train/fine-tune on an application-specific dataset.

We are currently using a simple, two-layer classification head, conveniently packaged in the Hugging Face Transformers library as Wav2Vec2BertForSequenceClassification.

We have experimented with a variety of architectures, including the widely-used predecessor of Wav2Vec2-BERT, Wav2Vec2, and more complex approaches to classification. Some of us who are working on the model think that the simple classification head will work well as we scale the data set to include more complexity. Some of us have the opposite intuition. Time will tell! Experimenting with additions to the model architecture is an excellent learning project if you are just getting into ML engineering. See "Things to do" below.

Links

Inference

predict.py shows how to pass an audio sample through the model for classification. A small convenience function in inference.py wraps the audio preprocessing and PyTorch inference code.

# defined in inference.py
def predict_endpoint(audio_array):
    """
    Predict whether an audio segment is complete (turn ended) or incomplete.

    Args:
        audio_array: Numpy array containing audio samples at 16kHz

    Returns:
        Dictionary containing prediction results:
        - prediction: 1 for complete, 0 for incomplete
        - probability: Probability of completion class
    """

    # Process audio
    inputs = processor(
        audio_array,
        sampling_rate=16000,
        padding="max_length",
        truncation=True,
        max_length=800,  # Maximum length as specified in training
        return_attention_mask=True,
        return_tensors="pt",
    )

    # Move inputs to device
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Run inference
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

        # Get probabilities using softmax
        probabilities = torch.nn.functional.softmax(logits, dim=1)
        completion_prob = probabilities[0, 1].item()  # Probability of class 1 (Complete)

        # Make prediction (1 for Complete, 0 for Incomplete)
        prediction = 1 if completion_prob > 0.5 else 0

    return {
        "prediction": prediction,
        "probability": completion_prob,
    }

Training

All training code is defined in train.py.

The training code will download datasets from the pipecat-ai HuggingFace repository. (But of course you can modify it to use your own datasets.)

You can run training locally or using Modal. Training runs are logged to Weights & Biases unless you disable logging.

# To run a training job on Modal, run:
modal run --detach train.py

Collecting and contributing data

Currently, there are two datasets used for training and evaluation:

human_5_all -- segmented speech recorded from human interactions
rime_2 -- synthetic speech generated using Rime

Four splits are created when these two datasets are loaded.

The train, validate, and test sets are a mix of synthetic and human data
The human eval set contains only human data

  7 -- TRAIN --
  8   Total samples: 5,694
  9   Positive samples (Complete): 2,733 (48.00%)
 10   Negative samples (Incomplete): 2,961 (52.00%)
 11 
 12 -- VALIDATION --
 13   Total samples: 712
 14   Positive samples (Complete): 352 (49.44%)
 15   Negative samples (Incomplete): 360 (50.56%)
 16 
 17 -- TEST --
 18   Total samples: 712
 19   Positive samples (Complete): 339 (47.61%)
 20   Negative samples (Incomplete): 373 (52.39%)
 21 
 22 -- HUMAN_EVAL --
 23   Total samples: 773
 24   Positive samples (Complete): 372 (48.12%)
 25   Negative samples (Incomplete): 401 (51.88%)

Our goal for an initial version of this model was to overfit on a non-trivial amount of data, plus exceed a non-quantitative vibes threshold when experimenting interactively. The next step is to broaden the amount of data and move away from overfitting towards more generalization.

[ more notes on data coming soon ]

Things to do

More languages

The base Wav2Vec2-BERT model is trained on a large amount of multi-lingual data. Supporting additional languages will require either collecting and cleaning or synthetically generating data for each language.

More data

The current checkpoint was trained on a dataset of approximately 8,000 samples. These samples mostly focus on filler words that are typical indications of a pause without utterance completion in English-language speech.

Two data sets are used in training: around 4,000 samples collected from human speakers, and around 4,000 synthetic data samples generated using Rime.

The biggest short-term data need is to collect, categorize, and clean human data samples that represent a broader range of speech patterns:

inflection and pacing that indicates a "thinking" pause rather than a completed speech segment
grammatical structures that typically occur in unfinished speech segments (but not in finished segments)
more individual speakers represented
more regions and accents represented

The synthetic data samples in the datasets/rime_2 dataset only improve model performance by a small margin, right now. But one possible goal for this project is to work towards a completely synthetic data generation pipeline. The potential advantages of such a pipeline include the ability to support more languages more easily, a better flywheel for building more accurate versions of the model, and the ability to rapidly customize the model for specific use cases.

For full guidelines on generating contributing data, see the data_generation_contribution_guide.md.

If you have expertise in steering speech models so that they output specific patterns (or if you want to experiment and learn), please consider contributing synthetic data.

Architecture experiments

The current model architecture is relatively simple, because the base Wav2Vec2-BERT model is already quite powerful.

However, it would be interesting to experiment with other approaches to classification added on top of the Wav2Vec2-BERT model. This might be particularly useful if we want to move away from binary classification towards an approach that is more customized for this turn detection task.

For example, it would be great to provide the model with additional context to condition the inference. A use case for this would be for the model to "know" that the user is currently reciting a credit card number, or a phone number, or an email address.

Adding additional context to the model is an open-ended research challenge. Some simpler todo list items include:

Experimenting with freezing different numbers of layers during training.
Hyperparameter tuning.
Trying different sizes for the classification head or moderately different classification head designs and loss functions.

Supporting training on more platforms

We trained early versions of this model on Google Colab. We should support Colab as a training platform, again! It would be great to have quickstarts for training on a wide variety of platforms.

We should alsoport the training code to Apple's MLX platform. A lot of us have MacBooks!

Optimization

This model will likely perform well in quantized versions. Quantized versions should run significantly faster than the current float32 weights.

The PyTorch inference code is not particularly optimized. We should be able to hand-write inference code that runs substantially faster on both GPU and CPU, for this model architecture.

It would be nice to port inference code to Apple's MLX platform. This would be particular useful for local development and debugging, as well as potentially open up the possibility of running this model locally on iOS devices (in combination with quantization).

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
coreml		coreml
datasets/scripts		datasets/scripts
docs		docs
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
predict.py		predict.py
record_and_predict.py		record_and_predict.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Smart turn detection

Current state of the model

Run the proof-of-concept model checkpoint

Installing audio dependencies

Ubuntu/Debian

macOS (using Homebrew)

Project goals

Model architecture

Links

Inference

Training

Collecting and contributing data

Things to do

More languages

More data

Architecture experiments

Supporting training on more platforms

Optimization

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 6

Languages

License

pipecat-ai/smart-turn

Folders and files

Latest commit

History

Repository files navigation

Smart turn detection

Current state of the model

Run the proof-of-concept model checkpoint

Installing audio dependencies

Ubuntu/Debian

macOS (using Homebrew)

Project goals

Model architecture

Links

Inference

Training

Collecting and contributing data

Things to do

More languages

More data

Architecture experiments

Supporting training on more platforms

Optimization

Contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Languages

Packages