Medical Audio Processing Pipeline

This project is designed to provide an automated pipeline for processing medical audio files. It leverages Google Cloud Platform (GCP) services, such as Cloud Storage, Pub/Sub, and Cloud Functions, to handle audio file uploads, transcribe them, and summarize their contents.

Challenges

I find it is very challenging to find relevant data and code for SOAP summarization. The data i obtained for training is very small, so the model is not very accurate.
In the business scenarios, the dataset could be much more larger than what i use here. Whenever i do test like do, i always try to find the dataset that as similar as possible in business scenarios. But for medical context, it nearly impossible
I found two paper that address the SOAP summarization issue. https://arxiv.org/pdf/2005.01795v3.pdf https://proceedings.mlr.press/v126/schloss20a/schloss20a.pdf. But the dataset they use is not available so it is hard to reproduce their results.

Dataset

Recordings: [https://oxiago.com/curavoice/dataset/10.5281.4279041/]
SOAP Notes: [https://huggingface.co/datasets/biomegix/soap-notes]

Ensure medical terminology and nuance

TBD

Project Overview

The pipeline operates as follows:

An audio file is uploaded to a Cloud Storage bucket.
A Cloud Function is triggered to handle the upload and publish audio metadata to a Pub/Sub topic.
Another Cloud Function is triggered by the Pub/Sub topic to transcribe the audio file using the Google Speech-to-Text API.
The transcription is then processed by a summarization function and stored in Firestore.

Prerequisites

Before using the Makefile to deploy the pipeline, make sure you have the following:

Google Cloud SDK (gcloud) installed and authenticated with your GCP account.
Appropriate permissions to create and manage Cloud Storage buckets, Pub/Sub topics, and Cloud Functions.
The make utility installed on your local machine.

Configuration

The Makefile contains predefined variables for the GCP project ID, region, bucket name, Firestore database ID, and Cloud Function names. Ensure these values are set correctly to match your GCP environment.

Using the Makefile

Setup

To create the Cloud Storage bucket, Pub/Sub topic, and Firestore database, run: Please adjust the project id in the Makefile to your own project id.

make setup

Deployment

To deploy the entire pipeline, simply run:

make deploy-all

This command will sequentially deploy the preprocess, transcribe, and summarization functions to your GCP project.

Individual Deployment

You can also deploy individual components by running:

make deploy-preprocess
make deploy-transcribe
make deploy-summarization

Cleanup

To remove all deployed resources, run:

make cleanup

This command will delete the Cloud Functions, Pub/Sub topic, and Cloud Storage bucket.

SOAP Summarization

The proposed process for doctor-patient conversation speech summarization: fnetune pretrain bart for summarization --> finetume pretrain bert for classifcation The SOAP Summarization algorithm contains two parts:

The classifier, which is leverage Bio_ClinicalBert[emilyalsentzer/Bio_ClinicalBERT] model. The model then trained on the SOAP notes obtained from https://huggingface.co/datasets/biomegix/soap-notes using ktrain[https://github.com/amaiya/ktrain.git] library. The model outputs the probability of each sentence belonging to each of the SOAP categories. After predicting the probabilities, the sentences are then classified into the categories based on the highest probability. Because the dataset is only 800-ish sentences, the model is not very accurate. I can not find any other dataset that contains SOAP notes. The one of physio bank[https://physionet.org/content/mimiciii/1.4/] requires a license, so I can not use it.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config/model		config/model
data		data
models		models
preprocess		preprocess
.DS_Store		.DS_Store
.gcloudignore		.gcloudignore
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
classifier_model.py		classifier_model.py
detail_pipeline.png		detail_pipeline.png
detailed_pipeline.png		detailed_pipeline.png
high_level_pipeline.png		high_level_pipeline.png
main.py		main.py
medical_check.py		medical_check.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
result.txt		result.txt
summarization_model.py		summarization_model.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Audio Processing Pipeline

Challenges

Dataset

Ensure medical terminology and nuance

Project Overview

Prerequisites

Configuration

Using the Makefile

Setup

Deployment

Individual Deployment

Cleanup

SOAP Summarization

About

Releases

Packages

Languages

tam0201/speech-to-text-pipeline

Folders and files

Latest commit

History

Repository files navigation

Medical Audio Processing Pipeline

Challenges

Dataset

Ensure medical terminology and nuance

Project Overview

Prerequisites

Configuration

Using the Makefile

Setup

Deployment

Individual Deployment

Cleanup

SOAP Summarization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages