This project is designed to provide an automated pipeline for processing medical audio files. It leverages Google Cloud Platform (GCP) services, such as Cloud Storage, Pub/Sub, and Cloud Functions, to handle audio file uploads, transcribe them, and summarize their contents.
- I find it is very challenging to find relevant data and code for SOAP summarization. The data i obtained for training is very small, so the model is not very accurate.
- In the business scenarios, the dataset could be much more larger than what i use here. Whenever i do test like do, i always try to find the dataset that as similar as possible in business scenarios. But for medical context, it nearly impossible
- I found two paper that address the SOAP summarization issue. https://arxiv.org/pdf/2005.01795v3.pdf https://proceedings.mlr.press/v126/schloss20a/schloss20a.pdf. But the dataset they use is not available so it is hard to reproduce their results.
- Recordings: [https://oxiago.com/curavoice/dataset/10.5281.4279041/]
- SOAP Notes: [https://huggingface.co/datasets/biomegix/soap-notes]
TBD
The pipeline operates as follows:
- An audio file is uploaded to a Cloud Storage bucket.
- A Cloud Function is triggered to handle the upload and publish audio metadata to a Pub/Sub topic.
- Another Cloud Function is triggered by the Pub/Sub topic to transcribe the audio file using the Google Speech-to-Text API.
- The transcription is then processed by a summarization function and stored in Firestore.
Before using the Makefile to deploy the pipeline, make sure you have the following:
- Google Cloud SDK (gcloud) installed and authenticated with your GCP account.
- Appropriate permissions to create and manage Cloud Storage buckets, Pub/Sub topics, and Cloud Functions.
- The
make
utility installed on your local machine.
The Makefile contains predefined variables for the GCP project ID, region, bucket name, Firestore database ID, and Cloud Function names. Ensure these values are set correctly to match your GCP environment.
To create the Cloud Storage bucket, Pub/Sub topic, and Firestore database, run: Please adjust the project id in the Makefile to your own project id.
make setup
To deploy the entire pipeline, simply run:
make deploy-all
This command will sequentially deploy the preprocess, transcribe, and summarization functions to your GCP project.
You can also deploy individual components by running:
make deploy-preprocess
make deploy-transcribe
make deploy-summarization
To remove all deployed resources, run:
make cleanup
This command will delete the Cloud Functions, Pub/Sub topic, and Cloud Storage bucket.
The proposed process for doctor-patient conversation speech summarization: fnetune pretrain bart for summarization --> finetume pretrain bert for classifcation The SOAP Summarization algorithm contains two parts:
-
The classifier, which is leverage Bio_ClinicalBert[emilyalsentzer/Bio_ClinicalBERT] model. The model then trained on the SOAP notes obtained from https://huggingface.co/datasets/biomegix/soap-notes using ktrain[https://github.com/amaiya/ktrain.git] library. The model outputs the probability of each sentence belonging to each of the SOAP categories. After predicting the probabilities, the sentences are then classified into the categories based on the highest probability. Because the dataset is only 800-ish sentences, the model is not very accurate. I can not find any other dataset that contains SOAP notes. The one of physio bank[https://physionet.org/content/mimiciii/1.4/] requires a license, so I can not use it.