transcribe.py

A python script to create a .txt transcript from an .m4a audio file

How it works

Uses ffmpeg to convert an .m4a file into a .flac file (single channel, 16000 Hz)
Uploads the .flac file to a Google Cloud Storage bucket (required by Speech-to-Text for audio files over 60 seconds)
Deletes the .flac file from the local machine
Transcribes the uploaded .flac file, saving the transcript as a .txt file, while also building a mysql INSERT statement and appending it to the vsearch.sql file.
Deletes the .flac file from the Google Cloud Storage bucket
Simply log into the database and run the updated vsearch.sql file to add additional transcripts into the database.

Set up your Google Cloud project

Sign in to Google Cloud Console

https://console.cloud.google.com/

Go to the project selector page and create a project

https://console.cloud.google.com/projectselector2/home

Enable the Speech-to-Text API

https://cloud.google.com/speech-to-text

Create a service account for your project

https://console.cloud.google.com/projectselector/iam-admin/serviceaccounts/create

note: the account needs to have at least Storage Admin access in order to upload to the bucket on Google Cloud.

Create a JSON key for your service account

Select your service account, navigate to the KEYS tab, and select ADD KEY (Create new key)

This will download a JSON file to your machine, you'll need to know where this is later.

Create a Google Cloud Storage bucket

note: do not use a nested bucket.

https://console.cloud.google.com/storage/browser

How to use

Pull the transcribe repository to the same machine as your audio files.
Edit transcribe.py and set BUCKET_NAME to the name of your Google Cloud Storage bucket.
Using a linux command line, set your authentication environment variable:

export GOOGLE_APPLICATION_CREDENTIALS="KEY_PATH"

setting KEY_PATH with the path to your JSON key file, for example:

export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"

Call transcribe from the command line with the audio file as an argument:

python3 transcribe.py path/to/file_name.m4a

There may be a few installs required:

sudo apt install python3-pip
pip install --upgrade google
pip install --upgrade google-cloud-storage
pip install google-cloud-speech
sudo apt install ffmpeg

Transcribe in batches

If you wish to loop over multiple .m4a audio files for processing, a shell script vloop.sh was created to facilitate using pattern matching to transcribe batches with a single command.

Edit vloop.sh to declare the following:

Set the starting pattern

starting_pattern="abc" (this will match all files starting with the string “abc”)

Set the KEY_PATH with the path to your JSON file, just as before

cred_path="/home/user/Downloads/service-account-file.json"

Set the FILE_PATH with the path to your audio files to process

FILES="audio/m4a"

Call vloop from the command line:

./vloop.sh

vloop.sh does 2 passes, first transcribing all the scoped exhibits matching your specified starting pattern, then matching the remaining exhibits. Duplicates are avoided by checking the txt folder for an existing transcript, scoped or unscoped.

Troubleshooting

Google Cloud's Speech-to-Text API can be tricky to get set up correctly.

If the process fails during audio conversion, make sure the original audio files are in .m4a format. Conversion from other formats is not currently supported.
If the process fails during upload to your Google Cloud Storage bucket, make sure your service account was granted at least Storage Admin access.
If the process fails during transcription, make sure the file path to your JSON key is correctly set and that Google Cloud's Speech-to-Text API is activated. Wait a few minutes before trying again.

Additional Resources

ffmpeg:

https://ffmpeg.org/

Upload objects to a Google Cloud Storage bucket:

https://cloud.google.com/storage/docs/uploading-objects

Transcribe short audio files:

https://cloud.google.com/speech-to-text/docs/sync-recognize

Transcribe long audio files:

https://cloud.google.com/speech-to-text/docs/async-recognize#speech_transcribe_async_gcs-python

Delete objects from a Google Cloud Storage bucket:

https://cloud.google.com/storage/docs/deleting-objects

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

txt

txt

README.md

README.md

transcribe.py

transcribe.py

vloop.sh

vloop.sh

vsearch.sql

vsearch.sql

Repository files navigation

transcribe.py

How it works

Set up your Google Cloud project

How to use

Transcribe in batches

Troubleshooting

Additional Resources

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
txt		txt
README.md		README.md
transcribe.py		transcribe.py
vloop.sh		vloop.sh
vsearch.sql		vsearch.sql

reelradio/transcribe

Folders and files

Latest commit

History

Repository files navigation

transcribe.py

How it works

Set up your Google Cloud project

How to use

Transcribe in batches

Troubleshooting

Additional Resources

About

Resources

Stars

Watchers

Forks

Languages