<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/modelling/example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Modules

In [None]:
"""
===================================================
Author: Sheldon Kemper
Role: Data Engineering Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/sheldon-kemper
Date: 2025-02-04
Version: 1.1

Description:
    This notebook implements a system for transcribing and processing audio transcripts for the Bank of England project.
    The workflow downloads an audio file from a specified URL, applies a machine learning-based speech-to-text model
    (e.g., OpenAI’s Whisper) to convert the audio into text, and segments the resulting transcript into two sections:
    the Manager Presentation and the Question & Answer (Q&A) sections. Each section is subsequently exported into its
    own CSV file using Python libraries such as requests, regex, and CSV (or pandas). This pipeline builds on our existing
    data engineering infrastructure to facilitate efficient extraction, segmentation, and analysis of key project content.

===================================================
"""


In [None]:
import os
import sys
from google.colab import drive

In [None]:


# Mount Google Drive to the root location with force_remount
drive.mount('/content/drive', force_remount=True)

# Assuming 'BOE' folder is in 'MyDrive' and already shared
BOE_path = '/content/drive/MyDrive/BOE/bank_of_england/data'

# Now you (and others with access) can work with files in this directory
# For example, you can list the contents:
print(os.listdir(BOE_path))

Transcribing the Audio

In [None]:
# Install whisper (if not already installed)
# pip install git+https://github.com/openai/whisper.git

import whisper

# Load a pre-trained model (choose a model size appropriate for your needs: tiny, base, small, medium, large)
model = whisper.load_model("base")

# Download or specify the local path of your audio file.
audio_file = "path/to/your_audio_file.mp3"  # replace with the path after downloading the file from your URL

# Transcribe the audio file to text
result = model.transcribe(audio_file)
transcript_text = result["text"]

# Optionally, save the full transcript to a text file for review.
with open("full_transcript.txt", "w", encoding="utf-8") as f:
    f.write(transcript_text)
