<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/processed/sk_process_santander.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Modules

In [4]:
"""
===================================================
Author: Sheldon Kemper
Role: Data Engineering Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/sheldon-kemper
Date: 2025-02-04
Version: 1.1

Description:
    This notebook implements a system for processing and converting video transcripts into a single CSV file
    for the Bank of England project. The workflow processes MP4 files stored in the raw data directory on Google Drive
    by using a machine learning-based speech-to-text model (e.g., OpenAI’s Whisper) to transcribe the audio content into text.
    Each transcript is appended as a record in the CSV file along with metadata—such as the year, quarter, and a duplicate indicator—
    which are inferred from the video file name. This pipeline supports the ongoing integration of transcripts across multiple
    quarters and years, facilitating further analysis and reporting within our data engineering infrastructure.

===================================================
"""




In [5]:
# Install whisper (if not already installed)
!pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-t6nt6q0m
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-t6nt6q0m
  Resolved https://github.com/openai/whisper.git to commit 517a43ecd132a2089d85f4ebc044728a71d49f6e
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20240930)
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->openai-whisper==20240930)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->openai-whisper==20240930)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-

In [6]:
import os
import glob
import subprocess
import requests
from bs4 import BeautifulSoup
import whisper
import re
import csv
import whisper

In [8]:
import os
from google.colab import drive

# Mount Google Drive to the root location with force_remount
drive.mount('/content/drive', force_remount=True)

# Assuming 'BOE' folder is in 'MyDrive' and already shared
BOE_path = '/content/drive/MyDrive/BOE/bank_of_england/data'

# List the contents of the BOE directory
print("BOE Directory Contents:", os.listdir(BOE_path))

# Define the raw data path (assuming your audio files are under raw/santander)
raw_data_path = os.path.join(BOE_path, 'raw', 'santander')
print("Raw Data Directory Contents:", os.listdir(raw_data_path))


Mounted at /content/drive
BOE Directory Contents: ['model', 'raw', 'processed']
Raw Data Directory Contents: ['video_2024_Q1_1.mp4', 'video_2024_Q2_2.mp4', 'video_2024_Q3_3.mp4', 'video_2024_Q4_4.mp4', 'video_2023_Q1_5.mp4', 'video_2023_Q2_6.mp4', 'video_2023_Q3_7.mp4', 'video_2023_Q4_8.mp4']


## Process All Downloaded MP4 Files

In [11]:
import os
import glob
import csv
import whisper
import re
import requests
from bs4 import BeautifulSoup

def get_call_dates():
    """
    Scrapes the Santander Financial and Economic Information page to build a mapping
    of financial quarter to call date. This function assumes that the page contains quarterly
    result sections within <div class="documents-wrapper"> elements. Within each wrapper:
      - A <div class="title-document"> contains a <span class="text-title"> with text like "Q4 2024".
      - The first <div class="documents-block__date"> element within the wrapper holds the call date (e.g., "05-02-2025").
    Returns a dictionary mapping keys like "2024 Q4" to the call date.
    """
    url = "https://www.santander.com/en/shareholders-and-investors/financial-and-economic-information"
    call_date_mapping = {}
    try:
        response = requests.get(url)
        response.raise_for_status()
    except Exception as e:
        print("Error fetching call dates:", e)
        return call_date_mapping

    soup = BeautifulSoup(response.text, 'html.parser')
    wrappers = soup.find_all("div", class_="documents-wrapper")
    for wrapper in wrappers:
        title_document = wrapper.find("div", class_="title-document")
        if title_document:
            span_title = title_document.find("span", class_="text-title")
            if span_title:
                title_text = span_title.get_text(strip=True)
                # Expect title text like "Q4 2024"; extract quarter and year.
                match = re.search(r'(Q[1-4])\s+(\d{4})', title_text)
                if match:
                    quarter = match.group(1)
                    year = match.group(2)
                    key = f"{year} {quarter}"
                    # Look for the call date in the first <div class="documents-block__date">
                    date_elem = wrapper.find("div", class_="documents-block__date")
                    if date_elem:
                        call_date = date_elem.get_text(strip=True)
                        if call_date:
                            call_date_mapping[key] = call_date
                        else:
                            call_date_mapping[key] = "Unknown"
    return call_date_mapping

def parse_financial_quarter(filename):
    """
    Given a filename (e.g., "video_2023_Q3_1"), extract and return a string like "2023 Q3".
    If the pattern is not found, return "Unknown".
    """
    match = re.search(r'(\d{4})_(Q[1-4])', filename)
    if match:
        year = match.group(1)
        quarter = match.group(2)
        return f"{year} {quarter}"
    return "Unknown"

# Define directories – adjust these paths as needed.
raw_dir = '/content/drive/MyDrive/BOE/bank_of_england/data/raw/santander'
processed_dir = '/content/drive/MyDrive/BOE/bank_of_england/data/processed'
os.makedirs(processed_dir, exist_ok=True)

# Load the Whisper transcription model.
model = whisper.load_model("base")

# Define the CSV file where all transcripts will be appended.
all_transcripts_csv = os.path.join(processed_dir, "santander_management_discussion.csv")

# Prepare a set to store already processed file names for duplicate checking.
existing_files = set()
if os.path.exists(all_transcripts_csv):
    with open(all_transcripts_csv, "r", newline="", encoding="utf-8") as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            if "filename" in row:
                existing_files.add(row["filename"])

# If the CSV file doesn't exist, create it with the desired header.
if not os.path.exists(all_transcripts_csv):
    with open(all_transcripts_csv, "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["filename", "management_discussion", "financial_quarter", "call_date"])

# Fetch the mapping of financial quarter to call date from Santander's page.
call_date_mapping = get_call_dates()
print("Call Date Mapping:", call_date_mapping)

# Process each MP4 file in the raw directory.
mp4_files = glob.glob(os.path.join(raw_dir, "*.mp4"))

for mp4_file in mp4_files:
    print(f"\nProcessing MP4 file: {mp4_file}")
    # Transcribe the video using Whisper.
    result = model.transcribe(mp4_file)
    transcript_text = result["text"]

    # Use the file's base name as an identifier.
    base_name = os.path.splitext(os.path.basename(mp4_file))[0]

    # Extract the financial quarter from the filename.
    financial_quarter = parse_financial_quarter(base_name)
    # Look up the call date from our mapping; default to "Unknown" if not found.
    call_date = call_date_mapping.get(financial_quarter, "Unknown")

    # Check for duplicates.
    duplicate_flag = "Yes" if base_name in existing_files else "No"
    existing_files.add(base_name)
    if duplicate_flag == "Yes":
        print(f"Duplicate found for {base_name}.")

    # Append the new record to the CSV with headers: filename, management_discussion, financial_quarter, call_date.
    with open(all_transcripts_csv, "a", newline="", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow([base_name, transcript_text, financial_quarter, call_date])

    print(f"Transcript for '{base_name}' appended (financial_quarter: {financial_quarter}, call_date: {call_date}).")


Call Date Mapping: {}

Processing MP4 file: /content/drive/MyDrive/BOE/bank_of_england/data/raw/santander/video_2024_Q1_1.mp4
Transcript for 'video_2024_Q1_1' appended (financial_quarter: 2024 Q1, call_date: Unknown).

Processing MP4 file: /content/drive/MyDrive/BOE/bank_of_england/data/raw/santander/video_2024_Q2_2.mp4
Transcript for 'video_2024_Q2_2' appended (financial_quarter: 2024 Q2, call_date: Unknown).

Processing MP4 file: /content/drive/MyDrive/BOE/bank_of_england/data/raw/santander/video_2024_Q3_3.mp4
Transcript for 'video_2024_Q3_3' appended (financial_quarter: 2024 Q3, call_date: Unknown).

Processing MP4 file: /content/drive/MyDrive/BOE/bank_of_england/data/raw/santander/video_2024_Q4_4.mp4
Transcript for 'video_2024_Q4_4' appended (financial_quarter: 2024 Q4, call_date: Unknown).

Processing MP4 file: /content/drive/MyDrive/BOE/bank_of_england/data/raw/santander/video_2023_Q1_5.mp4
Transcript for 'video_2023_Q1_5' appended (financial_quarter: 2023 Q1, call_date: Unknown)

In [4]:
import nltk
# Ensure the 'punkt_tab' resource is downloaded
nltk.download('punkt_tab')

from nltk.tokenize import sent_tokenize

def chunk_text(text, max_chunk_size=500):
    """
    Splits the input text into chunks that do not exceed max_chunk_size characters.
    The splitting is based on sentence boundaries.

    Parameters:
        text (str): The full text to be chunked.
        max_chunk_size (int): Maximum number of characters per chunk.

    Returns:
        List[str]: A list of text chunks.
    """
    # Tokenize the text into sentences.
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        # If adding the next sentence would exceed the maximum size, save the current chunk.
        if len(current_chunk) + len(sentence) + 1 > max_chunk_size:
            if current_chunk:
                chunks.append(current_chunk.strip())
                current_chunk = sentence + " "
            else:
                # If a single sentence exceeds max_chunk_size, add it as its own chunk.
                chunks.append(sentence.strip())
                current_chunk = ""
        else:
            current_chunk += sentence + " "

    if current_chunk.strip():
        chunks.append(current_chunk.strip())

    return chunks

# Example usage:
transcript_text = (
    "I'm going to start with the presentation. Please everyone and welcome to our third quarter earnings presentation. "
    "Like always, the presentation will start with our CEO's comments, followed by my detailed explanation of the P&L. "
    "He will then offer his concluding remarks and we will open it up for questions. "
    "This quarter has been a record quarter for Santander, with profit up 12% compared to last quarter. "
    "On the back of a strong customer base of 171 million, we continue to demonstrate resilience in our business model. "
    "Our efficiency has improved, and our balance sheets remain solid. "
    "We are on track to exceed our targets for the year. "
    "Now, let's move into a more detailed discussion on our financial performance, starting with revenue growth and cost management. "
    "We are confident that the strategies implemented will continue to deliver strong results in the upcoming quarters."
)

# Split the transcript into chunks (max 500 characters per chunk).
chunks = chunk_text(transcript_text, max_chunk_size=500)
for i, chunk in enumerate(chunks, start=1):
    print(f"Chunk {i}:\n{chunk}\n")


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Chunk 1:
I'm going to start with the presentation. Please everyone and welcome to our third quarter earnings presentation. Like always, the presentation will start with our CEO's comments, followed by my detailed explanation of the P&L. He will then offer his concluding remarks and we will open it up for questions. This quarter has been a record quarter for Santander, with profit up 12% compared to last quarter.

Chunk 2:
On the back of a strong customer base of 171 million, we continue to demonstrate resilience in our business model. Our efficiency has improved, and our balance sheets remain solid. We are on track to exceed our targets for the year. Now, let's move into a more detailed discussion on our financial performance, starting with revenue growth and cost management. We are confident that the strategies implemented will continue to deliver strong results in the upcoming quarters.

