# Cleaning Pipeline with SpaCy + Time Stamp Prompt Engineering

This notebook contains:

- A **cleaning pipeline** using SpaCy.
- Updated **chunking** techniques.
- Function to convert timestamps intp HH:MM:SS.
- **Prompts** for extracting key steps with **time stamps**.

---


Install packages

In [None]:
# ! pip install spacy

In [None]:
# !python -m spacy download en_core_web_lg  # loading english model
# !python -m spacy download de_core_news_lg # loading german model

In [None]:
# !pip install transformers pandas requests openai tiktoken


In [None]:
# Install Packacges
import os
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd
import json
import requests
from typing import List, Dict
import openai
from openai import OpenAI
import tiktoken  # To count tokens accurately
from datetime import datetime
import spacy
import re

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


### Define OpenAI key and instance client

### Function to save outputs

In [None]:
def save_model_output(output, folder="model_outputs", filename_prefix="output"):
    # Ensure the folder exists
    os.makedirs(folder, exist_ok=True)

    # Generate a timestamped filename
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{filename_prefix}_{timestamp}.json"
    filepath = os.path.join(folder, filename)

    # Save output as JSON
    with open(filepath, "w", encoding="utf-8") as f:
        if isinstance(output, dict):
            json.dump(output, f, indent=4, ensure_ascii=False)
        else:
            json.dump({"output": output}, f, indent=4, ensure_ascii=False)

    print(f"Output saved to: {filepath}")
    return filepath

### Function to convert time

In [None]:
def convert_to_seconds(time_str):
    match = re.match(r"(\d{2}):(\d{2}):(\d{2}\.\d+)", time_str)
    if not match:
        raise ValueError(f"Invalid time format: {time_str}")

    hours, minutes, seconds = map(float, match.groups())
    return hours * 3600 + minutes * 60 + seconds

def format_time(value):
    try:
        # If the value is a string in HH:MM:SS.sss format, convert it to seconds
        if isinstance(value, str) and ":" in value:
            value = convert_to_seconds(value)
        else:
            value = float(value)  # Ensure value is float

        # Convert seconds into HH:MM:SS.sss format
        hours = int(value // 3600)
        minutes = int((value % 3600) // 60)
        seconds = value % 60
        return f"{hours:02}:{minutes:02}:{seconds:06.3f}"
    except ValueError:
        print(f"Warning: Could not convert value '{value}' to float")
        return value  # Return original value if conversion fails

# Recursively update time values
def update_times(data):
    if isinstance(data, dict):
        for key, value in data.items():
            if isinstance(value, dict):
                update_times(value)  # Recursively process nested dictionaries
            elif key in ["startTime", "endTime", "duration"]:
                data[key] = format_time(value)  # Convert to proper format
    return data


### Define Prompts

In [None]:
system_message = """
    You're an AI with medical knowledge and you're asked to read a text from a video and answer the questions that are asked to you.
    Act like your input is a medical case. Please respond only in plain text without using Markdown formatting or special characters
    for headings, bullet points, code blocks, or links. Examine the raw transcription of the medical operation video and extract essential
    components such as procedure name, doctors' names, steps, tools, patient considerations, and outcomes. While formulating your responses,
    never mention the transcript. Speak as if your information comes directly from the video recording of the operation. Ensure that all details
    remain true to the original transcription without altering or simplifying the language. Identify and highlight the most relevant sections of the
    transcription that accurately describe the procedure without modifying the medical terminology or phrasing. Ensure that the original language
    is preserved fully to maintain fidelity to the transcription.

"""

In [None]:
time_stamp_prompt2 = '''
The document consists of **segmented transcript chunks**, each containing a portion of a surgical procedure.

Your task is to analyze the given chunk and summarize **only its relevant procedural details**, ensuring accuracy and avoiding redundancy.
Format your response as a **dictionary** with the following structure:

{
    "procedure_name": {
            "<Step Name>": {
                "description": "<Detailed explanation of this specific step in the procedure>",
                "startTime": <start_time>,
                "endTime": <end_time>,
                "duration": <step_duration>
            }
        }
    }

### **Guidelines:**
- **Each chunk contains only part of the full procedure**. Do NOT attempt to summarize missing information.
- Avoid repeating procedural steps already described in previous chunks.
- If a chunk introduces **a new step**, extract its details along with accurate timestamps.
- If a chunk **continues a previously described step**, provide additional details without redundancy.
- **Preserve medical terminology** and ensure timestamps align with procedural actions.
- If a chunk contains **no relevant surgical details**, do not provide a summary or time stamps for that segment.
- Include any events that are unforeseen or abnormal within a chunk.
- Maintain the original language of the text, the output should be in the same language as the input.
### **Example Output (for a single chunk):**
{
    "Transcatheter Aortic Valve Replacement (TAVR)": {
            "Positioning of the Delivery Catheter": {
                "description": "The catheter is advanced under fluoroscopic guidance toward the aortic valve. Proper positioning is confirmed using angiographic imaging.",
                "startTime": HH:MM:SS,
                "endTime": HH:MM:SS,
                "duration": HH:MM:SS
            }
        }
    }
}
'''

subsequent_chunk_prompt = '''
The document is a JSON file containing transcript segments with start and end times.

Your task is to **continue summarizing** the surgical procedure in a structured format. **Do not include the procedure name again** as it has already been established. Your response **must be a dictionary format** with the following structure:
{ "<Step Name>": {
            "description": "<Detailed step explanation including different key aspects of the step>",
            "startTime": <start_time>,
            "endTime": <end_time>,
            "duration": <step_duration>
        }
    }

### **Guidelines:**
- **Each chunk contains only part of the full procedure**. Do NOT attempt to summarize missing information.
- If a chunk discusses **a medical step**, extract its details along with accurate timestamps.
- **Preserve medical terminology** and ensure timestamps align with procedural actions.
- If a chunk contains **no relevant surgical details**, do not provide a summary for that segment.
- Include any events that are unforeseen or abnormal within a chunk relevant to the medical procedure.
'''

### Functions to clean text, NLP functions

In [None]:
def clean_text_in_chunks(text, chunk_size=50000, lang='en'):
    # Load the correct spaCy model based on language paramter
    models = {
        "en": "en_core_web_lg",
        "fr": "fr_core_news_lg",
        "es": "es_core_news_lg",
        "de": "de_core_news_lg",
    }

    if lang not in models:
        raise ValueError(f"Unsupported language: {lang}")

    # Load the NLP model
    nlp = spacy.load(models[lang])

    # increasing max length bc spacy has a character limit
    nlp.max_length = max(len(text), 2_000_000)

    original_length = len(text.split())

    words = text.split()
    cleaned_chunks = []

    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i + chunk_size])  # Take a chunk of words
        doc = nlp(chunk)  # Process only the chunk

        cleaned_tokens = [
            token.lemma_ if token.lemma_ != "-PRON-" else token.text
            for token in doc
            if not token.is_stop
            and (token.text in "{}:" or not token.is_punct or token.text in "{}:") # want to keep this form of puntuation to enable model to understand the formatting
        ]

        cleaned_chunks.append(" ".join(cleaned_tokens))  # Store the cleaned chunk

    cleaned_text = " ".join(cleaned_chunks)  # Recombine chunks

    cleaned_length = len(cleaned_text.split())

    print(f"Original Length: {original_length} words")
    print(f"Cleaned Length: {cleaned_length} words")

    return cleaned_text

# function to clean dorian translated text
def extract_quoted_text(input_text):

    # Find all text within double quotes
    pattern = r'"(.*?)"'
    matches = re.findall(pattern, input_text)

    # Combine all matches into a single string
    extracted_text = " ".join(matches)

    return extracted_text


In [None]:
# we want to get an idea of the type of words that were removed
def get_removed_words(original_text, cleaned_text):
    original_words = set(original_text.split())
    cleaned_words = set(cleaned_text.split())
    removed_words = original_words - cleaned_words  # Words that were in original but not in cleaned version
    return list(removed_words)

### Chunk text

In [None]:
# Load OpenAI's tokenizer for token counting (GPT-3.5)
tokenizer = tiktoken.get_encoding("cl100k_base")

def split_transcript(transcript, max_tokens = 4000, overlap=300):

    words = transcript.split()  # Split into words
    token_counts = [len(tokenizer.encode(word)) for word in words]  # Tokenize words

    chunks = []
    current_chunk = []
    current_token_count = 0
    i = 0

    while i < len(words):
        if current_token_count + token_counts[i] <= max_tokens:
            current_chunk.append(words[i])
            current_token_count += token_counts[i]
        else:
            # When reaching token limit, store current chunk and start a new one
            chunks.append(" ".join(current_chunk))

            # Start a new chunk, keeping an overlap for context
            overlap_start = max(0, i - overlap)
            current_chunk = words[overlap_start:i]  # Take the last `overlap` words
            current_token_count = sum(token_counts[overlap_start:i])

        i += 1

    # Append the final chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

In [None]:
# Function to chunk text based on token limit without overlap

def chunk_text(text, max_tokens=3000):
    enc = tiktoken.encoding_for_model("gpt-4-turbo")
    tokens = enc.encode(text)

    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk = enc.decode(tokens[i : i + max_tokens])
        chunks.append(chunk)

    return chunks

### OPENAI Functions

In [None]:
# Generic function to call open AI with defined paramters
def GenTimeStampSummary(transcript, system_message, prompt, temperature, client, model = "gpt-4-turbo"):

    # Construct the message prompt
    messages = [
        {"role": "system", "content": system_message + prompt},
        {"role": "user", "content": transcript}
    ]

    # Call the OpenAI API
    response = client.chat.completions.create(
        model = model,
        messages = messages,
        max_tokens = 4000,
        temperature = temperature
    )

    # Extract the summary from the response
    outputs = {"summary": response.choices[0].message.content}
    return outputs

def process_transcript_in_chunks(
    chunks, system_message, first_prompt, subsequent_prompt, api_key=None, temperature=0.3
):
    # Initialize OpenAI client
    client = OpenAI(api_key=api_key)
    structured_output = {}

    for idx, chunk in enumerate(chunks):
        print(f"Processing chunk {idx+1} of {len(chunks)}...")

        if idx == 0:
            # First chunk uses the full prompt
            result = GenTimeStampSummary(
                chunk,
                system_message,
                first_prompt,
                temperature,
                client
            )
            structured_output = result
        else:
            # Process subsequent chunks with simplified prompt
            result = GenTimeStampSummary(
                chunk,
                system_message,
                subsequent_prompt,
                temperature,
                client
            )

            if result and isinstance(result, dict):
                # Extract procedure name from first chunk
                if structured_output and len(structured_output) > 0:
                    procedure_name = list(structured_output.keys())[0]
                    if "steps" in result and procedure_name in structured_output:
                        structured_output[procedure_name]["steps"].update(result["steps"])

    return structured_output

# CALLING Pipeline

### Step 1: Text Preprocessing using SpaCy ( Cleaning/ reduction)

In [None]:
# Loading Transcript1
with open('transcripts/Pasteur_TMVR.txt', 'r') as file:
    transcript1 = file.read()

In [None]:
cleaned_text1 = clean_text_in_chunks(transcript1, chunk_size=50000, lang = 'en')

Original Length: 3762 words
Cleaned Length: 1609 words


In [None]:
cleaned_text1

'{ transcript":"so welcome clinic pastor transcatheter mitral valve valve case uh discuss baseline clinical Characteristic imaging uh review uh case uh uh detail step step transcatheter mitral valve valve implantation highlight um advanced feature periprocedural imaging ","startTime":"0.0","endTime":"27.51"},{"transcript":"We benefit guide uh case uh case guide transesophageal echography colleague doctor patient course reason general anesthesia care doctor ","starttime":"27.719","endtime":"41.75"},{"transcript":"um think wait day briefly review far um mention patient general dys benefit uh transesophageal echography guidance um right groin position uh um sheath trans valve implant uh uh 29 millimeter accord sizing,","starttime":"42.439","endtime":"68.22"},{"transcript":"we\'ve review small art line monitor arterial pressure blood sample monitor act uh uh t image inside surgical frame um decide protect brain cerebral protection device sentinel,","starttime":"68.47","endtime":"90.61"},{"

Can see about a 50% reduction


Want to check the type of english and german words that are removed for comparison to other models

In [None]:
removed_words = get_removed_words(transcript1, cleaned_text1)

In [None]:
removed_words

['Yes,',
 'proceed.","startTime":"761.789","endTime":"791.489"},{"transcript":"So',
 'difficulties',
 'Ok.","startTime":"1148.869","endTime":"1165.689"},{"transcript":"So,',
 'E',
 'mentioned,',
 'directed',
 'tips',
 'case.',
 'that.',
 'bit.',
 'to,","startTime":"916.51","endTime":"929.08"},{"transcript":"to',
 'here,',
 'Remember',
 'these',
 'uh","startTime":"875.19","endTime":"879.13"},{"transcript":"our',
 'putting',
 'mercury.","startTime":"116.9","endTime":"121.339"},{"transcript":"And',
 'bioprosthesis,',
 'valve.","startTime":"1449.449","endTime":"1464.01"},{"transcript":"I',
 'plane.',
 'appearing',
 'The',
 'So',
 'first',
 'will',
 'sir.","startTime":"527.479","endTime":"557.03"},{"transcript":"So',
 'configuration","startTime":"141.229","endTime":"158.33"},{"transcript":"in',
 'seconds.","startTime":"177.52","endTime":"201.75"},{"transcript":"And',
 'back,',
 'Well,',
 'intention,',
 'lot.',
 'being',
 'Is',
 'either',
 'Yeah.',
 'easier.',
 'stop.',
 'decided',
 'to","st

We can see some start and end times were removed. However this doesnt seem to have an overall negative impact when the model does the time stamping. some medical terms were removed: balloon, valve, bioprosthesis, ventricle. Potentially can look into processors for only medical words. However, the model still understands the test and context, so this may nto be an issue

### Step 2: Chunk text ( so it is under token limit 4096)

In [None]:
chunks1 = split_transcript(cleaned_text1, max_tokens=4000, overlap=250)

In [None]:
chunks1

['{ transcript":"so welcome clinic pastor transcatheter mitral valve valve case uh discuss baseline clinical Characteristic imaging uh review uh case uh uh detail step step transcatheter mitral valve valve implantation highlight um advanced feature periprocedural imaging ","startTime":"0.0","endTime":"27.51"},{"transcript":"We benefit guide uh case uh case guide transesophageal echography colleague doctor patient course reason general anesthesia care doctor ","starttime":"27.719","endtime":"41.75"},{"transcript":"um think wait day briefly review far um mention patient general dys benefit uh transesophageal echography guidance um right groin position uh um sheath trans valve implant uh uh 29 millimeter accord sizing,","starttime":"42.439","endtime":"68.22"},{"transcript":"we\'ve review small art line monitor arterial pressure blood sample monitor act uh uh t image inside surgical frame um decide protect brain cerebral protection device sentinel,","starttime":"68.47","endtime":"90.61"},{

### Step 3: CALL OpenAI

In [None]:
# step 3: call function using chunked transcipt
final_summary = process_transcript_in_chunks(
    chunks1,
    system_message = system_message,
    first_prompt = time_stamp_prompt2,  # Full prompt for first chunk
    subsequent_prompt= subsequent_chunk_prompt,  # Simplified prompt
    temperature=0.0,
    api_key=api_key
)

print((final_summary['summary']))

Processing chunk 1 of 1...
{
    "Transcatheter Mitral Valve Replacement": {
        "Review of Baseline Clinical and Imaging Data": {
            "description": "The case begins with a discussion of the patient's baseline clinical characteristics and imaging data. This includes a review of the patient's general condition, the benefits of using transesophageal echography for guidance, and the positioning of the patient under general anesthesia. The imaging review highlights the thin mitral valve leaflets with prolapse of the anterior leaflet and severe intraproietic mitral regurgitation with a peri-valvular leak. The mean gradient across the valve is noted, and the sizing for the valve implantation is determined to be 29 millimeters based on these assessments.",
            "startTime": "0.0",
            "endTime": "116.199",
            "duration": "116.199"
        },
        "Preparation and Positioning for Valve Implantation": {
            "description": "The procedure continues 

In [None]:
summary_dict1 = json.loads(final_summary["summary"])

# Update the dictionary
updated_data = update_times(summary_dict1)

# Print the updated dictionary
print(updated_data)

{'Transcatheter Mitral Valve Replacement': {'Review of Baseline Clinical and Imaging Data': {'description': "The case begins with a discussion of the patient's baseline clinical characteristics and imaging data. This includes a review of the patient's general condition, the benefits of using transesophageal echography for guidance, and the positioning of the patient under general anesthesia. The imaging review highlights the thin mitral valve leaflets with prolapse of the anterior leaflet and severe intraproietic mitral regurgitation with a peri-valvular leak. The mean gradient across the valve is noted, and the sizing for the valve implantation is determined to be 29 millimeters based on these assessments.", 'startTime': '00:00:00.000', 'endTime': '00:01:56.199', 'duration': '00:01:56.199'}, 'Preparation and Positioning for Valve Implantation': {'description': 'The procedure continues with the preparation for the transcatheter mitral valve implantation. This includes the positioning o

In [None]:
save_model_output(updated_data, folder="outputs", filename_prefix="pastuer_timestamp_gpt4")

Output saved to: outputs/pastuer_timestamp_gpt4_20250322_234240.json


'outputs/pastuer_timestamp_gpt4_20250322_234240.json'

## German transcript using Pipeline


This German text proposed many challenges. In german, as confirmed by a german speaker, the doctors used a lot of sland and only some medical terminology, which naturally decreased the overal professionalism of the procedure ( as interpreted by gpt). When the procedure was translated into english, some of the time stamps were erroded -> which naturally had an impact on the time stamp prompt engineering. Additionally there were two doctors having a conversation throughout most of the procedure, this contrasts the other two transcription which provided a better guide as to what they were seeing in the procedure --> this case could be a good example of when video, images, or computr vision would be useful in addition to the audio, since audio alone provides minimal context.

Take a look at the Translated text as see what is readable to you. Its very hard to grasp the ideas, it is very impressive the llm was able to understand.

In [None]:
# step 0: load transcript-- This is already translated to english
with open('transcripts/helsinki_cleand_translated.txt', 'r') as file:
    transcript4 = file.read()

In [None]:
transcript4

'[{\'translated_text\': "Test test record 62 frequency preparation procedure makes sorry finds Ki mal table video works hopefully works sound times today\'s mobile phones laid yes four pieces na already Christmas Have Kita App family App Kita Yes yes practical ne find Yes yes yes watch pure Wirsing kale apple jelly",\n  \'start_time\': None,\n  \'end_time\': 63.819},\n {\'translated_text\': \'Croquettes sometimes impressive delicious Kita today looking pure gave broccoli uh yes home yes mom potato porridge Kita good masses home potato best cook appetit eat just yes yes oh yeah funny please free and sorry find really horny thanks says serious needs drive already a bit just once\',\n  \'start_time\': 64.169,\n  \'end_time\': 153.83},\n {\'translated_text\': "Little driving actually exactly 25 o\'clock 35 two always yes corner Zero gravity relate yes nice Nee say yes even Say emphasize first come \'s always consecrated \'s calmly learn times have to guide bag must hand exactly best both h

In [None]:
cleaned_german = extract_quoted_text(transcript4)


In [None]:
cleaned_german

"Test test record 62 frequency preparation procedure makes sorry finds Ki mal table video works hopefully works sound times today's mobile phones laid yes four pieces na already Christmas Have Kita App family App Kita Yes yes practical ne find Yes yes yes watch pure Wirsing kale apple jelly Little driving actually exactly 25 o'clock 35 two always yes corner Zero gravity relate yes nice Nee say yes even Say emphasize first come 's always consecrated 's calmly learn times have to guide bag must hand exactly best both hands would like to learn times lesson would like to learn yes on it pretty foolproof already yes means go hand in ok yes yes yes ah yes right yes R R R r r do 's wrong already do 's wrong yes get there ok even simple Yes Something fits eggs yes old a three a four four fold up one two yes goes yes even far ne say down folded ne yes yes two down ne comes let's rather give new gloves old people driving licence yes safety first new gloves yes 30 yes exactly extra ordered applau

In [None]:
# step 1: clean text
cleaned4 = clean_text_in_chunks(transcript4, chunk_size = 50000, lang='en')

Original Length: 3534 words
Cleaned Length: 3041 words


In [None]:
cleaned4

"{ translated_text : test test record 62 frequency preparation procedure make sorry find Ki mal table video work hopefully work sound time today mobile phone lay yes piece na Christmas Kita App family App Kita yes yes practical ne find yes yes yes watch pure Wirsing kale apple jelly start_time : end_time : 63.819 } { translated_text : Croquettes impressive delicious Kita today look pure give broccoli uh yes home yes mom potato porridge Kita good masse home potato good cook appetit eat yes yes oh yeah funny free sorry find horny thank say need drive bit start_time : 64.169 end_time : 153.83 } { translated_text : little drive actually exactly 25 o'clock 35 yes corner Zero gravity relate yes nice Nee yes emphasize come consecrate calmly learn time guide bag hand exactly good hand like learn time lesson like learn start_time : 153.839 end_time : 252.139 } { translated_text : yes pretty foolproof yes mean hand ok yes yes yes ah yes right yes r r r r r wrong wrong yes ok simple yes fit egg y

In [None]:
# step 2: split text into chunks
chunks4 = split_transcript(cleaned4, max_tokens= 1500, overlap= 450)

In [None]:
chunks4

["{ translated_text : test test record 62 frequency preparation procedure make sorry find Ki mal table video work hopefully work sound time today mobile phone lay yes piece na Christmas Kita App family App Kita yes yes practical ne find yes yes yes watch pure Wirsing kale apple jelly start_time : end_time : 63.819 } { translated_text : Croquettes impressive delicious Kita today look pure give broccoli uh yes home yes mom potato porridge Kita good masse home potato good cook appetit eat yes yes oh yeah funny free sorry find horny thank say need drive bit start_time : 64.169 end_time : 153.83 } { translated_text : little drive actually exactly 25 o'clock 35 yes corner Zero gravity relate yes nice Nee yes emphasize come consecrate calmly learn time guide bag hand exactly good hand like learn time lesson like learn start_time : 153.839 end_time : 252.139 } { translated_text : yes pretty foolproof yes mean hand ok yes yes yes ah yes right yes r r r r r wrong wrong yes ok simple yes fit egg 

In [None]:
removed_words4 = get_removed_words(transcript4, cleaned4)

In [None]:
# look at removed words
removed_words4

['front",',
 'punctured',
 '"pushen',
 'lock",',
 'found',
 'looking',
 'patients',
 'wonders',
 'yes",',
 'Something',
 '2694.989,',
 'knows',
 'falls',
 'anyway',
 'daughters',
 'jumps',
 'first',
 '252.139},',
 'minutes',
 "today's",
 'tries',
 'Mhm',
 '1267.3},',
 '3800.439,',
 'understands',
 'picked',
 '2025.3},',
 'takes',
 "'yes",
 "'connected",
 'years',
 'decided',
 'Good',
 'freezes',
 "[{'translated_text':",
 "23',",
 'all',
 'even',
 'same',
 '606.919,',
 'going',
 'balloons',
 '865.739},',
 'means',
 '3704.409},',
 'why',
 'Buy',
 '4436.229,',
 'call",',
 'higher',
 'often',
 'slotted',
 'ordered',
 '153.83},',
 '1966.829,',
 'four-digit',
 'rocks',
 "okay',",
 'whole',
 'bottom',
 'know",',
 'top',
 'Below',
 '733.14,',
 'findings',
 'meant',
 'phones',
 '"Little',
 'Previously,',
 '1529.64},',
 'could',
 "'talk",
 '153.839,',
 '1110.989,',
 'her',
 'between',
 '1046.798},',
 'show',
 'used',
 "'A",
 'eggs',
 "'stuffed",
 'At',
 '"was',
 '2424.199,',
 '"same',
 '4554.529

In [None]:
# Step 3: calling openai fucntion using chuncked cleaned data + first and subsequent prompt

final_summary2 = process_transcript_in_chunks(
    chunks4,
    system_message = system_message,
    first_prompt = time_stamp_prompt2,  # Full prompt for first chunk
    subsequent_prompt= subsequent_chunk_prompt,  # Simplified prompt
    temperature=0.3, # Needed to increase temperature to make the model less deterministic -- the transcript is of poor quality and the model needs to be more creative
    api_key=api_key
)

Processing chunk 1 of 5...
Processing chunk 2 of 5...
Processing chunk 3 of 5...
Processing chunk 4 of 5...
Processing chunk 5 of 5...


In [None]:
print((final_summary2))

{'summary': '{\n    "Venous Ultrasound": {\n        "Controlled Functions": {\n            "description": "Venous Ultrasound Controlled Functions were performed under local anesthesia with XY Loka. The procedure involved closure and ummodeling of the site.",\n            "startTime": "1048.89",\n            "endTime": "1110.579",\n            "duration": "00:01:01.689"\n        }\n    },\n    "Multipolar Catheter Insertion": {\n        "Catheter Insertion": {\n            "description": "A multipolar catheter was inserted at 13:25. This step involved careful monitoring and adjustment of the catheter position.",\n            "startTime": "1267.31",\n            "endTime": "1420.859",\n            "duration": "00:02:33.549"\n        }\n    },\n    "Anticoagulation": {\n        "Administration of Heparin": {\n            "description": "Administered 10,000 units of Heparin to the patient to prevent clotting during the procedure.",\n            "startTime": "1420.869",\n            "endTim

In [None]:
summary_dict2 = json.loads(final_summary2["summary"])

# Update the dictionary
updated_data2 = update_times(summary_dict2)

# Print the updated dictionary
print(updated_data2)

{'Venous Ultrasound': {'Controlled Functions': {'description': 'Venous Ultrasound Controlled Functions were performed under local anesthesia with XY Loka. The procedure involved closure and ummodeling of the site.', 'startTime': '00:17:28.890', 'endTime': '00:18:30.579', 'duration': '00:01:01.689'}}, 'Multipolar Catheter Insertion': {'Catheter Insertion': {'description': 'A multipolar catheter was inserted at 13:25. This step involved careful monitoring and adjustment of the catheter position.', 'startTime': '00:21:07.310', 'endTime': '00:23:40.859', 'duration': '00:02:33.549'}}, 'Anticoagulation': {'Administration of Heparin': {'description': 'Administered 10,000 units of Heparin to the patient to prevent clotting during the procedure.', 'startTime': '00:23:40.869', 'endTime': '00:24:50.920', 'duration': '00:01:10.051'}}, 'Pacemaker Function Check': {'Pacemaker Testing': {'description': 'The functionality of the pacemaker was tested, ensuring proper feedback and operation.', 'startTim

In [None]:
save_model_output(updated_data2, folder="outputs", filename_prefix="german_timestamp_gpt4")

Output saved to: outputs/german_timestamp_gpt4_20250322_233918.json


'outputs/german_timestamp_gpt4_20250322_233918.json'

## Long text english-- Mitravalve (transcript 2)

In [None]:
# Loading Transcript1
with open('transcripts/AVAM_Mitraclip_2024.txt', 'r') as file:
    transcript2 = file.read()

# Loading Transcript1
with open('transcripts/AVAM_Mitraclip_2024_2.txt', 'r') as file:
    transcript3 = file.read()

In [None]:
full_transcript2 = transcript2 + transcript3

In [None]:
full_transcript2

'[{"transcript":"With me because there are so many other potential variables at the time of the implant and how the cows will respond or not. And the expansion, yeah, rings are challenging as well because sometimes we have been.","startTime":"0.25","endTime":"14.14"},{"transcript":"But depending, even sometimes the anterior leaflet is 2324 is quite a borderline. But again, it\'s a small ring and sometimes the centimeter but a little bit about, it\'s almost like all rings should have some leaflet modification as well despite the, I don\'t know how that interior will behave.","startTime":"14.34","endTime":"39.189"},{"transcript":"So, so has anyone here on the uh panel experienced LDLT obstruction and what, what was the predicted? And in those cases, first of all,","startTime":"39.389","endTime":"56.02"},{"transcript":"do you guys use a cut off for lampooning or,","startTime":"56.029","endTime":"72.669"},{"transcript":"or leaflet length? Um-hum like, you know, they\'re 2022 23 thrown arou

In [None]:
# step 1: clean text
cleaned2 = clean_text_in_chunks(full_transcript2, chunk_size = 3000, lang='en')

Original Length: 26448 words
Cleaned Length: 10915 words


In [None]:
# Checking words removed

removed_words2 = get_removed_words(transcript2, cleaned2)
print(removed_words2)

['Yes,', 'be,', 'creating', 'workflow.', 'repair,', 'found', 'damage.","startTime":"1125.011","endTime":"1153.489"},{"transcript":"So', 'Who,', 'here,', 'see","startTime":"6056.189","endTime":"6056.839"},{"transcript":"50","startTime":"6058.029","endTime":"6059.259"},{"transcript":"when', 'these', 'might', 'AM', 'this.","startTime":"701.5","endTime":"721.239"},{"transcript":"And', 'minutes', 'be","startTime":"4068.419","endTime":"4071.879"},{"transcript":"comfortable.","startTime":"4074.649","endTime":"4075.87"},{"transcript":"Ok.', 'procedure.', 'either', 'together.","startTime":"6097.35","endTime":"6119.6"},{"transcript":"That\'s', 'takes', 'they,', 'predictable,', 'like,', 'further', 'basket.', 'microsurgery,', 'state,","startTime":"181.389","endTime":"197.46"},{"transcript":"we\'ve', 'happening', 'young.","startTime":"4236.77","endTime":"4237.79"},{"transcript":"Uh', 'why', 'result,', 'working', 'devices', 'where","startTime":"4078.33","endTime":"4082.26"},{"transcript":"you', 'cal

In [None]:
# step 2: split text into chunks
chunks_AVAM = split_transcript(cleaned2, max_tokens=4000, overlap= 60)

In [None]:
# Step 3: calling openai fucntion using chuncked cleaned data + first and subsequent prompt

final_summary4 = process_transcript_in_chunks(
    chunks_AVAM,
    system_message = system_message,
    first_prompt = time_stamp_prompt2,  # Full prompt for first chunk
    subsequent_prompt= subsequent_chunk_prompt,  # Simplified prompt
    temperature=0.0,
    api_key=api_key
)

Processing chunk 1 of 11...
Processing chunk 2 of 11...
Processing chunk 3 of 11...
Processing chunk 4 of 11...
Processing chunk 5 of 11...
Processing chunk 6 of 11...
Processing chunk 7 of 11...
Processing chunk 8 of 11...
Processing chunk 9 of 11...
Processing chunk 10 of 11...
Processing chunk 11 of 11...


In [None]:
print(final_summary4["summary"])

{
    "Transcatheter Mitral Valve Repair (TMVR)": {
        "Leaflet Modification and Device Orientation": {
            "description": "Surgeons perform microsurgery to remove the anterior leaflet to ensure the device is oriented correctly for optimal procedural outcomes.",
            "startTime": "244.139",
            "endTime": "271.799",
            "duration": "00:27.660"
        },
        "Cutting of the Leaflet": {
            "description": "The cutting of the actual leaflet is done liberally despite the large opening area, aiming for good results for the patient.",
            "startTime": "271.989",
            "endTime": "300.792",
            "duration": "00:28.803"
        },
        "Assessment of LVOT and Leaflet Size": {
            "description": "The left ventricular outflow tract (LVOT) size is assessed at 55 mm to evaluate the risk of obstruction. The anterior leaflet is noted to be long, measuring 25 mm.",
            "startTime": "402.769",
            "endTime

In [None]:
summary_dict3 = json.loads(final_summary4["summary"])

# Update the dictionary
updated_data3 = update_times(summary_dict3)

# Print the updated dictionary
print(updated_data3)

{'Transcatheter Mitral Valve Repair (TMVR)': {'Leaflet Modification and Device Orientation': {'description': 'Surgeons perform microsurgery to remove the anterior leaflet to ensure the device is oriented correctly for optimal procedural outcomes.', 'startTime': '00:04:04.139', 'endTime': '00:04:31.799', 'duration': '00:27.660'}, 'Cutting of the Leaflet': {'description': 'The cutting of the actual leaflet is done liberally despite the large opening area, aiming for good results for the patient.', 'startTime': '00:04:31.989', 'endTime': '00:05:00.792', 'duration': '00:28.803'}, 'Assessment of LVOT and Leaflet Size': {'description': 'The left ventricular outflow tract (LVOT) size is assessed at 55 mm to evaluate the risk of obstruction. The anterior leaflet is noted to be long, measuring 25 mm.', 'startTime': '00:06:42.769', 'endTime': '00:07:17.709', 'duration': '00:34.940'}, 'Pericardium Planning and Device Positioning': {'description': 'Planning involves considering the pericardium and

In [None]:
save_model_output(updated_data3, folder="outputs", filename_prefix="AVAM_1_gpt4")


Output saved to: outputs/AVAM_1_gpt4_20250322_234130.json


'outputs/AVAM_1_gpt4_20250322_234130.json'