In [None]:
"""
README
William Ian McKinely
MSBI 31600 Advanced Project Winter 2024

1. Description of the project
This project is an initial framework for a pipeline which can take in a number of audio files which will be recordings from 
operating rooms collected for a research project examining teacher-learner verbal interactions. This uses the AWS Transcribe 
Medical service to transcribe the audio files into a JSON file, which can be pulled from AWS
servers and parsed into a formatted transcript which can be collected as a dataframe that can be passed to a downstream
analysis. 

2. Why I selected the project
I am the PI for this study, and I value the ability to plan for a long-term functional database of transcripts for analysis.
I think that we will begin to identify verbal patterns that we currently don't describe and which are unique to surgery, so
having something with a longstanding structure will be important when we need to include something retroactively. Also I'm
secretly super skeptical of the ability for AI to outperform humans in transcription, especially in our very specific use case
that is probably otherwise not well represented in the training data, so I was curious to explore it. Currently, the project
suffers from a limitation caused by the upper limits of capacity of human transcriptionists, which is compounded by the cost;
using AI would be a lot cheaper, but maybe not better in the long run.

I have learned a lot about the need for incredibly clean data in order for use of AI to be successful. Our audio files 
(the research files, which are not provided here) are real-world audio; they contain overlap, background noise, and 
other elements which easily confuse the software. To be fair, they often also confuse humans, but the humans know to 
notice it and find clarification or otherwise adjsut. I also have a much greater appreciation for anyone who makes something
that passes between systems because it is quite challenging. We will need to do a lot of work before this will be a viable
solution to our problem, but it was enough to get me interested, and I have a couple of ideas about where to go next 
(two channel audio input, custom vocabulary). 

3. Themes selected
Theme 1: JSON
I used JSON.loads() to get the data from a stream into a local variable; I also broke it in about every way you can imagine,
which is not shown here because I care about your happiness. I used the structure of the JSON to my advantage to reliably
parse it to a dataframe representing my desired format. Choosing this was almost a default choice; the organization structure
means I will almost certainly have a lot of my future occupied in the JSON format, so it would seem crazy to have possibly 
made a different choice.
Theme 2: Server/client relationships & APIs
I interfaced with the AWS API using the boko3 package. I selected this theme because I plan to do a lot of work with large
data sets in the future; this will inevitably mean use of a cloud service or remote HPC, and either will function along
similar lines so mastering navigation of this relationship is prudent. These elements are applied in the class declaration
for 'server_grab' because this is what connects to the AWS servers, starts the transcription job, and then pulls the data
in through a data stream.

4. What I would do differently
First, I would have liked to make this a much more in-depth exploration of the elements. However, limitations stemming
from both real life and the nature of writing code meant that this wasn't always possible or realistic. I would have really
liked to make a UI that allows the user to pick the files to be transcribed (currently you have to either manually list them,
have a folder for them which gets selected and then you go move them out manually later, or something like that) so that
more of the recordings could be stored in a single location but not all be transcribed repeatedly. I would also probably
have started versioning earlier, so that I could build a framework and piece it together. Instead, I have a smattering
of files which represent me going through the pieces and then a master where I eventually put them together. 

5. How to run the project
Normally, this would be run cell by cell like any other notebook. AWS, however, requires keys and configs that don't
pass well through homework submissions, so I submitted a video alongside this notebook which shows the code running.
If you wanted to test the AWS components yourself, you would need to alter the configurations to be suitable for your
account and local machine. Instead, I marked one of the cells below as the cell where grading should begin and have 
supplied the HIPAA-compliant JSON file produced by Amazon Transcribe Medical. This JSON file loads in directly, 
and then you can check that the file parsing functions as expected. You will have to change the file path to match
your local file path.

6. Challenges & Obstacles
This project was definitely challenging in the way I expected. I knew that I fundamentally would be able to make the JSON 
parsing work out, but it wasn't always easy to get the time modules to cooperate. I really did not enjoy engaging the AWS
API, and I found it very confusing to start but I eventually found my footing. I was able to answer a lot of questions for
myself about what is realistically possible, and was also able to determine which elements we will still need to work to 
overcome in the long term that can't be fixed programatically (like having 2 channel audio input, for example).

7. Sources
Sources are cited in-line throughout the code wherever they were used. I also used the AWS documentation for the Transcribe
Medical service to get me moving. 

8. Extra credit
I met the criteria for using try/except in the JSON parsing cell while setting the first starttime.
I also used an additional theme of versioning; this is the first project where I actually used GitHub to version my code
instead of saving a bunch of copies locally with some kind of naming scheme. This was a big growth experience for me,
because many times I will use the new notebook as a way to set things up piece by piece and reassure myself that it's 
doing what I think it is while I piece it all back together. I did this at one point during this project when I added
the file_parser class, which is why I wound up with a second document as the master. I'm still not sure I am handling them correctly,
but at least they exist now where they previously didn't. If this isn't enough, I also used boto3 and dataclass in this,
which are additional themes I didn't select and boto3 isn't a package we discussed (although I wouldn't classify
it as being particularly cool or unique)
"""

In [1]:
import json
import datetime
from dataclasses import dataclass
import numpy as np
import pandas as pd
import boto3

In [2]:
j = dict()
full_data = pd.DataFrame(columns=['spk','start_time','end_time','transcript'])

@dataclass
class server_grab():

    full_data = pd.DataFrame(columns=['spk','start_time','end_time','transcript'])

    def __init__(self):
        self.spk:str = "spk_0"
        self.start_time = datetime.timedelta(days=0,seconds=0,milliseconds=0)
        self.end_time = datetime.timedelta(days=0,seconds=0,milliseconds=0)
        self.transcript:str = ""
        self.s3_client = boto3.client('s3')
        self.client = boto3.client('transcribe')
        self.s3_resource = boto3.resource('s3')
        bucket = self.s3_resource.Bucket(name='msbi31600')

    #Here we can submit the job to the transcription service
    #Future improvements would include dynamic file naming, selection for output locations, and multiple file handling
    #A better version would also add error handling for job submissions
    def sub_job(self):
        job = self.client.start_medical_transcription_job(
            MedicalTranscriptionJobName='test_job',
            LanguageCode='en-US',
            MediaFormat='m4a',
            Media={
                'MediaFileUri': 's3://msbi31600/test_audio.m4a'
            },
            OutputBucketName='msbi31600',
         #OutputKey='outputs/test_output2.json',
            Settings={
        'ShowSpeakerLabels': True,
        'MaxSpeakerLabels': 2,
        'ChannelIdentification': False
            },
            Specialty='PRIMARYCARE',
            Type = 'CONVERSATION'
        )
    
    #This function gets the transcription JSON file from the bucket and puts it into a variable for our use
    def get_json(self):
        data = self.s3_client.get_object(Bucket='msbi31600',Key='medical/test_job.json')
        body=data['Body']
        j = json.loads(body.read().decode("utf-8"))
        return j

    #This function deletes the job if you encounter the error that the job already exists
    #Future error handling might be able to catch this error and delete the job automatically, but it would have to be a precise situation because this could cause big problems
    def del_job(self):
        dummy = self.client.delete_medical_transcription_job(
            MedicalTranscriptionJobName='test_job'
        )

class file_parser():

    def __init__(self, file):
        self.file = file
        self.start_time = datetime.timedelta(days=0,seconds=0,milliseconds=0)
        self.end_time = datetime.timedelta(days=0,seconds=0,milliseconds=0)
        self.second_start=int()
        self.minute_start=int()
        self.second_end=int()
        self.minute_end=int()
        self.spk = "spk_0"
        self.transcript = ""

    def parse(self):
        global full_data
        #This segment is the broadest loop, which is pulling each 'item' from the Amazon-produced JSON file and parsing it to the useful chunks we need
        for item in self.file['results']['items']:
    
            #This block checks if the speaker label has changed at its outermost scope, and behaves differently depending on whether or not the speaker has changed
            #We may need to change the behavior at this level when we start using multi channel input in the future
            if item['speaker_label'] == self.spk:
        
                #First we have to see if the item is a punctuation mark, and if so, we need to add it to the transcript without a space and without changing the timestamp
                if item['type'] == 'punctuation':
                    self.transcript = self.transcript + item['alternatives'][0]['content']
                elif item['type'] == 'pronunciation':

                    #If the speaker hasn't changed, we add the content to the transcript
                    #I plan to come back here later and add a checking function, where if the item type is 'punctuation' it will omit the added space before addending the content
                    self.transcript = self.transcript + " " + item['alternatives'][0]['content']
        
                    while True:
                        try:
                            #We also check to see if the start time is listed as being 0.0 (i.e. if it has been reset) and if so, we set it to the first value we come across since the reset
                            #This allows us to set the correct start time for each line of text per speaker, which will be important for assigning pauses in the future
                            if self.start_time.total_seconds() == 0.0:
                                split_start = item['start_time'].split('.')
                                second_start = int(split_start[0])
                                millisecond_start = (split_start[1])
                                if len(millisecond_start) == 1:
                                    millisecond_start = int(millisecond_start) * 100
                                elif len(millisecond_start) == 2:
                                    millisecond_start = int(millisecond_start) * 10
                                else:
                                    millisecond_start = int(millisecond_start)
                                start_time_hold = datetime.timedelta(days=0,seconds=second_start,milliseconds=millisecond_start)
                
                                #Lifted this directly from https://stackoverflow.com/a/539360 & modified it slightly for own use
                                s1= start_time_hold.total_seconds()
                                hours1, remain1 = divmod(s1, 3600)
                                minutes1, remain2_1 = divmod(remain1, 60)
                                #Remember that divisor on the following line is 1, because the whole thing is indexed to seconds
                                seconds1, remain3_1 = divmod(remain2_1, 1)
                                #We added this modifier to increase the remainder so the milliseconds format correctly
                                remain3_1 = remain3_1 * 1000
                                var1 = ('{:02}:{:02}:{:02}:{:03}'.format(int(hours1), int(minutes1), int(seconds1), int(remain3_1)))

                                self.start_time = var1
                                break
                            else:
                                break
                        except AttributeError:
                            break
        
                    #This block takes the end time from the JSON file and parses it into a timedelta object by splitting it and then putting it back together in the desired format
                    split_end = item['end_time'].split('.')
                    second_end = int(split_end[0])
                    millisecond_end = (split_end[1])
                    if len(millisecond_end) == 1:
                        millisecond_end = int(millisecond_end) * 100
                    elif len(millisecond_end) == 2:
                        millisecond_end = int(millisecond_end) * 10
                    else:
                        millisecond_end = int(millisecond_end)
                    end_time_hold = datetime.timedelta(days=0,seconds=second_end,milliseconds=millisecond_end)
            
                    #Lifted this directly from https://stackoverflow.com/a/539360 & modified it slightly for own use
                    s2= end_time_hold.total_seconds()
                    hours2, remain2 = divmod(s2, 3600)
                    minutes2, remain2_2 = divmod(remain2, 60)
                    #Remember that divisor on the following line is 1, because the whole thing is indexed to seconds
                    seconds2, remain3_2 = divmod(remain2_2, 1)
                    #We added this modifier to increase the remainder so the milliseconds format correctly
                    remain3_2 = remain3_2 * 1000
                    var2 = ('{:02}:{:02}:{:02}:{:03}'.format(int(hours2), int(minutes2), int(seconds2), int(remain3_2)))
            
                    #Then we replace the end time with the holder, to be sure it is updated every time and no failures will occur
                    self.end_time = var2
    
            #This is the block which is called if the speaker changes
            elif item['speaker_label'] != self.spk:
        
                #We add the data currently in our containers to the dataframe
                full_data = full_data.append({'spk':self.spk,'start_time':self.start_time,'end_time':self.end_time,'transcript':self.transcript},ignore_index=True)
        
                #Change the speaker label to the new speaker
                self.spk = item['speaker_label']
        
                #Reset the start time to the current value
                split_start = item['start_time'].split('.')
                second_start = int(split_start[0])
                millisecond_start = (split_start[1])
                if len(millisecond_start) == 1:
                    millisecond_start = int(millisecond_start) * 100
                elif len(millisecond_start) == 2:
                    millisecond_start = int(millisecond_start) * 10
                else:
                    millisecond_start = int(millisecond_start)
                start_time_hold = datetime.timedelta(days=0,seconds=second_start,milliseconds=millisecond_start)
        
                #Lifted this directly from https://stackoverflow.com/a/539360 & modified it slightly for own use
                s3= start_time_hold.total_seconds()
                hours3, remain3 = divmod(s3, 3600)
                minutes3, remain2_3 = divmod(remain3, 60)
                #Remember that divisor on the following line is 1, because the whole thing is indexed to seconds
                seconds3, remain3_3 = divmod(remain2_3, 1)
                #We added this modifier to increase the remainder so the milliseconds format correctly
                remain3_3 = remain3_3 * 1000
                var3 = ('{:02}:{:02}:{:02}:{:03}'.format(int(hours3), int(minutes3), int(seconds3), int(remain3_3)))
                self.start_time = var3

                #This block takes the end time from the JSON file and parses it into a timedelta object by splitting it and then putting it back together in the desired format
                split_end = item['end_time'].split('.')
                second_end = int(split_end[0])
                millisecond_end = (split_end[1])
                if len(millisecond_end) == 1:
                    millisecond_end = int(millisecond_end) * 100
                elif len(millisecond_end) == 2:
                    millisecond_end = int(millisecond_end) * 10
                else:
                    millisecond_end = int(millisecond_end)
                end_time_hold = datetime.timedelta(days=0,seconds=second_end,milliseconds=millisecond_end)
        
                #Lifted this directly from https://stackoverflow.com/a/539360 & modified it slightly for own use
                sx= end_time_hold.total_seconds()
                hoursx, remainx = divmod(sx, 3600)
                minutesx, remainx2 = divmod(remainx, 60)
                #Remember that divisor on the following line is 1, because the whole thing is indexed to seconds
                secondsx, remainx3 = divmod(remainx2, 1)
                #We added this modifier to increase the remainder so the milliseconds format correctly
                remainx3 = remainx3 * 1000
                var4 = ('{:02}:{:02}:{:02}:{:03}'.format(int(hoursx), int(minutesx), int(secondsx), int(remainx3)))
                self.end_time = var4

                #Reset the transcript to be blank
                self.transcript = ""
                #Then add the current content to the transcript
                self.transcript = self.transcript + " " + item['alternatives'][0]['content']

        #When the whole thing has run, the final line of data will not have been added to the dataframe, so we add it here
        full_data = full_data.append({'spk':self.spk,'start_time':self.start_time,'end_time':self.end_time,'transcript':self.transcript},ignore_index=True)


In [3]:
mine = server_grab()

In [5]:
#Use this cell if you need to delete the job
mine.del_job()

In [6]:
#After you submit the job, you have to wait a while for the transcription to complete
#I need to add something to check the status of the job to enhance the user experience
mine.sub_job()

In [9]:
#This is a functional solution to getting the file out; it is inelegant, but good enough for government work
my_file = mine.get_json()

In [None]:
#Jason, you should start grading at this cell when you examine it

#You will have to change the file path
f = open('/Users/ianmckinley/Documents/9.1.json')

t_file = json.load(f)

f.close()

In [10]:
#This cell instantiates the file parser and passes 'file' into it
myp = file_parser(t_file)

In [11]:
#This cell parses the JSON file into the dataframe
#Although I am aware that .append() is deprecated, I had a lot of trouble getting .concat() to work, so I used this instead because I needed to move on
#I am still working out the problem with concat; I have tried joining 2 DFs by making the series a DF to start, joining the series directly, and a few other things
#But for some reason it just becomes a big sloppy mess when I do that
myp.parse()

  full_data = full_data.append({'spk':self.spk,'start_time':self.start_time,'end_time':self.end_time,'transcript':self.transcript},ignore_index=True)
  full_data = full_data.append({'spk':self.spk,'start_time':self.start_time,'end_time':self.end_time,'transcript':self.transcript},ignore_index=True)
  full_data = full_data.append({'spk':self.spk,'start_time':self.start_time,'end_time':self.end_time,'transcript':self.transcript},ignore_index=True)
  full_data = full_data.append({'spk':self.spk,'start_time':self.start_time,'end_time':self.end_time,'transcript':self.transcript},ignore_index=True)
  full_data = full_data.append({'spk':self.spk,'start_time':self.start_time,'end_time':self.end_time,'transcript':self.transcript},ignore_index=True)
  full_data = full_data.append({'spk':self.spk,'start_time':self.start_time,'end_time':self.end_time,'transcript':self.transcript},ignore_index=True)
  full_data = full_data.append({'spk':self.spk,'start_time':self.start_time,'end_time':self.end_time

In [12]:
full_data

Unnamed: 0,spk,start_time,end_time,transcript
0,spk_0,00:00:11:489,00:01:12:388,This is gonna be a silent teacher. It's the w...
1,spk_1,00:01:17:209,00:01:17:588,Thank you.
2,spk_0,00:01:22:769,00:02:29:689,That. Yeah. Yeah. Yeah. Yeah. Keep all that f...
3,spk_1,00:02:32:179,00:02:32:889,so need to be
4,spk_0,00:02:32:899,00:02:58:118,gone that. Yeah. It uh Right. Right. Got it.
...,...,...,...,...
330,spk_0,02:45:46:200,02:45:51:218,like that. Yeah cause you see how you almost ...
331,spk_1,02:45:51:780,02:45:52:388,Oh I understand.
332,spk_0,02:45:56:668,02:47:08:370,Very good. Yeah. Take that. No. Yeah. Can we ...
333,spk_1,02:47:11:218,02:47:12:290,one more and then cut
