# Convert Word (Transcript) to DataFrame

Transcription using MS Word creates a format where timestamp and speaker are on one line, and the utterance is on the next. This notebook converts this format (from a Word docx file) to a CSV or Excel file by first converting it into a Pandas DataFrame first.

In [28]:
import docx2txt
import pandas as pd
from IPython.display import display, Markdown

### Get relevant text from the Word Doc
The Word document often contains some preamble, typically followed by the word `Transcript` after which the transcript begins. Test this assumption by printing the contents of the variable `text` below. If this is correct, then only extract the text _after_ the word 'Transcript'.

In [42]:
transcript = docx2txt.process("data/Transcript_odine_avery_pinar_1.docx")
# transcript = text.split("Transcript")[-1]
transcript

"00:00:00 Speaker_1\n\nCool.\n\n00:00:01 Speaker_2\n\nSo yeah.\n\n00:00:02 Speaker_1\n\nOK.\n\n00:00:05 Speaker_2\n\nOK, So what?\n\n00:00:07 Speaker_2\n\nWhat? What could we do with those posts?\n\n00:00:18 Speaker_2\n\nOr.\n\n00:00:19 Speaker_1\n\nIt's something 2D.\n\n00:00:21 Speaker_1\n\nOr 3D is possible too.\n\n00:00:24 Speaker_2\n\nYeah.\n\n00:00:24 Speaker_1\n\nI think it's possible.\n\n00:00:25\n\nYeah.\n\n00:00:27 Speaker_2\n\nOK.\n\n00:00:30 Speaker_2\n\nWe do want to make a because it says here you could create a product to process a system or engage with a particular environment.\n\n00:00:39 Speaker_2\n\nOr even tell a story. Tell a story like.\n\n00:00:41 Speaker_1\n\nWhere where is where does it say that?\n\n00:00:45 Speaker_2\n\nThis one ohh.\n\n00:00:47 Speaker_2\n\nCould do like a level you know.\n\n00:00:51 Speaker_1\n\nOhh that's that's a good idea.\n\n00:00:53 Speaker_2\n\nSomething like.\n\n00:00:55 Speaker_3\n\nOr.\n\n00:00:58\n\nYeah.\n\n00:01:00 Speaker_1\n\nE

### Clean up the text
A number of carriage returns will be in the text file. This is useful to split up the file into lines, so the code below does the following:
1. Splits the text into lines where a line is either a `timestamp, speaker` combo an `utterance`, or an empty line (in case of multiple successive carriage returns;
2. Removes all empty lines;
3. Replaces all non-breaking spaces (unicode: `\xa0`)
4. Splits the `'timestamp, speaker'` string combination into a list with `['timestamp', 'speaker']` in it.

In [46]:
transcript_list = transcript.split("\n")
transcript_list = [s.strip() for s in transcript_list if len(s.strip()) > 0]
transcript_list = [s.replace('\xa0',  ' ') for s in transcript_list]
transcript_list = [s.split(' ') if len(s.split(':')) > 2 else s for s in transcript_list]
transcript_list

[['00:00:00', 'Speaker_1'],
 'Cool.',
 ['00:00:01', 'Speaker_2'],
 'So yeah.',
 ['00:00:02', 'Speaker_1'],
 'OK.',
 ['00:00:05', 'Speaker_2'],
 'OK, So what?',
 ['00:00:07', 'Speaker_2'],
 'What? What could we do with those posts?',
 ['00:00:18', 'Speaker_2'],
 'Or.',
 ['00:00:19', 'Speaker_1'],
 "It's something 2D.",
 ['00:00:21', 'Speaker_1'],
 'Or 3D is possible too.',
 ['00:00:24', 'Speaker_2'],
 'Yeah.',
 ['00:00:24', 'Speaker_1'],
 "I think it's possible.",
 ['00:00:25'],
 'Yeah.',
 ['00:00:27', 'Speaker_2'],
 'OK.',
 ['00:00:30', 'Speaker_2'],
 'We do want to make a because it says here you could create a product to process a system or engage with a particular environment.',
 ['00:00:39', 'Speaker_2'],
 'Or even tell a story. Tell a story like.',
 ['00:00:41', 'Speaker_1'],
 'Where where is where does it say that?',
 ['00:00:45', 'Speaker_2'],
 'This one ohh.',
 ['00:00:47', 'Speaker_2'],
 'Could do like a level you know.',
 ['00:00:51', 'Speaker_1'],
 "Ohh that's that's a good 

In [47]:
print("Remove last item if it is incomplete.")
print('--------------------------- LAST 4 LINES ---------------------------')
for t in transcript_list[-4:] :
    print(t)
print('-------------------------------- END -------------------------------')

# Uncomment and re-run this cell if the last line is found to be incomplete
# if type(transcript_list[-1]) == list :
    # transcript_list.pop() 

Remove last item if it is incomplete.
--------------------------- LAST 4 LINES ---------------------------
['00:15:02']
Yeah, yeah.
['00:15:08']
I always.
-------------------------------- END -------------------------------


### Convert into DataFrame
Conver the list with successive items being `['timestamp', 'speaker]` and `'utterance'` items into a DataFrame

In [50]:
transcript_triad = []
for ind, line in enumerate(transcript_list) :
    if type(line) == list :
        if len(line) < 2 :
            speaker = 'Unclear'
        else :
            speaker = line[1]
        timestamp = line[0]
        utterance = transcript_list[ind+1]
        turn_data = { 'timestamp' : timestamp,
                      'speaker'   : speaker,
                      'utterance' : utterance
                    }
        transcript_triad.append(turn_data)

print(transcript_triad)
df = pd.DataFrame(transcript_triad)
df.sample(3)

[{'timestamp': '00:00:00', 'speaker': 'Speaker_1', 'utterance': 'Cool.'}, {'timestamp': '00:00:01', 'speaker': 'Speaker_2', 'utterance': 'So yeah.'}, {'timestamp': '00:00:02', 'speaker': 'Speaker_1', 'utterance': 'OK.'}, {'timestamp': '00:00:05', 'speaker': 'Speaker_2', 'utterance': 'OK, So what?'}, {'timestamp': '00:00:07', 'speaker': 'Speaker_2', 'utterance': 'What? What could we do with those posts?'}, {'timestamp': '00:00:18', 'speaker': 'Speaker_2', 'utterance': 'Or.'}, {'timestamp': '00:00:19', 'speaker': 'Speaker_1', 'utterance': "It's something 2D."}, {'timestamp': '00:00:21', 'speaker': 'Speaker_1', 'utterance': 'Or 3D is possible too.'}, {'timestamp': '00:00:24', 'speaker': 'Speaker_2', 'utterance': 'Yeah.'}, {'timestamp': '00:00:24', 'speaker': 'Speaker_1', 'utterance': "I think it's possible."}, {'timestamp': '00:00:25', 'speaker': 'Unclear', 'utterance': 'Yeah.'}, {'timestamp': '00:00:27', 'speaker': 'Speaker_2', 'utterance': 'OK.'}, {'timestamp': '00:00:30', 'speaker': 'S

Unnamed: 0,timestamp,speaker,utterance
202,00:11:31,Speaker_1,Or should it be a perfect square?
272,00:18:26,Speaker_3,The bathroom.
866,00:08:28,Unclear,Yeah. Amazing.


### Save DataFrame
Because this involves text which can have punctuations including commas and semicolons, a CSV file is not recommended. Instead, save it as an excel file.

In [51]:
df.to_excel('data/edi_2024_odine_avery_pinar.xlsx')