# Convert Word (Transcript) to DataFrame

Transcription using MS Word creates a format where timestamp and speaker are on one line, and the utterance is on the next. This notebook converts this format (from a Word docx file) to a CSV or Excel file by first converting it into a Pandas DataFrame first.

In [135]:
import docx2txt
import pandas as pd
from IPython.display import display, Markdown

### Get relevant text from the Word Doc
The Word document often contains some preamble, typically followed by the word `Transcript` after which the transcript begins. Test this assumption by printing the contents of the variable `text` below. If this is correct, then only extract the text _after_ the word 'Transcript'.

In [220]:
transcript_path = "data/xavier-transcript.docx"
transcript = docx2txt.process(transcript_path)
# print the first 500 characters, just to check.
transcript[:500]

"00:00:00 Stan\n\nStarting the recording right now. \n\n00:00:03 Stan\n\nDoes everybody give consent? (general laugh)\n\n00:00:05 Xavier\n\nNo, now we have an excuse.\n\n00:00:12 Xavier\n\nI mean even that is going to get analysed.\n\n00:00:15 Stan\n\nYep (laugh) (..) All right\n\n00:00:20 Xavier\n\nIt's an (.) It's an interesting assignment.\n\n00:00:21 Rueben\n\nI guess I need a pen.\n\n00:00:22 Xavier\n\nYep.\n\n00:00:24 Xavier\n\nYeah, we might need. I mean, first, I don't know if we need the (..), I just need a pen and a pap"

### Clean up the text
A number of carriage returns will be in the text file. This is useful to split up the file into lines, so the code below does the following:
1. Splits the text into lines where a line is either a `timestamp, speaker` combo an `utterance`, or an empty line (in case of multiple successive carriage returns;
2. Removes all empty lines;
3. Replaces all non-breaking spaces (unicode: `\xa0`)
4. Splits the `'timestamp, speaker'` string combination into a list with `['timestamp', 'speaker']` in it.

In [221]:
transcript_list = transcript.split("\n")
transcript_list = [s.strip() for s in transcript_list if len(s.strip()) > 0]
transcript_list = [s.replace('\xa0',  ' ') for s in transcript_list]
transcript_list

['00:00:00 Stan',
 'Starting the recording right now.',
 '00:00:03 Stan',
 'Does everybody give consent? (general laugh)',
 '00:00:05 Xavier',
 'No, now we have an excuse.',
 '00:00:12 Xavier',
 'I mean even that is going to get analysed.',
 '00:00:15 Stan',
 'Yep (laugh) (..) All right',
 '00:00:20 Xavier',
 "It's an (.) It's an interesting assignment.",
 '00:00:21 Rueben',
 'I guess I need a pen.',
 '00:00:22 Xavier',
 'Yep.',
 '00:00:24 Xavier',
 "Yeah, we might need. I mean, first, I don't know if we need the (..), I just need a pen and a paper or something to right on to try and think and understand what what we're meant to do with this. (Laughter)",
 '00:00:34 Xavier',
 'Cause I am confused.',
 '00:00:37 Julia',
 'So I brought candy, if you want it.',
 '00:00:40 Xavier',
 'It’s food for the soul',
 '00:00:42 Stan',
 "It's been years since I tried on of these.",
 '00:00:50 Xavier',
 'This course is not what I expected it to be, I have to say.',
 '00:00:53 Stan',
 'In the good way 

In [222]:
print("Remove last item if it is incomplete.")
print('--------------------------- LAST 4 LINES ---------------------------')
for t in transcript_list[-4:] :
    print(t)
print('-------------------------------- END -------------------------------')

# Uncomment and re-run this cell if the last line is found to be incomplete
# if type(transcript_list[-1]) == list :
    # transcript_list.pop() 

Remove last item if it is incomplete.
--------------------------- LAST 4 LINES ---------------------------
01:41:37 Xavier
Julia is going to be confused, but.
01:41:47 Stan
There you go. OK, now I can end the recording
-------------------------------- END -------------------------------


### Convert into DataFrame
Conver the list with successive items being `['timestamp', 'speaker]` and `'utterance'` items into a DataFrame

In [223]:
transcript_triad = []
for ind, line in enumerate(transcript_list) :
    if ind % 2 == 0 and ind+1 < len(transcript_list):
        # these are speaker/timestamp lines
        if line[0].isdigit() :
            # the start is a timestamp
            timestamp = line.split(' ')[0]
            speaker = ' '.join(line.split(' ')[1:])
            utterance = transcript_list[ind+1]
        else :
        # The line contains speaker but not timestamps 
            timestamp = ''
            speaker = line
            if len(line.split(' ')) > 2 :
                print("*****************************************")
                print("Problem found!")
                print(transcript_list[ind-3:ind+1])
                print(ind)
                print("*****************************************")
                break
            utterance = transcript_list[ind+1]
        turn_data = { 'timestamp' : timestamp,
                      'speaker'   : speaker,
                      'utterance' : utterance
                    }
        transcript_triad.append(turn_data)
            

In [224]:
# print(transcript_triad)
df = pd.DataFrame(transcript_triad)
df.sample(5)

Unnamed: 0,timestamp,speaker,utterance
948,01:22:51,Xavier,"I don't know, you choose, here."
163,00:13:45,Julia,"Yeah, true."
221,00:18:38,Stan,And then maybe write down it like pass this on...
190,00:15:20,Stan,"Yeah, that's interesting."
496,00:45:17,Stan,I had a really cool assignment where we needed...


### Save DataFrame
Because this involves text which can have punctuations including commas and semicolons, a CSV file is not recommended. Instead, save it as an excel file.

In [216]:
excel_name = transcript_path.split(".docx")[0] + ".xlsx"
df.to_excel(excel_name)
print("File saved to:", excel_name)

File saved to: data/xavier-transcript.xlsx
