### **Overview**

**Functionality:**

Given one or more `.docx` file(s) of an unformatted transcript, output one or more `.xlsx` file(s) that:


*   converts the unformatted transcript into a tabular form
*   breaks up the transcript into utterances based on specified punctuation marks, e.g. ?!.

Each `.docx` file will output exactly one `.xlsx` file, but more than one `.docx` files can be inputted to the program at a time to save repetitive manual inputs. Furthermore, a custom directory for the output files can be inputted in the case where the output files should be located in a different drive from that of the input files. However, only one custom directory can be defined, so output files cannot be relocated to two or more different directories.

See the play and book reading examples in the *Demo* section.

**Example:**

Single file

*   input file: [s2h8R4YENaZeNEnqm_Play_20190730_Text_clean_20200501.docx](https://docs.google.com/document/d/10CJCVZWi9H8WBuQd5zr1z4zZlsLMB-qX/edit?usp=sharing&ouid=118243912754580335163&rtpof=true&sd=true)
*   output file: [s2h8R4YENaZeNEnqm_Play_Formatted_08242021_20190730_Cleaned_Text.xlsx](https://docs.google.com/spreadsheets/d/1-yI4Q8p4lLwW3TFeGJza5msxoYPiX4vN/edit?usp=sharing&ouid=118243912754580335163&rtpof=true&sd=true)

Multiple files

*   input file: [[4yzKmafHk3Zyckcdk_Bookreading_20190701_Text_clean_20200515.docx](https://docs.google.com/document/d/1kkZMpJunphoZFSZ7Y38FoJV0BhgV0FMF/edit?usp=sharing&ouid=118243912754580335163&rtpof=true&sd=true), [Wzto8KhNvw7eKBqww_Bookreading_20190724_Text_clean_20200501.docx](https://docs.google.com/document/d/1AnLLxZKOIzkehACtPf6U7AGT8l7Ud2-g/edit?usp=sharing&ouid=118243912754580335163&rtpof=true&sd=true)]
*   output file: [[4yzKmafHk3Zyckcdk_Bookreading_Formatted_09052021_20190701_Cleaned_Text.xlsx](https://docs.google.com/spreadsheets/d/1-q_hNPlhQgdGUAdeM__ZkJbP6p6hYSnh/edit?usp=sharing&ouid=118243912754580335163&rtpof=true&sd=true), [Wzto8KhNvw7eKBqww_Bookreading_Formatted_09052021_20190724_Cleaned_Text.xlsx](https://docs.google.com/spreadsheets/d/1-HZeDtMN763le0aIDJUreR7c1h1cxZBx/edit?usp=sharing&ouid=118243912754580335163&rtpof=true&sd=true)]

**Input file requirements:**


*   The filename must be prefixed by id and task type in that order: "*id_tasktype_...*". 
  * For example: "s2h8R4YENaZeNEnqm_Play_20190730_Text_clean_20200501.docx" or "s2h8R4YENaZeNEnqm_Play_Bilingual_Text_20190730_clean_20200501.docx"
*   The actual transcript content in the file must be preceded by "Transcription results:" as seen in the example input file above.
*   This program is not language specific so it can work on both English and Spanish transcripts as well as in other languages.

**How to use:**

Follow the descriptions for each part and the green code comments. Edit all lines marked with "TODO"s with custom inputs.


### **Part I. Setup**


Install and import all needed modules.



In [1]:
# To read .docx file
!pip install python-docx

Collecting python-docx
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 4.4 MB/s 
Building wheels for collected packages: python-docx
  Building wheel for python-docx (setup.py) ... [?25l[?25hdone
  Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size=184508 sha256=9506444590904c83f918decbcee1777a738d564970ebe764ce0aaa151d621689
  Stored in directory: /root/.cache/pip/wheels/f6/6f/b9/d798122a8b55b74ad30b5f52b01482169b445fbb84a11797a6
Successfully built python-docx
Installing collected packages: python-docx
Successfully installed python-docx-0.8.11


In [2]:
import os
import shutil
import time
import re
import docx
import pandas as pd

Set the current working directory to the drive folder containing the transcript docs to be inputted to this formatting program.

In [3]:
# To access transcripts from local drive
from google.colab import drive
drive.mount('/content/drive')

# TODO: Change the path accordingly!
# For play
input_path = "/content/drive/MyDrive/Play/Play Transcripts Unformatted"
# For book reading
input_path = "/content/drive/MyDrive/Book reading/Book Reading Transcripts Word"

os.chdir(input_path)

Mounted at /content/drive


Set a custom directory for the output files.

In [None]:
# TODO: Change the path accordingly!
# For play
output_path = "/content/drive/MyDrive/Play/Play Formatted Transcripts Spreadsheet"
# For book reading
output_path = "/content/drive/MyDrive/Book reading/Book Reading Transcripts Spreadsheet"

### **Part II. Format Transcript Into Utterances**

In [8]:
def get_output_filename(input_filename):
  id, task_type, *other = input_filename.split('_')
  transcribed_date = None
  for token in other:
    if token.isnumeric():
      transcribed_date = token
      break
  today_date = time.strftime("%m%d%Y")

  if 'Bilingual' in input_filename:
    return '_'.join([id, task_type, "Bilingual", "Formatted", today_date, transcribed_date, "Cleaned", "Text.xlsx"])
  else:
    return '_'.join([id, task_type, "Formatted", today_date, transcribed_date, "Cleaned", "Text.xlsx"])

In [9]:
# Example with bilingual transcript
get_output_filename("6eKvaXeyFoAjdbf3k_Play_Bilingual_Formatted_06292021_20190621_Cleaned_Text.docx")

'6eKvaXeyFoAjdbf3k_Play_Bilingual_Formatted_09052021_06292021_Cleaned_Text.xlsx'

In [None]:
def format_into_utterances(input_file):
  # Prepare to read the input file
  doc = docx.Document(input_file)

  # Utterance table that will eventually contain all formatted utterances
  df = None

  # Indicator for having started reading the transcript content 
  # i.e. after the metadata header of date, participants, transcript ID, etc.
  begin_transcript = False

  # Read line by line from the input file
  for i in doc.paragraphs:
    # print("inside 2")
    # print(i.text + "dummy")

    # If line read indicates the start of the transcript content
    if "Transcription results:" in i.text:
      begin_transcript = True
      continue

    # If line is part of the transcript content and actually contains words
    if begin_transcript and i.text:
      # print("inside")
      # Split line into [speaker, words]. Note that speaker includes the 
      # timestamp, e.g. "Mother 00:00"
      tokenized = i.text.split(sep='\t')
      speaker = tokenized[0]
      # Split words into utterances based on punctuation
      utterances = re.split('(?<=[.!?]) +', tokenized[1])

      # Each utterance will be a separate row in the utterance table.
      # Since the table requires two columns ("Speaker" and "Words"),
      # this means that an utterance must have a speaker, either like 
      # "Mother 00:00" or a dummy filler ''
      num_utterances = len(utterances)
      if num_utterances > 1:
        speaker = [speaker] + [''] * (num_utterances - 1)

      # Append the new list of utterances and their associated speaker to
      # the existing utterance table
      row_dict = {"Speaker": speaker, "Words": utterances}
      new_df = pd.DataFrame(row_dict)
      df = pd.concat([df, new_df], ignore_index=True)

  # Prepare to write to the output file
  f = open(output_file, "w")

  # Write the formatted transcript (which is in table form)
  df.to_excel(output_file, index=False, header=False)

  return df

In [None]:
def move_output_files_to(directory):
  for filename in os.listdir(os.getcwd()):
    if ".xlsx" in filename:
      shutil.move(filename, directory)

### **Part III. Demo**

Input the transcript file(s) to be formatted.

In [None]:
# COMMENT OUT IF USING BOOK READING
# For play
# input_file = "s2h8R4YENaZeNEnqm_Play_20190730_Text_clean_20200501.docx"
# output_file = get_output_filename(input_file)
# df = format_into_utterances(input_file)
# df

In [None]:
# COMMENT OUT IF USING PLAY
# For book reading
input_file_ids = ["4yzKmafHk3Zyckcdk", 
                  "7tfqwquuWmhXsBvh3",
                  "8Y2QGv8gWFFrZjCw3",
                  "ABK3ZCr9iYCwFj8B9",
                  "FAhxZzRvAnGpkHc2T",
                  "TWAbYzhZiLxj4e9o7",
                  # "W8JasNsGRtYQSAXQh", <-- not able to parse for some reason...
                  "Wzto8KhNvw7eKBqww",
                  "aa67icGii4hSGxjWs",
                  "bqzz2LWhDJS9MMdRt",
                  "d8jweRx4u4gxFJxST",
                  "dcZ3RapxdkETMgBf9",
                  "embb8jnivTFR6Y5yK",
                  "guEAkx2MXqFKmxq7Y",
                  "jL9ot7JgnxEuPE3k4"
                  ]

for filename in os.listdir(os.getcwd()):
  tokenized = filename.split('_')
  if tokenized[0] in input_file_ids and ".docx" in filename:
    output_file = get_output_filename(filename)
    print("Done: ", output_file)
    df = format_into_utterances(filename)

move_output_files_to(output_path)

df # <-- Utterance table for the last input file

Done:  TWAbYzhZiLxj4e9o7_Bookreading_Formatted_09052021_20190726_Cleaned_Text.xlsx
Done:  Wzto8KhNvw7eKBqww_Bookreading_Formatted_09052021_20190724_Cleaned_Text.xlsx
Done:  d8jweRx4u4gxFJxST_Bookreading_Formatted_09052021_20190719_Cleaned_Text.xlsx
Done:  dcZ3RapxdkETMgBf9_Bookreading_Formatted_09052021_20190701_Cleaned_Text.xlsx
Done:  FAhxZzRvAnGpkHc2T_Bookreading_Formatted_09052021_20190809_Cleaned_Text.xlsx
Done:  guEAkx2MXqFKmxq7Y_Bookreading_Formatted_09052021_20190707_Cleaned_Text.xlsx
Done:  7tfqwquuWmhXsBvh3_Bookreading_Formatted_09052021_20190719_Cleaned_Text.xlsx
Done:  8Y2QGv8gWFFrZjCw3_Bookreading_Formatted_09052021_20190624_Cleaned_Text.xlsx
Done:  aa67icGii4hSGxjWs_Bookreading_Formatted_09052021_20190615_Cleaned_Text.xlsx
Done:  ABK3ZCr9iYCwFj8B9_Bookreading_Formatted_09052021_20190628_Cleaned_Text.xlsx
Done:  bqzz2LWhDJS9MMdRt_Bookreading_Formatted_09052021_20190807_Cleaned_Text.xlsx
Done:  jL9ot7JgnxEuPE3k4_Bookreading_Formatted_09052021_20190721_Cleaned_Text.xlsx
Done

Unnamed: 0,Speaker,Words
0,Researcher 13:06,All right.
1,,"So this last activity, I'm going to give you t..."
2,,And I'll give you five minutes for this activity.
3,,Does that sound good?
4,Child 13:19,Okay.
...,...,...
132,,[crosstalk]--
133,Mother 18:39,"Yeah, I want to know what happens."
134,Researcher 18:40,Yeah.
135,,Yeah.


In [None]:
format_into_utterances("W8JasNsGRtYQSAXQh_Bookreading_20190702_Text_clean_00000000.docx")

AttributeError: ignored