# 10-process-data
> Importing, cleaning, testing, and saving data

This notebook has mainly two functions:
1. It cleans docx file (remove 't', '.c' from speech lines)
2. It converts docx files to csv files

# Helpful packages and preliminaries

In [None]:
#all_no_test
#default_exp text_preprocessing

In [None]:
#export
# data access and processing
import pandas as pd
import numpy as np

# File helpers
import glob

# python helpers
import os.path
import re
import warnings

# docx helpers
import docx
import docx2txt

# Set the file path
You can change 'base_prefix' variable below according to your computer environment. In this example, Soyeon's local file path was used.

In [None]:
#base_prefix = os.path.expanduser('~/Box Sync/DSI Documents/')
base_prefix = '/data/p_dsi/wise/data/'
file_directory = base_prefix + 'Audio Files & Tanscripts/Transcripts'

# Cleaning docx files

### 1. Define getText() function to import docx into text in Python

In [None]:
#export
def getText(filename):
    """
    Import document file and show in python environment
    
    Parmeters
    ---------
    filename : str
        a document's file path
        
    Returns
    -------
    str
        the document's contents
    """
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

### 2. Read all transcripts and convert them into Pandas Data frame with text in Python environment

In [None]:
# get filenames list
filenames = glob.glob(file_directory + '/*.docx')

# read file contents
file_contents = []
file_id = []
for file in filenames:
    file_id.append(file.split("/")[-1].split(" ")[0])
    file_contents.append(getText(file))
    
# convert to df
file_df = pd.DataFrame({'file_id': file_id, 'text': file_contents})
file_df.head()    

Unnamed: 0,file_id,text
0,123-1-198,- 00:00:00.00\nt uhhuh [SI-0] [INF]. {NEU}\nc ...
1,046-2-198,- 00:00:00.00\nuhhuh. {NEU}\n(okay) you're col...
2,273-1-198,- 00:00:00.00\nt today we're talking about the...
3,083-1-198,-00:00:00.00 \nt ready [SI-0]? {OTR}\nc .\nt ...
4,108-2-198,- 00:00:00.00\nt (okay) we're gonna be talkin ...


### 3. Strip t and .c and remove all [*] inside

In [None]:
for i in range(file_df.shape[0]):
    strip_text = []
    for line in file_df.loc[i, "text"].split("\n"):
        if line[:2] == "t ":
            new = line[2:]
        elif line[:3] == "c .":
            continue
        else:
            new = line
        strip_text.append(new)
    strip_text = "\n".join(strip_text)
    file_df.loc[i, "strip_text"] = strip_text


for i in range(file_df.shape[0]):
    strip_text = []
    for line in file_df.loc[i, "strip_text"].split("\n"):
        new = re.sub(" \\[\D.*?\\]", "", line)
        strip_text.append(new)
    strip_text = "\n".join(strip_text)
    file_df.loc[i, "strip_text"] = strip_text

### 4. Generate all stripped and cleaned text into Box folder

In [None]:
cleaned_transcripts_dir = base_prefix + 'cleaned_data/cleaned_transcripts/' 

In [None]:
for i in range(file_df.shape[0]):
    document = docx.Document()
    document.add_paragraph(file_df.iloc[i]["strip_text"]) 
    document.save(cleaned_transcripts_dir + file_df.iloc[i]["file_id"] + " final analyses.docx")

# Convert docx file to csv file
In this section, we convert the documents to csv files by using Pandas. In the current section, we will show individual steps so that in the case that any particular file fails, we can use this section to investigate the reason and if necessary adapt this to the larger final operations.

### 1. Make a list having all docx files' path
First, we are going to get the path of documnet files we want to convert to csv files.

In [None]:
file_names = glob.glob(cleaned_transcripts_dir + '*.docx')

In [None]:
file_names[:5]

['/data/p_dsi/wise/data/cleaned_data/cleaned_transcripts/129-1-198 final analyses.docx',
 '/data/p_dsi/wise/data/cleaned_data/cleaned_transcripts/088-3-198 final analyses.docx',
 '/data/p_dsi/wise/data/cleaned_data/cleaned_transcripts/273-2-198 final analyses.docx',
 '/data/p_dsi/wise/data/cleaned_data/cleaned_transcripts/107-1-198 final analyses.docx',
 '/data/p_dsi/wise/data/cleaned_data/cleaned_transcripts/116-1-198 final analyses.docx']

### 2. Create "docx_to_df" function converting a docx file to a dataframe
The following function operates two primary tasks. 
1. Convert a docx file to dataframe of which row is one line.
2. Add 4 new columns ["transcript_filepath", "id", "transcriber_id", "wave_filepath"].

Since the values of new columns are related to file path, I added the tasks in this function.

In [None]:
#export
def docx_to_df(file_path):    
    """
    Convert docx file to dataframe
    
    Parameters
    ----------
    file_path : str
        A file path of documnet
        
    Returns
    -------
    dataframe
        speech | transcript_filepath | id  | transcriber_id | wave_filepath
        ------------------------------------------------------------------
        00:00  | Users/Soyeon/~~~.   |119-2| 113.           | Users/~~~~
        
    """
    # Convert docx file to dataframe
    text = docx2txt.process(file_path)
    text_list = text.split('\n')
    df = pd.DataFrame(text_list, columns = ["speech"])

    # Add [transcript_filepath] column
    df["transcript_filepath"] = file_path

    # Add [id], [transcriber_id] columns
    extract = re.search('(\d{3})-(\d{1})-(\d{3})', file_path)
    if extract is not None:
        df["id"] = extract.group(1) + "-" + extract.group(2)
        df["transcriber_id"] = extract.group(3)
    else:
        df["id"] = None
        df["transcriber_id"] = None
        warnings.warn('File {0} seems to have the wrong title format for extracting id and transcriber_id'.format(file_path));

    # Add [wave_filepath] column
    audio_path = base_prefix + "Audio Files & Transcripts/Audio Files/"
    df["wave_filepath"] = audio_path + df["id"] + ".wav"
    
    return df

### 3. Merge all dataframes
By using list comprehension, we are going to make a list having all datafrmaes converted from docx files with the "docx_to_df" function. And then, we are going to create a megadata, the result from merging all the dataframes.

In [None]:
# Create a list having all dataframes converted from the docx files
dfs_list = [docx_to_df(file) for file in file_names]
megadata = pd.concat(dfs_list)

In [None]:
megadata.head()

Unnamed: 0,speech,transcript_filepath,id,transcriber_id,wave_filepath
0,- 00:00:00.00,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,129-1,198,/data/p_dsi/wise/data/Audio Files & Transcript...
1,can you do it really quickly. {NEU},/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,129-1,198,/data/p_dsi/wise/data/Audio Files & Transcript...
2,just stick it in your cubby really quick pleas...,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,129-1,198,/data/p_dsi/wise/data/Audio Files & Transcript...
3,thank you. {NEU},/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,129-1,198,/data/p_dsi/wise/data/Audio Files & Transcript...
4,(okay) so we are going to > {NEU},/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,129-1,198,/data/p_dsi/wise/data/Audio Files & Transcript...


### 4. Create "find_timestamp" function finding timestamp lines
The ts_v_speech function makes a column called "digit". A speech line is saved as 0, and timestamp line is saved as 1 in the digit list. Secondly, it changes the format of timestamp. For example, it changes '- 00:00:00.00' into '00:00:00.00'. This timestamp is identified via regex (to be checked for formatting in the future) and also saved into the start_timestamp column.

* every line starting with '-' or '[' : timestamp (1)
* anything else: speech (0)

For example, ['00:00:00.00', 'Hi', 'What is your name?', '00:03:12.00'] generates -> [1, 0, 0, 1] in the digit column.

In [None]:
#export
def find_timestamp(text_list):
    """
    Find timestamp line and put digit's value
    
    Parameters
    ----------
    text_list : dataframe
        A dataframe you want to convert
        
    Returns
    -------
    dataframe
        it has new columns ["start_timestamp", "digit"]
        The digit column helps filling start_timestamp and end_timestamp
    """
    pat = re.compile('(\d\d:\d\d:\d\d. *\d\d)')
    matches = pat.search(text_list['speech'])
    if matches is not None:
        text_list['start_timestamp'] = matches.group(1) if matches is not None else None
        text_list['digit'] = 1
    else:
        text_list['digit'] = 0
        text_list['start_timestamp'] = None

    return(text_list)

In [None]:
megadata = megadata.apply(find_timestamp, axis=1)

### 5. Fill "start_timestamp" and "end_timestamp"
In looking at the structure above, we can see that the ending time stamp of an utterance section is the same as the start timestamp, except shifted up one row (e.g., the start of the beginning of one series of utterances is the end of the one that came before it. We perform this programmatically.

Then, we fill all of the None values with the beginning timestamp of the series of utterances (by filling non-NA values forward until they reach the next non-NA value and repeat), and then backfill the end timestamp similarly.

Finally, we remove the rows of speech that are just timestamps. These are indicated by where digit==1, meaning we keep everywhere that digit==0. And then we drop this column since it has served its purpose.

Some timestamp has unnecessary space in the value. So let's remove the space as well.

In [None]:
# Remove unnecessary space in start_timestamp
megadata['start_timestamp'] = megadata['start_timestamp'].str.replace(' ', '')

# Create a column "end_timestamp"
megadata['end_timestamp'] = megadata['start_timestamp'].shift(-1)

# Fill NA data in "start_timestamp" and "end_timestamp" columns
megadata['start_timestamp'].ffill(inplace=True)
megadata['end_timestamp'].bfill(inplace=True)

# Remove rows of which "speech" is timestamp
megadata = megadata.loc[megadata['digit']==0, :]

In [None]:
# Check result
megadata.head()

Unnamed: 0,digit,id,speech,start_timestamp,transcriber_id,transcript_filepath,wave_filepath,end_timestamp
1,0,129-1,can you do it really quickly. {NEU},00:00:00.00,198,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,00:02:02.12
2,0,129-1,just stick it in your cubby really quick pleas...,00:00:00.00,198,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,00:02:02.12
3,0,129-1,thank you. {NEU},00:00:00.00,198,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,00:02:02.12
4,0,129-1,(okay) so we are going to > {NEU},00:00:00.00,198,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,00:02:02.12
5,0,129-1,you're gonna do the same thing name. {NEU},00:00:00.00,198,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,00:02:02.12


### 6. Create "label" column

We distinguish a speech line with label and without label using a regex, and saves this label into the label column. Then, it divides the speech line into actual speech.

In [None]:
label_pat = re.compile('.*\{([A-Z]{3,3})\}.*')
megadata['label'] = megadata['speech'].apply(lambda x: None if label_pat.match(x) is None else label_pat.match(x).group(1))
megadata['speech'] = megadata['speech'].str.replace('\{[A-Z]{3,3}\}', '', regex=True)

# Remove unnecessary space in label and speech
megadata.label = megadata.label.str.strip()
megadata.speech = megadata.speech.str.strip()

### 7. Clean megadata
We remove the "digit" column and reorder the sequence of columns. In addtion, there are some rows having no labels. We removes this rows as well.

In [None]:
megadata = megadata.drop('digit', axis=1)
megadata = megadata[["id", "transcript_filepath", "wave_filepath", "speech", 'start_timestamp', 'end_timestamp', "label", "transcriber_id"]]

# Remove rows having NA values.
megadata = megadata.dropna()

In [None]:
megadata.head()

Unnamed: 0,id,transcript_filepath,wave_filepath,speech,start_timestamp,end_timestamp,label,transcriber_id
1,129-1,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,can you do it really quickly.,00:00:00.00,00:02:02.12,NEU,198
2,129-1,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,just stick it in your cubby really quick pleas...,00:00:00.00,00:02:02.12,NEU,198
3,129-1,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,thank you.,00:00:00.00,00:02:02.12,NEU,198
4,129-1,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,(okay) so we are going to >,00:00:00.00,00:02:02.12,NEU,198
5,129-1,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,you're gonna do the same thing name.,00:00:00.00,00:02:02.12,NEU,198


In [None]:
megadata.tail()

Unnamed: 0,id,transcript_filepath,wave_filepath,speech,start_timestamp,end_timestamp,label,transcriber_id
156,088-1,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,you know what he did get really big and turn i...,00:00:00.00,00:02:05.18,OTR,198
157,088-1,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,he was small when he came out of the egg.,00:00:00.00,00:02:05.18,NEU,198
158,088-1,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,remember name.,00:00:00.00,00:02:05.18,NEU,198
159,088-1,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,and how did he get to be so big name.,00:00:00.00,00:02:05.18,OTR,198
160,088-1,/data/p_dsi/wise/data/cleaned_data/cleaned_tra...,/data/p_dsi/wise/data/Audio Files & Transcript...,what did he do?,00:00:00.00,00:02:05.18,OTR,198


Ok, this looks great! Now, let's save this dataframe as csv files.

### 8. Save dataframe as csv files
We are done! Let's save this as a csv file. There are two options here:
1. Save the megadata(having all documents' data) as a csv file.
2. Group megadata by id and create different csv file.

If you want to run the code to create new csv files, please remove "#" in front of the code.

In [None]:
new_filepath = base_prefix + "cleaned_data/csv_files/"

# (Option #1)Save the megadata as one csv file 
#megadata.to_csv(new_filepath + "megadata.csv", index = False)

# (Option #2)Save dataframe grouped by id as a csv file
id_list = megadata["id"].unique()
megadata_groupby_id = megadata.groupby('id')
for i in id_list:
    df = megadata_groupby_id.get_group(i)
#    df.to_csv(new_filepath + i + ".csv", index = False)