# 45-restructure-audio-fixed-length
> Restructure audio segmentation into fixed length; each audio segmentation with multiple labels

In this notebook, we inspect a technique used in assessing teachers. In this technique, a particular duration is identified, and all labels that occur during that duration are set as labels. This then has the effect of resulting in a multilabel classification problem.

We'll do something similar here, except we won't be able to explicitly enforce the durations since the transcriptions already have a set start and end timestamp (e.g., it's possible that the duration limit occurs within a statement). To ameliorate this, we'll enforce durations on the end_timestamp.

In [None]:
#all_no_test

In [None]:
#export audio_preprocessing
# data science packages
import pandas as pd
import numpy as np

# other python packages
import os.path
import glob
import re

# Try restructuring on a single file
Here, we'll first investigate using a single file - a pretty short one, and just use a subset of the helpful columns. Later on, we'll define a function which can do this for us in one fell swoop.

In [None]:
#set base
base_prefix = '/data/p_dsi/wise/data/'
embedding_prefix = base_prefix + 'embedding_parquet/'
demo_df = pd.read_parquet(embedding_prefix+'083-1.parquet',
                          columns=['id', 'speech', 'label', 'start_timestamp', 'end_timestamp',
                                   'start_ms', 'end_ms', 'duration_ms',
                                   'start_index', 'end_index'])
display(demo_df.head())
demo_df.shape

Unnamed: 0,id,speech,label,start_timestamp,end_timestamp,start_ms,end_ms,duration_ms,start_index,end_index
0,083-1,ready?,OTR,00:00.000,00:03.330,0,3330,3330,0,53280
1,083-1,alright.,NEU,00:04.000,00:05.640,4000,5640,1640,64000,90240
2,083-1,let's quietly work together so everybody has t...,NEU,00:06.000,00:12.210,6000,12210,6210,96000,195360
3,083-1,(okay).,NEU,00:12.210,00:12.810,12210,12810,600,195360,204960
4,083-1,word number two.,OTR,00:13.000,00:15.820,13000,15820,2820,208000,253120


(123, 10)

## Add group numbers
Here, we add numbers corresponding to the groups that any individual statement should be in. We do this by using integer division on the end millseconds.

In [None]:
# 2 seconds groups
length_in_ms = 5000

## demo list
test_end_ts = list(range(0, 3001, 500))

In [None]:
[(test_ts, test_ts//length_in_ms) for test_ts in test_end_ts]

[(0, 0), (500, 0), (1000, 0), (1500, 0), (2000, 1), (2500, 1), (3000, 1)]

In [None]:
#Add groups
demo_df['ts_group'] = demo_df['end_ms']//length_in_ms

#View results
#with pd.option_context('display.max_rows', 150):
#    display(demo_df[['id', 'speech', 'start_ms', 'end_ms', 'ts_group']])
demo_df[['id', 'speech', 'start_ms', 'end_ms', 'ts_group', 'duration_ms']].head(10)

Unnamed: 0,id,speech,start_ms,end_ms,ts_group,duration_ms
0,083-1,ready?,0,3330,0,3330
1,083-1,alright.,4000,5640,1,1640
2,083-1,let's quietly work together so everybody has t...,6000,12210,2,6210
3,083-1,(okay).,12210,12810,2,600
4,083-1,word number two.,13000,15820,3,2820
5,083-1,friendly.,17000,18660,3,1660
6,083-1,find the word friendly.,20000,24750,4,4750
7,083-1,just write number one down.,26000,27900,5,1900
8,083-1,we'll come back and get it later.,27900,29230,5,1330
9,083-1,name I'm gonna ask please that you be quiet or...,30000,42530,8,12530


There's two complications we can notice here that may have downstream effects.

1. **Total group length not `length_in_ms`.** First of all, we see what groups each statement should be in. When we add the lengths of the groups together, we find that they're less than `length_in_ms`, although the total time we will pass into the model to classify will be `length_in_ms`. This means we will have lots of intervening "noise" that we don't really need to be classifying with. That's interesting.
2. **Excess noise in audio passed in for segments.**: See last sentence above.

Let's keep this in mind. We can start by using just the group `start_ms` and `end_ms`, but we can actually also chain together the subsets as well. We can explore this second option later, but for now, we take the simplest approach first.

## Restructure data
We'll start by first just adding the id2label indices.

In [None]:
#Make label list
label_list = ["OTR", "NEU", "REP", "PRS"]
    
#Create label encoding
label2id = {lab:ind for ind, lab in enumerate(label_list)}

#Do the encoding
demo_df['label_id'] = demo_df['label'].replace(label2id)

Now, we'll actually get the restructured data, where each aggregated value is a list of speech or labels.

In [None]:
#exporti audio_preprocessing
def _group_statements(group_info):
    '''
    Function _group_statements: pandas apply helper function to group sets of statements into one fixed length row
    Input: group_info: pandas group with minimally elements id, speech, label, label_id, start_ms, end_ms,
                       start_timestamp, end_timestamp, duration_ms
    Output: new dataframe with single row for sets of statements
    '''
    
    #Get overall info
    row_id = group_info['id'].iloc[0]
    speech_list = group_info['speech'].tolist()
    speech = ' '.join(speech_list)
    label = group_info['label'].tolist()
    label_id = group_info['label_id'].tolist()
    
    #Get start info
    start_ms = group_info['start_ms'].iloc[0]
    start_timestamp = group_info['start_timestamp'].iloc[0]
    start_index = group_info['start_index'].iloc[0]
    
    #Get end info
    end_ms = group_info['end_ms'].iloc[-1]
    end_timestamp = group_info['end_timestamp'].iloc[-1]
    end_index = group_info['end_index'].iloc[-1]
    
    #Get duration info
    duration_ms = group_info['duration_ms'].sum()
    
    #Make dataframe
    df = pd.DataFrame({'id':row_id, 'speech_list':[speech_list], 'speech':speech,
                       'label':[label], 'label_id':[label_id],
                       'start_timestamp':start_timestamp, 'end_timestamp':end_timestamp,
                       'start_ms':start_ms, 'end_ms':end_ms, 'duration_ms':duration_ms,
                       'start_index': start_index, 'end_index':end_index},
                     index=['1'])
    
    return df

In [None]:
res = demo_df.groupby('ts_group').apply(_group_statements).reset_index(drop=True)
res.head()

Unnamed: 0,id,speech_list,speech,label,label_id,start_timestamp,end_timestamp,start_ms,end_ms,duration_ms,start_index,end_index
0,083-1,[ready?],ready?,[OTR],[0],00:00.000,00:03.330,0,3330,3330,0,53280
1,083-1,[alright.],alright.,[NEU],[1],00:04.000,00:05.640,4000,5640,1640,64000,90240
2,083-1,[let's quietly work together so everybody has ...,let's quietly work together so everybody has t...,"[NEU, NEU]","[1, 1]",00:06.000,00:12.810,6000,12810,6810,96000,204960
3,083-1,"[word number two., friendly.]",word number two. friendly.,"[OTR, OTR]","[0, 0]",00:13.000,00:18.660,13000,18660,4480,208000,298560
4,083-1,[find the word friendly.],find the word friendly.,[OTR],[0],00:20.000,00:24.750,20000,24750,4750,320000,396000


## Add one-hot and counts
Now, let's add the one hot encodings and other counts. We'll start by adding a space for the labels.

In [None]:
res[[label_list]]=0
res

Unnamed: 0,id,speech_list,speech,label,label_id,start_timestamp,end_timestamp,start_ms,end_ms,duration_ms,start_index,end_index,OTR,NEU,REP,PRS
0,083-1,[ready?],ready?,[OTR],[0],00:00.000,00:03.330,0,3330,3330,0,53280,0,0,0,0
1,083-1,[alright.],alright.,[NEU],[1],00:04.000,00:05.640,4000,5640,1640,64000,90240,0,0,0,0
2,083-1,[let's quietly work together so everybody has ...,let's quietly work together so everybody has t...,"[NEU, NEU]","[1, 1]",00:06.000,00:12.810,6000,12810,6810,96000,204960,0,0,0,0
3,083-1,"[word number two., friendly.]",word number two. friendly.,"[OTR, OTR]","[0, 0]",00:13.000,00:18.660,13000,18660,4480,208000,298560,0,0,0,0
4,083-1,[find the word friendly.],find the word friendly.,[OTR],[0],00:20.000,00:24.750,20000,24750,4750,320000,396000,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,083-1,"[look up here baby girl., no.]",look up here baby girl. no.,"[OTR, NEU]","[0, 1]",09:35.000,09:38.990,575000,578990,3980,9200000,9263840,0,0,0,0
76,083-1,"[where's the S word?, name don't give it away.]",where's the S word? name don't give it away.,"[OTR, REP]","[0, 2]",09:39.000,09:44.400,579000,584400,5390,9264000,9350400,0,0,0,0
77,083-1,[that's following directions darling.],that's following directions darling.,[REP],[2],09:44.410,09:46.440,584410,586440,2030,9350560,9383040,0,0,0,0
78,083-1,[we don't shout out answers.],we don't shout out answers.,[REP],[2],09:47.000,09:51.370,587000,591370,4370,9392000,9461920,0,0,0,0


In [None]:
#exporti audio_preprocessing
def _add_label_counts(row_info):
    '''
    Function _add_label_counts: helper function for pandas apply; adds label counts as individual columns. Not to
    be used directly.
    Inputs: row_info: pandas Series of row info with minimally 'label'
    Output: returns pandas Series of row info with new label counts added for that row.
    '''
    
    #Get counts of labels
    vc = pd.Series(row_info['label']).value_counts()
    
    #Add it back info the index
    row_info[vc.index]=vc
    
    return row_info
    

In [None]:
final_pd = res.apply(_add_label_counts, axis=1)
final_pd.head(10)

Unnamed: 0,id,speech_list,speech,label,label_id,start_timestamp,end_timestamp,start_ms,end_ms,duration_ms,start_index,end_index,OTR,NEU,REP,PRS
0,083-1,[ready?],ready?,[OTR],[0],00:00.000,00:03.330,0,3330,3330,0,53280,1,0,0,0
1,083-1,[alright.],alright.,[NEU],[1],00:04.000,00:05.640,4000,5640,1640,64000,90240,0,1,0,0
2,083-1,[let's quietly work together so everybody has ...,let's quietly work together so everybody has t...,"[NEU, NEU]","[1, 1]",00:06.000,00:12.810,6000,12810,6810,96000,204960,0,2,0,0
3,083-1,"[word number two., friendly.]",word number two. friendly.,"[OTR, OTR]","[0, 0]",00:13.000,00:18.660,13000,18660,4480,208000,298560,2,0,0,0
4,083-1,[find the word friendly.],find the word friendly.,[OTR],[0],00:20.000,00:24.750,20000,24750,4750,320000,396000,1,0,0,0
5,083-1,"[just write number one down., we'll come back ...",just write number one down. we'll come back an...,"[OTR, NEU]","[0, 1]",00:26.000,00:29.230,26000,29230,3230,416000,467680,1,1,0,0
6,083-1,[name I'm gonna ask please that you be quiet o...,name I'm gonna ask please that you be quiet or...,[REP],[2],00:30.000,00:42.530,30000,42530,12530,480000,680480,0,0,1,0
7,083-1,[I know you don't care but what I'm telling yo...,I know you don't care but what I'm telling you...,[REP],[2],00:43.000,00:49.110,43000,49110,6110,688000,785760,0,0,1,0
8,083-1,[you are bothering them.],you are bothering them.,[NEU],[1],00:49.110,00:51.520,49110,51520,2410,785760,824320,0,1,0,0
9,083-1,[number two is friendly.],number two is friendly.,[OTR],[0],00:53.000,00:59.270,53000,59270,6270,848000,948320,1,0,0,0


This looks great! Let's make the final function then.
# Resectioning Function
Below is the final function along with the helper functions above to perform the restructuring.

In [None]:
#export audio_preprocessing
def get_fixed_length_segments(csv_df, length_in_ms=2000, label_list=None):
    '''
    Function get_fixed_length_segments: Function to regroup dataframe into fixed length segments
    Inputs: csv_df: dataframe with minimally speech, label, all timestamps, all milliseconds, duration, and indices.
            length_in_ms (default 2000): integer of time of fixed length in milliseconds
            label_list (default None): list of accepted labels in dataframe; default label list used if None
    Outputs: regrouped dataframe with one row per fixed length statements lengths with counts of each label
    '''
    
    #Make label list and generate encodings in df
    if label_list is None:
        label_list = ["OTR", "NEU", "REP", "PRS"]
    
    #Create label encoding
    label2id = {lab:ind for ind, lab in enumerate(label_list)}

    #Do the encoding
    csv_df['label_id'] = csv_df['label'].replace(label2id)
    
    #Add groups
    csv_df['ts_group'] = csv_df['end_ms']//length_in_ms
    
    #Get groups and get in a reasonable format
    csv_df = csv_df.groupby('ts_group').apply(_group_statements).reset_index(drop=True)
    
    #Add an area for the label counts to be filled in
    csv_df[[label_list]]=0
    
    #Add label counts
    csv_df = csv_df.apply(_add_label_counts, axis=1)
    
    #All done!
    return csv_df
    

## Check that it works
We'll quickly just check that it is working.

In [None]:
embeddings_list = glob.glob(base_prefix + '*.parquet')
len(embeddings_list)

11

In [None]:
#export audio_preprocessing
def short_embedding_csv_load(fname):
    '''
    Function short_embedding_csv_load: Function to load a subset of data from input parquet file
    Input: String of full filepath
    Output: dataframe with only columns of interest
    '''
    
    df = pd.read_parquet(fname,
                         columns=['id', 'speech', 'label', 'start_timestamp', 'end_timestamp',
                                  'start_ms', 'end_ms', 'duration_ms',
                                  'start_index', 'end_index'])
    
    return df

In [None]:
#read dataframe
test_df = short_embedding_csv_load(embeddings_list[0])

#get output
test_df_fl = get_fixed_length_segments(test_df)
test_df_fl.tail()

Unnamed: 0,id,speech_list,speech,label,label_id,start_timestamp,end_timestamp,start_ms,end_ms,duration_ms,start_index,end_index,OTR,NEU,REP,PRS
115,083-2,"[you don't know?, did you check?]",you don't know? did you check?,"[NEU, NEU]","[1, 1]",09:27.000,09:31.272,567000,571272,4271,9072000,9140352,0,2,0,0
116,083-2,[well you were in there a long time so I would...,well you were in there a long time so I would ...,[REP],[2],09:32.000,09:38.454,572000,578454,6454,9152000,9255264,0,0,1,0
117,083-2,[so what were you doing in there all that time?],so what were you doing in there all that time?,[REP],[2],09:39.000,09:43.281,579000,583281,4281,9264000,9332496,0,0,1,0
118,083-2,[well let's go quick check and we won't play t...,well let's go quick check and we won't play th...,[REP],[2],09:45.000,09:52.552,585000,592552,7552,9360000,9480832,0,0,1,0
119,083-2,[excuse me.],excuse me.,[REP],[2],09:54.000,09:56.704,594000,596704,2704,9504000,9547264,0,0,1,0


# Generate and save all files
Now, let's get these and save all of these files to the `multilabel` folder on ACCRE.

In [None]:
file_nos = [re.split('/|\.', file)[-2] for file in embeddings_list]
save_filepaths = [base_prefix + 'multilabel_parquet/' + file_no + '.parquet' for file_no in file_nos]
save_filepaths

['/data/p_dsi/wise/data/multilabel_parquet/083-2.parquet',
 '/data/p_dsi/wise/data/multilabel_parquet/251-1.parquet',
 '/data/p_dsi/wise/data/multilabel_parquet/134-1.parquet',
 '/data/p_dsi/wise/data/multilabel_parquet/083-3.parquet',
 '/data/p_dsi/wise/data/multilabel_parquet/008-1.parquet',
 '/data/p_dsi/wise/data/multilabel_parquet/273-3.parquet',
 '/data/p_dsi/wise/data/multilabel_parquet/123-1.parquet',
 '/data/p_dsi/wise/data/multilabel_parquet/055-1.parquet',
 '/data/p_dsi/wise/data/multilabel_parquet/083-1.parquet',
 '/data/p_dsi/wise/data/multilabel_parquet/120-1.parquet',
 '/data/p_dsi/wise/data/multilabel_parquet/134-2.parquet']

In [None]:
#Do a short loading of these files
embeds_dfs = [short_embedding_csv_load(df) for df in embeddings_list]

#Get fixed length segments for all files
fl_dfs = [get_fixed_length_segments(df) for df in embeds_dfs]

In [None]:
#save files
#[df.to_parquet(fname, index=False) for df, fname in zip(fl_dfs, save_filepaths)];