Skip to content

Conversation

yhliang2018
Copy link
Contributor

Hi All,

We add the model evaluation part (Haoliang works on it), and also apply multiprocessing to dataset in this PR. Let us know if you have any comments. Thank you!

@yhliang2018 yhliang2018 requested a review from a team as a code owner June 29, 2018 00:37
@yhliang2018 yhliang2018 requested a review from karmel June 29, 2018 00:38

def _preprocess_transcript(transcript, token_to_index):
"""Process transcript as label features."""
return featurizer.compute_label_feature(transcript, token_to_index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we should avoid the one line wrap function with same parameter.

def compute_label_feature(text, token_to_idx):
"""Convert string to a list of integers."""
tokens = list(text.strip().lower())
feats = [token_to_idx[token] for token in tokens]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just return here.

Arguments:
labels (string): mapping from integers to characters.
blank_index (int, optional): index for the blank '_' character.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value is different from the one in comments. Also why call underscore with blank?

labels (string): mapping from integers to characters.
blank_index (int, optional): index for the blank '_' character.
Defaults to 0.
space_index (int, optional): index for the space ' ' character.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same.

space_index (int, optional): index for the space ' ' character.
Defaults to 28.
"""
# e.g. labels = "[a-z]' _"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do u need a param for the single quote index?

for i, char in enumerate(sequence):
if char != self.int_to_char[self.blank_index]:
# if this char is a repetition and remove_repetitions=true,
# skip.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should wrap this into the previous line.

space. Option to remove repetitions (e.g. 'abbca' -> 'abca').
Arguments:
sequences: list of 1-d array of integers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be list of string.


def process_string(self, remove_repetitions, sequence):
"""Process each given sequence."""
seq_string = ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case your input string is very long, you should use list to hold individual char, and join them at the end with "". Python string is immutable object. concat string, means allocate new string and delete the old one.

https://waymoot.org/home/python_string/

target_strings = self.process_strings(
target_strings, remove_repetitions=True)
wer = 0
for i in xrange(len(decoded_strings)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just:

for (decoded_string, target_string) in zip(decoded_strings, target_strings):
  wer += self.wer(decoded_string, target_string) / float(len(target_string.split()))

target_strings = self.process_strings(
target_strings, remove_repetitions=True)
cer = 0
for i in xrange(len(decoded_strings)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, zip is cleaner.

Copy link
Contributor

@robieta robieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor style points. Nothing major.

strings = []
for x in xrange(len(sequences)):
seq_len = sizes[x] if sizes is not None else len(sequences[x])
string = self._convert_to_string(sequences[x], seq_len)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nooooo. str is a builtin type. result_string?

seq_string += char
return seq_string

def wer(self, output, target):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general acronyms as function names is discouraged.


return distance.edit_distance(''.join(new_output), ''.join(new_target))

def cer(self, output, target):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems superfluous.


# Use multiprocessing for feature/label extraction
num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=num_cores)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One trick I learned recently is that contextlib can let you use a context manager in 2 & 3.

with contextlib.closing(multiprocessing.Pool(processes=num_cores)):
  pool.map(...)

@robieta robieta self-requested a review July 9, 2018 17:07
Copy link
Contributor

@robieta robieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine tuning will be done in a follow up PR.

@yhliang2018 yhliang2018 merged commit 1635e56 into master Jul 9, 2018
@yhliang2018 yhliang2018 deleted the feat/deep_speech_eval branch July 9, 2018 17:08
djoshea pushed a commit to djoshea/models that referenced this pull request Jul 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants