-
Notifications
You must be signed in to change notification settings - Fork 45.5k
Add eval and parallel dataset #4651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
def _preprocess_transcript(transcript, token_to_index): | ||
"""Process transcript as label features.""" | ||
return featurizer.compute_label_feature(transcript, token_to_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we should avoid the one line wrap function with same parameter.
def compute_label_feature(text, token_to_idx): | ||
"""Convert string to a list of integers.""" | ||
tokens = list(text.strip().lower()) | ||
feats = [token_to_idx[token] for token in tokens] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just return here.
Arguments: | ||
labels (string): mapping from integers to characters. | ||
blank_index (int, optional): index for the blank '_' character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value is different from the one in comments. Also why call underscore with blank?
labels (string): mapping from integers to characters. | ||
blank_index (int, optional): index for the blank '_' character. | ||
Defaults to 0. | ||
space_index (int, optional): index for the space ' ' character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same.
space_index (int, optional): index for the space ' ' character. | ||
Defaults to 28. | ||
""" | ||
# e.g. labels = "[a-z]' _" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do u need a param for the single quote index?
for i, char in enumerate(sequence): | ||
if char != self.int_to_char[self.blank_index]: | ||
# if this char is a repetition and remove_repetitions=true, | ||
# skip. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should wrap this into the previous line.
space. Option to remove repetitions (e.g. 'abbca' -> 'abca'). | ||
Arguments: | ||
sequences: list of 1-d array of integers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be list of string.
|
||
def process_string(self, remove_repetitions, sequence): | ||
"""Process each given sequence.""" | ||
seq_string = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case your input string is very long, you should use list to hold individual char, and join them at the end with "". Python string is immutable object. concat string, means allocate new string and delete the old one.
target_strings = self.process_strings( | ||
target_strings, remove_repetitions=True) | ||
wer = 0 | ||
for i in xrange(len(decoded_strings)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just:
for (decoded_string, target_string) in zip(decoded_strings, target_strings):
wer += self.wer(decoded_string, target_string) / float(len(target_string.split()))
target_strings = self.process_strings( | ||
target_strings, remove_repetitions=True) | ||
cer = 0 | ||
for i in xrange(len(decoded_strings)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, zip is cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor style points. Nothing major.
strings = [] | ||
for x in xrange(len(sequences)): | ||
seq_len = sizes[x] if sizes is not None else len(sequences[x]) | ||
string = self._convert_to_string(sequences[x], seq_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nooooo. str
is a builtin type. result_string
?
seq_string += char | ||
return seq_string | ||
|
||
def wer(self, output, target): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general acronyms as function names is discouraged.
|
||
return distance.edit_distance(''.join(new_output), ''.join(new_target)) | ||
|
||
def cer(self, output, target): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems superfluous.
|
||
# Use multiprocessing for feature/label extraction | ||
num_cores = multiprocessing.cpu_count() | ||
pool = multiprocessing.Pool(processes=num_cores) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One trick I learned recently is that contextlib can let you use a context manager in 2 & 3.
with contextlib.closing(multiprocessing.Pool(processes=num_cores)):
pool.map(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine tuning will be done in a follow up PR.
Hi All,
We add the model evaluation part (Haoliang works on it), and also apply multiprocessing to dataset in this PR. Let us know if you have any comments. Thank you!