Add eval and parallel dataset #4651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

yhliang2018 merged 1 commit into master from feat/deep_speech_eval

Jul 9, 2018

Contributor

yhliang2018 commented Jun 29, 2018

Hi All,

We add the model evaluation part (Haoliang works on it), and also apply multiprocessing to dataset in this PR. Let us know if you have any comments. Thank you!


          Add eval and parallel dataset

a6b69bc

yhliang2018 requested a review from a team as a code owner

June 29, 2018 00:37

googlebot added the cla: yes label

yhliang2018 requested a review from karmel

June 29, 2018 00:38

qlzh727 reviewed

View reviewed changes

research/deep_speech/data/dataset.py

    
              def _preprocess_transcript(transcript, token_to_index):

                """Process transcript as label features."""

                return featurizer.compute_label_feature(transcript, token_to_index)

Member

qlzh727 Jun 29, 2018

Usually we should avoid the one line wrap function with same parameter.

research/deep_speech/data/featurizer.py

    
              def compute_label_feature(text, token_to_idx):

                """Convert string to a list of integers."""

                tokens = list(text.strip().lower())

                feats = [token_to_idx[token] for token in tokens]

Member

qlzh727 Jun 29, 2018

you can just return here.

research/deep_speech/decoder.py

    
                  Arguments:

                    labels (string): mapping from integers to characters.

                    blank_index (int, optional): index for the blank '_' character.

Member

qlzh727 Jun 29, 2018

The default value is different from the one in comments. Also why call underscore with blank?

research/deep_speech/decoder.py

    
                    labels (string): mapping from integers to characters.

                    blank_index (int, optional): index for the blank '_' character.

                      Defaults to 0.

                    space_index (int, optional): index for the space ' ' character.

Member

qlzh727 Jun 29, 2018

same.

research/deep_speech/decoder.py

    
                    space_index (int, optional): index for the space ' ' character.

                      Defaults to 28.

                  """

                  # e.g. labels = "[a-z]' _"

Member

qlzh727 Jun 29, 2018

Do u need a param for the single quote index?

research/deep_speech/decoder.py

    
                  for i, char in enumerate(sequence):

                    if char != self.int_to_char[self.blank_index]:

                      # if this char is a repetition and remove_repetitions=true,

                      # skip.

Member

qlzh727 Jun 29, 2018

should wrap this into the previous line.

research/deep_speech/decoder.py

    
                  space. Option to remove repetitions (e.g. 'abbca' -> 'abca').

                  Arguments:

                    sequences: list of 1-d array of integers

Member

qlzh727 Jun 29, 2018

Should be list of string.

research/deep_speech/decoder.py

    
                def process_string(self, remove_repetitions, sequence):

                  """Process each given sequence."""

                  seq_string = ''

Member

qlzh727 Jun 29, 2018

In case your input string is very long, you should use list to hold individual char, and join them at the end with "". Python string is immutable object. concat string, means allocate new string and delete the old one.

https://waymoot.org/home/python_string/

research/deep_speech/decoder.py

    
                  target_strings = self.process_strings(

                      target_strings, remove_repetitions=True)

                  wer = 0

                  for i in xrange(len(decoded_strings)):

Member

qlzh727 Jun 29, 2018

How about just:

for (decoded_string, target_string) in zip(decoded_strings, target_strings):
  wer += self.wer(decoded_string, target_string) / float(len(target_string.split()))

research/deep_speech/decoder.py

    
                  target_strings = self.process_strings(

                      target_strings, remove_repetitions=True)

                  cer = 0

                  for i in xrange(len(decoded_strings)):

Member

qlzh727 Jun 29, 2018

same, zip is cleaner.

robieta reviewed

View reviewed changes

Contributor

robieta left a comment

Just some minor style points. Nothing major.

research/deep_speech/decoder.py

    
                  strings = []

                  for x in xrange(len(sequences)):

                    seq_len = sizes[x] if sizes is not None else len(sequences[x])

                    string = self._convert_to_string(sequences[x], seq_len)

Contributor

robieta Jun 29, 2018

Nooooo. str is a builtin type. result_string?

research/deep_speech/decoder.py

    
                        seq_string += char

                  return seq_string

                def wer(self, output, target):

Contributor

robieta Jun 29, 2018

In general acronyms as function names is discouraged.

research/deep_speech/decoder.py

    
                  return distance.edit_distance(''.join(new_output), ''.join(new_target))

                def cer(self, output, target):

Contributor

robieta Jul 2, 2018

This seems superfluous.

research/deep_speech/data/dataset.py

    
                # Use multiprocessing for feature/label extraction

                num_cores = multiprocessing.cpu_count()

                pool = multiprocessing.Pool(processes=num_cores)

Contributor

robieta Jul 2, 2018

One trick I learned recently is that contextlib can let you use a context manager in 2 & 3.

with contextlib.closing(multiprocessing.Pool(processes=num_cores)):
  pool.map(...)

robieta self-requested a review

July 9, 2018 17:07

robieta approved these changes

View reviewed changes

Contributor

robieta left a comment

Fine tuning will be done in a follow up PR.

yhliang2018 merged commit 1635e56 into master

yhliang2018 deleted the feat/deep_speech_eval branch

July 9, 2018 17:08

djoshea pushed a commit to djoshea/models that referenced this pull request


          Add eval and parallel dataset (tensorflow#4651)

56aa5de

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels