I did similar experiments with IBM Bluemix where I took some videos from recorded presentations (not videos with high quality narration). The results were abysmal. It only got the basic English words right like stop words and some other simpler words. All the words that mattered it got wrong.
My intention was to make it possible to full-text search for videos based on what was said in them but the stop-words are ignored by the search engine anyway so I gave up.
Can you elaborate a bit on that "90%" number and the nature & quality of the audio?
"transcript": "NC is only an obstacle we have to move them out of the way so we can fight the number one present yet South Africa which is drug test"
"transcript": " why does ANC might be experiencing its huge internal weaknesses the institutionalization of this party and its infrastructure and resources still has death"
Hmm... I'm impressed but not impressed :)
For example the word "ANC" is a very important key word that it gets wrong.
Also, the transcript made it "still has death" when he said "still has depth" which can be a problem due to the "strength" of the word "death".
Your results with Google's Speech API is certainly better than mine from IBM Bluemix but I'm still unsure this transcript is good enough to put in front of users.
What my plan was was to use the automated transcript for my search engine "Find videos by words uttered" (to extend beyond searching metadata text) but people are more likely to type in "ANC" rather than "the number one".
Having said that I'm going to go back and re-investigate Google as an option for my videos with really clear and crisp sound.
Perhaps an output of this is not to really automate it but to guide and document how you'd go ahead and do it if interested. You know, to avoid snickers being too tightly bundled to vendors like Google.
Recently I played with @google's speech API and it seems they have a pretty accurate speech-to-text feature. I tested by extracting the audio of some @nytimes videos using
and sent to the speech api. I got ~90% of accuracy.
It would be a blast if we had this transcription generation as a feature of snickers.
The text was updated successfully, but these errors were encountered: