Ruby-based web service for speech recognition, using the PocketSphinx gstreamer module
Switch branches/tags
Nothing to show
Pull request Compare This branch is 30 commits behind alumae:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib
scripts
views
LICENSE
README.rdoc
conf.yaml
config.ru
dummy.fsg
unicorn.conf.rb

README.rdoc

Introduction

Ruby-based web service for speech recognition, using the PocketSphinx gstreamer module.

Requirements

  • Ruby 1.8

  • Sinatra

  • Rack

  • Unicorn

  • Additional Ruby modules: gdbm, rack/throttle, maruku

  • Pocketsphinx (NOTE: some features of the server require modifications of Pocketsphinx that are not yet public. Sorry about that.)

  • Some acoustic and language models for Pocketsphinx

Running

unicorn -c unicorn.conf.rb config.ru

Configuration

Web service

Unicorn configuration is in file unicorn.conf.rb. See unicorn.bogomips.org/examples/unicorn.conf.rb for more info.

Recognizer

See conf.yaml

Using the web service

Example 1

Record a sentence to a wav file, in mono (hit Ctrl-C when done speaking):

rec -c 1 lause.wav

Send it to the web service:

curl   -X POST --data-binary @lause.wav -H "Content-Type: audio/x-wav"  http://localhost:8080/recognize

Output (encoded using json):

{
  "status": 0,
  "hypotheses": [
    {
      "utterance": [
        "t\u00e4na on v\u00e4ljas \u00fcsna ilus ilm"
      ]
    }
  ],
  "id": "e30f54561135d681599915562d77d240"
}

Example 2

Record a raw file using arecord:

arecord --format=S16_LE  --file-type raw  --channels 1 --rate 16000 > sentence2.raw

Send it to web service:

curl -X POST --data-binary @senrence.raw -H "Content-Type: audio/x-raw-int; rate=16000"  http://localhost:8080/recognize

Example 3

Record a 5 second audio, pipe it to curl, which streams it directly to web service using PUT (and gets almost instant response):

arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000"  http://localhost:8080/recognize

Support for JSGF grammars

Users can now use their own grammars to recognize certain sentences. The grammars should be in JSGF format.

Example JSGF (let's call it robot.jsgf)

#JSGF V1.0;

grammar robot;

public <command> = (liigu | mine ) [ ( üks | kaks | kolm | neli | viis ) meetrit ] (edasi | tagasi);

NB! Grammars should be in the same charset that the server is using for dictionary, which currently is latin-1 (sorry for that).

You need to upload the JSGF file to somewhere where the server can fetch it, let's say www.example.com/robot.txt

Now, let the server download and compile it:

curl -vv  http://localhost:8080/fetch-lm?url=http://www.example.com/robot.jsgf

This should result in HTTP/1.1 200 OK.

Now you can use the grammar to recognize a sentence that is accepted by the grammar:

arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | \
curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000"  http://localhost:8080/recognize?lm=http://www.example.com/robot.jsgf

Result:

{
  "status": 0,
  "hypotheses": [
    {
      "utterance": "mine viis meetrit tagasi"
    }
  ],
  "id": "9e3895e9ee0b5138e73c6fca30f51a58"
}

If you update the grammar on the server, you need to make the /fetch-jsgf request again, as the server doesn't check for changes every time a recognition request is done (for efficiency reasons).

Support for GF grammars

GF (Grammatical Framework) grammars are now supported.

A GF grammar must be compiled into a .pgf file. To upload it to the server, use the fetch-pgf API call, e.g.:

curl "http://bark.phon.ioc.ee/speech-api/v1/fetch-lm?url=http://kaljurand.github.com/Grammars/grammars/pgf/Calc.pgf&lang=Est"

The 'lang' attribute (defaults to 'Est') specifies input languages of the grammar. Many comma-seperated languages can be specified, e.g lang=Est,Est2

To recognize with a GF, use similar request as with JSGF, e.g.:

arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000"  "http://localhost:8080/recognize?lm=http://kaljurand.github.com/Grammars/grammars/pgf/Calc.pgf

You can also specify output language(s) that will be used to linearize the raw recognition result, e.g.:

arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000"  "http://localhost:8080/recognize?lm=http://kaljurand.github.com/Grammars/grammars/pgf/Calc.pgf&output-lang=App"

Output:

{
 "status": 0,
 "hypotheses": [
   {
     "utterance": "viis minutit sekundites",
     "linearizations": [
       {
         "lang": "App",
         "output": "5 ' IN \""
       },
       {
         "lang": "App",
         "output": "5 min IN s"
       }
     ]
   }
 ],
 "id": "83486feaca30995401ed4a66951a3f23"
}

Multiple output languages can be used, by using comma-speperated values: “..&output-lang=App,App2”