Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class probability computation is very inefficient (patch enclosed) #11

Closed
ralfbrown opened this issue Aug 1, 2013 · 7 comments
Closed
Assignees

Comments

@ralfbrown
Copy link

The following patch produces the same output with a 4.4-fold speedup for language identification (not counting startup time) in --line mode given 650-byte average line lengths, and a 33-fold speedup with 62-byte average line lengths when using the default language model. Larger models with more features show an even larger speedup.

The speedup results from avoiding a matrix multiplication against a feature-count vector which is mostly zeros. You may wish to tweak the cut-over from "short" to "long" texts by adjusting the self.nb_numfeats/10; it could probably be moved higher, but I was being conservative.

259a260,302

optimized version by Ralf Brown

def instance2classprobs(self, text):
"""
Compute class probabilities for an instance according to the trained model
"""
if isinstance(text, unicode):
text = text.encode('utf8')

# Convert the text to a sequence of ascii values
ords = map(ord, text)

state = 0
if len(ords) < self.nb_numfeats / 10:
    # for very short texts, just apply each production every time the
    # state changes, rather than counting the number of occurrences of
    # each state
    pdc = np.zeros(len(self.nb_classes))
    for letter in ords:
        state = self.tk_nextmove[(state << 8) + letter]
        for index in self.tk_output.get(state, []):
            # compute the dot product incrementally, avoiding lots
            # of multiplications by zero with a sparse
            # feature-count vector
            pdc += self.nb_ptc[index]
else:
    # Count the number of times we enter each state
    statecount = defaultdict(int)
    for letter in ords:
        state = self.tk_nextmove[(state << 8) + letter]
        statecount[state] += 1

    # Update all the productions corresponding to the state
    arr = np.zeros((self.nb_numfeats,), dtype='uint32')
    for state in statecount:
        for index in self.tk_output.get(state, []):
            arr[index] += statecount[state]
    # compute the partial log-probability of the document given each class
    pdc = np.dot(arr,self.nb_ptc)

# compute the partial log-probability of the document in each class
pd = pdc + self.nb_pc
return pd

271,272c314,315
< fv = self.instance2fv(text)

< probs = self.norm_probs(self.nb_classprobs(fv))

probs = self.instance2classprobs(text)
probs = self.norm_probs(probs)

282,283c325,326
< fv = self.instance2fv(text)

< probs = self.norm_probs(self.nb_classprobs(fv))

probs = self.instance2classprobs(text)
probs = self.norm_probs(probs)
@ralfbrown
Copy link
Author

Oops, let's try that again (first-time poster on GitHub):

   ***************
   *** 258,261 ****
   --- 258,304 ----
         return arr

   +   ## optimized version by Ralf Brown
   +   def instance2classprobs(self, text):
   +     """
   +     Compute class probabilities for an instance according to the trained model
   +     """
   +     if isinstance(text, unicode):
   +       text = text.encode('utf8')
   +
   +     # Convert the text to a sequence of ascii values
   +     ords = map(ord, text)
   +
   +     state = 0
   +     if len(ords) < self.nb_numfeats / 10:
   +         # for very short texts, just apply each production every time the
   +         # state changes, rather than counting the number of occurrences of
   +         # each state
   +         pdc = np.zeros(len(self.nb_classes))
   +         for letter in ords:
   +             state = self.tk_nextmove[(state << 8) + letter]
   +             for index in self.tk_output.get(state, []):
   +                 # compute the dot product incrementally, avoiding lots
   +                 # of multiplications by zero with a sparse
   +                 # feature-count vector
   +                 pdc += self.nb_ptc[index]
   +     else:
   +         # Count the number of times we enter each state
   +         statecount = defaultdict(int)
   +         for letter in ords:
   +             state = self.tk_nextmove[(state << 8) + letter]
   +             statecount[state] += 1
   +
   +         # Update all the productions corresponding to the state
   +         arr = np.zeros((self.nb_numfeats,), dtype='uint32')
   +         for state in statecount:
   +             for index in self.tk_output.get(state, []):
   +                 arr[index] += statecount[state]
   +         # compute the partial log-probability of the document given each class
   +         pdc = np.dot(arr,self.nb_ptc)
   +
   +     # compute the partial log-probability of the document in each class
   +     pd = pdc + self.nb_pc
   +     return pd
   +
       def nb_classprobs(self, fv):
         # compute the partial log-probability of the document given each class
   ***************
   *** 269,274 ****
         Classify an instance.
         """
   !     fv = self.instance2fv(text)
   !     probs = self.norm_probs(self.nb_classprobs(fv))
         cl = np.argmax(probs)
         conf = float(probs[cl])
   --- 312,317 ----
         Classify an instance.
         """
   !     probs = self.instance2classprobs(text)
   !     probs = self.norm_probs(probs)
         cl = np.argmax(probs)
         conf = float(probs[cl])
   ***************
   *** 280,285 ****
         Return a list of languages in order of likelihood.
         """
   !     fv = self.instance2fv(text)
   !     probs = self.norm_probs(self.nb_classprobs(fv))
         return [(str(k),float(v)) for (v,k) in sorted(zip(probs, self.nb_classes), reverse=True)]

   --- 323,328 ----
         Return a list of languages in order of likelihood.
         """
   !     probs = self.instance2classprobs(text)
   !     probs = self.norm_probs(probs)
         return [(str(k),float(v)) for (v,k) in sorted(zip(probs, self.nb_classes), reverse=True)]

@saffsd
Copy link
Owner

saffsd commented Aug 1, 2013

Hi Ralf! Thanks for your patch. I've applied it to a new branch. Dawid Weiss addresses the same issue (sparse feature vector) in his Java implementation (https://github.com/carrotsearch/langid-java). His solution uses a different data structure. I do appreciate the simplicity of your patch, handling "long" and "short" strings separately.

@saffsd
Copy link
Owner

saffsd commented Aug 1, 2013

I did some experimentation to try and determine a suitable threshold for transitioning from "short" to "long" handling. Your initial estimate using the in-built feature set was num_feats / 10, which worked out to 748. I've made a slight change to that, allowing the user to specify the document length in bytes where the changeover should occur. I experimented with samples from a European government dataset, sampling 1000 strings of n bytes, for 100 <= n <= 3000 in increments of 100. The break-even point in my experiments appears to be right around 650 bytes, and I've made this change in the a/m branch.

@saffsd saffsd closed this as completed Aug 1, 2013
@ralfbrown
Copy link
Author

I'm glad I could help to improve your code. The speed issue came up
when I went to compare the performance of langid.py against my own
language identifier (a module within
http://la-strings.sourceforge.net) on the very large set of languages
I'm working with. Given 50,000+ features and 1290 classes (1188
languages, some with multiple encodings), langid.py was taking 0.7
seconds per string, which meant that an evaluation run would take over
a week, compared to less than two minutes for my program.... It will
still take a full day based on a 70-second startup time with the big
models (180-200MB on disk, ~2.5GB RAM after gc); I haven't decided yet
whether it will be worth my time to run langid.py in server mode and
write a client for my evaluation script to use.

The solution in my patch is in fact the same approach I use in my own
'whatlang' to incrementally compute the cosine similarity between the
input text and each language's model. 'whatlang' has been optimized
for short strings from the beginning.

I took a quick look at langid-java, but it seems to be a raw library
without a main() to let it run standalone, so I haven't tried running
it yet.

BTW, I set the switchover point relative to num_feats because it will
be higher for larger feature sets. My initial estimate was based on
the fact that most states in the FSA seem to fire two or three
productions, and an N-byte input can reach no more than N states, so a
string of num_feats/10 bytes would yield 30% or fewer non-zero entries
in the feature count vector. OTOH, 650 bytes is big enough that few
individual lines in real text will exceed it (Europarl isn't your
everyday text :-).

On 8/1/13, saffsd notifications@github.com wrote:

I did some experimentation to try and determine a suitable threshold for
transitioning from "short" to "long" handling. Your initial estimate using
the in-built feature set was num_feats / 10, which worked out to 748. I've
made a slight change to that, allowing the user to specify the document
length in bytes where the changeover should occur. I experimented with
samples from a European government dataset, sampling 1000 strings of n
bytes, for 100 <= n <= 3000 in increments of 100. The break-even point in my
experiments appears to be right around 650 bytes, and I've made this change
in the a/m branch.


Reply to this email directly or view it on GitHub:
#11 (comment)

@saffsd
Copy link
Owner

saffsd commented Aug 2, 2013

I'm familiar with your work, I read your paper when it came out. Great to see others releasing open-source tools!

I've never pushed this implementation of langid.py to the limits you're taking it to, I'll be very interested to see the final outcome of your comparison. I hope you are planning to publish it? I'm not sure server mode will help, the main advantage is to avoid unpacking the model multiple times. Are you doing that? If so, it would be faster if you access langid.py as a library, creating an single instance of the LanguageIdentifier class and then using that. I think this implementation of langid.py is as fast as I'm able to make it while sticking with numpy, I suspect I could do a fair bit better if I implemented the core estimation as a C extension module.

Ah, I see your rationale for the num_feats/10, that is quite sensible.

As an aside, I've experimented with la-strings briefly. Is there any way to access the language identifier component directly? What I ended up doing was invoking it as "la-strings -i -I 1 ", and then taking the most common prediction as the prediction for the whole document, which seems like a rather roundabout way of doing it.

@ralfbrown
Copy link
Author

I'm not sure server mode will help, the main advantage is to avoid unpacking the model multiple times.

That's in fact what is happening, because my evaluation script calls
the selected language identification program once for each of the test
files (currently 1203).

I've experimented with la-strings briefly. Is there any way to access the language identifier component directly?

The language identification code is all in the langident/ subdirectory
and separately compilable, with a front-end program called 'whatlang'.
You can tell 'whatlang' to give identification on
slightly-overlapping fixed-size blocks, the "entire" file (up to the
first 500,000 bytes), or line-by-line. There's also a
LanguageIdentifier class for incorporation into other programs.

I'll be very interested to see the final outcome of your comparison.

I have a paper at the Text, Speech, and Discourse conference next
month. The published version compares whatlang against mguesser,
libtextcat, and Shuyo's Java LangDetect. I'm planning to add
langid.py for the oral presentation. Based on evaluating a 29-file
subset of the test set using the full model, it looks like langid.py
will be the only one to give whatlang a run for the money on accuracy
if I can get 6-grams working -- hence yesterday's memory-reduction
patch. With 5-grams and 2000 features per class (--df_tokens 20000),
langid.py made 586 classification errors out of 21607 versus 478 for
whatlang using 6-grams and 3500 features per class. For a bunch of
the languages, byte 5-grams are effectively character bigrams.

@saffsd
Copy link
Owner

saffsd commented Aug 3, 2013

I've added handling of the --line option in --batch mode, this should be the most effective way for you to avoid unpacking the model over and over. You can find the modified version of langid.py in the new branch I opened for this issue. The file itself is here. Please let me know if this is helpful.

Congratulations on your publication! I will look for it when the proceedings come online.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants