Can Naive bayes in sklearn support good-turing smoothing? #12862

12ycli · 2018-12-25T07:48:08Z

Description

Naive bayes in sklearn seems do not support additive smoothing only.
Can we make some change to make it support good-turing smoothing or other smoothing and how?
In some experiment, good-turing smoothing is better than additive smoothing.

jnothman · 2018-12-25T21:46:31Z

Are you talking about multinomial naive Bayes or smoothing the class distribution? If the justification is clear I'm sure we would consider a pr implementing alternative smoothing.

12ycli · 2018-12-26T06:11:24Z

I'm so sorry for unclearly issue my problem.
I'm using MultinomialNB in sklearn. This classifier supports additive smoothing only. However, other smoothing like good-turing smoothing can sometimes get a better result.
I really appreciate it if you would consider a primplementing alternative smoothing.
Thanks you!

jnothman · 2018-12-26T12:42:15Z

I don't see why not. I'm not sure if Good-Turing is the only reasonable pick.

12ycli · 2018-12-26T14:06:32Z

I am new to machine learning area. However I search and find smoothing below.
The number behind them are number of google search results from which you maybe can estimate how popular they are. You may pick some smoothing according to this fact.
• Good-Turing estimate (28,200,000)
• Additive smoothing (5,210,000)
• Absolute discounting (4,180,000)
• Katz smoothing (backoff) (336,000)
• Witten-Bell smoothing (40,600)
• Kneser-Ney smoothing (31,600)
• Jelinek-Mercer smoothing (interpolation) (24,300)

psendyk · 2019-01-02T17:12:53Z

@jnothman How do you see this to be implemented? Should we add a smoothing argument to MultinomialNB with default value "additive" and other possible values like "good_turing", "absolute_discounting", etc.?

jnothman · 2019-01-02T21:25:00Z

I don't think users will benefit much from having a wide range of choices. But yes, it would be controlled by such a parameter.

AbhishekBabuji · 2019-01-06T12:29:56Z

Has someone started working on this? I'd like to follow the specific code that went into doing this so I understand how you structured things when adding new features

psendyk · 2019-01-07T05:08:58Z

Working on it now. I've only tested the code manually but I'll open a PR so that you can see the structure. I'm open to any suggestions as to what smoothing methods should be implemented. Jelinek-Mercer, Absolute Discounting, and Dirichlet all seem like good choices (their performance depends on the size of training data) and are very straightforward but I'm not sure how many we want to support.

psendyk · 2019-01-07T18:55:23Z

@@ -698,10 +700,15 @@ class MultinomialNB(BaseDiscreteNB):
     C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to
     Information Retrieval. Cambridge University Press, pp. 234-265.
     https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
+
+    Gale, William A., and Geoffrey Sampson. 1996. Good-Turing Frequency
+    Estimation without Tears. Brighton: University of Sussex.
+    https://www.grsampson.net/AGtf1.html
     """

-    def __init__(self, alpha=1.0, fit_prior=True, class_prior=None):
+    def __init__(self, alpha=1.0, smoothing='additive', fit_prior=True, class_prior=None):
         self.alpha = alpha
+        self.smoothing = smoothing
         self.fit_prior = fit_prior
         self.class_prior = class_prior

@@ -714,11 +721,11 @@ class MultinomialNB(BaseDiscreteNB):

     def _update_feature_log_prob(self, alpha):
         """Apply smoothing to raw counts and recompute log probabilities"""
-        smoothed_fc = self.feature_count_ + alpha
-        smoothed_cc = smoothed_fc.sum(axis=1)
-
-        self.feature_log_prob_ = (np.log(smoothed_fc) -
-                                  np.log(smoothed_cc.reshape(-1, 1)))
+        if self.smoothing == 'additive':
+            self.feature_log_prob_ = self._additive_smoothing(alpha)
+        elif self.smoothing == 'good-turing':
+            self.feature_log_prob_ = np.apply_along_axis(
+                    self._simple_good_turing, 1, self.feature_count_)

     def _joint_log_likelihood(self, X):
         """Calculate the posterior log probability of the samples X"""
@@ -728,6 +735,77 @@ class MultinomialNB(BaseDiscreteNB):
         return (safe_sparse_dot(X, self.feature_log_prob_.T) +
                 self.class_log_prior_)

+    def _additive_smoothing(self, alpha):
+        smoothed_fc = self.feature_count_ + alpha
+        smoothed_cc = smoothed_fc.sum(axis=1)
+        return np.log(smoothed_fc) - np.log(smoothed_cc.reshape(-1, 1))
+
+    def _simple_goog_turing(self, fc):
+        N = fc.sum()
+        # Get the frequencies of frequencies
+        n = dict()
+        for r in fc:
+            if r == 0:
+                continue
+            n[r] = n.get(r, 0) + 1
+
+        # The probability of unseen samples
+        P_0 = n.get(1, 0)/N
+
+        # Compute the Z values
+        r = np.array(sorted(n.items(), key=lambda keyval: keyval[0]), dtype='int')[:, 0]
+        n = np.array(sorted(n.items(), key=lambda keyval: keyval[0]), dtype='int')[:, 1]
+        Z = dict()
+        Z[r[0]] = 2*n[0] / r[1]
+        Z[r[-1]] = n[-1] / (r[-1] - r[-2])
+        for (idx, j) in enumerate(r):
+            if idx == 0 or idx >= len(r) - 1:
+                continue
+            i = r[idx-1]
+            k = r[idx+1]
+            Z[j] = 2*n[idx]/(k-i)
+        Z = np.array(sorted(Z.items()))[:, 1]
+
+        # Compute S using linear regression
+        log_r = np.log10(r).reshape(-1, 1)
+        log_Z = np.log10(Z).reshape(-1, 1)
+        reg = LinearRegression().fit(log_r, log_Z)
+        a, b = reg.intercept_, reg.coef_[0]
+        S = np.power(10, a+b*np.log(np.arange(1, np.max(r)+2)))
+
+        # Compute r*
+        cease_x = False
+        r_star = np.zeros(len(r))
+        for i in range(len(r)):
+            y = (r[i]+1)*(S[r[i]]/S[r[i]-1])
+            if i >= len(r) - 1 or r[i+1] != r[i] + 1:
+                cease_x = True
+            if cease_x:
+                r_star[i] = y
+                continue
+            x = (r[i]+1)*(n[i+1]/n[i])
+            threshold = 1.96*np.sqrt(((r[i]+1)**2)*(n[i+1]/n[i]**2)*(1+(n[i+1]/n[i])))
+            if np.abs(x-y) <= threshold:
+                r_star[i] = y
+                cease_x = True
+                continue
+            r_star[i] = x
+
+        # Compute the probability of each frequency
+        N_prime = np.inner(n, r_star)
+        p_r = (1-P_0)*(r_star/N_prime)
+
+        # Calculate probability of each feature
+        total_unseen = np.count_nonzero(fc == 0)
+        unseen_prob = P_0/total_unseen
+        prob = np.zeros(fc.size)
+        for i in range(len(p_r)):
+            idx = np.where(fc == r[i])
+            prob[idx] = p_r[i]
+        prob[np.where(fc == 0)] = unseen_prob
+
+        return np.log(prob)
+

This is what I have so far. The notation is the same as in the paper linked in the references.

Since Good-Turing uses raw counts, the code won't work if the input is transformed using, for example, TfidfTransformer. Would throwing an error be a reasonable solution when the input is not raw counts?

Below are some other smoothing algorithms from Enhancing Naive Bayes with Various Smoothing Methods for Short Text Classification that might be worth considering.

def jelinek_mercer(fc, alpha):
    mle = fc.sum(axis=0)/fc.sum()
    return np.log((1-alpha)*(fc/fc.sum(axis=1).reshape(-1,1)) + alpha*mle) 

def dirichlet(fc, alpha):
    mle = fc.sum(axis=0)/fc.sum()
    smoothed_fc = fc + alpha*mle
    smoothed_cc = smoothed_fc.sum(axis=1)
    return np.log(smoothed_fc) - np.log(smoothed_cc.reshape(-1, 1))

def absolute_discounting(fc, alpha):
    mle = fc.sum(axis=0)/fc.sum()
    count_unique = np.count_nonzero(fc, axis=1).reshape(-1,1)
    cc = fc.sum(axis=1).reshape(-1,1)
    return (np.log(np.maximum(fc - alpha, 0) + alpha*count_unique*mle) - 
            np.log(cc))

def two_stage(fc, lam, mu):
    mle = fc.sum(axis=0)/fc.sum()
    smoothed_fc = fc + mu*mle
    smoothed_cc = smoothed_fc.sum(axis=1)
    return np.log((1-lam)*(smoothed_fc/smoothed_cc) + lam*mle)

jnothman · 2019-01-08T07:57:33Z

Please offer changes as a pull request. They are much easier to comment on.

jnothman added the help wanted label Dec 26, 2018

psendyk linked a pull request Jan 9, 2019 that will close this issue

ENH Add new smoothing methods to MultinomialNB #12943

Open

cmarmo added the module:naive_bayes label Feb 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can Naive bayes in sklearn support good-turing smoothing? #12862

Can Naive bayes in sklearn support good-turing smoothing? #12862

12ycli commented Dec 25, 2018 •

edited

jnothman commented Dec 25, 2018 via email

12ycli commented Dec 26, 2018

jnothman commented Dec 26, 2018

12ycli commented Dec 26, 2018 •

edited

psendyk commented Jan 2, 2019

jnothman commented Jan 2, 2019

AbhishekBabuji commented Jan 6, 2019

psendyk commented Jan 7, 2019

psendyk commented Jan 7, 2019

jnothman commented Jan 8, 2019 via email

Can Naive bayes in sklearn support good-turing smoothing? #12862

Can Naive bayes in sklearn support good-turing smoothing? #12862

Comments

12ycli commented Dec 25, 2018 • edited

Description

jnothman commented Dec 25, 2018 via email

12ycli commented Dec 26, 2018

jnothman commented Dec 26, 2018

12ycli commented Dec 26, 2018 • edited

psendyk commented Jan 2, 2019

jnothman commented Jan 2, 2019

AbhishekBabuji commented Jan 6, 2019

psendyk commented Jan 7, 2019

psendyk commented Jan 7, 2019

jnothman commented Jan 8, 2019 via email

12ycli commented Dec 25, 2018 •

edited

12ycli commented Dec 26, 2018 •

edited