Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can Naive bayes in sklearn support good-turing smoothing? #12862

Open
12ycli opened this issue Dec 25, 2018 · 10 comments · May be fixed by #12943
Open

Can Naive bayes in sklearn support good-turing smoothing? #12862

12ycli opened this issue Dec 25, 2018 · 10 comments · May be fixed by #12943

Comments

@12ycli
Copy link

12ycli commented Dec 25, 2018

Description

Naive bayes in sklearn seems do not support additive smoothing only.
Can we make some change to make it support good-turing smoothing or other smoothing and how?
In some experiment, good-turing smoothing is better than additive smoothing.

@jnothman
Copy link
Member

jnothman commented Dec 25, 2018 via email

@12ycli
Copy link
Author

12ycli commented Dec 26, 2018

I'm so sorry for unclearly issue my problem.
I'm using MultinomialNB in sklearn. This classifier supports additive smoothing only. However, other smoothing like good-turing smoothing can sometimes get a better result.
I really appreciate it if you would consider a primplementing alternative smoothing.
Thanks you!

@jnothman
Copy link
Member

I don't see why not. I'm not sure if Good-Turing is the only reasonable pick.

@12ycli
Copy link
Author

12ycli commented Dec 26, 2018

I am new to machine learning area. However I search and find smoothing below.
The number behind them are number of google search results from which you maybe can estimate how popular they are. You may pick some smoothing according to this fact.
• Good-Turing estimate (28,200,000)
• Additive smoothing (5,210,000)
• Absolute discounting (4,180,000)
• Katz smoothing (backoff) (336,000)
• Witten-Bell smoothing (40,600)
• Kneser-Ney smoothing (31,600)
• Jelinek-Mercer smoothing (interpolation) (24,300)

@psendyk
Copy link
Contributor

psendyk commented Jan 2, 2019

@jnothman How do you see this to be implemented? Should we add a smoothing argument to MultinomialNB with default value "additive" and other possible values like "good_turing", "absolute_discounting", etc.?

@jnothman
Copy link
Member

jnothman commented Jan 2, 2019

I don't think users will benefit much from having a wide range of choices. But yes, it would be controlled by such a parameter.

@AbhishekBabuji
Copy link

Has someone started working on this? I'd like to follow the specific code that went into doing this so I understand how you structured things when adding new features

@psendyk
Copy link
Contributor

psendyk commented Jan 7, 2019

Working on it now. I've only tested the code manually but I'll open a PR so that you can see the structure. I'm open to any suggestions as to what smoothing methods should be implemented. Jelinek-Mercer, Absolute Discounting, and Dirichlet all seem like good choices (their performance depends on the size of training data) and are very straightforward but I'm not sure how many we want to support.

@psendyk
Copy link
Contributor

psendyk commented Jan 7, 2019

@@ -698,10 +700,15 @@ class MultinomialNB(BaseDiscreteNB):
     C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to
     Information Retrieval. Cambridge University Press, pp. 234-265.
     https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
+
+    Gale, William A., and Geoffrey Sampson. 1996. Good-Turing Frequency
+    Estimation without Tears. Brighton: University of Sussex.
+    https://www.grsampson.net/AGtf1.html
     """

-    def __init__(self, alpha=1.0, fit_prior=True, class_prior=None):
+    def __init__(self, alpha=1.0, smoothing='additive', fit_prior=True, class_prior=None):
         self.alpha = alpha
+        self.smoothing = smoothing
         self.fit_prior = fit_prior
         self.class_prior = class_prior

@@ -714,11 +721,11 @@ class MultinomialNB(BaseDiscreteNB):

     def _update_feature_log_prob(self, alpha):
         """Apply smoothing to raw counts and recompute log probabilities"""
-        smoothed_fc = self.feature_count_ + alpha
-        smoothed_cc = smoothed_fc.sum(axis=1)
-
-        self.feature_log_prob_ = (np.log(smoothed_fc) -
-                                  np.log(smoothed_cc.reshape(-1, 1)))
+        if self.smoothing == 'additive':
+            self.feature_log_prob_ = self._additive_smoothing(alpha)
+        elif self.smoothing == 'good-turing':
+            self.feature_log_prob_ = np.apply_along_axis(
+                    self._simple_good_turing, 1, self.feature_count_)

     def _joint_log_likelihood(self, X):
         """Calculate the posterior log probability of the samples X"""
@@ -728,6 +735,77 @@ class MultinomialNB(BaseDiscreteNB):
         return (safe_sparse_dot(X, self.feature_log_prob_.T) +
                 self.class_log_prior_)

+    def _additive_smoothing(self, alpha):
+        smoothed_fc = self.feature_count_ + alpha
+        smoothed_cc = smoothed_fc.sum(axis=1)
+        return np.log(smoothed_fc) - np.log(smoothed_cc.reshape(-1, 1))
+
+    def _simple_goog_turing(self, fc):
+        N = fc.sum()
+        # Get the frequencies of frequencies
+        n = dict()
+        for r in fc:
+            if r == 0:
+                continue
+            n[r] = n.get(r, 0) + 1
+
+        # The probability of unseen samples
+        P_0 = n.get(1, 0)/N
+
+        # Compute the Z values
+        r = np.array(sorted(n.items(), key=lambda keyval: keyval[0]), dtype='int')[:, 0]
+        n = np.array(sorted(n.items(), key=lambda keyval: keyval[0]), dtype='int')[:, 1]
+        Z = dict()
+        Z[r[0]] = 2*n[0] / r[1]
+        Z[r[-1]] = n[-1] / (r[-1] - r[-2])
+        for (idx, j) in enumerate(r):
+            if idx == 0 or idx >= len(r) - 1:
+                continue
+            i = r[idx-1]
+            k = r[idx+1]
+            Z[j] = 2*n[idx]/(k-i)
+        Z = np.array(sorted(Z.items()))[:, 1]
+
+        # Compute S using linear regression
+        log_r = np.log10(r).reshape(-1, 1)
+        log_Z = np.log10(Z).reshape(-1, 1)
+        reg = LinearRegression().fit(log_r, log_Z)
+        a, b = reg.intercept_, reg.coef_[0]
+        S = np.power(10, a+b*np.log(np.arange(1, np.max(r)+2)))
+
+        # Compute r*
+        cease_x = False
+        r_star = np.zeros(len(r))
+        for i in range(len(r)):
+            y = (r[i]+1)*(S[r[i]]/S[r[i]-1])
+            if i >= len(r) - 1 or r[i+1] != r[i] + 1:
+                cease_x = True
+            if cease_x:
+                r_star[i] = y
+                continue
+            x = (r[i]+1)*(n[i+1]/n[i])
+            threshold = 1.96*np.sqrt(((r[i]+1)**2)*(n[i+1]/n[i]**2)*(1+(n[i+1]/n[i])))
+            if np.abs(x-y) <= threshold:
+                r_star[i] = y
+                cease_x = True
+                continue
+            r_star[i] = x
+
+        # Compute the probability of each frequency
+        N_prime = np.inner(n, r_star)
+        p_r = (1-P_0)*(r_star/N_prime)
+
+        # Calculate probability of each feature
+        total_unseen = np.count_nonzero(fc == 0)
+        unseen_prob = P_0/total_unseen
+        prob = np.zeros(fc.size)
+        for i in range(len(p_r)):
+            idx = np.where(fc == r[i])
+            prob[idx] = p_r[i]
+        prob[np.where(fc == 0)] = unseen_prob
+
+        return np.log(prob)
+

This is what I have so far. The notation is the same as in the paper linked in the references.

Since Good-Turing uses raw counts, the code won't work if the input is transformed using, for example, TfidfTransformer. Would throwing an error be a reasonable solution when the input is not raw counts?

Below are some other smoothing algorithms from Enhancing Naive Bayes with Various Smoothing Methods for Short Text Classification that might be worth considering.

def jelinek_mercer(fc, alpha):
    mle = fc.sum(axis=0)/fc.sum()
    return np.log((1-alpha)*(fc/fc.sum(axis=1).reshape(-1,1)) + alpha*mle) 

def dirichlet(fc, alpha):
    mle = fc.sum(axis=0)/fc.sum()
    smoothed_fc = fc + alpha*mle
    smoothed_cc = smoothed_fc.sum(axis=1)
    return np.log(smoothed_fc) - np.log(smoothed_cc.reshape(-1, 1))

def absolute_discounting(fc, alpha):
    mle = fc.sum(axis=0)/fc.sum()
    count_unique = np.count_nonzero(fc, axis=1).reshape(-1,1)
    cc = fc.sum(axis=1).reshape(-1,1)
    return (np.log(np.maximum(fc - alpha, 0) + alpha*count_unique*mle) - 
            np.log(cc))

def two_stage(fc, lam, mu):
    mle = fc.sum(axis=0)/fc.sum()
    smoothed_fc = fc + mu*mle
    smoothed_cc = smoothed_fc.sum(axis=1)
    return np.log((1-lam)*(smoothed_fc/smoothed_cc) + lam*mle)

@jnothman
Copy link
Member

jnothman commented Jan 8, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants