In Jeremy Howard's Kernel [NB-SVM strong linear baseline](https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline),  the log-count ratio `r` (in equation (2) of [Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf)) is implemented differently from the paper: the paper uses l1 norm, and the kernel used the vector length (`(y==y_i).sum()`). Many have questions about this and JH's explanation: "Normally yes, but here that is rolled into the bias term in the logistic regression automatically."

The last part of this notebook "Explain the 1norm" looks into this difference and provides some explanation. The rest of the orginal "NB-SVM strong linear baseline" kernel is left unchanged.

## Introduction

This kernel shows how to use NBSVM (Naive Bayes - Support Vector Machine) to create a strong baseline for the [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) competition. NBSVM was introduced by Sida Wang and Chris Manning in the paper [Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf). In this kernel, we use sklearn's logistic regression, rather than SVM, although in practice the two are nearly identical (sklearn uses the liblinear library behind the scenes).

If you're not familiar with naive bayes and bag of words matrices, I've made a preview available of one of fast.ai's upcoming *Practical Machine Learning* course videos, which introduces this topic. Here is a link to the section of the video which discusses this: [Naive Bayes video](https://youtu.be/37sFIak42Sc?t=3745).

In [None]:
import pandas as pd, numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
subm = pd.read_csv('../input/sample_submission.csv')

## Looking at the data

The training data contains a row per comment, with an id, the text of the comment, and 6 different labels that we'll try to predict.

Here's a couple of examples of comments, one toxic, and one with no labels.

The length of the comments varies a lot.

We'll create a list of all the labels to predict, and we'll also create a 'none' label so we can see how many comments have no labels. We can then summarize the dataset.

In [None]:
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train['none'] = 1-train[label_cols].max(axis=1)
train.describe()

There are a few empty comments that we need to get rid of, otherwise sklearn will complain.

In [None]:
COMMENT = 'comment_text'
train[COMMENT].fillna("unknown", inplace=True)
test[COMMENT].fillna("unknown", inplace=True)

## Building the model

We'll start by creating a *bag of words* representation, as a *term document matrix*. We'll use ngrams, as suggested in the NBSVM paper.

In [None]:
import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

It turns out that using TF-IDF gives even better priors than the binarized features used in the paper. I don't think this has been mentioned in any paper before, but it improves leaderboard score from 0.59 to 0.55.

In [None]:
n = train.shape[0]
vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )
trn_term_doc = vec.fit_transform(train[COMMENT])
test_term_doc = vec.transform(test[COMMENT])

This creates a *sparse matrix* with only a small number of non-zero elements (*stored elements* in the representation  below).

In [None]:
trn_term_doc, test_term_doc

Here's the basic naive bayes feature equation:

In [None]:
def pr(y_i, y):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [None]:
x = trn_term_doc
test_x = test_term_doc

Fit a model for one dependent at a time:

In [None]:
def get_mdl(y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    m = LogisticRegression(C=4, dual=True)
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

And finally, create the submission file.

## Explain the 1norm

Many question about how `r` is computed. In the paper it's equation (2) which uses l1 norm, but the above used the length (`(y==y_i).sum()`). We take one label and break down the steps to calculate `r`. 

Take the first label.

In [None]:
x = trn_term_doc
j = label_cols[0]
j

In [None]:
y = train[j]
y = y.values

Use `p` and `q` and the paper does.

In [None]:
p = x[y==1].sum(0)+1
q = x[y==0].sum(0)+1

This is the paper implementation of the L1 norms `p_n_bk` and `q_n_bk`. (`p` and `q` are positive so l1 norm is just the sum).

In [None]:
p_n_bk = p.sum()
q_n_bk = q.sum()
r_bk = np.log( (p/p_n_bk) / (q/q_n_bk))

We can see that because "logAB=logA+logB", `r` can be broken down to `np.log(p/q)` and `np.log(q_n_bk/p_n_bk)`, where the L1 norms only appears in the 2nd part.

In [None]:
np.allclose(r_bk, np.log(p/q) + np.log(q_n_bk/p_n_bk))

This is JH's implementation in the kernel. Instead of the L1 norm, the normalizing terms are vector length `p_n_jh` and `q_n_jh`.

In [None]:
p_n_jh = (y==1).sum()+1
q_n_jh = (y==0).sum()+1
r_jh = np.log( (p/p_n_jh) / (q/q_n_jh))

We can see that because "logAB=logA+logB", `r` can be broken down to `np.log(p/q)` and `np.log(q_n_jh/p_n_jh)`, where the normalizing terms only appears in the 2nd part.

In [None]:
np.allclose(r_jh, np.log(p/q) + np.log(q_n_jh/p_n_jh))

Therefore the `r` from the paper and the `r` from JH's kernel only differs by a constant `np.log(q_n_bk/p_n_bk) - np.log(q_n_jh/p_n_jh)`

In [None]:
cnst = np.log(q_n_jh/p_n_jh) - np.log(q_n_bk/p_n_bk)
cnst

In [None]:
np.allclose(r_jh, r_bk + cnst)

So when it coms to the elementwise product.

Paper version `xm`

In [None]:
xm = x.multiply(r_bk) 

JH kernel version `xmjh`

In [None]:
xmjh = x.multiply(r_jh)

We have `xmjh = xm + x*cnst`

In [None]:
np.allclose(xmjh.tocsr()[0].todense(), 
            (xm + x*cnst)[0].todense())

So `x*cnst` is all the difference that fed into the  logstic regression models. - hence JH's explanation: "Normally yes, but here that is rolled into the bias term in the logistic regression automatically".

