## Introduction

This is a replica of Jeremy Howard's kernel: https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline
The point of this kernel is to have a simple baseline with a linear model.

This kernel shows how to use NBSVM (Naive Bayes - Support Vector Machine) to create a baseline for the Quora Insencere Quesrtions competition. NBSVM was introduced by Sida Wang and Chris Manning in the paper [Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf). In this kernel, we use sklearn's logistic regression, rather than SVM, although in practice the two are nearly identical (sklearn uses the liblinear library behind the scenes).

If you're not familiar with naive bayes and bag of words matrices, there's a preview available of one of fast.ai's upcoming *Practical Machine Learning* course videos, which introduces this topic. Here is a link to the section of the video which discusses this: [Naive Bayes video](https://youtu.be/37sFIak42Sc?t=3745).

In [None]:
import pandas as pd, numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
subm = pd.read_csv('../input/sample_submission.csv')

## Building the model

We'll start by creating a *bag of words* representation, as a *term document matrix*. We'll use ngrams, as suggested in the NBSVM paper.

In [None]:
import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

Split in train and test

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split

train_df, val_df = train_test_split(train, test_size=0.07, random_state=2018)

Naive Bayes SVM Classifier

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_is_fitted
from sklearn.linear_model import LogisticRegression
from scipy import sparse
class NbSvmClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, C=1.0, dual=False, n_jobs=1, solver='sag', max_iter = 100):
        self.C = C
        self.dual = dual
        self.n_jobs = n_jobs
        self.solver = solver
        self.max_iter = max_iter
        
    def predict(self, x):
        # Verify that model has been fit
        check_is_fitted(self, ['_r', '_clf'])
        return self._clf.predict(x.multiply(self._r))

    def predict_proba(self, x):
        # Verify that model has been fit
        check_is_fitted(self, ['_r', '_clf'])
        return self._clf.predict_proba(x.multiply(self._r))

    def fit(self, x, y):
        # Check that X and y have correct shape
        y = y.values
        x, y = check_X_y(x, y, accept_sparse=True)

        def pr(x, y_i, y):
            p = x[y==y_i].sum(0)
            return (p+1) / ((y==y_i).sum()+1)

        self._r = sparse.csr_matrix(np.log(pr(x,1,y) / pr(x,0,y)))
        x_nb = x.multiply(self._r)
        self._clf = LogisticRegression(C=self.C, dual=self.dual, n_jobs=self.n_jobs, solver=self.solver,
                                      max_iter=self.max_iter).fit(x_nb, y)
        return self

This is to find optimal threshold

In [None]:
from sklearn.metrics import f1_score

def threshold_search(y_true, y_proba):
    best_threshold = 0
    best_score = 0
    for threshold in [i * 0.01 for i in range(100)]:
        score = f1_score(y_true=y_true, y_pred=y_proba > threshold)
        if score > best_score:
            best_threshold = threshold
            best_score = score
    search_result = {'threshold': best_threshold, 'f1': best_score}
    return search_result

In [None]:
train.head()

We use trigrams up to 50000 words

In [None]:
N = 50000

vec = TfidfVectorizer(ngram_range=(1,3), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1, max_features=N)

trn_term_doc = vec.fit_transform(train_df['question_text'])
val_term_doc = vec.transform(val_df['question_text'])
test_term_doc = vec.transform(test['question_text'])

In [None]:
model = NbSvmClassifier(solver='sag', C = 1e1, max_iter=200)

In [None]:
model.fit(trn_term_doc, train_df['target'])

In [None]:
preds_val = model.predict_proba(val_term_doc)[:,1]
preds_test = model.predict_proba(test_term_doc)[:,1]

Use validation set to find optimal threshold

In [None]:
best_threshold = threshold_search(y_true=val_df['target'], y_proba=preds_val)

F1 score results

In [None]:
best_threshold

In [None]:
pred_test_y = (preds_test > best_threshold['threshold']).astype(int)
test_df = pd.read_csv("../input/test.csv", usecols=["qid"])
out_df = pd.DataFrame({"qid":test_df["qid"].values})
out_df['prediction'] = pred_test_y
out_df.to_csv("submission.csv", index=False)