## Introduction

This kernel shows how to use NBSVM (Naive Bayes - Support Vector Machine) to create (at the very least) a strong baseline for the [DonorsChoose.org Application Screening](http://https://www.kaggle.com/c/donorschoose-application-screening) playground competition. NBSVM was introduced by Sida Wang and Chris Manning in the paper [Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf). However we have to use sklearn's logistic regression in this kernel instead of SVM, although in practice the two are nearly identical (sklearn uses the liblinear library behind the scenes).

If you're not familiar with [Naive Bayes](http://https://en.wikipedia.org/wiki/Naive_Bayes_classifier) or [Bag-of-words matrices](http://https://en.wikipedia.org/wiki/Bag-of-words_model), another competitor made a preview available of one of fast.ai's upcoming *Practical Machine Learning* course videos, which [introduces this topic](https://youtu.be/37sFIak42Sc?t=3745).

In [1]:
import numpy as np
import pandas as pd
import math
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
subm = pd.read_csv('../input/sample_submission.csv')

## Quickly Exploring the Data

The training data contains a row per application, with an id, the text of the essays, and 1 binary label that we'll try to predict.

In [3]:
train.head()

Here's a couple of examples of essays (essay 1 and essay 2).

In [4]:
train['project_essay_1'][0]

In [5]:
train['project_essay_2'][0]

There is a little bit of variance in the length of the essays.  However, I think there is a word limit on the essays.

In [7]:
lens1 = train.project_essay_1.str.len()
lens2 = train.project_essay_2.str.len()
lens3 = train.project_title.str.len()
lens4 = train.project_resource_summary.str.len()
lens1.mean(), lens1.std(), lens1.max()
lens2.mean(), lens2.std(), lens2.max()
lens3.mean(), lens3.std(), lens3.max()
lens4.mean(), lens4.std(), lens4.max()

In [8]:
lens1.hist(grid=False, bins=30);
lens2.hist(grid=False, bins=30);
lens3.hist(grid=False, bins=30);
lens4.hist(grid=False, bins=30);

We'll create a list of all the labels to predict, and we'll also create a 'none' label so we can see how many comments have no labels. We can then summarize the dataset.

In [9]:
label_cols = ['project_is_approved']
train.describe()

In [10]:
len(train), len(test)
train_length = len(train)
test_length = len(test)
print("Train set has ", train_length, "pieces of data.")
print("Test set has ", test_length, "pieces of data.")

There are a few empty essays that we need to get rid of, otherwise sklearn will complain. It's worth mentioning here that we're only looking at essays 1 and 2, since it seems essays 3 and 4 aren't as important to the test dataset, and if they were, there aren't a lot to train on anyways.

In [11]:
ESSAY1 = 'project_essay_1'
train[ESSAY1].fillna("unknown", inplace=True)
test[ESSAY1].fillna("unknown", inplace=True)
ESSAY2 = 'project_essay_2'
train[ESSAY2].fillna("unknown", inplace=True)
test[ESSAY2].fillna("unknown", inplace=True)
TITLE = 'project_title'
train[TITLE].fillna("unknown", inplace=True)
test[TITLE].fillna("unknown", inplace=True)
RESOURCES = 'project_resource_summary'
train[RESOURCES].fillna("unknown", inplace=True)
test[RESOURCES].fillna("unknown", inplace=True)

## Building the Model and Making a Submission (essay 1)

We'll start by creating a *bag of words* representation as a [*term document matrix*](http://https://en.wikipedia.org/wiki/Document-term_matrix). Because the paper we wrote of earlier suggested the use of [*n*-grams](http://https://en.wikipedia.org/wiki/N-gram), we will use *n*-grams.

In [12]:
import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

It turns out that using TFIDF gives even better priors than the binarized features used in the paper. I don't think this has been mentioned in any published literature before, but it improves the public leaderboard score.

In [13]:
n = train.shape[0]
vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )
trn_term_doc = vec.fit_transform(train[ESSAY1])
test_term_doc = vec.transform(test[ESSAY1])

This creates a [*sparse matrix*](http://https://en.wikipedia.org/wiki/Sparse_matrix) with only a small number of non-zero elements (*stored elements* in the representation  below).

In [14]:
trn_term_doc, test_term_doc

Here's the basic Naive Bayes feature equation:

In [15]:
def pr(y_i, y):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [16]:
x = trn_term_doc
test_x = test_term_doc

Fit our model for "project is approved" binary classifier:

In [17]:
def get_mdl(y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    m = LogisticRegression(C=4, dual=True)
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

In [18]:
preds1 = np.zeros((len(test), len(label_cols)))

for i, j in enumerate(label_cols):
    print('Fitting for ', j, '.')
    m,r = get_mdl(train[j])
    preds1[:,i] = m.predict_proba(test_x.multiply(r))[:,1]

print(preds1[0:20])

And finally, create the submission file.

In [19]:
submid = pd.DataFrame({'id': subm["id"]})
submission = pd.concat([submid, pd.DataFrame(preds1, columns = label_cols)], axis=1)
submission.to_csv('essay1_output.csv', index=False)

## Building the Model and Making a Submission (essay 2)

We're going to do the exact same thing we did before for the first project essay, but this time with the second project essay.

In [20]:
import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

In [21]:
n = train.shape[0]
vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )
trn_term_doc = vec.fit_transform(train[ESSAY2])
test_term_doc = vec.transform(test[ESSAY2])

In [22]:
trn_term_doc, test_term_doc

In [23]:
def pr(y_i, y):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [24]:
x = trn_term_doc
test_x = test_term_doc

In [25]:
def get_mdl(y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    m = LogisticRegression(C=4, dual=True)
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

In [26]:
preds2 = np.zeros((len(test), len(label_cols)))

for i, j in enumerate(label_cols):
    print('Fitting for ', j)
    m,r = get_mdl(train[j])
    preds2[:,i] = m.predict_proba(test_x.multiply(r))[:,1]
    
print(preds2[0:20])

In [27]:
submid = pd.DataFrame({'id': subm["id"]})
submission = pd.concat([submid, pd.DataFrame(preds2, columns = label_cols)], axis=1)
submission.to_csv('essay2_output.csv', index=False)

## Blending Essay1 and Essay2
We use the geometric mean function here, because it's likely that one bad essay would weigh the other one down.

In [28]:
i = 0

essaypreds = np.zeros((len(test), len(label_cols)))

for i in range(len(preds1)):
    essaypreds[i] = math.sqrt( preds1[i] * preds2[i] ) # geometric mean

In [29]:
submid = pd.DataFrame({'id': subm["id"]})
submission = pd.concat([submid, pd.DataFrame(essaypreds, columns = label_cols)], axis=1)
submission.to_csv('essaysonlysubmission.csv', index=False)

## Building the Model and Making a Submission (title)

We're going to do the exact same thing we did before for the essays, but this time with the title.

In [30]:
import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

In [31]:
n = train.shape[0]
vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )
trn_term_doc = vec.fit_transform(train[TITLE])
test_term_doc = vec.transform(test[TITLE])

In [32]:
trn_term_doc, test_term_doc

In [33]:
def pr(y_i, y):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [34]:
x = trn_term_doc
test_x = test_term_doc

In [35]:
def get_mdl(y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    m = LogisticRegression(C=4, dual=True)
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

In [36]:
preds3 = np.zeros((len(test), len(label_cols)))

for i, j in enumerate(label_cols):
    print('Fitting for ', j)
    m,r = get_mdl(train[j])
    preds3[:,i] = m.predict_proba(test_x.multiply(r))[:,1]
    
print(preds3[0:20])

In [37]:
submid = pd.DataFrame({'id': subm["id"]})
submission = pd.concat([submid, pd.DataFrame(preds3, columns = label_cols)], axis=1)
submission.to_csv('title_output.csv', index=False)

Project Resource Summary

In [38]:
import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

In [39]:
n = train.shape[0]
vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )
trn_term_doc = vec.fit_transform(train[RESOURCES])
test_term_doc = vec.transform(test[RESOURCES])

In [40]:
trn_term_doc, test_term_doc

In [41]:
def pr(y_i, y):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [42]:
x = trn_term_doc
test_x = test_term_doc

In [43]:
def get_mdl(y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    m = LogisticRegression(C=4, dual=True)
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

In [44]:
preds4 = np.zeros((len(test), len(label_cols)))

for i, j in enumerate(label_cols):
    print('Fitting for ', j)
    m,r = get_mdl(train[j])
    preds4[:,i] = m.predict_proba(test_x.multiply(r))[:,1]
    
print(preds4[0:20])

In [45]:
submid = pd.DataFrame({'id': subm["id"]})
submission = pd.concat([submid, pd.DataFrame(preds4, columns = label_cols)], axis=1)
submission.to_csv('resourcesummary_output.csv', index=False)

Blend it all

In [46]:
i = 0

finalpreds = np.zeros((len(test), len(label_cols)))

for i in range(len(preds3)):
    finalpreds[i] = ( essaypreds[i] + preds4[i] + preds3[i] ) / 3
    
# essays = 71.515
# resource summary =
# title =

In [47]:
submid = pd.DataFrame({'id': subm["id"]})
submission = pd.concat([submid, pd.DataFrame(finalpreds, columns = label_cols)], axis=1)
submission.to_csv('submission.csv', index=False)