# Bayesian Classification

Linear regression is useful for predicting a lot of different things, especially on a continuum, but it's not always useful if we are essentially calculating a boolean. Finding a category for something (including a boolean) is something Bayesian inference is good at.

So we expose our filter to a lot of spam messages and a lot of non-spam messages, and identify which of those are which, and look for features present in each to make a prediction about new data. The more spam/non-spam (aka ham) a filter is exposed to, the better it should get if it's using Bayesian inference.

Let's classify a message as spam or ham, and build a belief system.

In [3]:
import glob

# glob gives us a function called glob that loops over sets of files.
# We want to loop over all the training files and give the data to our filter,
# and that will involve reading in a bunch of files

def read_ling_spam(directory):
    files = glob.glob('ling-spam/{}/*.txt'.format(directory))
    texts = []
    for file in files:
        with open(file) as f:
            texts.append(f.read())
    return texts

In [4]:
glob.glob('ling-spam/nonspam-train/*.txt')[:5]

['ling-spam/nonspam-train/3-380msg4.txt',
 'ling-spam/nonspam-train/3-384msg0.txt',
 'ling-spam/nonspam-train/3-385msg1.txt',
 'ling-spam/nonspam-train/3-385msg2.txt',
 'ling-spam/nonspam-train/3-390msg3.txt']

In [9]:
spam_train = read_ling_spam('spam-train')
ham_train = read_ling_spam('nonspam-train')
spam_test = read_ling_spam('spam-test')
ham_test = read_ling_spam('nonspam-test')



In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# CountVectorizer takes text and produces a matrix of counts for the words in the text

In [8]:
vectorizer = CountVectorizer()
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [10]:
vectorizer.fit(spam_train + ham_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [28]:
# the .fit method passes in the data and figures out the "fit" for it, which is just
# the boolean values for each word

X_train = vectorizer.transform(ham_train + spam_train)
y_train = ['ham'] * len(ham_train) + ['spam'] * len(spam_train)
X_test = vectorizer.transform(ham_test + spam_test)
y_test = ['ham'] * len(ham_test) + ['spam'] * len(spam_test)


In [17]:
row = X_train.getrow(0)

In [18]:
row.nonzero()

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0], dtype=int32),
 array([  846,  1031,  1648,  1954,  3368,  4912,  5194,  6450,  7119,
         7276,  7349,  8248,  8409,  9455, 10903, 12328, 12752, 13076,
        13405, 13912, 15527, 15711, 15777, 16379, 16792, 17055, 18708], dtype=int32))

In [24]:
for col in row.nonzero():
    print(vectorizer.get_feature_names()[1,col], X_train[0, col])

TypeError: list indices must be integers, not tuple

In [25]:
vectorizer.get_feature_names()[846]

'anyone'

In [27]:
y_train

['ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',


In [29]:
X_train

<700x19073 sparse matrix of type '<class 'numpy.int64'>'
	with 110609 stored elements in Compressed Sparse Row format>

`X_train` is a _sparse matrix_, meaning it only stores non-zero values to save on memory/space. If most of the elements of a matrix are expected to be zero, a spare matrix is a good data structure. If most of the elements, the matrix is considered "dense," and a sparse matrix is not a good data structure.

`y_train` has a single column, denoting a message as either ham or spam. The indexes match up between `X_train` and `y_train` to link the messages together

In [30]:
# get_feature_names lets us see what words correspond to which column in the sparse matrix

vectorizer.get_feature_names()[1000:1010]

['armidale',
 'armin',
 'armstrong',
 'army',
 'arnagardur',
 'arnhem',
 'arnold',
 'aron',
 'aronoff',
 'around']

In [31]:
len(spam_train+ham_train)

700

In [32]:
len(vectorizer.get_feature_names())

19073

In [34]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

The Multinomial Naive Bayes Classifier does some magic relating our rows and our columns

In [35]:
classifier.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [36]:
classifier.score(X_train, y_train)

0.99857142857142855

# Ooo! Ahh! Holy Bayes!

So it's really accurate... for the data we trained it on...

We can't judge accuracy based on training data. We need to use our test data, that wasn't part of our fit, to really judge accuracy.

Let's check our test data

In [37]:
classifier.score(X_test, y_test)

0.97692307692307689

This is making inferences based upon data it hasn't seen

In [40]:
# message = ['hi how are you i have a loan for you to consider today my friend is a prince has large sum of money']

message = ['i think the natural language processing features of scikit are very useful, and i have had students tell me the same']

classifier.predict(vectorizer.transform(message))

array(['ham'], 
      dtype='<U4')

# Pipelines

Pipelines make it simple for us to do a lot of transformations on data.

With the above example, we had to:
* vectorize the source data
* transform the vectorized data
* fit our MultinomialNB classifier to our data

You can create a pipeline to do all that in one step

In [50]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

pipeline_map = [('bag_of_words', CountVectorizer()),
                #('tfidf', TfidfTransformer()),
                ('bayes', MultinomialNB())]

# def pipeline(x, y):
#     return MultinomialNB(TfidfTransformer(CountVectorizer(x), y))
del pipeline

In [51]:
pipeline = Pipeline(pipeline_map)

In [52]:
# now we need to get this data into our pipeline
pipeline.fit(ham_train + spam_train, y_train)

Pipeline(steps=[('bag_of_words', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('bayes', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

Now we can pass our test data into the pipeline and see what we would predict

In [55]:
res = pipeline.predict(ham_test)

In [56]:
list(res).count('ham')

126

In [57]:
pipeline.score(ham_test + spam_test, y_test)

0.97692307692307689

## Links

* [Ling-Spam dataset](http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html)
  * Preprocessed
* [Bag of Words](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
  * CountVectorizer
    * binary=True
* [Multinomial vs Bernoulli Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html)
* [20 Newsgroups](http://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset)