* [Ling-Spam dataset](http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html)
  * Preprocessed
* [Bag of Words](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
  * CountVectorizer
    * binary=True
* [Multinomial vs Bernoulli Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html)
* [20 Newsgroups](http://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset)

# Bayesian Classification

So far we've covered linear regression which is handy for predicting a Y given an X or any multitude of variables. Unfortunately we don't always want to predict an indeterminate outcome given an input - sometimes we will want to find out what category something will fall into.  Spam filters do this all of the time.

The way it works is that your spam filter has seen a lot of spam - more than any human should have to - and it's been told what spam looks like and how to find more spam.  Doing this feeds it's hunger for more spam.  The more spam it reads and is validated the better it gets.

This is using a Bayesian method of classification where it strengthens (or weakens) it's belief in a probability to be true (or false) given any amount of data.

Take our coin flip for example. With an unbiased coin we may get somewhat crazy data with a tiny data set, but with millions of flips - we are able to accurately predict the probability that a heads or tails outcome will occur.

In our case, we want to classify a message as spam or not spam given a strong history of learning what spam looks like.

In [4]:
import glob

This handy function will help us read in our data quickly. Since we have **a lot** of spam (and not spam) we want to train our prediction models well.  So we want to load in our training spam/not-spam files as well as our testing spam/not-spam files

In [5]:
def read_ling_spam(directory):
    files = glob.glob("ling-spam/{}/*.txt".format(directory))
    texts = []
    for file in files:
        with open(file) as fh:
            texts.append(fh.read())
    return texts

spam_train = read_ling_spam("spam-train")
non_spam_train = read_ling_spam("nonspam-train")
spam_test = read_ling_spam("spam-test")
non_spam_test = read_ling_spam("nonspam-test")

A count vectorizer will tokenize a corpus of text and produce a matrix of counts for the given words.

Read more on `CounterVectorizer` [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

Instantiate our vectorizer just the same as we have before with our Linear Regression classifier.

<!---
vectorizer = CountVectorizer()
vectorizer
--->

In [7]:
vectorizer = CountVectorizer()

'want best economical hunt vacation life want best hunt camp vacation life felton s hunt camp wild wonderful west virginium per day pay room three home cook meal pack lunch want stay wood noon cozy accomodation reserve space follow season book buck season nov dec doe season announce please call muzzel loader deer dec dec archery deer oct dec turkey sesson oct nov e mail us compuserve com '

Call the `.fit()` method with our combined non-spam and spam training data.

<!---
vectorizer.fit(non_spam_train + spam_train)
--->

In [11]:
vectorizer.fit(spam_train + non_spam_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

... and extrapolate our transformed training data and testing data.

<!---
X_train = vectorizer.transform(non_spam_train + spam_train)
y_train = ['non-spam'] * len(non_spam_train) + ['spam'] * len(spam_train)
X_test = vectorizer.transform(non_spam_test + spam_test)
y_test = ['non-spam'] * len(non_spam_test) + ['spam'] * len(spam_test)
--->

In [12]:
X_train = vectorizer.transform(non_spam_train + spam_train)
y_train = ['non-spam'] * len(non_spam_train) + ['spam'] * len(spam_train)
X_test = vectorizer.transform(non_spam_test + spam_test)
y_test = ['non-spam'] * len(non_spam_test) + ['spam'] * len(spam_test)

Our X_train creates what is called a `sparse matrix`

> In numerical analysis, a sparse matrix is a matrix in which most of the elements are zero. By contrast, if most of the elements are nonzero, then the matrix is considered dense. The fraction of zero elements over the total number of elements in a matrix is called the sparsity (density). 

The y_train will label a corelated message as spam or not spam. Imagine this as a label for a message in the same index on the `X_train` sparse matrix.

Lets get a few of the feature names by slicing the array we get back from `vectorizer.get_feature_names()`

<!---
vectorizer.get_feature_names()[1000:1005]
--->

In [24]:
vectorizer.get_feature_names()[1000:1010]

['armidale',
 'armin',
 'armstrong',
 'army',
 'arnagardur',
 'arnhem',
 'arnold',
 'aron',
 'aronoff',
 'around']

If we want to compare the lengths of our training data vs. the amount of features our vectorizer has created for us we can print the lengths.

Notice quite a big difference between the two?

<!---
print(len(spam_train + non_spam_train), len(vectorizer.get_feature_names()))
--->

In [25]:
print(len(spam_train + non_spam_train), len(vectorizer.get_feature_names()))


700 19073


Enter stage right, our Multinomial Naive Bayes Classifier!

In [26]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

Fit our X and Y data to the classifier.

<!---
classifier.fit(X_train, y_train)
--->

In [28]:
classifier.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Now lets get a score.

<!---
classifier.score(X_train, y_train)
--->

In [29]:
classifier.score(X_train, y_train)

0.99857142857142855

# HOLY BAYES!!!

oh wait a second wait, what did we do wrong? <sub>Did you get it?<sub> are you sure? </sub></sub>

Yup, you probably did, we passed our training data back to our classifier. WRONNNNNNG! 

Lets try again but this time lets pass our TEST data sets into our classifier and see how we do.

<!---
classifier.score(X_test, y_test)
--->

In [30]:
classifier.score(X_test, y_test)

0.97692307692307689

Not bad at all! 97% accuracy for spam prediction.

Even when we pass messages in to our prediction function it will give us an answer - and remember, with 97% accuracy.

<!---
classifier.predict(X_test[0])
--->

In [46]:
message = ["Hi, Bekk how are you?  I am great did you get that message I sent you?"]
transormed_message = vectorizer.transform(message)

classifier.predict(transormed_message)


array(['spam'], 
      dtype='<U8')

# [Pipelines](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

In [47]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

A pipeline is a simple way to create a transformation pipeline for our data.

If you noticed above we had do do a few steps before we were able to adequate score our prediction model.

 - We had to vectorize the corpus
 - We had to transform the vectorized data
 - We had to fit our Multinomian Naive Bayes classifier to our data
 
What if I told you there was a way you could create a pipeline for all of those to be done in one step?

In [48]:
pipeline_map = [('bag_of_words', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('bayes', MultinomialNB())]

def pipeline(x, y):
    return MultinomialNB(TfidfTransformer(CountVectorizer(x), y))

First lets pass our `pipeline_map` into our `Pipeline` instance.

<!---
spam_pipe = Pipeline()
--->

In [49]:
pipeline = Pipeline(pipeline_map)

We still require our `y_train` data, if you remember was our spam/no spam labels.

Lets fit our combined spam/non-spam training data and our `y_train` information to our pipeline.

<!---
spam_pipe.fit(non_spam_train + spam_train, y_train)
--->

In [50]:
pipeline.fit(spam_train + non_spam_train, y_train)

Pipeline(steps=[('bag_of_words', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
     ...ear_tf=False, use_idf=True)), ('bayes', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

And after all of that, we can pass our `non_spam_test` data set into our vectorized-transformed-classified model and get back the information we'd expect.

<!---
spam_pipe.predict(non_spam_test)
--->

In [53]:
pipeline.predict(non_spam_test)

array(['spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'non-spam', 'spam', 'spam', 'spam', 'spam', 'non-spam', 'non-spam',
       'spam', 'spam', 'spam', 'non-spam', 'non-spam', 'spam', 'spam',
       'non-spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'non-spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'non-spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam'

And by scoring our combined spam/non-spam test sets against our Y labels we get.......

<!---
spam_pipe.score(non_spam_test + spam_test, y_test)
--->

In [55]:
pipeline.score(spam_test + non_spam_test, y_test)

0.96923076923076923