### Text Classification Using Spacy and Scikit-Learn

In this exercise you will use Spacy to do the document preprocessing, then use Scikit-Learn Naive Bayes Classifier to categorize news stories as to their topic.

The main exploration will be to use features of the Spacy text analysis pipeline to provide good term features that will be used in the document classifier.

The directory *documents* contains BBC news articles that are about one of these topics:  business, entertainment, politics, sport, and tech.  The subfolder under *documents* indicates the true category.  You can assume that every document is "about" exactly one category.

In [1]:
# Function from the lab to train a classifier and print the cross-validation accuracy score

from sklearn.model_selection import cross_val_score

def eval_classifier(clf, X, y):
    def cross_validate(clf, x, y):
        return cross_val_score(clf, x, y, cv=10).mean()
    clf.fit(X,y)
    print(f"Results for {clf}")
    print(f"Cross validation mean accuracy: {cross_validate(clf, X, y)}")


In [2]:
# Function to help in loading the training set
#
# Argument is a directory with one subdirectory per topic, and the subdirectory name is 
# the topic name.  Each subdirectory contains text files, which are articles that are "about" that topic.
# 
# Return value from this function is a list of the form [(document_text, topic), ... ]   where document_text is a string
# containing the full document content, and topic is the topic label (a string)

import os

def load_docs(dir="documents"):
    docs = []
    for root, dirs, _ in os.walk(dir):
        for dir in dirs:
            for r2, d2, f2 in os.walk(os.path.join(os.path.join(root, dir))):
                for f in f2:
                    docs.append((open(os.path.join(root, dir, f)).read(), dir))
    return docs


In [3]:
import spacy

# Inputs are (1) the root location of the training set, and 
#            (2) a tokenizer function that takes a document string, and returns a list 
#                of term strings.  
# Return value from this function is [(term_string, label), ...]

def tokenize_docs(docs, tokenizer):
    return [(tokenizer(t[0]), t[1]) for t in docs]

###################
#  Just chains together load_docs and tokenize_docs

def load_and_tokenize_docs(dir, tokenizer):
    return tokenize_docs(load_docs(dir), tokenizer)

For example, this example applies a very simple tokenizer that just splits the input
according to the Spacy regexp.  Notice this is not what a good tokenizer should do, 
since it does not case fold and does not deal with punctuation in any reasonable way.

<pre>
load_and_tokenize_docs("documents", tokenize_text)

[(['Call', 'centre', 'users', "'", 'lose', 'patience', "'", '\n\n', ... ], 'business'), ... ]
</pre>
  


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# If you use Spacy to preprocess the input documents but are using scikit-learn
# to do the classification, you need a vectorizer that skips the preprocessing steps.
# This function returns a vectorizer with no-ops for tokenizer and preprocessor.  

def dummyCountVectorizer():
    def dummy(doc):
        return doc
    return CountVectorizer(tokenizer=dummy, preprocessor=dummy)  

First establish a baseline classification accuracy:  don't use Spacy at all -- instead just use the scikit-learn preprocessor and tokenizer, as we did in lab, using a "real" Count Vectorizer from scikit-learn, and a Multinomial Naive Bayes classifier.  

In [None]:
from sklearn.naive_bayes import MultinomialNB

# Loads documents, instantiates a count vectorizer and Multinomial Naive Bayes
# classifier, trains the model and calls eval_classifier to get the cross_val accuracy
def baseline_accuracy(dir="documents"):
  # Your Code Here

For example,

<pre>
baseline_accuracy()

Results for MultinomialNB()
Cross validation mean accuracy: 0.89
</pre>


Next try the simplest possible Spacy tokenizer -- default splitting pattern, then just use the text of each token for te term (that is the tokenizer in the example above).   

In [6]:
# Argument is location of training data.  Use a very simple Spacy tokenizer, and train a MultinomialNB model.
# Function calls eval_classifier, which prints the cross-val accuracy

def basic_spacy_accuracy(dir="documents"):
    # YOUR CODE HERE


For example.  This is surprising -- even with a VERY crude tokenizer, accuracy is pretty good, but not as good as the scikit-learn baseline.
<pre>
basic_spacy_accuracy()

Results for MultinomialNB()
Cross validation mean accuracy: 0.8699999999999999
</pre>

Now use additional Spacy pipeline features to improve your tokenizer.  For example:  case folding, removing punctuation, lemmatization, stop word removal.   This function will be the same as basic_spacy_accuracy, except using a different tokenizer.

In [9]:
# Function is identical to basic_spacy_accuracy, but uses a better tokenizer

def better_spacy_accuracy(dir="documents"):
    # YOUR CODE HERE

For example.  Some improvement using a better tokenizer!  Also, not surprising this value is the same as the scikit-learn model, as the vectorizers are almost identical.
<pre>
better_spacy_accuracy()

Results for MultinomialNB()
Cross validation mean accuracy: 0.8866666666666667
</pre>

Finally, use the result of the Named Entity Recognition to try to improve the classifier.  

An idea to start
* If your tokenizer generates a "constant" token like TOKEN_MONEY for every piece of the document that was classified as entity type MONEY, that signals that there was a money amount in the text -- which is a good thing to do -- but does not specify the amount of money.  Leaving out the amount of money is probably a good thing to do.  There are several similar entity types, like CARDINAL, PERCENT, QUANTITY, etc.  for which including the entity token but not the actual document text is probably a good idea.
* For other entities like ORG or GPE, it's probably useful to insert a "constant" token like TOKEN_GPE to signify that there was a location name in the text, but it also might be useful to include the document token(s) as well.

But please experiment with various ideas about how to incorporate the NER entities into the document's token stream.

In [11]:
# Function is identical to basic_spacy_accuracy and better_spacy_accuracy, but uses a tokenizer that includes 
# NER entities

def ner_spacy_accuracy(dir="documents"):
    ## YOUR CODE HERE

For example.  Tragedy!  For me, NER did not improve things at all!   But I'm sure you will do better!
<pre>
ner_spacy_accuracy()

Results for MultinomialNB()
Cross validation mean accuracy: 0.8866666666666667
</pre>