## Document Classification using Machine Learning

It’s pretty safe to assume that for a organization called "Library Carpentry", classification of text documents is not a novel idea.  However, the massive increase in the volume of new information and data presents new challenges.  Classification is more essential than ever, but the sheer volume of information makes it impractical to do this manually.  

A new set of techniques, derived from Machine Learning, makes it possible to train a computer to recognize patterns in text documents (and other types of data) and assign new documents to categories based on those patterns.  Generally speaking, there are two broad categories for machine learning around document classification: supervised and unsupervised.  Supervised learning happens when a machine is trained to recognize and assign documents to predetermined categories.  Unsupervised learning happens when a machine extracts categories from the documents independently.  

In this tutorial, we’ll be applying machine learning techniques for *supervised* learning.  We will use a training set of documents, which have already been sorted into categories, to teach a computer to recognize which documents belong in a category, and which don’t.  We will then apply the trained system to classify new documents that weren’t part of the training set.  

To illustrate these techniques, we’ll training a computer to determine whether a movie review was positive or negative. We use a set of existing movie reviews, some positive, some negative, to train a computer to analyze text and figure out what words or combination of words are commonly associated with good or bad reviews.  Once the computer has been trained to recognize a good or bad movie based on reviews, we’ll apply the system to determine whether a new review should be classified as good or, well, not good.  

To do this, we’ll apply five common supervised learning algorithms: naive bayes, logistic regression, support vector machine, neural network, and random forest.  Note that we’re approaching this from a data carpentry perspective rather than a mathematical or algorithmic perspective.  Our focus is on preparing data, writing python code to build and run a machine learning model, and interpreting the result.  Although we’ll review the algorithms at a high level, we won’t get into depth about how these algorithms work (I will provide links for further reading if you’re interested in getting deeper into the mathematics of various machine learning algorithms).  

## The overall approach

Although different machine learning approaches often apply extremely different underlying mathematics, the overall approach to classifying text is very similar from a data carpentry perspective.  We will start with a set of reviews that have already been classified as good or bad, called a training set.  From here, we will take a “bag of words” approach as follows: 

1. Generate an ordered list, called a “bag of words”, listing the most frequently occurring words in a training set of documents.
2. Create a histogram, or word vector, for each document in our training set.  This word vector will provide a count of how frequently each term from the “bag of words” shows up in a training document.
3. Use this histogram, along with the existing rating of good or bad, to create a set of rules mapping the word vector to “good” or “bad” reviews.
4. Apply this set of rules to new, as yet uncategorized reviews.  

If this seems a little abstract, don’t worry.  It’s much easier to follow though a concrete example.  In the next few sections, we’ll use python to go through each of these steps.  

