# Classification Walkthrough - Forward Interest

This walkthrough will take you through a simple classification task. By the end of the tutorial, you should be able to create a simple machine learning workflow to classify sentences for `forward interest`

## Create a project

To start, we'll quickly create a Squirro project that we can work in. To do this you'll need a running Squirro cluster and a valid API token.

In [None]:
CLUSTER = ""
TOKEN = ""

# get a client
from squirro_client import SquirroClient
client = SquirroClient(client_id=None, client_secret=None, cluster=CLUSTER)
client.authenticate(refresh_token=TOKEN)

# create a project
PROJECT_ID = client.new_project("Classification Walkthrough").get("id")
print PROJECT_ID

## Loading data

The next step is to load data in our Squirro instance. We can now run a pre-made Squirro data loader script to insert our data set:

In [None]:
!{"./load.sh %s %s %s" % (CLUSTER, TOKEN, PROJECT_ID)}

## Examine the dataset

The first step in any machine learning project should be to look carefully at your dataset. Try to answer questions like:
- How many labeled samples do I have?
- Are the labels evenly distributed between categories?
- How accurate would I be if I labeled the samples randomly?
- How accurate would I be if I labeled all the samples as only one category?
Answering these questions will give you an idea of what method to use, what parameters to use for that method, and what the baseline perfomance might be.

For the dataset we just loaded, we'll first look at a few samples:

In [None]:
# print a positive item
for item in client.query(project_id=PROJECT_ID,
                         query='dataset:train label:pos',
                         fields=['body','keywords'], count=1)['items']:
    print u'{label} - {body}'.format(body=item['body'], label=item['keywords']['label'][0])
    
# print a negative item
for item in client.query(project_id=PROJECT_ID,
                         query='dataset:train label:neg',
                         fields=['body','keywords'], count=1)['items']:
    print u'{label} - {body}'.format(body=item['body'], label=item['keywords']['label'][0])

As you can see, we've printed a single positive and negative example to get an idea of what we're looking for. No big surprises so far.

Next we'll look at the dataset as a whole to get an idea of the balance between the two labels:

In [None]:
res = client.query(project_id=PROJECT_ID, query='*', aggregations={'label': {}})
for value in res['aggregations']['label']['label']['values']:
    print u'{label} - {count}'.format(label=value['key'], count=value['value'])

As you can see, while our dataset has a decent number of samples (almost 18K), it has many more negative examples than positive. This imbalance should be noted, as it tells us a couple things. First, if we were to guess every item was negative, we'd already be ~85% correct! This is considerably higher than the random guessing baseline of 50%. Second, this imbalance might skew our model towards the negative category. To account for this, we should consider weighting the model categories when we construct our model.

## Build the model workflow

Now that we have an idea of the data we're dealing with, we can move on to building our classification model. To reiterate the goal, we want to build a model that can guess a whether or not a sentence is forward-looking based on the text therein.

The heart of Squirro's Machine Learning Service is our custom natural language processing library libNLP. It is what actually does all the processing. Thus our model workflow is simply a libNLP workflow, which we'll walk through now. (For extended documentation for libNLP, see https://squirro.github.io/nlp/).

The libNLP workflow is simply a JSON file with specifications for individual components required for machine learning, so we start with an empty JSON:

In [None]:
workflow = {}

### Specify the dataset

The first thing we need to do is tell libNLP on which dataset to operate. We do this by providing Squirro queries to `train`, `test`, and `infer` data sets. `train` is the data we want to train the model on. `test` is the data we'd like to test the model on, and `infer` is the data we'd like to predict on (which is typically unlabeled).

In [None]:
workflow["dataset"] = {
    "train": {"query_string": "dataset:train"},
    "test": {"query_string": "dataset:test"}
}

Here we have already split our dataset into a training and test set using a `dataset` facet during loading. Notice also that `query_string` can be any Squirro query, making it easy to carve out your samples.

### Specify the analyzer

Next we want to tell libNLP the type of machine learning task we have. That way we can later analyze how well we are doing at this task.

In [None]:
workflow["analyzer"] = {
    "type": "classification",
    "label_field": "keywords.label",
    "tag_field": "keywords.label_pred"
}

Here we said we have a `classification` task, where the ground-truth label is `label` and the field with our predicted gender is `label_pred`.

### Specify the pipeline

Finally we need to tell libNLP the steps we'll use to go from unstructured text to a prediction for each item. We do so by defining a pipeline compose of sequential steps where each step does some operation on an internal stream of items.

Here we only present the steps that we need for this task. For a list of all steps and associated documentation, see https://squirro.github.io/nlp/.

First we instantiate an empty pipeline:

In [None]:
workflow["pipeline"] = []

#### Loader step

The first step is to load the data from Squirro into libNLP and convert them to libNLP's internal format. This step will be passed the various `dataset` settings we gave above since it is the beginning of the pipeline.

In [None]:
workflow['pipeline'].append({
    "step": "loader",
    "type": "squirro_query",
    "fields": ["body", "keywords.label"]
})

Notice that we specified the `fields` we wanted to import to make loading more efficient.

Also note, that when the loader step gets content, it will always turn it into a flat dictionary before passing it to the next step in the pipeline. This is why we prepend `keywords.` to the fields.

#### Normalization step

We next need to normalize the incoming data so that all the training samples are in the same format. This makes training the model simpler since it shrinks the space of data it has to be able to predict on.

In [None]:
workflow['pipeline'].append({
    "step": "normalizer",
    "types": ["html", "character", "punctuation", "lowercase"],
    "fields": ["body"]
})

Here for the field `body`, we are first stripping out `html`, numeric `character`s, and `punctuation`, and then making everything `lowercase`.

#### Tokenization step

Now we need to split our input from a stream of words into a list of tokens. For this particular case, we can use the `spaces` tokenizer to get our a sequential list of words.

In [None]:
workflow['pipeline'].append({
    "step": "tokenizer",
    "type": "spaces",
    "fields": ["body"]
})

#### Embedding step

Right before classification, we have to convert our list of tokenized words into numbers. This is done via an `embedder` step. Squirro comes shipped with some pre-trained embeddings, but for this case, we'll make our own TF-IDF embeddings.

In [None]:
workflow['pipeline'].append({
    "step": "embedder",
    "type": "tfidf",
    "input_field": "body",
    "output_field": "embedded_body"
})

#### Classification step

We are now ready to classify the incoming items. For this task we'll use a simple SVM classifier from scikit-learn.

In [None]:
workflow['pipeline'].append({
    "step": "classifier",
    "type": "sklearn",
    "model_type": "SVC",
    "model_kwargs": {"probability": True},
    "use_sparse": True,
    "input_fields": ["embedded_body"],
    "label_field": "keywords.label",
    "output_field": "keywords.label_pred",
    "explanation_field": "keywords.label_pred_explanation"
})

This classifier takes our input field `embedded_body` and attempts to predict the label field `keywords.label`. It writes its prediction in the output field `label_pred`.

Some models also provide an explanation of their prediction (though the SVM does not). Here it's written to `keywords.label_pred_explanation`.

#### Saver step

Finally we want to save our predictions back to Squirro. We do this through a saver step:

In [None]:
workflow["pipeline"].append({
    "step": "saver",
    "type": "squirro_item",
    "fields": ["keywords.label_pred"]
})

Note that only the fields we specify in `fields` will be sent back to Squirro.

### All together

Putting it all together, our libNLP workflow looks like this:

In [None]:
import json
print json.dumps(workflow, indent=2)

## Train the model

Now we're ready to train our proposed workflow. To do that we can simply push it to the Squirro Machine Learning Service:

In [None]:
ml_workflow_id = client.new_machinelearning_workflow(
    PROJECT_ID, name='gender_divide', config=workflow).get('id')
print ml_workflow_id

Now we create a training job for the workflow. This will tell the Machine Learning Service to schedule a job that runs the workflow with the `train` dataset we specified above.

In [None]:
training_job_id = client.new_machinelearning_job(
    PROJECT_ID, ml_workflow_id=ml_workflow_id, type='training').get('id')
print training_job_id

Now we just wait for it to finish. Depending on the size the dataset, size of the model, and the number of free parameters, this can take anywhere from a few seconds to days. Because of this, it's always a good idea to START SMALL with a test dataset and model until you're confident things are working well.

Since training will take up to 5 minutes to finish, we write the simple function below that pings the job status every 5 seconds. Once this cell is done evaluating, we'll be ready to move on.

In [None]:
import time
def wait_for_ml_job(project_id, ml_workflow_id, ml_job_id, max_wait_time=600):
    """Wait for ML job to finish"""
    start_time = time.time()
    while True:
        job = client.get_machinelearning_job(
            project_id, ml_workflow_id, ml_job_id, include_run_log=True).get('machinelearning_job')
        if job.get('last_error_at') is not None or job.get('last_success_at') is not None:
            print job.get('logs')
            break
        else:
            print '.',
            time.sleep(5)
        if (time.time() - start_time) > max_wait_time:
            print 'max_wait_time has been exceeded!'
            print job.get('logs')
            break
wait_for_ml_job(PROJECT_ID, ml_workflow_id, training_job_id, max_wait_time=300)

## Analyze the model quality

Now that our model is trained, we can check out how it performed on our test data set (again in this instance it was the same as the training set).

In [None]:
result = client.get_machinelearning_job(
    PROJECT_ID, ml_workflow_id, training_job_id).get('machinelearning_job').get('last_result')
print json.dumps(result, indent=2)

The above results tell us several things. First, we see the `precision` and `recall` of each predicted label. `precision` is the number of true positives divided by the total number of predictions for each category. `recall` is the number of true positives divided by the total number of samples for each category.

We also see the `confusion matrix` which shows us where we are most likely to mis-predict. In a perfect classifier, only the diagonal of the matrix would be populated. Here we see, though, that there is population is off-diagonal elements as well, meaning we are mis-predicting in some cases.

## Validate the model on new data

Since our model has reasonble (though not perfect) quality, we can now move on to validating it on samples that don't yet have a `label`. We can do this in a few different ways, which we cover below.

### Direct inference

First, it's good to do a sanity check. The simplest way to check our model on new data is to run a direct inference on items we have.

In [None]:
test_items = [{"id": 0, "body": "The board gives us its full support."},
              {"id": 1, "body": "We will make sure we have enough runway for the next 12 months."},
              {"id": 2, "body": "We are actively looking for investment."},
              {"id": 3, "body": "Luke, I am your father."}]

test_items_pred = client.run_machinelearning_workflow(
    PROJECT_ID, ml_workflow_id, data={'items': test_items}).get('items')
for item, item_pred in zip(test_items, test_items_pred):
    print u'{label} - {body}'.format(body=item['body'], label=item_pred['keywords']['label_pred'])

Seems reasonable...

### Add a pipelet to for ingestion

Now that we have some confidence in our trained model, we can set up a pipelet step that will run items through it during ingestion. For this we have made an example pipelet here: https://github.com/squirro/delivery/tree/master/templates/pipelets/machinelearning.

### Add an inference job for future data

If we want to avoid blocking the ingestion process, we can instead make an ayschronous inference job that will tag new items with our trained model

In [None]:
inference_job_id = client.new_machinelearning_job(
    PROJECT_ID, ml_workflow_id=ml_workflow_id, type='inference', scheduling_options={}).get('id')
print inference_job_id

## Reset

WARNING: This deletes the project!!!

In [None]:
client.delete_project(PROJECT_ID)