# SigOpt for ML: Automatically Tuning Text Classifiers Walkthrough
*This <a href="http://jupyter.readthedocs.org/en/latest/install.html">Jupyter notebook</a> is a walkthrough companion for our blog post [SigOpt for ML: Automatically Tuning Text Classifiers](http://blog.sigopt.com/post/133089144983/sigopt-for-ml-automatically-tuning-text) by Research Engineer [Ian Dewancker](https://sigopt.com/about#ian)*

Text classification problems appear quite often in modern information systems, and you might imagine building a small document/tweet/blogpost classifier for any number of purposes. In this example, we're building a pipeline to label [Amazon product reviews](http://www.cs.jhu.edu/~mdredze/datasets/sentiment/) as either favorable or not. First, we'll use scikit-learn to build a logistic regression classifier, performing k-fold cross validation to measure the accuracy. Next, we'll use SigOpt to tune the hyperparameters of both the feature extraction and the model building. 
## Setup
If you'd like to try running this example, you'll need to install:
 * [scikit-learn](http://scikit-learn.org/stable/install.html)
 * SigOpt's [python client](https://sigopt.com/docs/overview/python)


In [1]:
# Learn more about authenticating the SigOpt API: sigopt.com/docs/overview/authentication
# Uncomment the following line and add your SigOpt API token:
# SIGOPT_API_TOKEN=<YOUR API TOKEN>

In [2]:
import json, urllib
(negative_file, _ ) = urllib.urlretrieve(
    "http://sigopt-public.s3-website-us-west-2.amazonaws.com/NEGATIVE_list.json",
    "NEGATIVE_LIST.json"
)
(positive_file, _) = urllib.urlretrieve(
    "http://sigopt-public.s3-website-us-west-2.amazonaws.com/POSITIVE_list.json",
    "POSITIVE_LIST.json"
)
POSITIVE_TEXT = json.load(open(positive_file))
NEGATIVE_TEXT = json.load(open(negative_file))
print "POSITIVE: {}".format(POSITIVE_TEXT[0])
print "NEGATIVE: {}".format(NEGATIVE_TEXT[0])

POSITIVE: My husband is a BMW mechanic and he drives cars all day.  I wish I could get him one for every car

NEGATIVE: Far better models for similair price range - dissapointed with the quality of the finish etc etc. works fine though.



# Part I: Setting up the Text Classifier
## Feature Representation
The [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class in scikit-learn is a convenient mechanism for transforming a corpus of text documents into vectors using [bag of words representations (BOW)](https://en.wikipedia.org/wiki/Bag-of-words_model). scikit-learn offers quite a bit of control in determining which [n-grams](https://en.wikipedia.org/wiki/N-gram) make up the vocabulary for your BOW vectors.  As a quick refresher, n-grams are sequences of text tokens as shown in the example below:

Original Text | “SigOpt optimizes any complicated system”
--- | ---
1-grams | { “SigOpt”, “optimizes”, “any”, “complicated”, “system”}
2-grams | { “SigOpt_optimizes”, “optimizes_any”, “any_complicated” ... }
3-grams | { “SigOpt_optimizes_any”, “optimizes_any_complicated” ... }

The number of times each n-gram appears in a given piece of text is then encoded in the BOW vector describing that text.  CountVectorizer allows you to control the range of n-grams that are included in the vocabulary (`ngram_range`), as well as filtering n-grams outside a specified [document-frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) range (`min_df`, `max_df`). For example, if a rare 3-gram like “hi_diddly_ho” doesn’t appear with at least `min_df` frequency in the corpus, it is not included in the vocabulary.  Similarly, n-grams that occur in nearly every document (1-grams like “the”, “a” etc) can also be filtered using the `max_df` parameter. 

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
# Create a CountVectorizer with the default parameters
vectorizer = CountVectorizer()
# Transform corpus into vectors using bag-of-words
text_features = vectorizer.fit_transform(POSITIVE_TEXT + NEGATIVE_TEXT)

## Logistic Regression Error Cost Parameters
Using the [SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) class in scikit-learn, we can succinctly formulate and solve the logistic regression learning problem.  For a full description of the error function for logistic regression, see the [original blog post](http://blog.sigopt.com/post/133089144983/sigopt-for-ml-automatically-tuning-text). For brevity, we'll examine the exposed parameters of `alpha`, the weight of the [regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) term in our cost function, and `l1_ratio`, which controls the [mixture](https://en.wikipedia.org/wiki/Elastic_net_regularization) of l1 and l2 regularization.

In [4]:
from sklearn.linear_model import SGDClassifier
# Create an SGDClassifier with default parameters
classifier = SGDClassifier(loss='log', penalty='elasticnet')

## Objective Metric
SigOpt finds parameter configurations that **maximize** any metric, so we need to pick one that is appropriate for this classification task. In designing our objective metric, [accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision), the number of correctly classified reviews, is obviously important, but we also want assurance that our model generalizes and can perform well on data on which it was not trained.  This is where the idea of [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) comes into play.  

Cross-validation requires us to split up our entire labeled dataset into two distinct sets: one to train on and one to validate our trained classifier on.  We then consider metrics like accuracy on only the validation set.  Taking this further and considering not one, but many possible splits of the labeled data is the idea of [k-fold cross-validation](https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#k-fold_cross-validation) where multiple training, validation sets are generated and validation metrics can be aggregated in several ways (e.g., mean, min, max) to give a single estimation of performance.

In this case, we’ll use the mean of the k-folded cross-validation accuracies (see the [original blog post](http://blog.sigopt.com/post/133089144983/sigopt-for-ml-automatically-tuning-text) for a further discussion). In our case, k=5 folds are used and the train and validation sets are split randomly using 70% and 30% of the entire dataset, respectively.

In [5]:
from sklearn import cross_validation
cv_split = cross_validation.ShuffleSplit(
    text_features.shape[0], 
    n_iter=5, 
    test_size=0.3, 
    random_state=0
)
# Target classification of 1 for positive sentiment, -1 for negative
target = [1] * len(POSITIVE_TEXT) + [-1] * len(NEGATIVE_TEXT)
cv_scores = cross_validation.cross_val_score(
    classifier, 
    text_features, 
    target, 
    cv=cv_split
)
import numpy
print "Mean of 5-folded cross-validation accuracies: {}".format(
    numpy.mean(cv_scores)
)

Mean of 5-folded cross-validation accuracies: 0.839165419162


# Part II: Tuning the Text Classifier
## Tunable Parameters
The objective metric is controlled by a set of parameters that potentially influence its performance. While the logistic regression model might be conceptually simple and implemented in many statistics and machine learning software packages, valuable engineering time and resources are often wasted experimenting with feature representation and parameter tuning via trial and error.  SigOpt can automatically and intelligently optimize your objective, letting you get back to working on other tasks.

Within SigOpt, [Parameters](https://sigopt.com/docs/objects/parameter) can be defined on integer, continuous, or categorical domains. The first step is to define a SigOpt [Experiment](https://sigopt.com/docs/objects/experiment) with parameters to be tuned.

The parameters used in this experiment can be split into two groups: those governing the feature representation of the review text in the `vectorizer` (`min_df`, `max_df` and `ngram_range`) and those governing the cost function of logistic regression in the `classifier` (`alpha`, `l1_ratio`). We can examine the default values of these parameters below:

In [6]:
print "Default parameters of vectorizer: {}".format(json.dumps({
    'min_df': vectorizer.min_df,
    'max_df': vectorizer.max_df,
    'ngram_range': vectorizer.ngram_range,
}))
print "Default parameters of classifier: {}".format(json.dumps({
    'alpha': classifier.alpha,
    'l1_ratio': classifier.l1_ratio,
}))

Default parameters of vectorizer: {"ngram_range": [1, 1], "max_df": 1.0, "min_df": 1}
Default parameters of classifier: {"alpha": 0.0001, "l1_ratio": 0.15}


### Tricks to Tuning with SigOpt
As you can see, `ngram_range` is really two integer parameters, a minimum and a maximum. To tune these parameters in the classifier we tell SigOpt to tune the minimum value as an integer parameter such as `min_ngram`, in addition to an offset such as `ngram_offset`, where `max_ngram = min_ngram + ngram_offset`. We'll use this trick with parameters for both document frequency and n-grams.

Often when the range of the parameter is very large or very small, it makes sense to look at the parameter on the [log scale](https://en.wikipedia.org/wiki/Logarithmic_scale), as we'll do with the `log_min_df` and the `log_reg_coef` parameters. To transform back to the original parameter, `min_df = math.exp(log_min_df)`.

To finish, we choose reasonable bounds for every parameter, and create a SigOpt experiment with the [python client](https://github.com/sigopt/sigopt-python).

In [7]:
# Learn more about authenticating the SigOpt API: sigopt.com/docs/overview/authentication
from sigopt import Connection
conn = Connection(client_token=SIGOPT_API_TOKEN)

In [8]:
import math
experiment = conn.experiments().create(
    name='Sentiment LR Classifier',
    project='sigopt-examples',
    metrics=[dict(name='cv_scores', objective='maximize')],
    parameters=[{ 
        'name':'l1_coef', 
        'type': 'double', 
        'bounds': { 'min': 0, 'max': 1.0 }
    }, { 
        'name':'log_reg_coef', 
        'type': 'double', 
        'bounds': { 'min': math.log(0.000001), 'max': math.log(100.0) }
    }, { 
        'name':'min_ngram', 
        'type': 'int',
        'bounds': { 'min': 1, 'max': 2 }
    }, { 
        'name':'ngram_offset',
        'type': 'int',
        'bounds': { 'min': 0, 'max': 2 }
    }, { 
        'name':'log_min_df', 
        'type': 'double',
        'bounds': { 'min': math.log(0.00000001), 'max': math.log(0.1) }
    }, { 
        'name':'df_offset', 
        'type': 'double',
        'bounds': { 'min': 0.01, 'max': 0.25 }
    }],
)
print "View your experiment details at https://sigopt.com/experiment/{0}".format(experiment.id)

### Putting it all Together
Now that we've defined our experiment for SigOpt, we should re-write our classifier code so that it accepts a dictionary of assignments as they will be returned from the [SigOpt API](https://sigopt.com/docs/overview). Here, our function also accepts the lists of positive and negative sentiment text. You can see the relationship between the parameters defined in the SigOpt experiment and the parameters used by CountVectorizer and SGDClassifier.

In [9]:
def sentiment_metric(positive, negative, assignments):
    min_ngram = assignments['min_ngram']
    max_ngram = min_ngram + assignments['ngram_offset']
    min_doc_frequency = math.exp(assignments['log_min_df'])
    max_doc_frequency = min_doc_frequency + assignments['df_offset']
    vectorizer = CountVectorizer(
        min_df=min_doc_frequency, 
        max_df=max_doc_frequency,                          
        ngram_range=(min_ngram, max_ngram),
    )
    text_features = vectorizer.fit_transform(positive + negative)
    target = [1] * len(positive) + [-1] * len(negative)
    
    alpha = math.exp(assignments['log_reg_coef'])
    l1_ratio = assignments['l1_coef']
    classifier = SGDClassifier(
        loss='log', 
        penalty='elasticnet', 
        alpha=alpha, 
        l1_ratio=l1_ratio
    )
    cv = cross_validation.ShuffleSplit(
        text_features.shape[0], 
        n_iter=5, 
        test_size=0.3, 
        random_state=0
    )
    cv_scores = cross_validation.cross_val_score(classifier, text_features, target, cv=cv)
    return numpy.mean(cv_scores)

### Optimization Loop
After creating an experiment, run the [optimization loop](https://sigopt.com/docs/overview/optimization) to tune the parameters:
1. Receive suggested assignments from SigOpt
2. Calculate the sentiment metric given these assignments
3. Report the observed sentiment metric to SigOpt

Based on our observations, we advise running iterations equal to 10-20 times the number of parameters. Since we have 6 parameters, we'll run 60 iterations.

In [10]:
for i in range(60):
    print "Training model candidate {0}".format(i)
    suggestion = conn.experiments(experiment.id).suggestions().create()
    opt_metric = sentiment_metric(POSITIVE_TEXT, NEGATIVE_TEXT, suggestion.assignments)
    conn.experiments(experiment.id).observations().create(
      suggestion=suggestion.id,
      value=opt_metric,
    ) 

# Closing Remarks
See the [original blog post](http://blog.sigopt.com/post/133089144983/sigopt-for-ml-automatically-tuning-text) for a discussion of results average over 20 runs of this optimization loop for grid search, random search, and SigOpt, all versus the baseline of no parameter tuning. (Hint: you see accuracy gains with parameter tuning.)

This short example scratches the surface of the types of ML related experiments one could conduct using SigOpt.  For example, SGDClassifier has lots of variations from which to select– another experiment might be to treat the loss function as a [categorical variable](https://sigopt.com/docs/objects/parameter).  What sort of models are you building that could benefit from better experimentation or optimization?  Stay tuned for more posts in our series on integrating SigOpt with various ML frameworks to solve **real problems more efficiently!**

In [11]:
experiment = conn.experiments(experiment.id).fetch()

In [12]:
best_observation = experiment.progress.best_observation
print "Best value: {value}, found at:\n{assignments}".format(
    value=best_observation.value, 
    assignments=json.dumps(
        best_observation.assignments.to_json(),
        sort_keys=True,
        indent=4, 
        separators=(',', ': ')
    )
)

In [13]:
def sentiment_classifier(positive, negative, assignments):
    min_ngram = assignments['min_ngram']
    max_ngram = min_ngram + assignments['ngram_offset']
    min_doc_frequency = math.exp(assignments['log_min_df'])
    max_doc_frequency = min_doc_frequency + assignments['df_offset']
    vectorizer = CountVectorizer(
        min_df=min_doc_frequency, 
        max_df=max_doc_frequency,                          
        ngram_range=(min_ngram, max_ngram),
    )
    text_features = vectorizer.fit_transform(positive + negative)
    target = [1] * len(positive) + [-1] * len(negative)
    
    alpha = math.exp(assignments['log_reg_coef'])
    l1_ratio = assignments['l1_coef']
    classifier = SGDClassifier(
        loss='log', 
        penalty='elasticnet', 
        alpha=alpha, 
        l1_ratio=l1_ratio
    )
    classifier.fit(text_features, target)
    return (vectorizer, classifier)

In [14]:
import math
(best_vectorizer, best_classifier) = sentiment_classifier(
    POSITIVE_TEXT, 
    NEGATIVE_TEXT, 
    best_observation.assignments
)

In [15]:
def predict_sentiment(text, verbose=True):
    text_features = best_vectorizer.transform([text])
    sentiment = best_classifier.predict(text_features)[0]
    if verbose:
        if sentiment == 1:
            print "positive: {}".format(text)
        else:
            print "negative: {}".format(text)
    return sentiment

In [16]:
import random
sentiment = predict_sentiment(random.choice(POSITIVE_TEXT))
sentiment = predict_sentiment(random.choice(NEGATIVE_TEXT))

In [17]:
print "View your final results at https://sigopt.com/experiment/{0}".format(experiment.id)

View your final results at https://sigopt.com/experiment/2328
