In [1]:
from __future__ import print_function

from traitlets.config.manager import BaseJSONConfigManager
path = '/Users/jmk/anaconda2/envs/data601/etc/jupyter/nbconfig'
cm = BaseJSONConfigManager(config_dir=path)
cm.update('livereveal', {
              'theme': 'night',
              'scroll': True,
              #'transition': 'zoom',
              'start_slideshow_at': 'selected',
})

{'scroll': True, 'start_slideshow_at': 'selected', 'theme': 'night'}

# So Machine Learning is really three separate steps

* Get labelled training data
* Convert your training data into n-dimensional vectors
* Run the ML algorithm

The second step is what we call "feature engineering".  It's the primary way that we encode human knowledge (wisdom?) into the ML process.  We pick which things that algorithm can look at but we also need to be able to describe them in a way that's meaningful to the algorithm.

# Get labelled training data

Supervised machine learning uses the "answer key" to help it adjust the way it makes decisions.

In order to use supervised machine learning methods, we _must_ get labelled training data.

This normally means either a known value over time or a label to place it into a class (e.g. spam vs. not-spam)

How can we get labelled training data:
* Find a dataset that includes labels
* Label it by hand ourselves
* Trick users into labelling it for us
* Hire users to label it for you (e.g. Amazon Mechanical Turk, Crowdflower)

## Find a dataset that includes labels

* Talk to the customer
* Use google
* Check places like kaggle, the UCI Machine Learning Library, etc.
* Ask at the opendata stack exchange
* Explore other open sources of data (data.gov, etc.)

## Label it by hand ourselves

* Assumes you have enough domain expertise
  * Probably fine if it's identifying dogs vs. cats
  * Not fine if it's diagnosing radiological images
* Slow 
* Time consuming
* Expensive

## Trick users into labelling it for us

Can we find a part of the average users workflow that gives us the answer already?

* Conversion testing for web sites.  
  * We can track mouse movements and clicks to know if they did or didn't "convert".
  * Some cool tools out there for this (heapanalytics is one)
* House number identification
  * reCaptcha works data labelling into the captcha process

* Takes time to collect (unless you already have data logged)
* May take effort to instrument the system to record the desired behaviors

## Hire Annotators
* Can be expensive, depending on expertise required
* Requires carefully defining the task at hand
* Still have to pay for mistakes
* Takes time to coordinate
* Crowdsourcing is an option (e.g. mechanical turk, crowdflower)

#  Convert Data Into n-dimensional Vectors

* In order to send data as input to ML algorithms, we must conver it to a vector of numbers
* For _quantitative_ variables, this is normally straightforward
* What about categorical variables?  What could we do?

* Ordinal variables, while categorical, are probably ok.  
  * e.g. "Strongly agree", "Agree", "Neither agree nor disagree", "Disagree", "Strongly disagree"
  * Because they're _ordinal_, we can assign them to a sequence of integers
* What about _nominal_ variables?

# Categorical variables:  One-hot encoding

Let's say we have bicycles to rent and they come in three colors: red, green, and blue.

How can we turn that into a value in a numerical vector?  If we just assign random numbers to them, the algorithm can get confused since it will assume that those distances are meaningful.

What we really want is for "redness" to be a dimension, "blueness" to be a dimension, and "greenness" to be a dimension.

To do this we use _one-hot encoding_.

In [2]:
from sklearn.preprocessing import LabelBinarizer

binarizer = LabelBinarizer()

#  This learns that red will be [0,0,1], blue will be [1, 0, 0], etc.
binarizer.fit(['red', 'blue', 'green'])

#  This takes what it learned previously and replaces each string with it's replacement
#  in one-hot encoding
binarizer.transform(['red', 'blue', 'green', 'red', 'blue'])

array([[0, 0, 1],
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0]])

In [3]:
binarizer.classes_


array(['blue', 'green', 'red'], dtype='<U5')

Now we need to merge these values back into our array of features.

Note that the positions of the various features need to be the same for each data sample.  That is, color can't be the first three elements of one sample and the last three of the next.

Fortunately for us, this is so common that sklearn has a built-in mechanism for taking an existing feature matrix and one-hot encoding some fields but not necessarily all:  The `OneHotEncoder`.

In [4]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
enc = OneHotEncoder()
city_enc = LabelEncoder()
country_enc = LabelEncoder()

city_enc.fit(['Atlanta', 'Baltimore',  'Zurich', 'Charlotte'])
country_enc.fit(['USA', 'Switzerland', 'Germany'])

samples = [['Atlanta', 'USA'],
           ['Charlotte', 'USA'],
           ['Baltimore', 'USA'],
           ['Baltimore', 'Germany'],
           ['Baltimore', 'Switzerland'],
           ['Zurich', 'Switzerland']]
city_samples = city_enc.transform([sample[0] for sample in samples])
print('City samples: ', city_samples)
country_samples = country_enc.transform([sample[1] for sample in samples])
print('Country samples: ', country_samples)

City samples:  [0 2 1 1 1 3]
Country samples:  [2 2 2 0 1 1]


In [5]:
transformed_samples = list(zip(city_samples, country_samples))
print(transformed_samples)

[(0, 2), (2, 2), (1, 2), (1, 0), (1, 1), (3, 1)]


In [6]:
enc.fit(transformed_samples)  
print(enc.n_values_)
print(enc.feature_indices_)
for i in range(len(samples)):
  feature_vector = enc.transform([transformed_samples[i]]).todense()
  print('%s => %s' % (samples[i], feature_vector))
#   print('city: %s, country: %s' % (feature_vector[enc.feature_indices_[0]:enc.feature_indices_[1]], 
#                                    feature_vector[enc.feature_indices_[1]:enc.feature_indices_[2]]))

[4 3]
[0 4 7]
['Atlanta', 'USA'] => [[1. 0. 0. 0. 0. 0. 1.]]
['Charlotte', 'USA'] => [[0. 0. 1. 0. 0. 0. 1.]]
['Baltimore', 'USA'] => [[0. 1. 0. 0. 0. 0. 1.]]
['Baltimore', 'Germany'] => [[0. 1. 0. 0. 1. 0. 0.]]
['Baltimore', 'Switzerland'] => [[0. 1. 0. 0. 0. 1. 0.]]
['Zurich', 'Switzerland'] => [[0. 0. 0. 1. 0. 1. 0.]]


# Handling Text

Text data are particularly tricky to represent as vectors because they're so far removed from numerical data.

The traditional approach here is what we call a "bag of words" model:

* We take all the words in the corpus and assign them a number
* We make a new vector where each index means a specific word (e.g. `v[0]` is `1` if the word `dog` is present and `0` otherwise)
* We then take all of the documents and "map" each one to a vector by marking a `1` for every word position that's present

This gives _surprisingly_ good results.

There are a few variations here too:  

* Use the frequency of words (`CountVectorizer`)
* Use something called _tf-idf_ (`TfIdfVectorizer`) which accounts for the fact that some words are common in all documents, not just this one.

The following example is lightly modified from the sklearn documentation.

It illustrates the use of several new features of sklearn: `CountVectorizer`, `TfIdfTransformer`, and `Pipeline` which is a way to bundle up multiple steps, taken in order, for easier use.

First, the boilerplate...

In [7]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Mathieu Blondel <mathieu@mblondel.org>
# License: BSD 3 clause

from __future__ import print_function

from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')

Now we want to load the dataset and prepare it

In [8]:
# #############################################################################
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
print()

#  Make a dataframe from this so it's easier to see.
import pandas as pd
df = pd.DataFrame(data.data)
df['target'] = pd.Series(data.target)
df.columns = ['text', 'target']
df.head()

Downloading 20news dataset. This may take a few minutes.
2018-04-18 16:28:36,579 INFO Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
2018-04-18 16:28:36,582 INFO Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc']
857 documents
2 categories



Unnamed: 0,text,target
0,From: mangoe@cs.umd.edu (Charley Wingate)\nSub...,0
1,Subject: Re: There must be a creator! (Maybe)\...,0
2,From: MANDTBACKA@FINABO.ABO.FI (Mats Andtbacka...,0
3,From: royc@rbdc.wsnc.org (Roy Crabtree)\nSubje...,1
4,"Subject: Re: ""Imaginary"" Friends - Info and Ex...",1


In [11]:
#  Hide some deprecation warnings that are coming from inside sklearn.  
#
#  We can't actually fix these ourselves, so for now they're just noise
import warnings
warnings.filterwarnings("ignore", category=FutureWarning) 

# #############################################################################
# Define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    #('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1),), # (1, 2)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    #'clf__n_iter': (10, 50, 80),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
t0 = time()
grid_search.fit(data.data, data.target)
print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))        

Performing grid search...
pipeline: ['vect', 'clf']
parameters:
{'clf__alpha': (1e-05, 1e-06),
 'clf__penalty': ('l2', 'elasticnet'),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__ngram_range': ((1, 1),)}
Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:    6.5s finished


done in 7.522s

Best score: 0.904
Best parameters set:
	clf__alpha: 1e-06
	clf__penalty: 'elasticnet'
	vect__max_df: 0.5
	vect__ngram_range: (1, 1)


In [12]:
print('Here\'s an example of a transformed text object:')
steps = grid_search.best_estimator_.steps[0:-1]
print(steps)
vectorizer = Pipeline(steps)
vectorizer.transform([data.data[0]])

Here's an example of a transformed text object:
[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None))]


<1x18053 sparse matrix of type '<class 'numpy.int64'>'
	with 53 stored elements in Compressed Sparse Row format>

Note that it's a row of over 18,000 values!  That's one index for each possible word in the vocabulary across all the samples.