# Intro to Machine Learning with Python

## Info
- Scott Bailey (CIDR), *scottbailey@stanford.edu*
- Javier de la Rosa (CIDR), *versae@stanford.edu*
- Ashley Jester (CIDR/SSDS), *ajester@stanford.edu*

## What are we covering today?
- What is AI? What is ML?
- Learning styles
- Classification workflow with `scikit-learn`
- Text classification
- Training, testing and validation
- Model selection
- Parameter optimization

## Goals

By the end of the workshop, we hope you'll be able to identify when a problem related to a textual dataset might be approached using a classification strategy. Moreover, we hope you'll get a basic understanding of the different steps involved in the general workflow of machine learning using `scikit-learn`. We will be learning to use document classification to sort literary text by genre.

## What is AI? What is ML?

> In computer science, the field of **AI** research defines itself as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of success at some goal.

<div align="right">&mdash; Source: [Wikipedia](https://en.wikipedia.org/wiki/Artificial_intelligence)</div>

> Evolved from the study of pattern recognition and computational learning theory in artificial intelligence, **machine learning** explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through building a **model** from inputs.

<div align="right">&mdash; Source: [Wikipedia](https://en.wikipedia.org/wiki/Machine_learning)</div>


<figure>
  <img style="margin: 0 0 -55px 0;" src="https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAAhLAAAAJDFiODI3ZmNmLTFkOTgtNGYzZS04MWMzLTlkNmVmYjU2NTVlNw.png" />  <figcaption><div align="right" style="padding-top: 4px;">&mdash; Source: <a href="https://www.linkedin.com/pulse/deep-learning-next-big-thing-adtech-volker-ballueder">Is Deep Learning the next big thing in adtech?</a></div></figcaption>
</figure>

## Learning styles

Depending on the *input*: 
- Supervised, learning a mapping between example inputs and desired outputs.
- Unsupervised, learning patterns and structure in unlabeled inputs.
- Reinforcement, learning how to achieve a goal in a dynamic environment by giving feedback in terms of rewards and punishments.

Depending on the *output*: 
- Supervised:
  - Classification, inputs are labeled with one or more classes, and the learner, called the **classifier**, must produce a model that assigns unseen inputs to one or more of these classes. *Example*: Spam filtering is an example of classification, where the inputs are email messages and the classes are "spam" and "not spam".
    
    Algorithms:
    - Support Vector Machines
    - Decision Trees
    - AdaBoost
    - Gradient Boosting
    - Random Forest
    - Logistic Regression
    - Neural Networks
    - Maximum Entropy Classifier
    - k-Nearest Neighbor
    - Naïve Bayesian
    - Discriminant Analysis

  - Regression, inputs are continous rather than discrete, and the learner is referred to as the **regressor**.
    
    Algorithms:
    - Support Vector Regression
    - Gaussian Process
    - Regression Trees
    - Gradient Boosting
    - Random Forest
    - RBF Networks
    - OLS
    - LASSO
    - Ridge Regression

- Unsupervised:
  - Clustering, inputs' memberships are not known, and the learner must divide the inputs into groups, or **clusters**, by some kind of similarity measure.
    
    Algorithms:
    - DBScan
    - K-Means
    - Hierarchical Clustering
    - Self-Organizing Maps
    - Spectral Clustering
    - Minimum Entropy Clustering

  - Dimensionality reduction, inputs' properties are reduced to a fewer number by keeping the most informative, **feature selection**, or by transforming them into a space of fewer dimensions, **feature extraction**. 
    
    Algorithms:
    - Principal Components Analysis
    - Kernel PCA
    - Linear Discriminant Analysis

## Classification workflow with `scikit-learn`

Generally, the workflow for any classification task is usually as follows: 
1. **Collect** or create clean **labeled data**
2. **Transform** that data into a numeric representation
  - Each numeric value representing a characteristic of the data is called a **feature**
  - The set of all features representing a single pair of input data and label is called the **feature vector**
  - The whole data is split into at least a training set and a test set
3. **Train**, learn, or fit a model on a part of the transformed labeled data (traininig set)
4. **Test** the model predictions on unseen transformed labeled data (test set) to evaluate its performance
5. Assess your model and tweak each of the previous steps

### `scikit-learn`

In Python, one solid choice for machine learning is the library [`scikit-learn`](http://scikit-learn.org/stable/):
- Simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license

### Simple and unified API
All learning algorithms in `scikit-learn` share a uniform and limited API consisting of complementary interfaces:
- An `estimator` interface for building and fitting models
- A `predictor` interface for making predictions
- A `transformer` interface for converting data

The goal is to enforce a simple and consistent API to make it trivial to swap or plug algorithms.

Estimators

```python
class Estimator(object):
    def fit(self, X, y=None):
        """Fits estimator to data."""
        ...
        return self
    
    def predict(self, X):
        """Predict class for unseen data."""
        ...
        return
    
    def transform(self):
        """Converts data."""
        ...
        return
    
    def fit_transform(self, X, y):
        """Fit and then transform data."""
        ...
        return
```

## Text classification


In the specific task of assigning a category to a text:
1. Collect. For example, in the case of tweets, it would require to download the tweets using the Twitter API, extracting the text from the JSON response, and assigning a label, like a specific hashtag.
2. Transform. There are many strategies for turning your textual data into numbers. One of the most used and reliable for text classification is counting words (Bag of Words or BOW), which in `scikit-learn` is a form of `Vectorizer` known as the `CountVectorizer`. On the decissions to make in this type of transformation is the number of words we are going to count. This is a **hyper-parameter**.
3. Train. After selecting a model for classification, we train on part of the input data.
4. Test. We then test on the remainder.
5. Asses the model performance and tweak the hyper-parameters if needed.

Let's import the `scikit-learn`'s `CountVectorizer`

In [83]:
from sklearn.feature_extraction.text import CountVectorizer

In [84]:
documents = [
    "This is a piece of text. This is also some random text. Also text.",
]

In [85]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit(documents)
print("Vocabulary size:", len(count_vectorizer.vocabulary_))
count_vectorizer.vocabulary_

Vocabulary size: 8


{'also': 0,
 'is': 1,
 'of': 2,
 'piece': 3,
 'random': 4,
 'some': 5,
 'text': 6,
 'this': 7}

In [86]:
counts = count_vectorizer.transform(documents)
print(counts)
print("   ^  ^         ^\n   |  |         |\n  doc word_id count")

  (0, 0)	2
  (0, 1)	2
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	3
  (0, 7)	2
   ^  ^         ^
   |  |         |
  doc word_id count


In [87]:
doc = 0  # first document
for word, word_id in count_vectorizer.vocabulary_.items():
    print(word, ":", counts[doc,word_id])

is : 2
some : 1
text : 3
random : 1
piece : 1
also : 2
this : 2
of : 1


In [88]:
counts = count_vectorizer.fit_transform(documents)
print(counts)

  (0, 4)	1
  (0, 5)	1
  (0, 0)	2
  (0, 6)	3
  (0, 2)	1
  (0, 3)	1
  (0, 1)	2
  (0, 7)	2


`CountVectorizer` also has some options to disregard stopwords, count ngrams instead of words, cap the max number of words to count, normalize spelling, or count terms within a frequency range. It is worth exploring the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Fill in the text in the code cell below so after transforming it using a `CountVectorizer` the counts are as shown (order of words is not important).

```
flowers : 1
garden : 1
up : 1
some : 1
the : 1
morning : 1
place : 1
every : 1
pick : 1
from : 1
my : 1
by : 1
```
<!--
<br/>
* **Hint**: You could count the characters in the words.*
-->
</p>
</div>

In [108]:
documents = [
    "Every morning, I pick up some flowers from the garden by my place",
    # Opportunity to exaplin that different sentences might have same counts!
]
count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(documents)
doc = 0  # first document
for word, word_id in count_vectorizer.vocabulary_.items():
    print(word, ":", counts[doc,word_id])

flowers : 1
garden : 1
up : 1
some : 1
the : 1
morning : 1
place : 1
every : 1
pick : 1
from : 1
my : 1
by : 1


## Training, testing, and validation

`scikit-learn` provides with functions to split a dataset into training and testing. Often times, if part of the data is use to fine tune the model, holding out another split for validation is a good idea.

For our purposes, we will be using the [Brown corpus](http://icame.uib.no/brown/bcm-los.html) included in NLTK, which comprises more a million words of English text, containin text from 500 sources, and categorized by genre. Specifically, we will use 2 categories: `fiction` (W.E.B. Du Bois: Worlds of Color), and `romance` (Callaghan: A Passion in Rome). The goal will be to create a classifier able to assign a text to `romance` or `government` only based on its textual data.

In [282]:
import nltk
nltk.download('all', quiet=True);

In [257]:
from nltk.corpus import brown

dataset = []
categories = ['fiction', 'romance']
for category in categories:
    for fileid in brown.fileids(category):
        text = " ".join(brown.words(fileids=fileid))
        dataset.append((text, category))

random.shuffle(dataset)
print(len(dataset), "documents:", ", ".join(" ".join((str(len(brown.fileids(c))), c)) for c in categories))
"..." + text[35:166] + "...", category

58 documents: 29 fiction, 29 romance


("...You know , I've got one of your cars at home . As a prominent industrialist , you ought to be interested in his nibs' support group...",
 'romance')

In [289]:
range(X.shape[0])

range(0, 58)

In [292]:
from sklearn.model_selection import train_test_split

labels = []
for _, label in dataset:
    labels.append(label)

texts = []
for text, _ in dataset:
    texts.append(text)

X = CountVectorizer().fit_transform(texts)  # don't forget to transform the text!

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=42)

print("{} training documents with {} features per document".format(*X_train.shape))
print("{} testing documents with {} features per document".format(*X_test.shape))

43 training documents with 12004 features per document
15 testing documents with 12004 features per document


Let's start with one the Naïve Bayes classifiers.

Naïve Bayes are a family of classifiers based on the Bayes Theorem for probability, which describes the probability of an event based on its prior knowledge. Although its formulation can get confusing, all the math boils down to counting, multiplication and division, making Naïve Bayes classifiers very fast. On the other hand, it makes the assumption that all of the features in the data set are equally important and independent, which is rarely true in real life.

Based on its characteristics, Naïve Bayes classifiers are well suited to perform text classification, where word counts are considered to be independent features.

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()

clf.fit(X_train, y_train)

Now we can predit the categories of the held out texts and assess how well our classifier did.

In [311]:
y_pred = clf.predict(X_test)

y_pred, y_test
matrix = Counter(zip(y_pred, y_test))
(matrix[('fiction', 'fiction')] + matrix[('romance', 'romance')]) / len(y_test)

0.7333333333333333

In [310]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.73333333333333328

In [7]:
import pickle
import datetime
import nltkb
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

In [5]:
from collections import defaultdict
from nltk.corpus import brown,stopwords
import random
import nltk

In [6]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [None]:
ftext = open(texts, 'rb')
pipeline = Pipeline([
("count", CountVectorizer(stop_words='english', min_df=0.0,
#              max_features=10000,
          binary=False)),
#("tfidf", TfidfTransformer(norm="l2"))
])
#  X = pipeline.fit_transform(map(lambda line: calc_ngrams(line), ftext))
X = pipeline.fit_transform(ftext)
ftext.close()
flabel = open(labels, 'rb')
y = np.loadtxt(flabel)
flabel.close()
return X, y

In [2]:
# Helper functions
import requests
from urllib.request import urlopen

def get_text(url):
    try:
        return requests.get(url).text
    except:
        return urlopen(url).read().decode("utf8")
        
def get_speech(url):
    page = get_text(url)
    full_text = page.split('\n')
    return " ".join(full_text[2:])

In [4]:
!python -m nltk.downloader all

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /Users/versae/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/versae/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/versae/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/versae/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/versae/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/versae/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading p

[nltk_data]    |   Package qc is already up-to-date!
[nltk_data]    | Downloading package reuters to
[nltk_data]    |     /Users/versae/nltk_data...
[nltk_data]    |   Package reuters is already up-to-date!
[nltk_data]    | Downloading package rte to /Users/versae/nltk_data...
[nltk_data]    |   Package rte is already up-to-date!
[nltk_data]    | Downloading package semcor to
[nltk_data]    |     /Users/versae/nltk_data...
[nltk_data]    |   Package semcor is already up-to-date!
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     /Users/versae/nltk_data...
[nltk_data]    |   Package senseval is already up-to-date!
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     /Users/versae/nltk_data...
[nltk_data]    |   Package sentiwordnet is already up-to-date!
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /Users/versae/nltk_data...
[nltk_data]    |   Package sentence_polarity is already up-to-date!
[nltk_data]    | Downl

[nltk_data]    |   Package mwa_ppdb is already up-to-date!
[nltk_data]    | 
[nltk_data]  Done downloading collection all


In [6]:
from collections import defaultdict
from nltk.corpus import brown,stopwords
import random
import nltk

dataset = [] # 500 samples

for category in brown.categories():
    for fileid in brown.fileids(category):
        dataset.append((brown.words(fileids = fileid),category))

dataset = [([w.lower() for w in text],category) for text,category in dataset]

def feature_extractor(text,bag):
    # bag -> bag of words
    frec = defaultdict(int)
    for word in text:
        if word not in bag:
            frec[word] += 1

    return frec

# training & test 90%-10% naivebayes nltk

def train_and_test(featureset,n=90):

    random.shuffle(featureset)
    split = int((len(featureset)*n)/100)
    train,test = featureset[:split],featureset[split:]
    classifier = nltk.NaiveBayesClassifier.train(train)
    accuracy= nltk.classify.accuracy(classifier, test)
    return accuracy

# Stopwords as features
stopwords = stopwords.words("english") # 153 words

featureset = [(feature_extractor(text,stopwords),category)for text,category in dataset]

print("Accuracy: ",train_and_test(featureset)) # around 0.25

Accuracy:  0.02


In [None]:
import sys

import cPickle as pickle
import datetime
import nltk
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

# total number of sentences (combined)
NTOTAL = 1788280

def calc_ngrams(line):
    """ Converts line into a list of trigram tokens """
    words = nltk.word_tokenize(line.lower())
    word_str = " ".join(words)
    bigrams = nltk.bigrams(words)
    bigram_str = " ".join(["0".join(bigram) for bigram in bigrams])
    trigrams = nltk.trigrams(words)
    trigram_str = " ".join(["0".join(trigram) for trigram in trigrams])
    return " ".join([word_str, bigram_str, trigram_str])
  
def generate_xy(texts, labels):
    ftext = open(texts, 'rb')
    pipeline = Pipeline([
    ("count", CountVectorizer(stop_words='english', min_df=0.0,
    #              max_features=10000,
              binary=False)),
    ("tfidf", TfidfTransformer(norm="l2"))
    ])
    #  X = pipeline.fit_transform(map(lambda line: calc_ngrams(line), ftext))
    X = pipeline.fit_transform(ftext)
    ftext.close()
    flabel = open(labels, 'rb')
    y = np.loadtxt(flabel)
    flabel.close()
    return X, y

def crossvalidate_model(X, y, nfolds):
    kfold = KFold(X.shape[0], n_folds=nfolds)
    avg_accuracy = 0
    for train, test in kfold:
        Xtrain, Xtest, ytrain, ytest = X[train], X[test], y[train], y[test]
        clf = LinearSVC()
        clf.fit(Xtrain, ytrain)
        ypred = clf.predict(Xtest)
        accuracy = accuracy_score(ytest, ypred)
        print "...accuracy = ", accuracy
        avg_accuracy += accuracy
    print "Average Accuracy: ", (avg_accuracy / nfolds)

def train_model(X, y, binmodel):
    model = LinearSVC()
    model.fit(X, y)
    # reports
    ypred = model.predict(X)
    print "Confusion Matrix (Train):"
    print confusion_matrix(y, ypred)
    print "Classification Report (Train)"
    print classification_report(y, ypred)
    pickle.dump(model, open(binmodel, 'wb'))
    
def test_model(X, y, binmodel):
    model = pickle.load(open(binmodel, 'rb'))
    if y is not None:
        # reports
        ypred = model.predict(X)
        print "Confusion Matrix (Test)"
        print confusion_matrix(y, ypred)
        print "Classification Report (Test)"
        print classification_report(y, ypred)

def print_timestamp(message):
    print message, datetime.datetime.now()

def usage():
    print "Usage: python classify.py [xval|test|train]"
    sys.exit(-1)

def main():
    if len(sys.argv) != 2:
        usage()
    print_timestamp("started:")
    X, y = generate_xy("data/sentences.txt", "data/labels.txt")
    if sys.argv[1] == "xval":
        crossvalidate_model(X, y, 10)
    elif sys.argv[1] == "run":
        Xtrain, Xtest, ytrain, ytest = train_test_split(X, y,
            test_size=0.1, random_state=42)
        train_model(Xtrain, ytrain, "data/model.bin")
        test_model(Xtest, ytest, "data/model.bin")
    else:
        usage()
    print_timestamp("finished:")

if __name__ == "__main__":
    main()

Use accuracy as our model selection metric

Explain briefly over-fitting.