# Intro to Machine Learning with Python

## Info
- Scott Bailey (CIDR), *scottbailey@stanford.edu*
- Javier de la Rosa (CIDR), *versae@stanford.edu*
- Ashley Jester (CIDR/SSDS), *ajester@stanford.edu*

## What are we covering today?
- What is AI? What is ML?
- Learning styles
- Classification workflow with `scikit-learn`
- Text classification
- Training, testing and validation
- Model evaluation and selection

## Requirements
Add `conda-forge` as a new channel in Anaconda.
- `jupyter`
- `scikit-learn`
- `matplotlb`
- `lime`

## Goals

By the end of the workshop, we hope you'll be able to identify when a problem related to a textual dataset might be approached using a classification strategy. Moreover, we hope you'll get a basic understanding of the different steps involved in the general workflow of machine learning using `scikit-learn`. We will be learning to use document classification to sort literary text by genre.

## What is AI? What is ML?

> In computer science, the field of **AI** research defines itself as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of success at some goal.

<div align="right">&mdash; Source: [Wikipedia](https://en.wikipedia.org/wiki/Artificial_intelligence)</div>

> Evolved from the study of pattern recognition and computational learning theory in artificial intelligence, **machine learning** explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through building a **model** from inputs.

<div align="right">&mdash; Source: [Wikipedia](https://en.wikipedia.org/wiki/Machine_learning)</div>


<figure>
  <img style="margin: 0 0 -55px 0;" src="https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAAhLAAAAJDFiODI3ZmNmLTFkOTgtNGYzZS04MWMzLTlkNmVmYjU2NTVlNw.png" />  <figcaption><div align="right" style="padding-top: 4px;">&mdash; Source: <a href="https://www.linkedin.com/pulse/deep-learning-next-big-thing-adtech-volker-ballueder">Is Deep Learning the next big thing in adtech?</a></div></figcaption>
</figure>

## Learning styles

Depending on the *input*: 
- Supervised, learning a mapping between example inputs and desired outputs.
- Unsupervised, learning patterns and structure in unlabeled inputs.
- Reinforcement, learning how to achieve a goal in a dynamic environment by giving feedback in terms of rewards and punishments.

Depending on the *output*: 
- Supervised:
  - Classification, inputs are labeled with one or more classes, and the learner, called the **classifier**, must produce a model that assigns unseen inputs to one or more of these classes. *Example*: Spam filtering is an example of classification, where the inputs are email messages and the classes are "spam" and "not spam".
    
    Algorithms:
    - Support Vector Machines
    - Decision Trees
    - AdaBoost
    - Gradient Boosting
    - Random Forest
    - Logistic Regression
    - Neural Networks
    - Maximum Entropy Classifier
    - k-Nearest Neighbor
    - Naïve Bayesian
    - Discriminant Analysis

  - Regression, inputs are continous rather than discrete, and the learner is referred to as the **regressor**.
    
    Algorithms:
    - Support Vector Regression
    - Gaussian Process
    - Regression Trees
    - Gradient Boosting
    - Random Forest
    - RBF Networks
    - OLS
    - LASSO
    - Ridge Regression

- Unsupervised:
  - Clustering, inputs' memberships are not known, and the learner must divide the inputs into groups, or **clusters**, by some kind of similarity measure.
    
    Algorithms:
    - DBScan
    - K-Means
    - Hierarchical Clustering
    - Self-Organizing Maps
    - Spectral Clustering
    - Minimum Entropy Clustering

  - Dimensionality reduction, inputs' properties are reduced to a fewer number by keeping the most informative, **feature selection**, or by transforming them into a space of fewer dimensions, **feature extraction**. 
    
    Algorithms:
    - Principal Components Analysis
    - Kernel PCA
    - Linear Discriminant Analysis

## Classification workflow with `scikit-learn`

Generally, the workflow for any classification task is usually as follows: 
1. **Collect** or create clean **labeled data**
2. **Transform** that data into a numeric representation
  - Each numeric value representing a characteristic of the data is called a **feature**
  - The set of all features representing a single pair of input data and label is called the **feature vector**
  - The whole data is split into at least a training set and a test set
3. **Train**, learn, or fit a model on a part of the transformed labeled data (traininig set)
4. **Test** the model predictions on unseen transformed labeled data (test set) to evaluate its performance
5. Assess your model and tweak each of the previous steps

### `scikit-learn`

In Python, one solid choice for machine learning is the library [`scikit-learn`](http://scikit-learn.org/stable/):
- Simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license

### Simple and unified API
All learning algorithms in `scikit-learn` share a uniform and limited API consisting of complementary interfaces:
- An `estimator` interface for building and fitting models
- A `predictor` interface for making predictions
- A `transformer` interface for converting data

The goal is to enforce a simple and consistent API to make it trivial to swap or plug algorithms.

Estimators

```python
class Estimator(object):
    def fit(self, X, y=None):
        """Fits estimator to data."""
        ...
        return self
    
    def predict(self, X):
        """Predict class for unseen data."""
        ...
        return
    
    def transform(self):
        """Converts data."""
        ...
        return
    
    def fit_transform(self, X, y):
        """Fit and then transform data."""
        ...
        return
```

## Text classification


In the specific task of assigning a category to a text:
1. Collect. For example, in the case of tweets, it would require to download the tweets using the Twitter API, extracting the text from the JSON response, and assigning a label, like a specific hashtag.
2. Transform. There are many strategies for turning your textual data into numbers. One of the most used and reliable for text classification is to look at occurrences of words in documents (Bag of Words or BOW), or counting them (BoW counts), which in `scikit-learn` is a form of `Vectorizer` known as the `CountVectorizer`. One the decissions to make in this type of transformation is the number of words we are going to count. This is a **hyper-parameter**.
3. Train. After selecting a model for classification, we train on part of the input data.
4. Test. We then test on the remainder.
5. Asses the model performance and tweak the hyper-parameters if needed.

Let's import the `scikit-learn`'s `CountVectorizer`

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
documents = [
    "This is a piece of text. This is also some random text. Also text.",
]

In [None]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit(documents)
print("Vocabulary size:", len(count_vectorizer.vocabulary_))
count_vectorizer.vocabulary_

In [None]:
counts = count_vectorizer.transform(documents)
print(counts)
print("   ^  ^         ^\n   |  |         |\n  doc word_id count")

In [None]:
doc = 0  # first document
for word, word_id in count_vectorizer.vocabulary_.items():
    print(word, ":", counts[doc, word_id])

In [None]:
counts = count_vectorizer.fit_transform(documents)
print(counts)

`CountVectorizer` also has some options to disregard stopwords, count ngrams instead of words, cap the max number of words to count, normalize spelling, or count terms within a frequency range. It is worth exploring the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Fill in the text in the code cell below so after transforming it using a `CountVectorizer` the counts are as shown (order of words is not important).

```
flowers : 1
garden : 1
up : 1
some : 1
the : 1
morning : 1
place : 1
every : 1
pick : 1
from : 1
my : 1
by : 1
```
<!--
<br/>
* **Hint**: You could count the characters in the words.*
-->
</p>
</div>

In [None]:
documents = [
    ...
]
count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(documents)
doc = 0  # first document
for word, word_id in count_vectorizer.vocabulary_.items():
    print(word, ":", counts[doc,word_id])

## Training, testing, and validation

`scikit-learn` provides with functions to split a dataset into training and testing. Often times, if part of the data is use to fine tune the model, holding out another split for validation is a good idea.

For our purposes, we will be using the [Brown corpus](http://icame.uib.no/brown/bcm-los.html) included in NLTK, which comprises more a million words of English text, containin text from 500 sources, and categorized by genre. Specifically, we will use 2 categories: `fiction` (W.E.B. Du Bois: *Worlds of Color*), and `romance` (Callaghan: *A Passion in Rome*). The goal will be to create a classifier able to assign a text to `romance` or `government` only based on its textual data.

In [None]:
import nltk
nltk.download('all', quiet=True);

In [None]:
import random
from nltk.corpus import brown

dataset = []
categories = ['fiction', 'romance']
for category in categories:
    for fileid in brown.fileids(category):
        text = " ".join(brown.words(fileids=fileid))
        dataset.append((text, category))

random.shuffle(dataset)
print(len(dataset), "documents:", ", ".join(" ".join((str(len(brown.fileids(c))), c)) for c in categories))

"..." + text[35:166] + "...", category

From the dataset, we can now separate the labels from the texts into two different variables. Usually, the varaible containing the labels is named `y`, and the one containing the array of features (in our case, the transformed version of the texts) is named `X`, as in you can obtain the output `y` as a function of the inputs `X`, which is the core abstraction in `scikit-learn`.

In [None]:
import numpy as np  # scikit-learn works internally with NumPy arrays

texts = []
labels = []
for text, label in dataset:
    texts.append(text)
    labels.append(label)
    
X = np.array(texts)
y = np.array(labels)

In [None]:
from sklearn.model_selection import train_test_split

(X_train_texts, X_test_texts,
 y_train, y_test) = train_test_split(X, y, test_size=0.25, random_state=42)

print("{} training documents".format(*X_train_texts.shape))
print("{} testing documents".format(*X_test_texts.shape))

In [None]:
# Don't forget to transform the text!
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train_texts)
X_test = vectorizer.transform(X_test_texts)

print("{} training documents with {} features per document".format(*X_train.shape))
print("{} testing documents with {} features per document".format(*X_test.shape))

Let's start with one the Naïve Bayes classifiers.

**Naïve Bayes** are a family of classifiers based on the Bayes Theorem for probability, which describes the probability of an event based on its prior knowledge. Although its formulation can get confusing, all the math boils down to counting, multiplication and division, making Naïve Bayes classifiers very fast. On the other hand, it makes the assumption that all of the features in the data set are equally important and independent, which is rarely true in real life.

Based on its characteristics, Naïve Bayes classifiers are well suited to perform text classification, where word counts are considered to be independent features.

There are three Naïve Bayes algorimths in `scikit-learn`: 
- Gaussian. It assumes that features follow a normal distribution.
- Multinomial. Good for discrete counts, like in text classification problems using counts of words.
- Bernoulli. Useful for feature vectors that are binary (i.e. zeros and ones), like classic bag of words.

Given that our feature vectors are counts of bag of words, we will use `MultinomialNB`.

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()

clf.fit(X_train, y_train)

Now we can predit the categories of the held out texts and assess how well our classifier did.

In [None]:
samples = [
    "She will travel at the speed of light in that rocket",
]
transformed_samples = vectorizer.transform(samples)
clf.predict(transformed_samples)

If we wanted to know which words are being used to decided whether a sentence is either `romance` or `fiction`, we need to take into account the coefficients calculated by the model at training time, and the vocabulary built bythe  vectorizer we used to transform our data.

In [None]:
def most_informative_features(classifier, vectorizer=None, n=10):
    class_labels = classifier.classes_
    if vectorizer is None:
        feature_names = classifier.steps[0].get_feature_names()
    else:
        feature_names = vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
    topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]
    for coef, feat in reversed(topn_class2):
        print(class_labels[1], coef, feat)
    print()
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)

most_informative_features(clf, vectorizer)

In [None]:
%%capture --no-display
# Do not worry about this particular chunk of code

%matplotlib inline
from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import make_pipeline

def explain(entry, clf, vectorizer=None, n=10):
    if vectorizer is None:
        class_names = clf.steps[1].classes_.tolist()
        pipeline = clf
    else:
        class_names = clf.classes_.tolist()
        pipeline = make_pipeline(vectorizer, clf)
    explainer = LimeTextExplainer(class_names=class_names)
    exp = explainer.explain_instance(entry, pipeline.predict_proba, num_features=n)
    exp.show_in_notebook()

explain(samples[0], clf, vectorizer)

That doesn't seem very good. How can we better assess and improve the performance of our model?

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
[`MultinomialNB`](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) has some hyper-parameters that can be fine-tuned. Play around with the `alpha` (from `0` to `1`) and `fit_prior` (`True` or `False`) to see if you can get "better" important features.
<!--
<br/>
* **Hint**: Y.*
-->
</p>
</div>

In [None]:
alpha = 0.1
fit_prior = False

most_informative_features(
    MultinomialNB(alpha=alpha, fit_prior=fit_prior).fit(X_train, y_train),
    vectorizer
)

## Model evaluation and selection

Changing a model hyper-parameters without really having a way to assess its performance can get problematic. If for example you find better informative features but the classifier is failing 50% of the time, it doesn't really matter much how good you think those features are. In some cases, very performant classifiers make use of unexpected or counterintuitive features.

The only real way to assess its performance is by making the classifier predict labels for unseen data, and then comparing the predicted labels with the real labels. This is where the importance of separating training and testing data lies in.

In [None]:
y_pred = clf.predict(X_test)
y_pred

In [None]:
print("Label     Predicted   Result\n-----     ---------   ------")
for i, real_label in enumerate(y_test):
    predicted_label = y_pred[i]
    if real_label == predicted_label:
        result =  "hit"
    else:
        result = "miss"
    print(real_label, " ", predicted_label, "   ", result)

While hit vs. miss gives an idea of what's going on, it doesn't tell us much about the nature of the fails. Were the `romance` texts being classified as `fiction`, or the other way around? This might not seem very important, but if instead of texts you were trying to test whether a person has a deadly desease or not (here labels are positive, has the desease, or negative, does not have the desease), you might be more interested in classifiers with lower false negative ratio.

Taking this into account, we could count how many `romance` texts have been correctly classified as `romance` (we will call this `TR`, as in true romance), and how many have been incorrectly classified as `fiction` (`FF`, as in false fiction); and the same for `fiction`: how many `fiction` texts have been classified as `fiction` (`TF`, as in true fiction), and how many as `romance` (`FR`, as in false romance).

In [None]:
print("Label     Predicted   Result\n-----     ---------   ------")
counts = {"TR": 0, "TF": 0, "FF": 0, "FR": 0}
for i, real_label in enumerate(y_test):
    predicted_label = y_pred[i]
    if real_label == predicted_label:
        if real_label == "romance":
            result =  "TR"
        else:
            result = "TF"
    else:
        if real_label == "romance":
            result =  "FF"
        else:
            result = "FR"
    counts[result] += 1
    print(real_label, " ", predicted_label, "   ", result)
counts

In [None]:
print(" TF FR")
print([counts["TF"], counts["FR"]])
print([counts["FF"], counts["TR"]])
print(" FF TR")

Fortunately, `scikit-learn` comes with a function that calculates just that table, which is called the confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix

print(" TF FR")
print(confusion_matrix(y_test, y_pred))
print(" FF TR")

In [None]:
(TF, FR), (FF, TR) = confusion_matrix(y_test, y_pred)
print(" TF FR")
print([TF, FR])
print([FF, TR])
print(" FF TR")

Depending on your problem, inspecting the confusion matrix might not be necessary, as in the case of distinguishing a `romance` from a `fiction` text, as long as the overall performance is good. **Accuracy** is one of those measures that gives an overall value of performance. If the accuracy is 0, that means the classifier is not doing a good job; if 0.5, that means it's failing 50% of the time, basically as good as a random classifier; if 1 the predictions are perfect.

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Accuracy is defined as the ratio of hits vs the total number of predictions. Using the numbers in the confusion matrix above, do the math to get the accuracy of our classifier.
<br/>
* **Hint**: Use the variables TR, TF, FR, and FF *
</p>
</div>

In [None]:
accuracy = ... / ...
accuracy

In [None]:
print("{:.4}%".format(accuracy * 100))

Or you can do it the `scikit-learn` way.

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

In [None]:
alpha = 0.0000001
fit_prior = False
tuned_clf = MultinomialNB(alpha=alpha, fit_prior=fit_prior).fit(X_train, y_train)
tuned_y_pred = tuned_clf.predict(X_test)
accuracy_score(y_test, tuned_y_pred)

What are other moving parts involved in our workflow? Let's look at our vectorizer.

In [None]:
def get_accuracy(X, y, vectorizer):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

    X_train = vectorizer.fit_transform(X_train)
    X_test = vectorizer.transform(X_test)

    clf = MultinomialNB()
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)

    return accuracy_score(y_test, y_pred)

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
The function `get_accuracy()` defined below, takes in input data in `X` and its labels in `y`, and returns the accuracy of a `MultinomialNB` using `vectorizer` passed in as another argument. Using the function `get_accuracy()`, complete the code cell below so it performs a search over vocabulary size (10000, 1000, 500, 300, 150, 100) using and ignoring stop words (prepositions, articles, etc.). What set of parameters is performing better?
</p>
</div>

In [None]:
for stopwords in [None, 'english']:
    for vocabulary_size in ...:
        print("stopwords:", ...)
        print("vocabulary_size:", ...)
        vectorizer = CountVectorizer(...)
        accuracy = ...
        print("accuracy:", accuracy)
        print()

However, the split might also have an effect in the model performance. One common strategy to prevent the split from affecting our model, is to split the dataset randomly several times in different **folds**, calculate whatever measure you want, and then averaging the scores. This process is called **cross-validation**, and also prevents us from making the model learn too well the training data (**over-fitting**).

In order to cross-validate our model, we need to create a pipeline in `scikit-learn` by combining both the vectorizer and the classifier, then calling the `cross_val_score` function.

In [None]:
from sklearn.model_selection import cross_val_score

for stopwords in [None, 'english']:
    for vocabulary_size in [10000, 1000, 500, 300, 150, 100]:
        print("stopwords:", stopwords)
        print("vocabulary_size:", vocabulary_size)
        vectorizer = CountVectorizer(
            max_features=vocabulary_size, stop_words=stopwords)
        
        pipe = make_pipeline(vectorizer, MultinomialNB())
        accuracies = cross_val_score(pipe, X, y, cv=10)
        
        print("accuracy:", accuracies.mean())
        print()

By looking at this data, wich is more reliable, it looks like we could get a classifier with up to ~75% of accuracy. However, if we want a model more intuitive and with a better explanatory power, we could go with the model that uses stop words, although its accuracy is only ~65%.

In [None]:
vectorizer = CountVectorizer(max_features=100, stop_words="english")
X_train = vectorizer.fit_transform(X_train_texts)
X_test = vectorizer.transform(X_test_texts)

clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

most_informative_features(clf, vectorizer, n=10)
entry = -1
print()
print("Label:", y_test[entry])
print("Predicted:", y_pred[entry])
explain(X_test_texts[entry], clf, vectorizer)

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
In the code cell below, add a new for loop to the parameters search using cross-validation so other models (`LinearSVC`, `SVC`, and `RandomForestClassifier` with default parameters) are also used. What's the best model and parameters for our dataset?
</p>
</div>

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC, LinearSVC

for ...:
    for stopwords in [None, 'english']:
        for vocabulary_size in [10000, 1000, 500, 300, 150, 100]:
            print("model:", ...)
            print("stopwords:", stopwords)
            print("vocabulary_size:", vocabulary_size)
            vectorizer = CountVectorizer(
                max_features=vocabulary_size, stop_words=stopwords)

            pipe = make_pipeline(vectorizer, ...)
            accuracies = cross_val_score(pipe, X, y, cv=10)

            print("accuracy:", accuracies.mean())
            print()

<figure>
  <img  src="http://scikit-learn.org/stable/_static/ml_map.png" />  <figcaption><div align="right" style="padding-top: 4px;">&mdash; Source: <a href="http://scikit-learn.org/stable/tutorial/machine_learning_map/">`scikit-learn`: Choosing the right estimator</a></div></figcaption>
</figure>