# Workshop: Machine learning with text in Scikit-learn
## Outline

1. [Introduction to supervised learning in scikit-learn](#Introduction-to-supervised-learning-in-scikit-learn)
1. [Converting text to feature vectors](#From-text-to-feature-vectors)
1. [Classifying creditors from the Czech Insolvency Register](#Classifying-creditors-from-the-Czech-Insolvency-Register)
    1. [Loading and preprocessing the dataset](#Loading-and-preprocessing-the-dataset)
    1. [Vectorizing the dataset](#Vectorizing-the-dataset)
    1. [Building and evaluating the model](#Building-and-evaluating-the-model)
    1. [Examining the model](#Examining-the-model)
1. [Topics not covered](#Topics-not-covered)

In [None]:
import pandas as pd
import unicodedata
import re
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics

In [None]:
# for Python 2 users
from __future__ import print_function

## Introduction to supervised learning in scikit-learn

**From <a href="https://en.wikipedia.org/wiki/Supervised_learning">Wikipedia</a>:**<br>
**Supervised learning** is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value.

**Note:** We will consider a classification task, i.e., samples belong to two or more classes that we want to predict.

In [None]:
# Load the iris dataset.
from sklearn.datasets import load_iris
iris = load_iris()

In [None]:
X = iris.data      # feature matrix, on sample per row
y = iris.target    # target vector

In [None]:
# Let's examine the shapes of X and y.
print(X.shape)
print(y.shape)

In [None]:
n_features = X.shape[1]
n_features

In [None]:
# Nicer overview of our dataset.
dataset = pd.DataFrame(X, columns=iris.feature_names)
dataset["label"] = y
dataset.head()

In [None]:
# Let's examine the target vector
print(y)

In [None]:
# Init logistic regression model with default params.
clf = LogisticRegression()

# Fit the model. 
clf.fit(X, y)

In [None]:
# Take a sample from the training data.
s = X[0]
s

In [None]:
# Let's try to predict the target value for this sample.
clf.predict([s])

**Note:** In scikit-learn, an estimator for classification is a Python object that implements the methods **fit(X, y)** and **predict(samples)**. 

**To summarize the general process:**
1. Get a dataset in form **X** (feature matrix) and **y** (target variable)
2. Pick a model and fit it using **.fit(X, y)**
3. Predict values of new, unobserved samples using **.predict(samples)**

You can also check the basic introduction to ML with scikit-learn in the <a href="http://scikit-learn.org/stable/tutorial/basic/tutorial.html">documentation</a>.

## From text to feature vectors

In [None]:
text_dataset = ["A coward judges all he sees by what he is.",
                "There are people who need people to need them.",
                "Never's the word God listens for when he needs a laugh."]

### Problem

From <a target="_blank" href="http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction">scikit-learn documentation</a>:<br>
Text Analysis is a major application field for machine learning algorithms. However the raw data, **a sequence of symbols cannot be fed directly to the algorithms themselves** as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

### Solution

From <a target="_blank" href="http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html">scikit-learn documentation</a>:<br>

In order to perform machine learning on text documents, we first need to turn the text content into **numerical feature vectors**.

The most intuitive way to do so is the **bags of words** representation:
1. assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).
1. for each document **#i**, count the number of occurrences of each word w and store it in **X[i, j]** as the value of feature **#j** where **j** is the index of word **w** in the dictionary

The bags of words representation implies that **n_features** is the number of distinct words in the corpus: **this number is typically larger than 100,000**.

Fortunately, **most values in X will be zeros** since for a given document less than a couple thousands of distinct words will be used. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory.


We will use scikits **CountVectorizer** to convert text into a **matrix of token counts (document-term matrix)**:

In [None]:
# Init CountVectorizer with the default params.
vectorizer = CountVectorizer()

In [None]:
# Learn the vocabulary from the text data.
vectorizer.fit(text_dataset)

In [None]:
# Examine the vocabulary.
vocabulary = vectorizer.get_feature_names()
print("Vocabulary size: {0}".format(len(vocabulary)))
print("Vocabulary:")
print(vocabulary)

In [None]:
# Transform text data into a document-term matrix.
dtm = vectorizer.transform(text_dataset)
dtm

In [None]:
# Let's examine the obtained document-term matrix.
pd.DataFrame(dtm.toarray(), columns=vectorizer.get_feature_names())

**Summary:**<br>
**Vectorization** is a general process of turning a collection of text documents into numerical feature vectors.<br>
**CountVectorizer** is one of the vectorizers available in scikit-learn.<br>
All vectorizers are used as follows:
* use **.fit(data)** to learn the vocabulary
* use **.transform(data)** to build the document-term matrix from text data

## Classifying creditors from the Czech Insolvency Register

### Loading and preprocessing the dataset

In [None]:
# Load the dataset with pandas.
dataset = pd.read_table("./data/receivables.tsv", encoding="utf-8", header=0)
dataset.info()

In [None]:
# Check the dataset.
dataset.head()

In [None]:
# Check the number of samples per class.
dataset.groupby('creditor').count()

In [None]:
# Since we have texts written in Czech in the dataset, let's remove the accents (diacritics) from the text first.
def remove_accents(s):
    nkfd_form = unicodedata.normalize('NFKD', s)
    ascii_string = nkfd_form.encode('ASCII', 'ignore')
    return ascii_string

dataset["text"] = dataset["text"].apply(remove_accents)

In [None]:
# Check the dataset without accents again.
dataset.head()

In [None]:
# scikit-learn required numerical values as labels, so let's convert 
# the creditors' names to numbers first using LabelEncoder.
label_encoder = LabelEncoder()
numeric_labels = label_encoder.fit_transform(dataset['creditor'])
numeric_labels

In [None]:
# Check the classes.
label_encoder.classes_

In [None]:
# Example usage of label_encoder.
label_encoder.transform(["CSOB"])

In [None]:
# Add the numberic_labels to the dataset so that we have everything in one place.
dataset["numeric_label"] = numeric_labels
dataset.head()

In [None]:
# Get the feature vectors and target variables.
# The feature vector still contains just raw texts.
X = dataset["text"]
y = dataset["numeric_label"]
print(X.shape)
print(y.shape)

In [None]:
# Split X and y into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

### Vectorizing the dataset

In [None]:
# Load the prepared list of stopwords for the Czech language.
stopwords = pd.read_csv('data/stopwords_cz.txt', encoding='utf-8', header=None, names=["word"])
stopwords["word"] = stopwords["word"].apply(remove_accents)
print("Number of stopwords: {0}".format(len(stopwords)))
stopwords.head()

In [None]:
# Initialize the CountVectorizer, this time with customized params.
vectorizer = CountVectorizer(lowercase=True,
                             ngram_range=(1,3),
                             stop_words=list(stopwords["word"].values),
                             max_df = 0.5,
                             min_df = 30,
                             tokenizer = lambda x: re.split("[\r\t\n .,;:'\"()?!/]+", x))



**Parameters:**
* **lowercase** - convert all characters to lowercase before tokenizing.
* **ngram_range** - the lower and upper boundary of the range of n-values for different n-grams to be extracted. 
* **stop_words** - list of stopwords which will be removed from the vocabulary.
* **max_df** - ignore terms that have a document frequency strictly higher than the this threshold (float from [0.0, 1.0] for relative value or integer for absolute value).
* **min_df** - ignore terms that have a document frequency strictly lower than the this threshold (float from [0.0, 1.0] for relative value or integer for absolute value).
* **tokenizer** - used to specify a custom tokenization (i.e. splitting text to words) step.

In [None]:
# Learn the vocabulary and check its size.
vectorizer.fit(X_train)
len(vectorizer.get_feature_names())

In [None]:
# Transform train data into a document-term matrix.
X_train_dtm = vectorizer.transform(X_train)
X_train_dtm

In [None]:
# Transform test data into a document-term matrix.
X_test_dtm = vectorizer.transform(X_test)
X_test_dtm

### Building and evaluating the model

In [None]:
# Init logistic regression model, this time with slightly changed params.
clf = LogisticRegression(C=1.0, penalty='l1')

In [None]:
# Train the model and time it with IPython magic command.
%time clf.fit(X_train_dtm, y_train)

In [None]:
# Make predictions for test data.
y_predictions = clf.predict(X_test_dtm)

In [None]:
# Calculate the accuracy of your predictions.
metrics.accuracy_score(y_test, y_predictions)

In [None]:
# Print the confusion matrix.
pd.DataFrame(metrics.confusion_matrix(y_test, y_predictions), 
             index=label_encoder.classes_, columns=label_encoder.classes_)

### Examining the model
Can we somehow find out what has the model actually learned?

In [None]:
# Check the logistic regressions params.
print(clf.coef_.shape)
print(clf.coef_)

In [None]:
# Get list of feature names and classes.
feature_names = vectorizer.get_feature_names()
classes = label_encoder.classes_

In [None]:
# Create a redable version of our model and check its learned parameters.
readable_model = pd.DataFrame(clf.coef_.transpose(), columns=classes)
readable_model.insert(0, "ngram", feature_names)
readable_model[7000:8000]

In [None]:
# Examine parameters for specific class.
readable_model.sort_values(by="Ceska sporitelna", ascending=False).head(10)

What has the model learned? Check
<a target="_blank" href="https://www.google.cz/maps/@50.0448653,14.4483488,3a,75y,195.68h,98.1t/data=!3m6!1e1!3m4!1s0K232cRC0B4x6gwNN8GG7Q!2e0!7i13312!8i6656!6m1!1e1?hl=en">Olbrachtova 1929, Praha 4</a>.

## Topics not covered
* Advanced text preprocessing.
* Better model evaluation.
* Comparison of different models.
* (Meta) Parameter optimization of each model.
* ... and much much more.