# Workshop: Machine learning with text in Scikit-learn
## Outline

1. [Introduction to supervised learning in scikit-learn](#Introduction-to-supervised-learning-in-scikit-learn)
1. [Converting text to feature vectors](#From-text-to-feature-vectors)
1. [Classifying creditors from the Czech Insolvency Register](#Classifying-creditors-from-the-Czech-Insolvency-Register)
    1. [Loading and preprocessing the dataset](#Loading-and-preprocessing-the-dataset)
    1. [Vectorizing the dataset](#Vectorizing-the-dataset)
    1. [Building and evaluating the model](#Building-and-evaluating-the-model)
    1. [Examining the model](#Examining-the-model)
1. [Topics not covered](#Topics-not-covered)

In [1]:
import pandas as pd
import unicodedata
import re
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics

In [2]:
# for Python 2 users
from __future__ import print_function

## Introduction to supervised learning in scikit-learn

**From <a href="https://en.wikipedia.org/wiki/Supervised_learning">Wikipedia</a>:**<br>
**Supervised learning** is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value.

**Note:** We will consider a classification task, i.e., samples belong to two or more classes that we want to predict.

In [3]:
# Load the iris dataset.
from sklearn.datasets import load_iris
iris = load_iris()

In [4]:
X = iris.data      # feature matrix, on sample per row
y = iris.target    # target vector

In [5]:
# Let's examine the shapes of X and y.
print(X.shape)
print(y.shape)

(150, 4)
(150,)


In [6]:
n_features = X.shape[1]
n_features

4

In [7]:
# Nicer overview of our dataset.
dataset = pd.DataFrame(X, columns=iris.feature_names)
dataset["label"] = y
dataset.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [8]:
# Let's examine the target vector
print(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [9]:
# Init logistic regression model with default params.
clf = LogisticRegression()

# Fit the model. 
clf.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [10]:
# Take a sample from the training data.
s = X[0]
s

array([ 5.1,  3.5,  1.4,  0.2])

In [11]:
# Let's try to predict the target value for this sample.
clf.predict([s])

array([0])

**Note:** In scikit-learn, an estimator for classification is a Python object that implements the methods **fit(X, y)** and **predict(samples)**. 

**To summarize the general process:**
1. Get a dataset in form **X** (feature matrix) and **y** (target variable)
2. Pick a model and fit it using **.fit(X, y)**
3. Predict values of new, unobserved samples using **.predict(samples)**

You can also check the basic introduction to ML with scikit-learn in the <a href="http://scikit-learn.org/stable/tutorial/basic/tutorial.html">documentation</a>.

## From text to feature vectors

In [12]:
text_dataset = ["A coward judges all he sees by what he is.",
                "There are people who need people to need them.",
                "Never's the word God listens for when he needs a laugh."]

### Problem

From <a target="_blank" href="http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction">scikit-learn documentation</a>:<br>
Text Analysis is a major application field for machine learning algorithms. However the raw data, **a sequence of symbols cannot be fed directly to the algorithms themselves** as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

### Solution

From <a target="_blank" href="http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html">scikit-learn documentation</a>:<br>

In order to perform machine learning on text documents, we first need to turn the text content into **numerical feature vectors**.

The most intuitive way to do so is the **bags of words** representation:
1. assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).
1. for each document **#i**, count the number of occurrences of each word w and store it in **X[i, j]** as the value of feature **#j** where **j** is the index of word **w** in the dictionary

The bags of words representation implies that **n_features** is the number of distinct words in the corpus: **this number is typically larger than 100,000**.

Fortunately, **most values in X will be zeros** since for a given document less than a couple thousands of distinct words will be used. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory.


We will use scikits **CountVectorizer** to convert text into a **matrix of token counts (document-term matrix)**:

In [13]:
# Init CountVectorizer with the default params.
vectorizer = CountVectorizer()

In [14]:
# Learn the vocabulary from the text data.
vectorizer.fit(text_dataset)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [15]:
# Examine the vocabulary.
vocabulary = vectorizer.get_feature_names()
print("Vocabulary size: {0}".format(len(vocabulary)))
print("Vocabulary:")
print(vocabulary)

Vocabulary size: 24
Vocabulary:
[u'all', u'are', u'by', u'coward', u'for', u'god', u'he', u'is', u'judges', u'laugh', u'listens', u'need', u'needs', u'never', u'people', u'sees', u'the', u'them', u'there', u'to', u'what', u'when', u'who', u'word']


In [16]:
# Transform text data into a document-term matrix.
dtm = vectorizer.transform(text_dataset)
dtm

<3x24 sparse matrix of type '<type 'numpy.int64'>'
	with 25 stored elements in Compressed Sparse Row format>

In [17]:
# Let's examine the obtained document-term matrix.
pd.DataFrame(dtm.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,all,are,by,coward,for,god,he,is,judges,laugh,...,people,sees,the,them,there,to,what,when,who,word
0,1,0,1,1,0,0,2,1,1,0,...,0,1,0,0,0,0,1,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,2,0,0,1,1,1,0,0,1,0
2,0,0,0,0,1,1,1,0,0,1,...,0,0,1,0,0,0,0,1,0,1


**Summary:**<br>
**Vectorization** is a general process of turning a collection of text documents into numerical feature vectors.<br>
**CountVectorizer** is one of the vectorizers available in scikit-learn.<br>
All vectorizers are used as follows:
* use **.fit(data)** to learn the vocabulary
* use **.transform(data)** to build the document-term matrix from text data

## Classifying creditors from the Czech Insolvency Register

### Loading and preprocessing the dataset

In [18]:
# Load the dataset with pandas.
dataset = pd.read_table("./data/receivables.tsv", encoding="utf-8", header=0)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3500 entries, 0 to 3499
Data columns (total 3 columns):
documentId    3500 non-null int64
creditor      3500 non-null object
text          3500 non-null object
dtypes: int64(1), object(2)
memory usage: 82.1+ KB


In [19]:
# Check the dataset.
dataset.head()

Unnamed: 0,documentId,creditor,text
0,3771532,Cetelem CR,přihláška pohledávky sülldí krajský soud v brn...
1,435902,Cetelem CR,krajsky soud v. ceskych nugajuariaau _ -i;i5;'...
2,4354526,Cetelem CR,liřlhll-íšiíá puhledáxřke” krajský soud v ústi...
3,46081,Cetelem CR,prihlaska pohledaefkt' 501152 krajsky soud v c...
4,4764086,Cetelem CR,přihláška pohledávky soud: krajský soud v plzn...


In [20]:
# Check the number of samples per class.
dataset.groupby('creditor').count()

Unnamed: 0_level_0,documentId,text
creditor,Unnamed: 1_level_1,Unnamed: 2_level_1
CEZ Prodej,250,250
CSOB,250,250
Ceska sporitelna,250,250
Cetelem CR,250,250
Cofidis,250,250
Essox,250,250
GE Money Bank,250,250
Home Credit,250,250
Komercni banka,250,250
Profidebt,250,250


In [21]:
# Since we have texts written in Czech in the dataset, let's remove the accents (diacritics) from the text first.
def remove_accents(s):
    nkfd_form = unicodedata.normalize('NFKD', s)
    ascii_string = nkfd_form.encode('ASCII', 'ignore')
    return ascii_string

dataset["text"] = dataset["text"].apply(remove_accents)

In [22]:
# Check the dataset without accents again.
dataset.head()

Unnamed: 0,documentId,creditor,text
0,3771532,Cetelem CR,prihlaska pohledavky sulldi krajsky soud v brn...
1,435902,Cetelem CR,krajsky soud v. ceskych nugajuariaau _ -i;i5;'...
2,4354526,Cetelem CR,lirlhll-isiia puhledaxrke krajsky soud v usti ...
3,46081,Cetelem CR,prihlaska pohledaefkt' 501152 krajsky soud v c...
4,4764086,Cetelem CR,prihlaska pohledavky soud: krajsky soud v plzn...


In [23]:
# scikit-learn required numerical values as labels, so let's convert 
# the creditors' names to numbers first using LabelEncoder.
label_encoder = LabelEncoder()
numeric_labels = label_encoder.fit_transform(dataset['creditor'])
numeric_labels

array([ 3,  3,  3, ..., 12, 12, 12])

In [24]:
# Check the classes.
label_encoder.classes_

array([u'CEZ Prodej', u'CSOB', u'Ceska sporitelna', u'Cetelem CR',
       u'Cofidis', u'Essox', u'GE Money Bank', u'Home Credit',
       u'Komercni banka', u'Profidebt', u'T-Mobile CR', u'VZP', u'unknown'], dtype=object)

In [25]:
# Example usage of label_encoder.
label_encoder.transform(["CSOB"])

array([1])

In [26]:
# Add the numberic_labels to the dataset so that we have everything in one place.
dataset["numeric_label"] = numeric_labels
dataset.head()

Unnamed: 0,documentId,creditor,text,numeric_label
0,3771532,Cetelem CR,prihlaska pohledavky sulldi krajsky soud v brn...,3
1,435902,Cetelem CR,krajsky soud v. ceskych nugajuariaau _ -i;i5;'...,3
2,4354526,Cetelem CR,lirlhll-isiia puhledaxrke krajsky soud v usti ...,3
3,46081,Cetelem CR,prihlaska pohledaefkt' 501152 krajsky soud v c...,3
4,4764086,Cetelem CR,prihlaska pohledavky soud: krajsky soud v plzn...,3


In [27]:
# Get the feature vectors and target variables.
# The feature vector still contains just raw texts.
X = dataset["text"]
y = dataset["numeric_label"]
print(X.shape)
print(y.shape)

(3500,)
(3500,)


In [28]:
# Split X and y into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(2625,)
(875,)
(2625,)
(875,)


### Vectorizing the dataset

In [29]:
# Load the prepared list of stopwords for the Czech language.
stopwords = pd.read_csv('data/stopwords_cz.txt', encoding='utf-8', header=None, names=["word"])
stopwords["word"] = stopwords["word"].apply(remove_accents)
print("Number of stopwords: {0}".format(len(stopwords)))
stopwords.head()

Number of stopwords: 256


Unnamed: 0,word
0,ackoli
1,ahoj
2,ale
3,anebo
4,ano


In [30]:
# Initialize the CountVectorizer, this time with customized params.
vectorizer = CountVectorizer(lowercase=True,
                             ngram_range=(1,3),
                             stop_words=list(stopwords["word"].values),
                             max_df = 0.5,
                             min_df = 30,
                             tokenizer = lambda x: re.split("[\r\t\n .,;:'\"()?!/]+", x))



**Parameters:**
* **lowercase** - convert all characters to lowercase before tokenizing.
* **ngram_range** - the lower and upper boundary of the range of n-values for different n-grams to be extracted. 
* **stop_words** - list of stopwords which will be removed from the vocabulary.
* **max_df** - ignore terms that have a document frequency strictly higher than the this threshold (float from [0.0, 1.0] for relative value or integer for absolute value).
* **min_df** - ignore terms that have a document frequency strictly lower than the this threshold (float from [0.0, 1.0] for relative value or integer for absolute value).
* **tokenizer** - used to specify a custom tokenization (i.e. splitting text to words) step.

In [31]:
# Learn the vocabulary and check its size.
vectorizer.fit(X_train)
len(vectorizer.get_feature_names())

13700

In [32]:
# Transform train data into a document-term matrix.
X_train_dtm = vectorizer.transform(X_train)
X_train_dtm

<2625x13700 sparse matrix of type '<type 'numpy.int64'>'
	with 1874777 stored elements in Compressed Sparse Row format>

In [33]:
# Transform test data into a document-term matrix.
X_test_dtm = vectorizer.transform(X_test)
X_test_dtm

<875x13700 sparse matrix of type '<type 'numpy.int64'>'
	with 603056 stored elements in Compressed Sparse Row format>

### Building and evaluating the model

In [34]:
# Init logistic regression model, this time with slightly changed params.
clf = LogisticRegression(C=1.0, penalty='l1')

In [35]:
# Train the model and time it with IPython magic command.
%time clf.fit(X_train_dtm, y_train)

CPU times: user 2.45 s, sys: 28 ms, total: 2.48 s
Wall time: 2.49 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [36]:
# Make predictions for test data.
y_predictions = clf.predict(X_test_dtm)

In [37]:
# Calculate the accuracy of your predictions.
metrics.accuracy_score(y_test, y_predictions)

0.9474285714285714

In [38]:
# Print the confusion matrix.
pd.DataFrame(metrics.confusion_matrix(y_test, y_predictions), 
             index=label_encoder.classes_, columns=label_encoder.classes_)

Unnamed: 0,CEZ Prodej,CSOB,Ceska sporitelna,Cetelem CR,Cofidis,Essox,GE Money Bank,Home Credit,Komercni banka,Profidebt,T-Mobile CR,VZP,unknown
CEZ Prodej,60,0,0,0,0,0,0,0,0,0,0,0,0
CSOB,0,56,0,0,0,0,0,0,0,0,0,0,6
Ceska sporitelna,0,0,50,0,0,0,0,0,0,0,0,0,1
Cetelem CR,0,0,1,66,0,0,0,1,0,0,0,0,1
Cofidis,0,0,0,0,60,0,0,0,0,0,0,0,2
Essox,0,0,0,0,0,59,0,0,0,0,0,0,1
GE Money Bank,0,0,0,0,1,0,68,0,0,0,0,0,2
Home Credit,0,0,0,1,1,0,0,56,0,0,0,0,3
Komercni banka,0,1,0,0,0,0,0,0,56,0,0,0,4
Profidebt,0,0,0,0,0,0,0,0,0,67,0,0,0


### Examining the model
Can we somehow find out what has the model actually learned?

In [39]:
# Check the logistic regressions params.
print(clf.coef_.shape)
print(clf.coef_)

(13, 13700)
[[ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 ..., 
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [-0.04184195  0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.22476086  0.         ...,  0.          0.          0.        ]]


In [40]:
# Get list of feature names and classes.
feature_names = vectorizer.get_feature_names()
classes = label_encoder.classes_

In [41]:
# Create a redable version of our model and check its learned parameters.
readable_model = pd.DataFrame(clf.coef_.transpose(), columns=classes)
readable_model.insert(0, "ngram", feature_names)
readable_model[7000:8000]

Unnamed: 0,ngram,CEZ Prodej,CSOB,Ceska sporitelna,Cetelem CR,Cofidis,Essox,GE Money Bank,Home Credit,Komercni banka,Profidebt,T-Mobile CR,VZP,unknown
7000,majetku,0.0,0.0,0.540936,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7001,maleho,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7002,maleho bydliste,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7003,mama,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7004,mamma,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7005,man,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7006,mandanta,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7007,marcela,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7008,marie,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7009,martin,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.322980


In [42]:
# Examine parameters for specific class.
readable_model.sort_values(by="Ceska sporitelna", ascending=False).head(10)

Unnamed: 0,ngram,CEZ Prodej,CSOB,Ceska sporitelna,Cetelem CR,Cofidis,Essox,GE Money Bank,Home Credit,Komercni banka,Profidebt,T-Mobile CR,VZP,unknown
11123,sporitelna,0.0,0.0,3.355158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1676,14000,0.0,0.0,1.6797,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1790,1929,0.0,0.0,1.647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10113,radny,0.0,0.0,1.335466,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9298,praha 4,0.0,0.0,1.240059,-0.794842,-0.092286,0.0,0.0,-1.135765,-1.056659,0.0,0.169751,-0.70201,0.0
7948,olbrachtova,0.0,0.0,1.046949,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6399,knih,0.0,0.0,0.761867,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3732,autorita,0.781963,0.0,0.730911,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10637,sestava,0.0,0.0,0.719761,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12935,vzniku smlouva,0.0,0.0,0.714382,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


What has the model learned? Check
<a target="_blank" href="https://www.google.cz/maps/@50.0448653,14.4483488,3a,75y,195.68h,98.1t/data=!3m6!1e1!3m4!1s0K232cRC0B4x6gwNN8GG7Q!2e0!7i13312!8i6656!6m1!1e1?hl=en">Olbrachtova 1929, Praha 4</a>.

## Topics not covered
* Advanced text preprocessing.
* Better model evaluation.
* Comparison of different models.
* (Meta) Parameter optimization of each model.
* ... and much much more.