## Advanced text mining with Python - Text Classification

In [1]:
import sys
print(sys.version)

3.5.4 |Anaconda 4.0.0 (64-bit)| (default, Aug 14 2017, 13:41:13) [MSC v.1900 64 bit (AMD64)]


In [2]:
import sklearn

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn import metrics

## Part 1: Representing text as numerical data

In [3]:
# example text for model training (citation messages)
simple_train = ['The University of Chicago', \
                'private research university in Chicago', \
                'culturally rich and ethnically diverse coeducational research university']

In [4]:
simple_train

['The University of Chicago',
 'private research university in Chicago',
 'culturally rich and ethnically diverse coeducational research university']

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

In scikit-learn we have a choice of four different feature extraction mechanisms:
* CountVectorizer - Convert a collection of text documents to a matrix of token counts
* HashingVectorizer - Convert a collection of text documents to a matrix of token occurrences
* TfidfTransformer - Transform a count matrix to a normalized TF or TF-IDF representation
* TfidfVectorizer - Convert a collection of raw documents to a matrix of TF-IDF features.  Equivalent to CountVectorizer followed by TfidfTransformer

### CountVectorizer

In [5]:
countvectorizer = CountVectorizer()
countvectorizer_matrix = countvectorizer.fit_transform(simple_train)
countvectorizer_matrix.shape

(3, 13)

In [6]:
countvectorizer_matrix_df = pd.DataFrame(countvectorizer_matrix.toarray(), columns=countvectorizer.get_feature_names())
countvectorizer_matrix_df

Unnamed: 0,and,chicago,coeducational,culturally,diverse,ethnically,in,of,private,research,rich,the,university
0,0,1,0,0,0,0,0,1,0,0,0,1,1
1,0,1,0,0,0,0,1,0,1,1,0,0,1
2,1,0,1,1,1,1,0,0,0,1,1,0,1


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

#### Removing stop-words from CountVectorizer

In [7]:
countvectorizer = CountVectorizer(stop_words='english')
countvectorizer_matrix = countvectorizer.fit_transform(simple_train)
countvectorizer_matrix.shape

(3, 9)

In [8]:
countvectorizer_matrix_df = pd.DataFrame(countvectorizer_matrix.toarray(), columns=countvectorizer.get_feature_names())
countvectorizer_matrix_df

Unnamed: 0,chicago,coeducational,culturally,diverse,ethnically,private,research,rich,university
0,1,0,0,0,0,0,0,0,1
1,1,0,0,0,0,1,1,0,1
2,0,1,1,1,1,0,1,1,1


#### Adding N-Gram features to CountVectorizer
word = word n-grams  
char = char n-grams  
char_wb = creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

In [9]:
countvectorizer = CountVectorizer(analyzer='word', ngram_range=(1,3))
#countvectorizer = CountVectorizer(analyzer='char', ngram_range=(1,3))
#countvectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(1,3))
countvectorizer_matrix = countvectorizer.fit_transform(simple_train)
countvectorizer_matrix.shape

(3, 37)

In [10]:
countvectorizer_matrix_df = pd.DataFrame(countvectorizer_matrix.toarray(), columns=countvectorizer.get_feature_names())
countvectorizer_matrix_df

Unnamed: 0,and,and ethnically,and ethnically diverse,chicago,coeducational,coeducational research,coeducational research university,culturally,culturally rich,culturally rich and,...,rich and,rich and ethnically,the,the university,the university of,university,university in,university in chicago,university of,university of chicago
0,0,0,0,1,0,0,0,0,0,0,...,0,0,1,1,1,1,0,0,1,1
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,1,1,0,0
2,1,1,1,0,1,1,1,1,1,1,...,1,1,0,0,0,1,0,0,0,0


#### Controlling features in CountVectorizer

max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

max_features: If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [11]:
countvectorizer = CountVectorizer(lowercase=False, stop_words='english',
                                  max_df=0.8, min_df=0.2, max_features=1000, ngram_range=(1,3))
countvectorizer_matrix = countvectorizer.fit_transform(simple_train)
countvectorizer_matrix.shape

(3, 29)

In [12]:
countvectorizer_matrix_df = pd.DataFrame(countvectorizer_matrix.toarray(), columns=countvectorizer.get_feature_names())
countvectorizer_matrix_df

Unnamed: 0,Chicago,The,The University,The University Chicago,University,University Chicago,coeducational,coeducational research,coeducational research university,culturally,...,private research,private research university,research,research university,research university Chicago,rich,rich ethnically,rich ethnically diverse,university,university Chicago
0,1,1,1,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,0,0,0,1,1
2,0,0,0,0,0,0,1,1,1,1,...,0,0,1,1,0,1,1,1,1,0


### HashingVectorizer
The HashingVectorizer has a parameter n_features which is 2^20 = 1048576 by default. When hashing, they don't actually compute a dictionary mapping terms to a unique index to use for each one. Instead, you just hash each term and use a large enough size that you don't expect there to be too many collisions


In [13]:
hashingvectorizer = HashingVectorizer(n_features=15)
hashingvectorizer_matrix = hashingvectorizer.fit_transform(simple_train)
hashingvectorizer_matrix.shape

(3, 15)

In [14]:
hashingvectorizer_matrix_df = pd.DataFrame(hashingvectorizer_matrix.toarray())
hashingvectorizer_matrix_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,-0.707107,0.0
1,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,-0.57735,0.0,0.0,0.0,0.0,0.0,-0.57735,0.0
2,0.0,0.0,0.0,0.316228,-0.316228,-0.632456,0.316228,-0.316228,0.0,0.0,0.0,0.0,0.0,0.316228,0.316228


### TfidfVectorizer

In [15]:
tfidfvectorizer = TfidfVectorizer(stop_words='english')
tfidfvectorizer_matrix = tfidfvectorizer.fit_transform(simple_train)
tfidfvectorizer_matrix.shape

(3, 9)

In [16]:
tfidfvectorizer_matrix_df = pd.DataFrame(tfidfvectorizer_matrix.toarray(), columns=tfidfvectorizer.get_feature_names())
tfidfvectorizer_matrix_df

Unnamed: 0,chicago,coeducational,culturally,diverse,ethnically,private,research,rich,university
0,0.789807,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.613356
1,0.480458,0.0,0.0,0.0,0.0,0.631745,0.480458,0.0,0.373119
2,0.0,0.410747,0.410747,0.410747,0.410747,0.0,0.312384,0.410747,0.242594


## Part 2.  Fitting vocabulary of the training data

In [17]:
# instantiate CountVectorizer (we will use default parameters)
countvectorizer = CountVectorizer()
# learn the 'vocabulary' of the training data (occurs in-place)
countvectorizer.fit(simple_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [18]:
# examine the fitted vocabulary
countvectorizer.get_feature_names()

['and',
 'chicago',
 'coeducational',
 'culturally',
 'diverse',
 'ethnically',
 'in',
 'of',
 'private',
 'research',
 'rich',
 'the',
 'university']

In [19]:
# example text for model testing
simple_test = ["University of Chicago is a private institution"]

In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [20]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_matrix = countvectorizer.transform(simple_test)
simple_test_matrix.toarray()

array([[0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1]], dtype=int64)

In [21]:
simple_test_matrix_df = pd.DataFrame(simple_test_matrix.toarray(), columns=countvectorizer.get_feature_names())
simple_test_matrix_df

Unnamed: 0,and,chicago,coeducational,culturally,diverse,ethnically,in,of,private,research,rich,the,university
0,0,1,0,0,0,0,0,1,1,0,0,0,1


**Summary:**

- `vect.fit(train)` **learns the vocabulary** of the training data
- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)

## Part 3: Reading a text-based dataset into pandas

In [22]:
# read file into pandas using a relative path
directory = 'C://Users//Nick//Documents//Teaching//Data Projects//Text//Classification//'
path = directory+'md_traffic_10K.csv'
citation = pd.read_table(path, skiprows=1, header=None, sep=',', names=['description', 'violation'])
#citation = pd.read_table(path, sep=',', header='infer')

In [23]:
# examine the shape
citation.shape

(9999, 2)

In [24]:
# examine the first 5 rows
citation.head(5)

Unnamed: 0,description,violation
0,DRIVER FAILURE TO STOP AT STEADY CIRCULAR RED ...,Citation
1,HEADLIGHTS (*),ESERO
2,FAILURE TO DISPLAY TWO LIGHTED FRONT LAMPS WHE...,Warning
3,DRIVER FAILURE TO STOP AT STOP SIGN LINE,Warning
4,STOP LIGHTS (*),ESERO


In [25]:
# examine the class distribution
citation.description.value_counts().head(10)

DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS                           702
FAILURE TO DISPLAY REGISTRATION CARD UPON DEMAND BY POLICE OFFICER                                   467
DRIVER USING HANDS TO USE HANDHELD TELEPHONE WHILEMOTOR VEHICLE IS IN MOTION                         416
DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGISTRATION                                               326
FAILURE OF INDIVIDUAL DRIVING ON HIGHWAY TO DISPLAY LICENSE TO UNIFORMED POLICE ON DEMAND            246
STOP LIGHTS (*)                                                                                      235
PERSON DRIVING MOTOR VEHICLE ON HIGHWAY OR PUBLIC USE PROPERTY ON SUSPENDED LICENSE AND PRIVILEGE    225
DISPLAYING EXPIRED REGISTRATION PLATE ISSUED BY ANY STATE                                            224
DRIVER FAILURE TO STOP AT STOP SIGN LINE                                                             223
DRIVING VEHICLE ON HIGHWAY WITHOUT CURRENT REGISTRATION

In [26]:
# convert label to a binary numerical variable
citation['violation_flag'] = citation.violation.map({'Warning':0, 'Citation':1, 'ESERO':2})

In [27]:
# check that the conversion worked
citation.head(10)

Unnamed: 0,description,violation,violation_flag
0,DRIVER FAILURE TO STOP AT STEADY CIRCULAR RED ...,Citation,1
1,HEADLIGHTS (*),ESERO,2
2,FAILURE TO DISPLAY TWO LIGHTED FRONT LAMPS WHE...,Warning,0
3,DRIVER FAILURE TO STOP AT STOP SIGN LINE,Warning,0
4,STOP LIGHTS (*),ESERO,2
5,DRIVING MOTOR VEHICLE ON HIGHWAY WITHOUT REQU...,Citation,1
6,DRIVING VEHICLE ON HIGHWAY WITH AN EXPIRED LIC...,Citation,1
7,FAILURE OF INDIVIDUAL DRIVING ON HIGHWAY TO DI...,Citation,1
8,FAILURE TO DISPLAY REGISTRATION CARD UPON DEMA...,Citation,1
9,PERSON DRIVING MOTOR VEHICLE ON HIGHWAY OR PUB...,Citation,1


In [28]:
# how to define X and y (from the citation data) for use with COUNTVECTORIZER
X = citation.description
y = citation.violation_flag
print(X.shape)
print(y.shape)

(9999,)
(9999,)


In [29]:
# split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(7499,)
(2500,)
(7499,)
(2500,)


## Part 3: Vectorizing our dataset

In [30]:
# instantiate the vectorizer
vect = CountVectorizer()

In [31]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

In [32]:
# equivalently: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)

In [33]:
# examine the document-term matrix
X_train_dtm

<7499x775 sparse matrix of type '<class 'numpy.int64'>'
	with 66918 stored elements in Compressed Sparse Row format>

In [34]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<2500x775 sparse matrix of type '<class 'numpy.int64'>'
	with 22188 stored elements in Compressed Sparse Row format>

## Part 4: Building and evaluating models

### Naive Bayes Model

In [35]:
# instantiate a Multinomial Naive Bayes model
nb = MultinomialNB()

In [36]:
# train and time the model using X_train_dtm
%time nb.fit(X_train_dtm, y_train)

Wall time: 10 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [37]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [38]:
# calculate accuracy of class predictions
print(metrics.accuracy_score(y_test, y_pred_class))

0.7448


In [39]:
# calculate precision and recall
print(classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          0       0.71      0.78      0.75      1199
          1       0.75      0.68      0.71      1145
          2       0.94      0.94      0.94       156

avg / total       0.75      0.74      0.74      2500



In [40]:
# calculate the confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[941 252   6]
 [368 774   3]
 [  9   0 147]]


### Logistic Regression Model

In [41]:
# instantiate a logistic regression model
logreg = LogisticRegression()

In [42]:
# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)

Wall time: 170 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [43]:
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

In [44]:
# calculate accuracy of class predictions
print(metrics.accuracy_score(y_test, y_pred_class))

0.7532


In [45]:
# calculate precision and recall
print(classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          0       0.69      0.87      0.77      1199
          1       0.81      0.60      0.69      1145
          2       1.00      0.97      0.98       156

avg / total       0.77      0.75      0.75      2500



In [46]:
# calculate the confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[1041  158    0]
 [ 454  691    0]
 [   5    0  151]]


### Support Vector Machine

In [47]:
# instantiate a SVM model
svm = SGDClassifier(max_iter=100, tol=None)

In [48]:
# train the model using X_train_dtm
%time svm.fit(X_train_dtm, y_train)

Wall time: 190 ms


SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=100, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)

In [49]:
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)

0.75319999999999998

In [50]:
# make class predictions for X_test_dtm
y_pred_class = svm.predict(X_test_dtm)

In [51]:
# calculate accuracy of class predictions
print(metrics.accuracy_score(y_test, y_pred_class))

0.7556


In [52]:
# calculate precision and recall
print(classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          0       0.70      0.87      0.77      1199
          1       0.82      0.60      0.69      1145
          2       1.00      1.00      1.00       156

avg / total       0.77      0.76      0.75      2500



In [53]:
# calculate the confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[1045  154    0]
 [ 457  688    0]
 [   0    0  156]]


## Part 5: Improving model performance

In [54]:
# show default parameters for CountVectorizer
countvectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [55]:
# remove English stop words
countvectorizer = CountVectorizer(stop_words='english')

- **ngram_range:** tuple (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
    - All values of n such that min_n <= n <= max_n will be used.

In [56]:
# include n-grams
countvectorizer = CountVectorizer(ngram_range=(1, 3))

- **max_df:** float in range [0.0, 1.0] or int, default=1.0
    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [57]:
# ignore terms that appear in more than 50% of the documents
countvectorizer = CountVectorizer(max_df=0.5)

- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [58]:
# only keep terms that appear in at least 2 documents
countvectorizer = CountVectorizer(min_df=2)

## Part 6.  Troubleshooting results
Focus on predictions with low confidence level

In [59]:
X_test_df = pd.DataFrame(X_test)
X_test_df.reset_index(inplace=True, drop=True)

y_test_df = pd.DataFrame(y_test)
y_test_df.reset_index(inplace=True, drop=True)

#### Scoring the test results and appending both class and probabilities

In [60]:
y_pred_prob = logreg.predict_proba(X_test_dtm)
y_pred_class = logreg.predict(X_test_dtm)

In [61]:
y_pred_prob_df = pd.DataFrame(y_pred_prob)
y_pred_prob_df.columns = ['0-prob', '1-prob', '2-prob']

y_pred_class_df = pd.DataFrame(y_pred_class)
y_pred_class_df.columns = ['predicted']

#### Combining the results and focusing on low confidence levels

In [62]:
results_df = X_test_df.join(y_test_df).join(y_pred_class_df).join(y_pred_prob_df)

In [63]:
results_review_df = results_df[(results_df['0-prob'] < 0.6) & (results_df['1-prob'] < 0.6) & (results_df['2-prob'] < 0.6)]
results_review_df.shape

(437, 6)

In [64]:
results_review_df.head(10)

Unnamed: 0,description,violation_flag,predicted,0-prob,1-prob,2-prob
6,OPERATING VEHICLE ON HIGHWAY WITH UNAUTHORIZED...,0,0,0.57153,0.425385,0.003085
17,FAILURE TO SECURELY FASTENREGISTRATION PLATE T...,0,1,0.475838,0.523973,0.000189
23,OPERATING VEHICLE ON HIGHWAY WITH UNAUTHORIZED...,1,0,0.57153,0.425385,0.003085
32,DRIVING A MOTOR VEH WITHOUT A VALID MEDICAL EX...,0,0,0.545378,0.453863,0.000759
36,DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGI...,0,0,0.55549,0.444434,7.5e-05
42,DISPLAYING EXPIRED REGISTRATION PLATE ISSUED B...,1,0,0.568944,0.428898,0.002158
46,DRIVER USING HANDS TO USE HANDHELD TELEPHONE W...,1,0,0.50619,0.493447,0.000363
47,OPERATING VEHICLE ON HIGHWAY WITH UNAUTHORIZED...,1,0,0.57153,0.425385,0.003085
50,DRIVER USING HANDS TO USE HANDHELD TELEPHONE W...,0,0,0.50619,0.493447,0.000363
56,DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGI...,0,0,0.55549,0.444434,7.5e-05
