# <center>7. Learning to Classify Text with Python</center>

## 1. load the text data and check it

In [1]:
# load training dtat
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

In [2]:
# the size of training data
len(twenty_train.filenames)

11314

In [3]:
# class labels of training data
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [4]:
# the size of class labels
len(twenty_train.target_names)

20

In [5]:
# check the type of training data
twenty_train.data.__class__

list

In [6]:
# investigate what the raw text data in the training data,
# e.g., input text of the 10th training sample
print(twenty_train.data[10])

From: irwin@cmptrc.lonestar.org (Irwin Arnstein)
Subject: Re: Recommendation on Duc
Summary: What's it worth?
Distribution: usa
Expires: Sat, 1 May 1993 05:00:00 GMT
Organization: CompuTrac Inc., Richardson TX
Keywords: Ducati, GTS, How much? 
Lines: 13

I have a line on a Ducati 900GTS 1978 model with 17k on the clock.  Runs
very well, paint is the bronze/brown/orange faded out, leaks a bit of oil
and pops out of 1st with hard accel.  The shop will fix trans and oil 
leak.  They sold the bike to the 1 and only owner.  They want $3495, and
I am thinking more like $3K.  Any opinions out there?  Please email me.
Thanks.  It would be a nice stable mate to the Beemer.  Then I'll get
a jap bike and call myself Axis Motors!

-- 
-----------------------------------------------------------------------
"Tuba" (Irwin)      "I honk therefore I am"     CompuTrac-Richardson,Tx
irwin@cmptrc.lonestar.org    DoD #0826          (R75/6)
-------------------------------------------------------------------

In [7]:
# investigate the label (numerical and textual) of the training data,
# e.g., label of the 10th training sample
print(twenty_train.target[10],twenty_train.target_names[twenty_train.target[10]])

8 rec.motorcycles


## 2. Extract features from raw text

Text files are actually series of words (ordered). In order to run machine learning algorithms we need to **convert the text files into numerical feature vectors**. We will be using **bag of words** model for our example. 

<div align=center>
<img src="https://github.com/zhangjianzhang/text_mining/blob/master/files/codes/lecture_7/bow.png?raw=true">
<br>
<center><em><strong>Bag of Words</strong></em></center>
</div>

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

In [10]:
# X_train_counts is a Document-Term matrix and its shape is [n_samples, n_features].
# i.e., there are 11314 training samples and 130107 words in the vocabulary
X_train_counts.shape

(11314, 130107)

Let's examine the content of the `X_train_counts` and investigate what does it means.

In [11]:
X_train_counts.__class__

scipy.sparse.csr.csr_matrix

In [12]:
X_train_counts.getrow(10)

<1x130107 sparse matrix of type '<class 'numpy.int64'>'
	with 109 stored elements in Compressed Sparse Row format>

In [13]:
arr = X_train_counts.getrow(10).toarray()

In [14]:
arr

array([[2, 0, 0, ..., 0, 0, 0]])

In [15]:
arr.nonzero()

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([     0,   1049,   1410,   3802,   5791,   6437,   6475,   8042,
         11976,  12963,  21480,  25568,  27721,  28146,  28601,  29451,
         30868,  32311,  32489,  33301,  33527,  35151,  35194,  37423,
         40477,  40647,  41633,  47982,  48421,  49328,  49331,  51268,
         53441,  54163,  55597,  56979,  58830,  59534,  60731,  61959,
         62221,  63910,  64186,  66670,  68524,  68532,  68766,  69511,
         72384,  75028,  75033,  75901,  76007,  76032,  76377,  76681,
         79785,  80005,  80638,  83256,  83706,  83914,  84681,  85447,
         87170,  89362,  89550,  89860,  89919,  90097,

In [16]:
count_vect.vocabulary_.__class__

dict

In [17]:
# there are 130107 words in the vocabulary in total
len(count_vect.vocabulary_)

130107

In [18]:
# the word "good" is in the vocabulary
'good' in count_vect.vocabulary_.keys()

True

In [19]:
# let us find the value of "good", i.e., the numerical id of "good"
count_vect.vocabulary_['good']

59779

In [20]:
idx2word = {v:k for k,v in count_vect.vocabulary_.items()}

In [21]:
word_list = []
for idx in arr.nonzero()[1]:
    word = idx2word.get(idx)
    word_list.append(word)

In [22]:
print(word_list)

['00', '05', '0826', '13', '17k', '1978', '1993', '1st', '3495', '3k', '900gts', 'accel', 'am', 'and', 'any', 'arnstein', 'axis', 'be', 'beemer', 'bike', 'bit', 'bronze', 'brown', 'call', 'clock', 'cmptrc', 'computrac', 'distribution', 'dod', 'duc', 'ducati', 'email', 'expires', 'faded', 'fix', 'from', 'get', 'gmt', 'gts', 'hard', 'have', 'honk', 'how', 'inc', 'irwin', 'is', 'it', 'jap', 'keywords', 'leak', 'leaks', 'like', 'line', 'lines', 'll', 'lonestar', 'mate', 'may', 'me', 'model', 'more', 'motors', 'much', 'myself', 'nice', 'of', 'oil', 'on', 'only', 'opinions', 'orange', 'org', 'organization', 'out', 'owner', 'paint', 'please', 'pops', 'r75', 're', 'recommendation', 'richardson', 'runs', 'sat', 'shop', 'sold', 'stable', 'subject', 'summary', 'thanks', 'the', 'then', 'there', 'therefore', 'they', 'thinking', 'to', 'trans', 'tuba', 'tx', 'usa', 'very', 'want', 'well', 'what', 'will', 'with', 'worth', 'would']


In [22]:
print(twenty_train.data[10])

From: irwin@cmptrc.lonestar.org (Irwin Arnstein)
Subject: Re: Recommendation on Duc
Summary: What's it worth?
Distribution: usa
Expires: Sat, 1 May 1993 05:00:00 GMT
Organization: CompuTrac Inc., Richardson TX
Keywords: Ducati, GTS, How much? 
Lines: 13

I have a line on a Ducati 900GTS 1978 model with 17k on the clock.  Runs
very well, paint is the bronze/brown/orange faded out, leaks a bit of oil
and pops out of 1st with hard accel.  The shop will fix trans and oil 
leak.  They sold the bike to the 1 and only owner.  They want $3495, and
I am thinking more like $3K.  Any opinions out there?  Please email me.
Thanks.  It would be a nice stable mate to the Beemer.  Then I'll get
a jap bike and call myself Axis Motors!

-- 
-----------------------------------------------------------------------
"Tuba" (Irwin)      "I honk therefore I am"     CompuTrac-Richardson,Tx
irwin@cmptrc.lonestar.org    DoD #0826          (R75/6)
-------------------------------------------------------------------

In [24]:
idx2word[0]

'00'

In [25]:
# "00" occurs 2 times in the 10th training sample
X_train_counts[10,0]

2

<div align=center>
<img src="https://github.com/zhangjianzhang/text_mining/blob/master/files/codes/lecture_7/tfidf.png?raw=true">
<br>
<center><em><strong>Term Frequency-Inverse Document Frequency</strong></em></center>
</div>

In [26]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [27]:
# X_train_tfidf is a document-tfidf matrix
# its shape is the same as the shape of the above document-count matrix
# i.e., [n_samples, n_features]
X_train_tfidf.shape

(11314, 130107)

## 3. Train text classfiers with different ML algorithms

In [26]:
# train a Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_counts, twenty_train.target)

Building a pipeline: We can write less code and do all of the above, by building a pipeline as follows:

In [27]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB()),
                    ])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [28]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)

In [29]:
# the accuracy is about 77.39%
np.mean(predicted == twenty_test.target)

0.7738980350504514

Let’s try using a different algorithm **SVM**, and see if we can get any better performance.

In [30]:
from sklearn.linear_model import SGDClassifier

In [31]:
text_clf_svm = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge',
                                                   penalty='l2',
                                                   alpha=1e-3,
                                                   n_iter_no_change=5,
                                                   random_state=42))])

In [32]:
_ = text_clf_svm.fit(twenty_train.data, twenty_train.target)

In [33]:
predicted_svm = text_clf_svm.predict(twenty_test.data)

In [34]:
# 82.41%
np.mean(predicted_svm == twenty_test.target)

0.8240839086563994

## 4. Grid Search for Selecting the Best Parameters

Almost all the classifiers will have various parameters which can be tuned to obtain optimal performance. 

Scikit-learn gives an extremely useful tool `GridSearchCV`.

In [35]:
from sklearn.model_selection import GridSearchCV

In [36]:
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
               'clf__alpha': (1e-2, 1e-3),
             }

Here, we are creating a list of parameters for which we would like to do performance tuning. 

All the parameters name start with the component name (remember the component name we gave previously). E.g. `vect__ngram_range`; here we are telling to use unigram and bigrams and choose the one which is optimal.

Next, we create an instance of the grid search by passing the classifier, parameters and `n_jobs=-1` which tells to use multiple cores from user machine.

In [37]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

This might take few minutes to run depending on the machine configuration.

Lastly, to see the best mean score and the params, run the following code:

In [38]:
# accuracy is improved to 91.58%
gs_clf.best_score_

0.9157684864695698

In [39]:
# the best parameter is as follow:
gs_clf.best_params_

{'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

Let's tuning the SVM classifier with grid search.

In [40]:
from sklearn.model_selection import GridSearchCV
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)],
                  'tfidf__use_idf': (True, False),
                  'clf-svm__alpha': (1e-2, 1e-3),
}
gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)

In [41]:
# the accuracy is improved to 90.52%
gs_clf_svm.best_score_

0.9051618841994754

In [42]:
gs_clf_svm.best_params_

{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

You can further optimize the SVM classifier by tuning other parameters. This is left up to you to explore more.

## 5. Some Useful Tips for Improving the Performance

**Removing stop words**: (the, then etc) from the data. 

You should do this only when stop words are not useful for the underlying problem. 

In most of the text classification problems, this is indeed not useful. 

Let’s see if removing stop words increases the accuracy. Update the code for creating object of `CountVectorizer` as follows:

In [43]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
                    ])

In [44]:
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [45]:
# after stopwords removal, the accuracy is boosted from 77.39% to 81.69%
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.8169144981412639

**FitPrior=False**: When set to `false` for MultinomialNB, a uniform prior will be used. 

In [46]:
# 81.69 -> 82.14, improve a little
import numpy as np
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB(fit_prior=False)),
                    ])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.8214285714285714

**Stemming**: stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. E.g. A stemming algorithm reduces the words *fishing*, *fished*, and *fisher* to the root word, *fish*.

Below we use `Snowball stemmer` which works very well for English language.

In [47]:
import nltk
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
    
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')
text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect),
                             ('tfidf', TfidfTransformer()),
                             ('mnb', MultinomialNB(fit_prior=False)),
                            ])

# 82.14 -> 81.68 decrease a little 
text_mnb_stemmed = text_mnb_stemmed.fit(twenty_train.data, twenty_train.target)
predicted_mnb_stemmed = text_mnb_stemmed.predict(twenty_test.data)
np.mean(predicted_mnb_stemmed == twenty_test.target)

0.8167817312798725

Try wordnet lemmatizer in `NLTK` by yourself.