# Text Classification


by   
[__Michael Granitzer__ (michael.granitzer@uni-passau.de)]( http://www.mendeley.com/profiles/michael-granitzer/)  
[Konstantin Ziegler (konstantin.ziegler@uni-passau.de)](http://zieglerk.net)  
Jörg Schlötterer (joerg.schloetterer@uni-passau.de)

with examples taken from the [scikit-learn documentation](http://scikit-learn.org/stable/)

__License__

This work is licensed under a [Creative Commons Attribution 3.0 Unported License](http://creativecommons.org/licenses/by/3.0/) (CC BY 3.0)

In [1]:
%matplotlib inline

import numpy as np
from matplotlib import pyplot as plt

## Naive Bayes
### Data Set
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian). Here is a list of the 20 newsgroups, partitioned (more or less) according to subject matter:




<table border=1>
<tr>
<td>comp.graphics<br>comp.os.ms-windows.misc<br>comp.sys.ibm.pc.hardware<br>comp.sys.mac.hardware<br>comp.windows.x</td>
<td>rec.autos<br>rec.motorcycles<br>rec.sport.baseball<br>rec.sport.hockey</td>
<td>sci.crypt<br>sci.electronics<br>sci.med<br>sci.space</td>
</tr><tr>
<td>misc.forsale</td>
<td>talk.politics.misc<br>talk.politics.guns<br>talk.politics.mideast</td>
<td>talk.religion.misc<br>alt.atheism<br>soc.religion.christian</td>
</tr>
</table>

The "bydate"-option is sorted by date into training(60%) and test(40%) sets, does not include cross-posts (duplicates) and does not include newsgroup-identifying headers (Xref, Newsgroups, Path, Followup-To, Date). 

<div class="alert alert-success">
Implement your own naive bayes classifier and apply it to the 20newsgroups dataset.  
</div>
* Take a look at the [notebook about 20newsgroups](../1-Datasets_Visualization_and_preprocessing/5-20newsgroups.ipynb) to obtain the data
* Read the files and tokenize the text to obtain a "bag of words"
* Implement the naive bayes classifier (pseudocode is given below)
* Evaluate your classfier on the training/test set. Which accuracy can you achieve?

### Naive Bayes Pseudocode
#### TrainMultiNomialNB($\mathbb C$,$\mathbb D$)  
$V \leftarrow extractVocabulary(\mathbb D)$  
$N \leftarrow countDocs(\mathbb D)$    
for $c \in \mathbb C$:  
&nbsp;&nbsp;&nbsp;&nbsp;$N_c \leftarrow countDocsInClass(\mathbb D, c)$  
&nbsp;&nbsp;&nbsp;&nbsp;$prior[c] \leftarrow \frac{N_c}{N}$  
&nbsp;&nbsp;&nbsp;&nbsp;$text_c \leftarrow concatenateTextOfAllDocsInClass(\mathbb D, c)$   
&nbsp;&nbsp;&nbsp;&nbsp;for $t \in V$:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$T_{ct} \leftarrow countTokensOfTerm(text_c,t)$  
&nbsp;&nbsp;&nbsp;&nbsp;for $t \in V$:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$condprob[t][c] \leftarrow \frac{T_{ct} + 1}{\sum_{t'}(T_{ct'} + 1)}$  
return $V,prior,condprob$

#### ApplyMultinomialNB($\mathbb C,V,prior,condprob,d$)
$W \leftarrow extractTokensFromDoc(V,d)$   
for $c \in \mathbb C$:  
&nbsp;&nbsp;&nbsp;&nbsp;$score[c] \leftarrow log(prior[c])$  
&nbsp;&nbsp;&nbsp;&nbsp;for $t \in W$:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$score[c] += log(condprob[t][c])$  
return $argmax_{c \in \mathbb C} score[c]$

** Some snippets that might be useful for the implementation: **

In [1]:
# tokenization
import re
def tokenize(doc):
    return re.findall(r'\b\w\w+\b',doc) # return all words with #characters > 1

tokenize("This is a test string.")

['This', 'is', 'test', 'string']

In [3]:
# list files (or directories)
import os
for directory in os.listdir('./../1-Datasets_Visualization_and_preprocessing/newsgroups/20news-bydate-train/'):
    print(directory)

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc


In [19]:
import codecs
# simple file reading
with open('./Bayes-Learning.ipynb') as f:
    doc = f.read()
    print(doc[80:366])
    
# codecs can help if you run into encoding problems
with codecs.open('./Bayes-Learning.ipynb', encoding='latin1') as f:
    doc = f.read()
    print(doc[80:366])

    "# Text Classification\n",
    "\n",
    "\n",
    "by   \n",
    "[__Michael Granitzer__ (michael.granitzer@uni-passau.de)]( http://www.mendeley.com/profiles/michael-granitzer/)  \n",
    "[Konstantin Ziegler (konstantin.ziegler@uni-passau.de)](http://zieglerk.net)  \n",
    "Jörg
    "# Text Classification\n",
    "\n",
    "\n",
    "by   \n",
    "[__Michael Granitzer__ (michael.granitzer@uni-passau.de)]( http://www.mendeley.com/profiles/michael-granitzer/)  \n",
    "[Konstantin Ziegler (konstantin.ziegler@uni-passau.de)](http://zieglerk.net)  \n",
    "JÃ¶r


In [39]:
import pprint
with codecs.open('./../1-Datasets_Visualization_and_preprocessing/newsgroups/20news-bydate-train/alt.atheism/49960', encoding='latin1') as file:
    pprint.pprint(file.readlines()[:40])

['From: mathew <mathew@mantis.co.uk>\n',
 'Subject: Alt.Atheism FAQ: Atheist Resources\n',
 'Summary: Books, addresses, music -- anything related to atheism\n',
 'Keywords: FAQ, atheism, books, music, fiction, addresses, contacts\n',
 'Expires: Thu, 29 Apr 1993 11:57:19 GMT\n',
 'Distribution: world\n',
 'Organization: Mantis Consultants, Cambridge. UK.\n',
 'Supersedes: <19930301143317@mantis.co.uk>\n',
 'Lines: 290\n',
 '\n',
 'Archive-name: atheism/resources\n',
 'Alt-atheism-archive-name: resources\n',
 'Last-modified: 11 December 1992\n',
 'Version: 1.0\n',
 '\n',
 '                              Atheist Resources\n',
 '\n',
 '                      Addresses of Atheist Organizations\n',
 '\n',
 '                                     USA\n',
 '\n',
 'FREEDOM FROM RELIGION FOUNDATION\n',
 '\n',
 'Darwin fish bumper stickers and assorted other atheist paraphernalia are\n',
 'available from the Freedom From Religion Foundation in the US.\n',
 '\n',
 'Write to:  FFRF, P.O. Box 750, Madis

### Implementation

In [53]:
def tokenize(doc_file):
    with codecs.open(doc_file, encoding='latin1') as doc:
        doc = doc.read().lower()
        _header, _blankline, body = doc.partition('\n\n')
        return re.findall(r'\b\w\w+\b',body)

In [54]:
pprint.pprint(tokenize('./../1-Datasets_Visualization_and_preprocessing/newsgroups/20news-bydate-train/alt.atheism/49960')[:30])

['archive',
 'name',
 'atheism',
 'resources',
 'alt',
 'atheism',
 'archive',
 'name',
 'resources',
 'last',
 'modified',
 '11',
 'december',
 '1992',
 'version',
 'atheist',
 'resources',
 'addresses',
 'of',
 'atheist',
 'organizations',
 'usa',
 'freedom',
 'from',
 'religion',
 'foundation',
 'darwin',
 'fish',
 'bumper',
 'stickers']


In [55]:
import os
import re
import math
import codecs
from sklearn import metrics


class NaiveBayesClassifier:
    def __init__(self, min_count=1):
        self.min_count = min_count
        self.vocabulary = {}
        self.num_docs = 0
        self.classes = {}
        self.priors = {}
        self.conditionals = {}

    def train(self, path):
        self.num_docs = 0
        for d in os.listdir(path):
            self.classes[d] = {'doc_counts':0, 'terms':{}}
            print(d)
            for f in os.listdir(path + d):
                terms = tokenize(path + d + '/' + f)
                self.num_docs += 1
                self.classes[d]['doc_counts'] += 1
                
                # build vocabulary and count terms
                for term in terms:
                    if not term in self.vocabulary:
                        self.vocabulary[term] = 1
                        self.classes[d]['terms'][term] = 1
                    else:
                        self.vocabulary[term] += 1
                        if not term in self.classes[d]['terms']:
                            self.classes[d]['terms'][term] = 1
                        else:
                            self.classes[d]['terms'][term] += 1
                            
        # remove terms with frequency < min_count
        self.vocabulary = {k:v for k,v in self.vocabulary.items() if v > self.min_count}

        for c in self.classes:
            # calculate priors
            self.priors[c] = math.log(self.classes[c]['doc_counts']) - math.log(self.num_docs)
            
            # calculate conditionals
            self.conditionals[c] = {}
            c_len = sum([self.classes[c]['terms'][x] for x in self.classes[c]['terms']])
            for term in self.vocabulary:
                t_ct = 1
                if term in self.classes[c]['terms']:
                    t_ct += self.classes[c]['terms'][term]
                self.conditionals[c][term] = math.log(t_ct) - math.log(c_len + len(self.vocabulary))

    def classify(self, doc):
        scores = {}
        for c in self.classes:
            scores[c] = self.priors[c]
            for term in doc:
                if term in self.vocabulary:
                    scores[c] += self.conditionals[c][term]

        return scores, max(scores, key=scores.get)

In [56]:
clf = NaiveBayesClassifier()

clf.train('./../1-Datasets_Visualization_and_preprocessing/newsgroups/20news-bydate-train/')


test_path = './../1-Datasets_Visualization_and_preprocessing/newsgroups/20news-bydate-test/'

out_y = []
true_y = []
for cl in clf.classes:
    for f in os.listdir(test_path + cl):
        _, result_class = clf.classify(tokenize(test_path+cl+'/'+f))
        out_y.append(result_class)
        true_y.append(cl)

print('accuracy',metrics.accuracy_score(true_y,out_y))

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
accuracy 0.757302177377


### compare with sklearn

In [45]:
from sklearn.datasets import load_files
from sklearn import feature_extraction
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB


twenty_train = load_files('./../1-Datasets_Visualization_and_preprocessing/newsgroups/20news-bydate-train/', encoding='latin1')
twenty_test = load_files('./../1-Datasets_Visualization_and_preprocessing/newsgroups/20news-bydate-test/', encoding='latin1')

vectorizer = feature_extraction.text.CountVectorizer()
train_X = vectorizer.fit_transform(twenty_train.data)
print(train_X.shape)

clf = MultinomialNB()
clf.fit(train_X,twenty_train.target)

pred = clf.predict(vectorizer.transform(twenty_test.data))

print('accuracy',metrics.accuracy_score(twenty_test.target,pred))

(11314, 130107)
accuracy 0.772835900159


### Can we do better?

In [62]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV


clf = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB()),
               ])

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1.0, 0.1),
             }

gs_clf = GridSearchCV(clf, parameters, n_jobs=-1)
gs_clf.fit(twenty_train.data, twenty_train.target)
pred = gs_clf.predict(twenty_test.data)
print('accuracy',metrics.accuracy_score(twenty_test.target,pred))

for param_name in sorted(parameters.keys()):
    print(param_name,":", gs_clf.best_params_[param_name])

accuracy 0.817843866171
clf__alpha : 0.1
tfidf__use_idf : True
vect__ngram_range : (1, 2)


In [63]:
from sklearn.linear_model import SGDClassifier
#SGDClassifier with hinge loss gives a linear SVM

clf = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42)),
               ])

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),
             }

gs_clf = GridSearchCV(clf, parameters, n_jobs=-1)
gs_clf.fit(twenty_train.data, twenty_train.target)
pred = gs_clf.predict(twenty_test.data)
print('accuracy',metrics.accuracy_score(twenty_test.target,pred))

for param_name in sorted(parameters.keys()):
    print(param_name,":", gs_clf.best_params_[param_name])



accuracy 0.834838024429
clf__alpha : 0.001
tfidf__use_idf : True
vect__ngram_range : (1, 2)


further details can be found at http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html