# Assignment  Practice Text classification with Naive Bayes 

# Practice Text classification with Naive Bayes  
        
        
        
<h3>Abstract</h3>
<p>We will do text classification on a collection of Dutch parliamentary questions.
    The website <a href="https://zoek.officielebekendmakingen.nl/zoeken/parlementaire_documenten">officielebekendmakingen.nl</a>lets you search in "kamervragen".
    You can donwload
    <a href='http://data.politicalmashup.nl/kamervragen/PoliDocs_Kamervragen.zip'>this zipfile with Kamervragen in XML</a>
    to see some of the  data in XML format. 
    It also contains style sheets to show the XML well in a browser.  
    The <a href='http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/'>MYSQL directory</a> contains an <a href='http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/KVR14807.xml'>example   Kamervraag XML file</a> and a file `kvr.csv.gz` with all those kamervragen in a handy csv format. Note that in your browser you see the result of applying stylesheets. So choose View Source or open it in an editor.</p>

<h3>First exploration</h3>

See below.

<h2>Exercise</h2>

<p>We will use the fields in elements of the form <tt> &lt;item attribuut="Afkomstig_van"></tt> as our classes. 
    These are the ministeries to whom the question is addressed.
    An example is 
    <pre>
        &lt;item attribuut="Afkomstig_van">Landbouw, Natuurbeheer en Visserij (LNV)&lt;/item>
    </pre>
    Note that these labels are <strong>not normalized</strong>, see e.g. the counts below:
    <pre>
Justitie (JUS)                                                   3219
Volksgezondheid, Welzijn en Sport (VWS)                          2630
Buitenlandse Zaken (BUZA)                                        1796
Verkeer en Waterstaat (VW)                                       1441
Justitie                                                         1333
Sociale Zaken en Werkgelegenheid (SZW)                           1231
Onderwijs, Cultuur en Wetenschappen (OCW)                        1187
Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer (VROM)     984
FinanciÃ«n (FIN)                                                   960
Volksgezondheid, Welzijn en Sport                                 951
Economische Zaken (EZ)                                            946
Buitenlandse Zaken                                                753
Binnenlandse Zaken en Koninkrijksrelaties (BZK)                   725
Verkeer en Waterstaat                                             724
Defensie (DEF)                                                    646
Sociale Zaken en Werkgelegenheid                                  607
Landbouw, Natuurbeheer en Visserij (LNV)                          586
Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer            554
Onderwijs, Cultuur en Wetenschappen                               532
Vreemdelingenzaken en Integratie (VI)                             466
    </pre>
</p>

  <ol>
      <li>Normalize the values for "ministerie" and choose 10 ministeries to work with. </li>
      <li>Implement the two algorithms in Fig MRS.13.2, using your earlier code for creating term and document frequencies.
      It might be easier to use the representation and formula given in MRS section 13.4.1.</li>
      <li>On this collection, train NB text classifiers for 10 different classes with enough and interesting data.</li>
      <li>Compute for each term and each of your 10 classes its utility for that class using mutual information.</li>
      <li>For each class, show the top 10 words as in Figure 13.7 in MRS.</li>
      <li>Evaluate your classifiers using Precision, Recall and F1. (
           <br/>
          Give a table in which you show these values for using the top 10, top 100 terms and all terms, for all of your 10 classes.
          Thus do feature selection per class, and use for each class the top n best features for that class. 
          <br/>
      Also show the microaverage(s) for all 10 classes together.
      <br/>
      If you like you can also present this in a figure like MRS.13.8. 
      Then compute the F1 measure for the same number of terms as in that figure.</li>
      <li>Reflect and report briefly about your choices in this process and about the obtained results. </li>
  </ol>

<h3>Training/Testing</h3>
<p>It is important that you do not test your classifier using documents that have also been used in training.
    So split up your collection in a training set and a test set. A 80%-20% split is reasonable.

<br/>
    If you have too little data you can use 5 or <a href="http://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation">10-fold cross validation</a>.</p>

<h2>Form of presentation</h2>
<ul>
    <li>Make slides or wikipages and have your system running (this could just be an IPython notebook with a classify function that accepts any string and classifies it.) ~~and be able to accept documents from the web.~~ </li>
    <li>Create one or two slides or wikipages for each of the sub exercises listed above.
</li>
<li>Make it clear in the heading of the slides which sub exercises you talk about.</li>
    <li>Show running code with one or two  good examples (a TP of course, but also a FP and an error-analysis is nice to show). </li>

</ul>

<h2>Form of handing in your final product</h2>
<ul>
    <li>An IPython notebook would be perfect, with clear indications which part of the code answers which subquestion.</li>
    <li>A clear git repo, with good comments and a clear separation and indication what code does what is also fine.</li>
    <li> You are free to program in whatever language you prefer.</li>
</ul>


In [4]:
import pandas as pd
import timeit
import nltk
import math
import numpy
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from collections import Counter, defaultdict
from __future__ import division

# The classifier:

In [1]:
# Change to KVR1000.csv.gz if this becomes too slow for you
kvrdf= pd.read_csv('http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/KVR.csv.gz', 
                   compression='gzip', sep='\t', encoding='utf-8',
                   index_col=0, names=['jaar', 'partij','titel','vraag','antwoord','ministerie']) 


# Normalize the values for "ministerie" and choose 10 ministeries to work with.
ourMinisteries={"ustitie":"JUS", "nderwijs":"OCW", "uitenlandse":"BUZA", "innenlandse":"BZK", "efensie":"DEF", "reemdeling":"VI", "andbouw":"LNV", "erkeer":"VW", "olkshuisvesting":"VROM", "olksgezondheid":"VWS"}
ourNormMinisteries=["JUS", "OCW", "BUZA", "BZK", "DEF", "VI", "LNV", "VW", "VROM", "VWS"]
check = True
for index, row in kvrdf.iterrows():
    if row.ministerie in ourNormMinisteries:
        if check:
            print "You've already run this code!" 
            check = False
    elif isinstance(row.ministerie, basestring) and any(x in row.ministerie for x in ourMinisteries):
        for key in ourMinisteries:
            if key in row.ministerie:
                kvrdf.set_value(index, 'ministerie', ourMinisteries[key])
                break
    else:
        kvrdf=kvrdf.drop([index])

print kvrdf.shape
# Divide data into test and training sets:
kvrTrain = kvrdf[4201:]
kvrTest = kvrdf[:4200]
print kvrTrain.shape, kvrTest.shape

# List of stopwords and special chars
DutchStop = stopwords.words('dutch') + [u',',u'.',u'?',u')',u'(',u"''",u'-']

# List the ministeries from most to least common.
counts = kvrTrain.ministerie.value_counts()

# Show the classes we will be using
classList = list(counts[:10].axes[0][0:10])
print classList

SyntaxError: Missing parentheses in call to 'print'. Did you mean print("You've already run this code!")? (<ipython-input-1-4305cbff1829>, line 14)

In [11]:
# From the previous assignment: Making an inverted index
def index_collection(docs):
    MyIndex= defaultdict(Counter) # initialize MyIndex
    print 'Indexing...'
    count = len(docs)
    for index, doc in docs.iterrows():  # loop over each file
        print "\r", count,
        count -= 1
        f = doc.titel + doc.vraag + doc.antwoord
        text = [w for w in nltk.word_tokenize(f)]   # get text and tokenize
        for w in text:    # update MyIndex with each token 
            if not w.isupper(): # leave completely upper case words
                w = w.lower()   # lower case others
            MyIndex[w][index]+=1
    print 'Done!'            
    return MyIndex

trainingIndex = index_collection(kvrTrain)

 Indexing...
1 Done!                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       


197177

In [58]:
# Implement the two algorithms

# Training function
def trainMNB(classes, docs):
    print 'Training...'
    # Extract vocabulary v
    ## DEPRECATED ##
    # v = vocab(docs)
    vsize = len(trainingIndex)
    print 'vocab size: ', vsize
    # n is the number of documents used
    n = len(docs)
    # for each class do:
    p_nc = {}
    condProb = {}
    print 'Training class: '
    for c in classes:
        print c
        condProb[c] = {}
        # Count documents in class (docs, c)
        nc = docs.ministerie.value_counts().loc[c]
        # Calculate the prior
        p_nc[c] = nc / n
        # Get all text for the class
        t_counts = Counter(classText(docs, c))
        t_total = sum(t_counts.values())
        # Calculate conditional probability
        time = len(trainingIndex)
        for (t, counter) in trainingIndex.items():
            print '\r', time,
            time -= 1
            if t in t_counts:
                condProb[c][t] = (t_counts[t] + 1) / (t_total + vsize)
            else:
                condProb[c][t] = 1 / (t_total + vsize)
    print 'Done!'
    return p_nc, condProb

# Get all words from a set of documents as a list of strings
def vocab(docs):
    result = []
    # Get text from these index keys:
    text = docs.titel + docs.vraag + docs.antwoord
    # Tokenize each string and turn into a bag of words
    count = 0
    print 'Processing documents:'
    for item in text:
        count += 1
        print '\r', len(text)-count,
        BoW = list([w for w in nltk.word_tokenize(item) if not w in set(DutchStop)])
        result += BoW
    return result

# Get all words for a class and set of documents as a list of strings
def classText(docs, c):
    result = []
    classdocs = docs[docs.ministerie==c]
    text = classdocs.titel + classdocs.vraag + classdocs.antwoord
    # Tokenize and turn into list
    for item in text:
        BoW = list([w for w in nltk.word_tokenize(item) if not w in set(DutchStop)])
        result += BoW
    return result

In [78]:
# Training the classifier
# list(set(vocab(kvrTest[0:100])))
p, cp = trainMNB(classList, kvrTrain)
#timeit.timeit(trainMNB(classList, kvrTrain))

Training...
vocab size:  197177
Training class: 
VIFM         Done!


In [60]:
# Apply the trained model to a document
def applyMNB(classes, prior, cp, doc):
    score = {}
    w = nltk.word_tokenize(doc.titel + doc.vraag + doc.antwoord)
    for c in classes:
        score[c] = math.log(prior[c])
        for t in w:
            if t in cp[c]:
                # print '\r', score[c],
                score[c] += math.log(cp[c][t])
    # print score
    return max(score, key=lambda i: score[i])

In [84]:
# Apply the MNB 
# --unfinished--
def applySet(docs,p,cp):
    correct = 0
    mnbCount = defaultdict(Counter)
    time = len(docs)
    print 'Documents remaining:'
    for index, doc in docs.iterrows():
        print '\r', time,
        time -= 1
        mnb = applyMNB(classList, p, cp, doc)
        if mnb == doc.ministerie:
            mnbCount[mnb]['correct'] += 1
            correct += 1
        else:
            mnbCount[mnb]['incorrect'] += 1
    print 'precision: ', correct / len(docs)
    return mnbCount

In [85]:
# Try classifying a document:
#row = kvrTest.ix[4]
#applyMNB(classList, p, cp, row)

# Classify a set of documents
applySet(kvrTest,p,cp)

Documents remaining:
1 precision:  0.262380952381                                                                                                                                                                                                    


defaultdict(collections.Counter,
            {'BUZA': Counter({'correct': 36}),
             'BZK': Counter({'correct': 165, 'incorrect': 106}),
             'DEF': Counter({'correct': 172, 'incorrect': 639}),
             'JUS': Counter({'correct': 1}),
             'LNV': Counter({'correct': 209, 'incorrect': 93}),
             'OCW': Counter({'correct': 191, 'incorrect': 13}),
             'VI': Counter({'incorrect': 2163}),
             'VROM': Counter({'correct': 218, 'incorrect': 73}),
             'VW': Counter({'correct': 102, 'incorrect': 11}),
             'VWS': Counter({'correct': 8})})

In [73]:
# Compute for each term and each of your 10 classes its utility for that class using mutual information. 
# --unfinished--
def mutualInfo(clist, docs, index):
    result = {}
    for c in clist:
        v = list(set(classText(docs, c)))
        result[c] = {}
        print 'Processing terms for: ', c
        time = len(index)
        for (t, counter) in index:
            time -= 1
            print '\r', time,
            classDocs = docs.ministerie.loc[c]
            n = len(classDocs)
            n1 = len(counter)
            n0 = len(docs) - n1
            n11, n10, n01, n00 = (1,)*4
            for doc, count in counter.items():
                if t in tokens:
                    n1 += 1
                    if doc.ministerie == c:
                        n11 += 1
                    else:
                        n10 += 1
                else:
                    n0 += 1
                    if doc.ministerie == c:
                        n01 += 1
                    else:
                        n00 += 1
            # print n, n1, n0, n11, n10, n01, n00
            t1 = (n11 / n) * math.log((n*n11)/(n1*(n01+n11)),2)
            t2 = (n01/n) * math.log((n*n01)/((n00+n01)*(n01+n11)),2)
            t3 = (n10/n) * math.log((n*n10)/((n10+n11)*(n00+n10)),2)
            t4 = (n00/n) * math.log((n*n00)/((n00+n01)*(n00+n10)),2)
            # print t1, t2, t3, t4
            result[c][t] = t1 + t2 + t3 + t4
            # print '\r', result
    return result


In [57]:
from operator import itemgetter
#print len(list(set(v)))
mi = mutualInfo([classList[0]], kvrTrain[:10])
print sorted(mi['JUS'].items(), key=itemgetter(1), reverse = True)[:10]


 Processing terms 
[]


In [92]:
kvrTrain.ministerie.value_counts()

JUS     3687
VWS     3009
BUZA    2184
VW      1743
OCW     1606
BZK     1457
VROM    1217
LNV     1186
DEF      873
VI       554
Name: ministerie, dtype: int64