# Practice Text classification with Naive Bayes  
        
        
        
<h3>Abstract</h3>
<p>We will do text classification on a collection of Dutch parliamentary questions.
    The website <a href="https://zoek.officielebekendmakingen.nl/zoeken/parlementaire_documenten">officielebekendmakingen.nl</a>lets you search in "kamervragen".
    <!--You can donwload
    <a href='http://data.politicalmashup.nl/kamervragen/PoliDocs_Kamervragen.zip'>this zipfile with Kamervragen in XML</a>
    to see some of the  data in XML format. 
    It also contains style sheets to show the XML well in a browser.  
-->
    The <a href='http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/'>MYSQL directory</a> contains an <a href='http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/KVR14807.xml'>example   Kamervraag XML file</a> and a file `kvr.csv.gz` with 40K kamervragen in a handy csv format. Note that in your browser you see the result of applying stylesheets. So choose View Source or open it in an editor.</p>

<h3>First exploration</h3>

See below.

<h2>Exercises</h2>

<p>We will use the fields in elements of the form <tt> &lt;item attribuut="Afkomstig_van"></tt> as our classes. 
    These are the ministeries to whom the question is addressed.
    An example is 
    <pre>
        &lt;item attribuut="Afkomstig_van">Landbouw, Natuurbeheer en Visserij (LNV)&lt;/item>
    </pre>
    Note that these labels are <strong>not normalized</strong>, see e.g. the counts below:
    <pre>
Justitie (JUS)                                                   3219
Volksgezondheid, Welzijn en Sport (VWS)                          2630
Buitenlandse Zaken (BUZA)                                        1796
Verkeer en Waterstaat (VW)                                       1441
Justitie                                                         1333
Sociale Zaken en Werkgelegenheid (SZW)                           1231
Onderwijs, Cultuur en Wetenschappen (OCW)                        1187
Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer (VROM)     984
FinanciÃ«n (FIN)                                                   960
Volksgezondheid, Welzijn en Sport                                 951
Economische Zaken (EZ)                                            946
Buitenlandse Zaken                                                753
Binnenlandse Zaken en Koninkrijksrelaties (BZK)                   725
Verkeer en Waterstaat                                             724
Defensie (DEF)                                                    646
Sociale Zaken en Werkgelegenheid                                  607
Landbouw, Natuurbeheer en Visserij (LNV)                          586
Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer            554
Onderwijs, Cultuur en Wetenschappen                               532
Vreemdelingenzaken en Integratie (VI)                             466
    </pre>
</p>


<h2>Form of handing in your final product</h2>

* An IPython notebook with for each question, a MarkDown cell containing the question, a code cell which solves the question, an output cell with the output, followed by a MarkDown cell with explanation/reflection  

In [1]:
import pandas as pd

names=['jaar', 'partij','titel','vraag','antwoord','ministerie']

# Change to KVR1000.csv.gz if this becomes too slow for you
# kvrdf= pd.read_csv('http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/KVR.csv.gz', 
kvrdf= pd.read_csv('http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/KVR1000.csv.gz', 
                   compression='gzip', sep='\t', 
                   index_col=0, names=names,
                   ) 

for kolom in names[1:]:
    kvrdf[kolom]= kvrdf[kolom].astype(str)
print(kvrdf.shape)
kvrdf.head()



(1000, 6)


Unnamed: 0,jaar,partij,titel,vraag,antwoord,ministerie
KVR1000.xml,1994,PvdA,De vragen betreffen de betrouwbaarheid van de...,Hebt u kennisgenomen van het televisieprogram...,Ja. Het bedoelde geluidmeetpunt is eigendom v...,Verkeer en Waterstaat
KVR10000.xml,1999,PvdA,Vragen naar aanleiding van berichten (uitzend...,Kent u de berichten over de situatie in de Me...,,Justitie
KVR10001.xml,1999,SP,"Vragen naar aanleiding van de berichten ""Nede...",Kent u de berichten «Nederland steunt de Soeh...,,Financien
KVR10002.xml,1999,PvdA,Vragen over de gebrekkige opvang van verpleeg...,Kent u het bericht over onderzoek van Nu91 me...,Ja. Het onderzoek van NU’91 wijst uit dat het...,"Volksgezondheid, Welzijn en Sport"
KVR10003.xml,1999,PvdA,Vragen over onbetrouwbaarheid van filemeldingen.,Hebt u kennisgenomen van de berichten over de...,Ja. Nee. Door de waarnemers van het Algemeen ...,Verkeer en Waterstaat


In [12]:
from nltk.corpus import stopwords
import nltk
from collections import Counter
import itertools
from math import log

DutchStop= stopwords.words('dutch')
allvragen= '\n'.join(list(kvrdf.titel))
classes = list(kvrdf.ministerie)

KVR1000.xml       De vragen betreffen de betrouwbaarheid van de...
 KVR10000.xml     Vragen naar aanleiding van berichten (uitzend...
 KVR10001.xml     Vragen naar aanleiding van de berichten "Nede...
 KVR10002.xml     Vragen over de gebrekkige opvang van verpleeg...
 KVR10003.xml     Vragen over onbetrouwbaarheid van filemeldingen.
 KVR10004.xml     Vragen naar aanleiding van het bericht in de ...
 KVR10006.xml                                                     
 KVR10007.xml     Vragen naar aanleiding van het artikel ''Eila...
 KVR10008.xml     Vragen over de mogelijke terugkeer naar het o...
 KVR10009.xml     Vragen over de uiteindelijke invulling van ui...
Name: titel, dtype: object


Words are treated as lowercase & stopwords are filtered out below

In [3]:
# Term frequencies per word
def str_to_tf(string):
    return Counter([w for w in nltk.word_tokenize(string.lower()) if w.isalpha() and not w in set(DutchStop)])


tfdict = {d:str_to_tf(kvrdf.loc[d].titel) for d in list(kvrdf.index)}

print(list(tfdict.keys())[0])
print(tfdict[list(tfdict.keys())[0]])


KVR1000.xml
Counter({'vragen': 1, 'betreffen': 1, 'betrouwbaarheid': 1, 'geluidsmetingen': 1, 'schiphol': 1})


In [4]:
# document frequecies per word
k = [list(set([w for w in nltk.word_tokenize(t.lower()) if w.isalpha() and not w in set(DutchStop)])) for t in kvrdf.titel]

dfdict = Counter(list(itertools.chain.from_iterable(k)))
print(dfdict['vragen'])



896


In [5]:
# class occurance & prior probabilities, given complete data

class_frequency = Counter([c for c in kvrdf.ministerie])
class_priors = {c:val/sum(class_frequency.values()) for c, val in class_frequency.items()}

In [6]:
# complete rows per class

class_rows_full = {c:kvrdf.loc[kvrdf.ministerie == c] for c in classes}


In [11]:

def str_to_tk(string):
    """ returns token count in a string
    """
    return len([w for w in nltk.word_tokenize(string.lower()) if w.isalpha() and not w in set(DutchStop)])
    

# term frequency per class
class_text = {c:str_to_tf('\n'.join(list(class_rows_full.get(c).titel))) for c in classes}
class_tk = {c:str_to_tk('\n'.join(list(class_rows_full.get(c).titel))) for c in classes}


In [13]:
print(class_text[classes[2]].get('vragen'))


7


In [14]:
V = dfdict.keys()
cond_prob = {t:{c:(class_text[c][t] + 1)/(class_tk[c] + 1) for c in classes} for t in V}

In [None]:
class_df = {c:{t:} for t in 

In [64]:
def ApplyMultinomialNB(classes, priors, conditionals, d):
    W = str_to_tf(d)
    score = {c:log(priors[c]) for c in classes}
    for k in score.keys():
        score[k] += sum([log(conditionals[t][k]) for t in W.keys()])
    
    return sorted(score, key = lambda key: score[key])[0]

ApplyMultinomialNB(classes, class_priors, cond_prob, "vragen")


' Binnenlandse Zaken en Koninkrijksrelaties (BZK) Financiën (FIN)'

## 1. 
Normalize the values for "ministerie" and choose 10 ministeries to work with.

## 2.
Implement the two algorithms in Fig MRS.13.2, using your earlier code for creating term and document frequencies. It might be easier to use the representation and formula given in MRS section 13.4.1.

## 3.
On this collection, train NB text classifiers for 10 different classes with enough and interesting data.

## 4.
Compute for each term and each of your 10 classes its utility for that class using mutual information.

In [None]:
def MI(df, class_text):
    N = sum(tf.values())
    MIdict = {}
    for c in class_text.keys():
        MIdict[c] = {}
        for t in c.keys()
            N11 = class_text[c][t]
            N10 = df[t] - N11
            N01 = sum(class_text[c].values()) - class_text[c][t]
            N00 = N - df[t] - (sum(class_text[c].values() - class_text[c][t]))
            N1 = N11 + N10
            N0 = N01 + N00
            MIdict[c][t] = N11/N * log( (N*N11)/(N1*N1), 2)
            MIdict[c][t] += N01/N * log( (N*N01)/(N0*N1) , 2)
            MIdict[c][t] += N10/N * log( (N*N10)/(N1*N0) , 2)
            MIdict[c][t] += N00/N * log(  (N*N00)/(N0*N0), 2)
    

## 5.
For each class, show the top 10 words as in Figure 13.7 in MRS.

## 6.
Evaluate your classifiers using Precision, Recall and F1. (Give a table in which you show these values for using the top 10, top 100 terms and all terms, for all of your 10 classes)
          Thus do feature selection per class, and use for each class the top n best features for that class. 
          <br/>
      Also show the microaverage(s) for all 10 classes together.
      <br/>
      If you like you can also present this in a figure like MRS.13.8. 
      Then compute the F1 measure for the same number of terms as in that figure.

## 7 
You have done the complete implementation by yourself. Congratulations! You can also use `scikit-learn` routines for all of this work. Do that. So follow [this text classification tutorial](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)  and implement the same steps but now with your kamervragen dataset. Also use [mutual information feature selection](http://scikit-learn.org/stable/modules/feature_selection.html) to select the K-best features, and compare the results as before.


## 8

Reflect and report briefly about your choices in this process and about the obtained results. Also reflect on the differences between the scikit learn approach and the "own implementation approach".

<h3>Training/Testing</h3>
<p>It is important that you do not test your classifier using documents that have also been used in training.
    So split up your collection in a training set and a test set. A 80%-20% split is reasonable.

<br/>
    If you have too little data you can use 5 or <a href="http://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation">10-fold cross validation</a>.</p>



<h2>Form of handing in your final product</h2>

* An IPython notebook with for each question, a MarkDown cell containing the question, a code cell which solves the question, an output cell with the output, followed by a MarkDown cell with explanation/reflection  