# Practice Text classification with Naive Bayes  
        
        
        
<h3>Abstract</h3>
<p>We will do text classification on a collection of Dutch parliamentary questions.
    The website <a href="https://zoek.officielebekendmakingen.nl/zoeken/parlementaire_documenten">officielebekendmakingen.nl</a>lets you search in "kamervragen".
    <!--You can donwload
    <a href='http://data.politicalmashup.nl/kamervragen/PoliDocs_Kamervragen.zip'>this zipfile with Kamervragen in XML</a>
    to see some of the  data in XML format. 
    It also contains style sheets to show the XML well in a browser.  
-->
    The <a href='http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/'>MYSQL directory</a> contains an <a href='http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/KVR14807.xml'>example   Kamervraag XML file</a> and a file `kvr.csv.gz` with 40K kamervragen in a handy csv format. Note that in your browser you see the result of applying stylesheets. So choose View Source or open it in an editor.</p>

<h3>First exploration</h3>

See below.

<h2>Exercises</h2>

<p>We will use the fields in elements of the form <tt> &lt;item attribuut="Afkomstig_van"></tt> as our classes. 
    These are the ministeries to whom the question is addressed.
    An example is 
    <pre>
        &lt;item attribuut="Afkomstig_van">Landbouw, Natuurbeheer en Visserij (LNV)&lt;/item>
    </pre>
    Note that these labels are <strong>not normalized</strong>, see e.g. the counts below:
    <pre>
Justitie (JUS)                                                   3219
Volksgezondheid, Welzijn en Sport (VWS)                          2630
Buitenlandse Zaken (BUZA)                                        1796
Verkeer en Waterstaat (VW)                                       1441
Justitie                                                         1333
Sociale Zaken en Werkgelegenheid (SZW)                           1231
Onderwijs, Cultuur en Wetenschappen (OCW)                        1187
Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer (VROM)     984
FinanciÃ«n (FIN)                                                   960
Volksgezondheid, Welzijn en Sport                                 951
Economische Zaken (EZ)                                            946
Buitenlandse Zaken                                                753
Binnenlandse Zaken en Koninkrijksrelaties (BZK)                   725
Verkeer en Waterstaat                                             724
Defensie (DEF)                                                    646
Sociale Zaken en Werkgelegenheid                                  607
Landbouw, Natuurbeheer en Visserij (LNV)                          586
Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer            554
Onderwijs, Cultuur en Wetenschappen                               532
Vreemdelingenzaken en Integratie (VI)                             466
    </pre>
</p>


<h2>Form of handing in your final product</h2>

* An IPython notebook with for each question, a MarkDown cell containing the question, a code cell which solves the question, an output cell with the output, followed by a MarkDown cell with explanation/reflection  

In [1]:
import pandas as pd

names=['jaar', 'partij','titel','vraag','antwoord','ministerie']

# Change to KVR1000.csv.gz if this becomes too slow for you
# kvrdf= pd.read_csv('http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/KVR.csv.gz', 
kvrdf= pd.read_csv('http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/KVR.csv.gz', 
                   compression='gzip', sep='\t', 
                   index_col=0, names=names,
                   ) 

for kolom in names[1:]:
    kvrdf[kolom]= kvrdf[kolom].astype(str)

train = kvrdf.sample(frac=0.2,random_state=200)
test = kvrdf.drop(train.index)

kvrdf = train

print(kvrdf.shape)
kvrdf.head()

(8103, 6)


Unnamed: 0,jaar,partij,titel,vraag,antwoord,ministerie
0000115403.xml,1986,,Welke zijn de gevolgen van de beslissing om d...,De overeenkomsten van 14 oktober 1985 bevatte...,,
V040508590.xml,2005,CDA,Vragen naar aanleiding van een ANP-bericht va...,Deelt u de opvatting dat «de uiterste inspann...,"ANP-bericht, 11 februari 2005 Kamerstuk 27 92...",
KVR8178.xml,1998,CDA,,Hoe verklaart u de toename van het aantal all...,In de eerste acht maanden van 1997 meldden zi...,Justitie (JUS)
0000019654.xml,1991,Groen Links,,Hebt u kennis genomen van de inhoud en conclu...,Ja. In het kader van het CFK-aktieprogramma l...,
KVR9578.xml,1999,D66,Vragen naar aanleiding van de uitspraak van d...,Bent u op de hoogte van de uitspraak van de a...,Ja. Nadat de kinderen door de moeder in Neder...,Justitie


In [2]:
from nltk.corpus import stopwords
import nltk
from collections import Counter
import itertools
from math import log

DutchStop= stopwords.words('dutch')
allvragen= '\n'.join(list(kvrdf.vraag))
classes = list(set(list(kvrdf.ministerie)))

Words are treated as lowercase & stopwords are filtered out below

In [3]:
# Term frequencies per word

def strip_string(string):
    """
    
    :param string: string of unfiltered word/symbols
    :return: list of all tokens extracted from the string (lowercasing, stopwords, alpha)
    """
    return [w for w in nltk.word_tokenize(string.lower()) if w.isalpha() and not w in set(DutchStop)]

def str_to_tf(string):
    """
    
    :param string: string of unfiltered word/symbols
    :return: dictionary of all term frequencies: occurance of term in string
    """
    return Counter(strip_string(string))





In [4]:

def str_to_tk(string):
    """ returns token count in a string
    """
    return len(strip_string(string))
    



## 1. 
Normalize the values for "ministerie" and choose 10 ministeries to work with.

In [5]:
# Only return classes that are about a single ministerie.
def determine_classes(classes):
    """ Remove any class that spans multiple ministeries
    
    :param classes: All classes that occur in kamervragen
    :return:
        norm_classes, a subset of classes with only the classes that span a single ministerie.
    """

    norm_classes = set()
    
    for c in classes:
        add = True

        if c == 'nan':
            continue

        for nc in norm_classes:
            if nc in c:
                add = False
                break
            elif c in nc:
                norm_classes.remove(nc)
                break

        if add:
            norm_classes.add(c)
            
    return norm_classes


# Normalize class c by replacing strange e's with a normal e
# and removing anything between parenthesis.
def normalize_class(c):
    """ Normalize class c by replacing different representations of the e by a normal e
    
    :param c: str, a single class name
    :return:
        str, a single class name with 'normal' e's and no trailing whitespace.
    """

    nc = ""
    parenthesis = False
    
    for char in c:
        if char == '(':
            parenthesis = True
            
        elif char == 'ë':
            char = 'e'
        elif char == 'Ã':
            char = 'e'
        elif char == '«':
            char = ''
            
        if not parenthesis:
            nc += char
            
        if char == ')':
            parenthesis = False
    
    return nc.strip()


def choose_10_classes(class_rows, norm_classes):
    """ Only return the 10 most occuring classes.
    
    :param class_rows: dict, a dictionary mapping a class name to all rows that class occurs in.
    :param norm_classes: set, all normalized class names of classes that only span a single ministerie.
    
    :return:
        set, the normalized names of the 10 most occuring classes
    """
    count = Counter()
    for key, rows in class_rows.items():
        key = normalize_class(key)
        
        if key in norm_classes:
            count[key] += len(rows)
            
    return set([name for name, _ in count.most_common(10)])


def kvrdf_to_10_classes(kvrdf, norm_classes):
    """ Return a copy of kvrdf with normalized class names and only containing rows with classes in norm_classes
    
    :param kvrdf: pd.DataFrame, containing all downloaded information: questions, ministerie, answers, etc.
    :param norm_classes: set, containing all normalized class names of the 10 most occuring classes.
    
    :return: pd.DataFrame, kvrdf but with all rows removed that don't contain any of the ministeries in norm_classes.
    """   
    all_classes = set(kvrdf.ministerie)
    
    for i, c in enumerate(all_classes):
        nc = normalize_class(c)
        
        if not (nc in norm_classes):
            kvrdf = kvrdf[kvrdf.ministerie != c]            
        else:
            kvrdf.loc[kvrdf["ministerie"] == c, "ministerie"] = nc
            
    return kvrdf

# classes is a set of strings of all classes that are in kvrdf
classes = set(kvrdf.ministerie)

# norm classes is a set of strings of classes that only span a single ministerie with normalized names.
norm_classes = {normalize_class(c) for c in determine_classes(classes)}

# class_rows_full is a dictionary mapping a class name to all rows that contain that class.
class_rows_full = {c:kvrdf.loc[kvrdf.ministerie == c] for c in classes}

# classes is adjusted so it now is a set that contains the normalized strings of the 10 most occuring classes.
classes = choose_10_classes(class_rows_full, norm_classes)
print(classes)

# train and test data are adjusted so any rows that don't contain the class in 'classes' are removed.
# also, all class names are normalized.
kvrdf = kvrdf_to_10_classes(kvrdf, classes)
test = kvrdf_to_10_classes(test, classes)
kvrdf.head()

{'Justitie', 'Financien', 'Verkeer en Waterstaat', 'Buitenlandse Zaken', 'Onderwijs, Cultuur en Wetenschappen', 'Sociale Zaken en Werkgelegenheid', 'Economische Zaken', 'Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer', 'Volksgezondheid, Welzijn en Sport', 'Landbouw, Natuurbeheer en Visserij'}


  object.__getattribute__(self, name)
  return object.__setattr__(self, name, value)


Unnamed: 0,jaar,partij,titel,vraag,antwoord,ministerie
KVR8178.xml,1998,CDA,,Hoe verklaart u de toename van het aantal all...,In de eerste acht maanden van 1997 meldden zi...,Justitie
KVR9578.xml,1999,D66,Vragen naar aanleiding van de uitspraak van d...,Bent u op de hoogte van de uitspraak van de a...,Ja. Nadat de kinderen door de moeder in Neder...,Justitie
KVR16328.xml,2002,GroenLinks,Vragen naar aanleiding van een bericht over e...,Kent u het bericht over een toename van meldi...,Ja. De toename van het aantal meldingen over ...,"Onderwijs, Cultuur en Wetenschappen"
KVR6500.xml,1997,GroenLinks,Vragen over besmetting van vlees als gevolg v...,Kent u het bericht dat volgens de microbioloo...,Ja. Het bericht is gebaseerd op het onderzoek...,"Landbouw, Natuurbeheer en Visserij"
KVR5330.xml,1997,CDA,Vragen over het nader onderzoek naar de voorm...,Wanneer wordt in de relevante politierapporte...,Het nadere onderzoek dat ik uw Kamer bij mijn...,Justitie



Classes are extracted from the ministerie column in the dataframe. Any classes that don't occur often or span multiple ministeries, are removed from the dataframe. All other classes have their names normalized.

The normalization is done by changing all variants of the 'e' in a class to a normal 'e'. Also any class that is named 'nan' were removed. And finally, any starting or trailing whitespace were removed from the class name.

## 2.
Implement the two algorithms in Fig MRS.13.2, using your earlier code for creating term and document frequencies. It might be easier to use the representation and formula given in MRS section 13.4.1.

In [6]:
# document frequencies per term
k = [list(set(strip_string(t))) for t in kvrdf.vraag]

# KLS: dfdict here to train w/ MI
dfdict = Counter(list(itertools.chain.from_iterable(k)))

#document frequencies per class per term
def TrainMultinomialNB(pd_df, classes, V=None):
    """ more reference from p258 onwards of MRS
    
    :param classes: classes which are taken into account for training
    :param pd_df: kamervragen dataframe
    :return:
        V: set, of terms which which are used for classification
        class_priors: dictionary, prior probabilities for each class: P(c) prior[class]
        cond_prob: dictionary, conditional probabilities for each term per class: P(t|c) cond[class][term]
        dfdict: dictionary with document frequencies
        classes: list, classes actually used, temporary until class input is synchronised with pandas dataframe
    """
    class_frequency = Counter([c for c in pd_df.ministerie])
    
    
    # P(c)
    class_priors = {c:class_frequency[c]/sum([class_frequency[cn] for cn in classes]) for c in classes}

    #vocabulary
    if not V:
        V = set(dfdict.keys())
    
    # complete rows per class
    class_rows_full = {c:kvrdf.loc[kvrdf.ministerie == c] for c in classes}

    # term frequency per class
    class_tf = {c:str_to_tf('\n'.join(list(class_rows_full.get(c).vraag))) for c in classes}
    
    # token count per class (for sum over Tct' in (t' in V), (p259, formula 13.6)
    class_tk = {c:str_to_tk('\n'.join(list(class_rows_full.get(c).vraag))) for c in classes}

    #P(t|c)
    cond_prob = {t:{c:(class_tf[c][t] + 1)/(class_tk[c] + 1) for c in classes} for t in V}
    
    
    # KLS: ... so no dfdict returned here
    return V, class_priors, cond_prob, classes

def filter_vocab(W, V):
    """ 
    
    :param W: list; stripped query
    :param V: set; of vocab used for classification (determined by one or anoter method)
    :return: a list of words that occur in the used vocabulary
    """
    return [w for w in W if w in V]

def ApplyMultinomialNB(classes, priors, conditionals, d, Vocab=None):
    """ Applies a Naive Bayes classifier on previously derived priors & conditionals on a query d

    :param classes: list, optional ministeries to classify between
    :param priors: dictionary, prior probabilities for each class: P(c) prior[class]
    :param conditionals: dictionary, conditional probabilities for each term per class: P(t|c) cond[class][term]
    :param d: string, query
    :param Vocab: set, of terms which which are used for classification
    :return: best classification for query
    """
    W = strip_string(d)
    if Vocab:
        W = filter_vocab(W, Vocab)
    # KLS: change here to handle single class application
    #if isinstance(priors, float):
    #    score = {classes:log(priors)}
    #else:
    score = {c:log(priors[c]) for c in classes}

    for c in classes:
        score[c] += sum([log(conditionals[t][c]) for t in W if t in conditionals])
    best_class = sorted(score, key = lambda key: score[key])[-1]
    return best_class



## 3.
On this collection, train NB text classifiers for 10 different classes with enough and interesting data.

In [7]:
# KLS: commented out to train later w/ MI
V, class_priors, cond_prob, classes = TrainMultinomialNB(kvrdf, classes)

## 4.
Compute for each term and each of your 10 classes its utility for that class using mutual information.

In [8]:
def extract_class_df(pddf, classes):
    class_dfdict = {c:{} for c in classes}

    for i, c in enumerate(classes):
        for d in kvrdf[kvrdf.ministerie == c].vraag:
            for elem in set(strip_string(d)):
                if class_dfdict[c].get(elem, 0):
                    class_dfdict[c][elem] += 1
                else:
                    class_dfdict[c][elem] = 1
    return class_dfdict

cddf = extract_class_df(kvrdf, classes)

In [9]:
def MI(doc_freq, class_doc_freq):
    """ creates a mutual information dictionary for each term, class pair
    
    :param doc_freq: dictionary, contains documents frequencies per term, doc_freq[term]
    :param class_doc_freq: dictionary, contains documents frequencies per term per class, cdf[class][term]
    :return MIdict: dictionary, contains mutual information for each class/term pair, MI[class][term]
    """
    N = sum(doc_freq.values())
    MIdict = {}
    for c in class_doc_freq.keys():
        MIdict[c] = {}
        for t in class_doc_freq[c].keys():
            N11 = class_doc_freq[c][t]
            N10 = doc_freq[t] - N11
            N01 = sum(class_doc_freq[c].values()) - class_doc_freq[c][t]
            N00 = N - doc_freq[t] - (sum(class_doc_freq[c].values()) - class_doc_freq[c][t])
            N1 = N11 + N10
            N0 = N01 + N00
            if N11:
                MIdict[c][t] = N11/N * log( (N*N11)/(N1*N1), 2)
            if N01:
                MIdict[c][t] += N01/N * log( (N*N01)/(N0*N1) , 2)
            if N10:
                MIdict[c][t] += N10/N * log( (N*N10)/(N1*N0) , 2)
            if N00:
                MIdict[c][t] += N00/N * log(  (N*N00)/(N0*N0), 2)
    return MIdict

MIdict = MI(dfdict, cddf)

## 5.
For each class, show the top 10 words as in Figure 13.7 in MRS.

In [10]:
def top_MI(MIdict, N):
    """ 
    
    :param MIdict: a dictionary with mutual information for each MIdict[class][term]
    :param N: top N MI terms returned per class
    :return: a dictionary of the top N MI predictors of each class    
    """
    resdict = {}
    for c in MIdict.keys():
        resdict[c] = sorted(MIdict[c].items(), key = lambda key: -MIdict[c][key[0]])[:N]
    return resdict

def top_MI_to_markdown(topMI):
    firstline = classes_table_markdown(topMI)
    rest_of_table = values_table_markdown(topMI)
    return firstline + rest_of_table


def classes_table_markdown(topmidict):
    return "| []() | " + " | []() | ".join(topmidict.keys()) + "|\n" + "|" + " | ".join(['----- | -----' for _ in topmidict.keys()]) + "| \n"

def values_table_markdown(topmidict):
    returnstring = ""
    for i, elem in enumerate(topmidict[list(topmidict.keys())[0]]):
        returnstring += " | " + " | ".join([pair_to_markdown(topmidict[c][i]) for c in topmidict.keys()]) + " | \n"
    return returnstring
            
def pair_to_markdown(pair):
    return pair[0] + " | " + str(pair[1])


def topMI_to_vocab(topmidict):
    vocab_l_o_l = [[elem[0] for elem in topmidict[c]] for c in topmidict.keys()]
    return set(list(itertools.chain.from_iterable(vocab_l_o_l)))


In [11]:
#topfeatures = top_MI(MIdict, 1)
#topfeatures10 = top_MI(MIdict, 10)
#topfeatures100 = top_MI(MIdict, 100)

dfdict2 = {t:v for t,v in dfdict.items() if v > 5}
cddf2 = {c:{t:cddf[c][t] for t in cddf[c].keys() if t in dfdict2.keys()} for c in cddf}

MIdict2 = MI(dfdict2, cddf2)
topfeatures2_30 = top_MI(MIdict2, 30)

#topvocab10 = topMI_to_vocab(topfeatures2_10)
#topvocab100 = topMI_to_vocab(topfeatures100)

In [12]:
from IPython.display import display, Markdown, Latex


In [13]:
topfeatures_MD2 = top_MI_to_markdown(topfeatures2_30)

# bug is because of empty classes & not being able to recognize classes in normalized form

display(Markdown(topfeatures_MD2))

| []() | Justitie | []() | Financien | []() | Verkeer en Waterstaat | []() | Buitenlandse Zaken | []() | Onderwijs, Cultuur en Wetenschappen | []() | Sociale Zaken en Werkgelegenheid | []() | Economische Zaken | []() | Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer | []() | Volksgezondheid, Welzijn en Sport | []() | Landbouw, Natuurbeheer en Visserij|
|----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | -----| 
 | griffier | 2.4285798500549296 | bnb | 0.5853723320925814 | tunnels | 0.9562544027350109 | un | 1.2256791892157386 | vsnu | 0.8188053099694912 | werkvoorziening | 0.8669835918569925 | gaswet | 0.6053225478094617 | bedrijventerreinen | 0.7649050958316113 | generieke | 2.203491347348963 | veevoer | 0.499002326631561 | 
 | behoorlijke | 2.4285798500549296 | omzetbelasting | 0.5853723320925814 | fiets | 0.9562544027350109 | molukken | 1.2256791892157386 | onderwijsblad | 0.8188053099694912 | uitzendkrachten | 0.8669835918569925 | tnt | 0.6052934176968702 | bestemmingsplannen | 0.7649050958316113 | geneesmiddelenwet | 2.203491347348963 | diersoorten | 0.4989722449236479 | 
 | averechts | 2.4285798500549296 | imf | 0.5853430360090361 | wegennet | 0.9562544027350109 | milities | 1.2256791892157386 | onderwijsbond | 0.8188053099694912 | xv | 0.8669835918569925 | afnemers | 0.6052934176968702 | wro | 0.7649050958316113 | borstkanker | 2.203491347348963 | staatsbosbeheer | 0.4989722449236479 | 
 | topambtenaar | 2.4285798500549296 | tax | 0.5853430360090361 | spoorlijn | 0.9562544027350109 | ambassadeurs | 1.2256791892157386 | havo | 0.8188053099694912 | arbeidsongeschikt | 0.8669835918569925 | windenergie | 0.6052934176968702 | verbranding | 0.7648771357546459 | bedrijvenpoli | 2.203491347348963 | faunawet | 0.4989722449236479 | 
 | mensensmokkel | 2.4285798500549296 | invordering | 0.5853222306708862 | inpassing | 0.9562275800396735 | civil | 1.2256791892157386 | examens | 0.8187776944239534 | werkhervatting | 0.8669835918569925 | energiesector | 0.6052727783199998 | woningstichting | 0.7648576663531585 | toetsingscommissies | 2.203491347348963 | zeehonden | 0.4989722449236479 | 
 | notarissen | 2.4285798500549296 | terugbetaald | 0.5853222306708862 | rijbewijs | 0.9562275800396735 | peace | 1.2256791892157386 | middelbare | 0.8187776944239534 | arbeidsongeschikten | 0.8669835918569925 | jaarverslagen | 0.6052727783199998 | vierkante | 0.7648576663531585 | ccmo | 2.203491347348963 | laser | 0.49895065401126204 | 
 | opsporingsonderzoek | 2.4285798500549296 | belastinginkomsten | 0.5853222306708862 | maximumsnelheid | 0.9562275800396735 | rwandese | 1.2256791892157386 | gemeentefonds | 0.8187585695385503 | uitvoeringsinstelling | 0.8669562671368469 | brandstofprijzen | 0.6052727783199998 | geel | 0.7648576663531585 | mensgebonden | 2.203491347348963 | zeldzame | 0.49895065401126204 | 
 | kansspelbeleid | 2.4285798500549296 | aanmerkt | 0.5853055751325945 | landingsrechten | 0.9562275800396735 | strafhof | 1.2256536604819206 | bibliotheken | 0.8187585695385503 | wsw | 0.8669562671368469 | bonafide | 0.6052727783199998 | leegstand | 0.7648576663531585 | toedienen | 2.203491347348963 | dierenwelzijn | 0.49895065401126204 | 
 | rechercheurs | 2.428558124460832 | monitor | 0.5853055751325945 | vervoerder | 0.9562275800396735 | hamas | 1.2256536604819206 | leermiddelen | 0.8187585695385503 | solliciteren | 0.8669374330646434 | abonnement | 0.6052727783199998 | woningbouw | 0.7648576663531585 | gvs | 2.203491347348963 | eten | 0.49895065401126204 | 
 | tamil | 2.428558124460832 | schatkist | 0.5853055751325945 | snelwegen | 0.9562092479727526 | oekraïne | 1.2256536604819206 | verdelen | 0.8187585695385503 | verzekeringskamer | 0.8669374330646434 | installeren | 0.6052562887333139 | lawaai | 0.7648576663531585 | dbc | 2.203491347348963 | natuurwaarden | 0.49895065401126204 | 
 | terroristen | 2.428558124460832 | ingedeeld | 0.5853055751325945 | verkeerstekens | 0.9562092479727526 | uganda | 1.2256536604819206 | erfgoed | 0.8187585695385503 | flexibele | 0.8669374330646434 | kinderarbeid | 0.6052562887333139 | planbureau | 0.7648423466816193 | verpleeghuiszorg | 2.203491347348963 | structuurschema | 0.49895065401126204 | 
 | racisme | 2.428558124460832 | tegenvallers | 0.5853055751325945 | calamiteit | 0.9561950655887234 | soedanese | 1.2256536604819206 | les | 0.8187585695385503 | integratiebeleid | 0.8669374330646434 | mislukt | 0.6052562887333139 | dorpen | 0.7648423466816193 | zorgvraag | 2.203491347348963 | schapen | 0.49895065401126204 | 
 | revu | 2.428558124460832 | topman | 0.5853055751325945 | ptt | 0.9561950655887234 | office | 1.2256536604819206 | rijksmiddelen | 0.8187585695385503 | anw | 0.8669374330646434 | grip | 0.6052562887333139 | bedrijvigheid | 0.7648423466816193 | tandheelkunde | 2.203491347348963 | gebiedsgerichte | 0.49895065401126204 | 
 | vreemdelingencirculaire | 2.428558124460832 | ozb | 0.5853055751325945 | telefoon | 0.9561950655887234 | war | 1.2256536604819206 | sponsor | 0.8187585695385503 | samenwonen | 0.8669374330646434 | opererende | 0.6052562887333139 | koepel | 0.7648423466816193 | bijsluiter | 2.203491347348963 | ethisch | 0.49895065401126204 | 
 | cbp | 2.428558124460832 | uitgeverijen | 0.5853055751325945 | prognose | 0.9561950655887234 | kernwapens | 1.2256536604819206 | creatief | 0.8187435943677542 | gerespecteerd | 0.8669227486948543 | zomaar | 0.6052562887333139 | onroerend | 0.7648423466816193 | voorschrijfgedrag | 2.203491347348963 | aardappelen | 0.49895065401126204 | 
 | vreemdelingendiensten | 2.428558124460832 | rechtsbescherming | 0.5853055751325945 | zand | 0.9561950655887234 | mensenrechtenactivist | 1.2256536604819206 | vo | 0.8187435943677542 | gemeentebesturen | 0.8669227486948543 | gejaagd | 0.6052562887333139 | klimaatverandering | 0.7648423466816193 | thuiszorginstelling | 2.203491347348963 | natuurgebieden | 0.49895065401126204 | 
 | vuurwapens | 2.428558124460832 | lening | 0.5853055751325945 | ameland | 0.9561950655887234 | mensenrechtenverdragen | 1.2256536604819206 | homoseksualiteit | 0.8187435943677542 | kroon | 0.8669227486948543 | xiii | 0.6052562887333139 | vrijkomen | 0.7648423466816193 | ziektekostenverzekering | 2.203491347348963 | uitbraak | 0.49893321294885873 | 
 | beveiligingsmaatregelen | 2.428558124460832 | bewindvoerder | 0.5853055751325945 | milieueffecten | 0.9561950655887234 | declaration | 1.2256536604819206 | ergste | 0.8187435943677542 | veeleer | 0.8669227486948543 | dollar | 0.6052562887333139 | opruimen | 0.7648423466816193 | ciz | 2.203491347348963 | houdend | 0.49893321294885873 | 
 | verhoor | 2.428558124460832 | toegepaste | 0.5852925542778851 | aanschaffen | 0.9561950655887234 | leider | 1.2256536604819206 | versnippering | 0.8187435943677542 | toetst | 0.8669227486948543 | consortium | 0.6052434338205386 | kopers | 0.7648423466816193 | genezer | 2.203491347348963 | polder | 0.49893321294885873 | 
 | ontvlucht | 2.428558124460832 | element | 0.5852925542778851 | mer | 0.9561950655887234 | gevangene | 1.2256536604819206 | nogal | 0.8187435943677542 | zomerreces | 0.8669227486948543 | verbindingen | 0.6052434338205386 | velzen | 0.7648423466816193 | palliatieve | 2.203491347348963 | subsidiebedrag | 0.49893321294885873 | 
 | heropenen | 2.4285448893752846 | verspreiden | 0.5852925542778851 | stimulans | 0.9561950655887234 | says | 1.2256536604819206 | stevige | 0.8187435943677542 | cnv | 0.8669227486948543 | gerelateerde | 0.6052434338205386 | wateren | 0.7648423466816193 | cak | 2.203491347348963 | nul | 0.49893321294885873 | 
 | willem | 2.4285448893752846 | indirecte | 0.5852925542778851 | vlaamse | 0.9561950655887234 | bloedbad | 1.2256366223340378 | overblijven | 0.8187435943677542 | premieheffing | 0.8669227486948543 | verzameld | 0.6052434338205386 | vuurwerk | 0.7648423466816193 | specialist | 2.2034690522114957 | natuurlijk | 0.49893321294885873 | 
 | arbeidsrecht | 2.4285448893752846 | efficiënt | 0.5852925542778851 | aanleggen | 0.9561950655887234 | action | 1.2256366223340378 | regeringsstandpunt | 0.8187322537952991 | halsema | 0.8669227486948543 | verzamelde | 0.6052434338205386 | verhandeld | 0.7648423466816193 | sporters | 2.2034690522114957 | besmettelijke | 0.49893321294885873 | 
 | heren | 2.4285448893752846 | automatische | 0.5852925542778851 | snelheden | 0.9561950655887234 | law | 1.2256366223340378 | bezorgen | 0.8187322537952991 | kinderarbeid | 0.8669227486948543 | importeur | 0.6052434338205386 | brandstoffen | 0.7648423466816193 | buitengewone | 2.2034690522114957 | fte | 0.49893321294885873 | 
 | lekken | 2.4285448893752846 | liter | 0.5852925542778851 | deadline | 0.9561950655887234 | understanding | 1.2256366223340378 | veiling | 0.8187322537952991 | lukt | 0.8669116989112152 | vervaardigd | 0.6052434338205386 | veroorzaakte | 0.7648423466816193 | contracteren | 2.2034690522114957 | japanse | 0.4989194066201563 | 
 | belemmerend | 2.4285448893752846 | leermiddelen | 0.5852925542778851 | passeren | 0.9561950655887234 | herald | 1.2256366223340378 | gezette | 0.8187322537952991 | administratiekantoor | 0.8669116989112152 | voetbal | 0.6052434338205386 | complex | 0.7648423466816193 | ambulances | 2.2034690522114957 | benutting | 0.4989194066201563 | 
 | bewaard | 2.4285448893752846 | stemt | 0.5852925542778851 | voorleggen | 0.9561845177713224 | an | 1.2256366223340378 | anoniem | 0.8187322537952991 | beoordelingen | 0.8669116989112152 | minima | 0.6052434338205386 | benutting | 0.7648423466816193 | dosis | 2.2034690522114957 | beleidsmatig | 0.4989194066201563 | 
 | bvd | 2.4285448893752846 | ruimen | 0.5852925542778851 | keus | 0.9561845177713224 | inlichtingendienst | 1.2256366223340378 | uitgerust | 0.8187322537952991 | explosief | 0.8669116989112152 | overheidsdiensten | 0.6052434338205386 | gemodificeerde | 0.7648423466816193 | medicatie | 2.2034690522114957 | session | 0.4989194066201563 | 
 | auteurswet | 2.4285448893752846 | leeftijdsdiscriminatie | 0.5852925542778851 | materialen | 0.9561845177713224 | multilaterale | 1.2256366223340378 | arnhemse | 0.8187322537952991 | privaat | 0.8669116989112152 | concurrerende | 0.6052434338205386 | bestaand | 0.7648423466816193 | medical | 2.2034690522114957 | produktie | 0.4989194066201563 | 
 | implicaties | 2.4285448893752846 | opvolger | 0.5852925542778851 | opruimen | 0.9561845177713224 | koerdische | 1.2256366223340378 | beantwoordt | 0.8187322537952991 | vloeien | 0.8669116989112152 | voorkomend | 0.6052434338205386 | milieueffecten | 0.7648306616237612 | verkorten | 2.2034690522114957 | afgewacht | 0.4989194066201563 | 


## 6.
Evaluate your classifiers using Precision, Recall and F1. (Give a table in which you show these values for using the top 10, top 100 terms and all terms, for all of your 10 classes)
          Thus do feature selection per class, and use for each class the top n best features for that class. 
          <br/>
      Also show the microaverage(s) for all 10 classes together.
      <br/>
      If you like you can also present this in a figure like MRS.13.8. 
      Then compute the F1 measure for the same number of terms as in that figure.

In [15]:
def precision(tp, tpfp):
    return tp/(tpfp)
    
def recall(guesses, classes):
    return guesses/classes
    
def F1_measure(P, R):
    return 2*P*R/(P+R)


In [17]:


def getstats_macro_fix(vocabs, test):
    """ takes a vocabulary and testset and collects performance in Precision, Recall & F1 per class
    """
    results = []
    for vocab in vocabs:
        print("starting with vocab")
        recall_dict = {c:{'retrieved':0, 'total relevant':len(test.loc[test["ministerie"] == c, "ministerie"])} for c in classes} 
        precision_dict = {c:{'tp':1, 'tp+fp':1} for c in classes}
        
        for i, query in enumerate(test.vraag):
            guess = ApplyMultinomialNB(classes, class_priors, cond_prob, query, Vocab=vocab)
            if guess == test.ministerie[i]:
                recall_dict[guess]['retrieved'] += 1
                precision_dict[guess]['tp'] += 1
                precision_dict[guess]['tp+fp'] += 1
            else:
                precision_dict[guess]['tp+fp'] += 1
        
        
        P = [precision(precision_dict[c]['tp'], precision_dict[c]['tp+fp']) for c in recall_dict.keys()]
        R = [recall(recall_dict[c]['retrieved'], recall_dict[c]['total relevant']) for c in recall_dict.keys()]
        F1 = [F1_measure(p, r) for p,r in zip(P,R)]
        
        results.extend((p,r,f1,c) for p,r,f1,c in zip(P,R,F1,[c for c in recall_dict.keys()]))
        return results




In [19]:
#micro = getstats_microavg()
# KLS: commented out because it takes a long time, my results are stored below
# macro = getstats_macroavg()
#macro = [(0.04666320586029878, 1.0, 0.08916565634299571), (0.10070946530541616, 1.0, 0.18299009589687157), (0.1626002191844033, 1.0, 0.2797181980551696), (0.06881236661475457, 1.0, 0.12876416621694547), (0.07809886370190922, 1.0, 0.14488256380075973), (0.11830189767549172, 1.0, 0.21157416958943676), (0.08461671569475687, 1.0, 0.15603063178047225), (0.20989790621214743, 1.0, 0.34696796338672775), (0.07163869181519295, 1.0, 0.13369933796221542), (0.058660667935629004, 1.0, 0.11082052958483166), (0.04666320586029878, 1.0, 0.08916565634299571), (0.10070946530541616, 1.0, 0.18299009589687157), (0.1626002191844033, 1.0, 0.2797181980551696), (0.06881236661475457, 1.0, 0.12876416621694547), (0.07809886370190922, 1.0, 0.14488256380075973), (0.11830189767549172, 1.0, 0.21157416958943676), (0.08461671569475687, 1.0, 0.15603063178047225), (0.20989790621214743, 1.0, 0.34696796338672775), (0.07163869181519295, 1.0, 0.13369933796221542), (0.058660667935629004, 1.0, 0.11082052958483166), (0.04666320586029878, 1.0, 0.08916565634299571), (0.10070946530541616, 1.0, 0.18299009589687157), (0.1626002191844033, 1.0, 0.2797181980551696), (0.06881236661475457, 1.0, 0.12876416621694547), (0.07809886370190922, 1.0, 0.14488256380075973), (0.11830189767549172, 1.0, 0.21157416958943676), (0.08461671569475687, 1.0, 0.15603063178047225), (0.20989790621214743, 1.0, 0.34696796338672775), (0.07163869181519295, 1.0, 0.13369933796221542), (0.058660667935629004, 1.0, 0.11082052958483166)]

#vocabs = [V, topvocab10, topvocab100]
vocabs = [V]

#macro = getstats_macro_fix(vocabs, test)
print(macro)
print()

['Justitie', 'Financien', 'Verkeer en Waterstaat', 'Buitenlandse Zaken', 'Onderwijs, Cultuur en Wetenschappen', 'Sociale Zaken en Werkgelegenheid', 'Economische Zaken', 'Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer', 'Volksgezondheid, Welzijn en Sport', 'Landbouw, Natuurbeheer en Visserij']
[(0.9188940092165898, 0.5476779334982138, 0.6863051956933509), (0.43393466601657726, 0.7451802179379715, 0.5484784110409693), (0.7981220657276995, 0.6809851088201604, 0.734915293677067), (0.8373493975903614, 0.8127742564602632, 0.8248788293572038), (0.7298888162197514, 0.8234859675036927, 0.7738675872594257), (0.6521739130434783, 0.7866394001363326, 0.7131233649865079), (0.4206036745406824, 0.6293018682399213, 0.5042104596876831), (0.6941923774954628, 0.6151368760064412, 0.6522779954666645), (0.9498873873873874, 0.5980844271018092, 0.7340092998841676), (0.35188047398248323, 0.8430160692212608, 0.4965131009870975)]



In [20]:
def stats_to_markdown(stats, vocabs, measures):
    firstline = classes_table_markdown(topfeatures)
    values = values_to_markdown(stats, vocabs, measures)
    return firstline + values

def values_to_markdown(stats, vocabs, measures):
    print("Macro averages")
    returnstring = ""
    stats1 = stats[:9]
    stats2 = stats[9:19]
    stats3 = stats[19:]
    for i, stat in enumerate([stats1, stats2, stats3]):
        returnstring += " | " + vocabs[i] + " | \n"
        for j, measure in enumerate(measures):
            for s in stats:
                returnstring += " | " + measure + " | " + str((s[j]))
            returnstring += " | \n"
            
    return returnstring

def show_micro(stats, vocabs, measures):
    print("Micro averages \n")
    for i, vocab in enumerate(vocabs):
        print(vocab + ":")
        for j, measure in enumerate(measures):
            print(measure + " = " + str(stats[i][j]))
        print("\n")

print(macro)
vocabs = ["V = all"]
measures = ["Precision", "Recall", "F1"]
show_macro = stats_to_markdown(macro, vocabs, measures)
display(Markdown(show_macro))
show_micro(micro, vocabs, measures)

[(0.9188940092165898, 0.5476779334982138, 0.6863051956933509), (0.43393466601657726, 0.7451802179379715, 0.5484784110409693), (0.7981220657276995, 0.6809851088201604, 0.734915293677067), (0.8373493975903614, 0.8127742564602632, 0.8248788293572038), (0.7298888162197514, 0.8234859675036927, 0.7738675872594257), (0.6521739130434783, 0.7866394001363326, 0.7131233649865079), (0.4206036745406824, 0.6293018682399213, 0.5042104596876831), (0.6941923774954628, 0.6151368760064412, 0.6522779954666645), (0.9498873873873874, 0.5980844271018092, 0.7340092998841676), (0.35188047398248323, 0.8430160692212608, 0.4965131009870975)]


NameError: name 'topfeatures' is not defined

## 7 
You have done the complete implementation by yourself. Congratulations! You can also use `scikit-learn` routines for all of this work. Do that. So follow [this text classification tutorial](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)  and implement the same steps but now with your kamervragen dataset. Also use [mutual information feature selection](http://scikit-learn.org/stable/modules/feature_selection.html) to select the K-best features, and compare the results as before.


In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
    

def make_bunch(kvrdf, classes):
    """ 
    Constructs a sklearn Bunch object from a pandas dataframe. A Bunch is an Object similar to a Python dictionary
    that provides attribute-style access. (dataset["target"] == dataset.target)
    Input: pandas df, list of class names
    """

    kvrdf_np = kvrdf.values
    class_names = list(kvrdf_np[:][:, 5])
    data = list(kvrdf_np[:][:,3])
    classes = list(classes)

    target = []
    x = 0
    for name in class_names:
        target.append(classes.index(name)) 

    
    dataset = datasets.base.Bunch(data=data, target=target, target_names=classes)
    return(dataset)
    
kvrdf_train, kvrdf_test = train_test_split(kvrdf, test_size=0.2)
dataset = make_bunch(kvrdf_train, classes)
test_dataset = make_bunch(kvrdf_test, classes)

Tokenizing text files with scikit-learn

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Build a dictionary of features and transforms documents to feature vectors
# Absolute word count
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(dataset.data)
X_train_counts.shape

# CALCULATE TF AND IDF
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

In [None]:
#feature SELECTION
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X, y = X_train_tfidf, dataset.target
X.shape
selectkbest = SelectKBest(chi2, k=5000)
X_new = selectkbest.fit_transform(X, y)
X_new.shape

In [None]:
# TRAINING A CLASSIFIER
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_new, dataset.target)

In [None]:
import numpy as np
# Trying classifier on test data
X_new_counts = count_vect.transform(test_dataset.data)

X_new_tfidf = tfidf_transformer.transform(X_new_counts)

X_selected_f = selectkbest.transform(X_new_tfidf)


predicted = clf.predict(X_selected_f)

print(np.mean(predicted == test_dataset.target))

# for doc, category in zip(docs_new, predicted):
#     print('%r => %s' % (doc, dataset.target_names[category]))
#     print("\n")   

Instead of implementing everything yourself, you can also use a library, such as scikit-learn, for text classification. In the scikit-learn tutorial for naive Bayes classification, there are examples that walk you through all the steps of using the library to implement a naive Bayes classifier. First the data has to be loaded into Python. In the tutorial a scikit-learn Bunch object was used, so we opted to use this data type, as it would make it easier to follow along with the tutorial.

After loading in the data, the documents are tokenized the documentes are transformed into feature vectors, with a CountVectorizer() object. Counting the absolute occurence of words is a good start, but there is a problem. Documents of different lengths about the same subject have different count values. We therefore use term frequencies instead of absolute word count.

Scikit-learn has built-in functionality for feature selection. Here, we've selected the K-best features with SelectKBest. We experimented with different values for K and we seemed to get the best results with a value for K around 5000.

Using this transformed training data, we trained a Naive Bayes classifier with MultinomialNB and applied the classifier to our training data.

This led to an overall performance of 0.5893604303646145

## 8

Reflect and report briefly about your choices in this process and about the obtained results. Also reflect on the differences between the scikit learn approach and the "own implementation approach".

<h3>Training/Testing</h3>
<p>It is important that you do not test your classifier using documents that have also been used in training.
    So split up your collection in a training set and a test set. A 80%-20% split is reasonable.

<br/>
    If you have too little data you can use 5 or <a href="http://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation">10-fold cross validation</a>.</p>



<h2>Form of handing in your final product</h2>

* An IPython notebook with for each question, a MarkDown cell containing the question, a code cell which solves the question, an output cell with the output, followed by a MarkDown cell with explanation/reflection  