# Title: Identifying regulations containing outdated technologies

#### Members Names: Shaofang Xu, Dongrui Zhang


# Introduction:

#### Problem Description: 

The government faces a heavy burden of regulations make reference to outdated technologies by finding such words as: paper, scan, print, mail, fax, signature, written, pen ,pencil, ink, physical, carbon copy, hard copy, original copy, in person, proof, signed, notary, on site, file, filing, submit, submission(non-electronic), document (excluding electronic).

#### Context of the Problem: 

For updating the regulations fitting to modern world that electronic device are used more widely and conveniently, our project attempts to provide a mean of finding these regulations. 

#### Limitation About other Approaches:

We tried to use name entity tagger on some text with such kinds of words and phrases form Justice Canada, then generate a score based on some heuristic. But we found the score only sovled sequence labeling task for words, not very helpful for classifying sentences and documents which refered to outdated technologies.

#### Solution: 

Maximum Entropy Classifier (MaxEnt) and Logistic regression



# Background

Explain the related work using the following table

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Miles Osborne et al. [1] | They showed how maximum entropy could be used for sentence extraction, that adding prior could deal with the categorical nature of the features | NA | Costly to produce
| John Mount et al. [2] | They show that the simpler derivation already given is a very good way to demonstrate the equivalence of logistic regression and maximum entropy modeling| NA | The sigmoid form is less trouble than appealing to maximum entropy


The last row in this table should be about the method discussed in this paper (If you can't find the weakenss of this method then write about the future improvement, see the future work section of the paper)

# Methodology

1. Build datasets:

1) Parser extracts all text XML format from the Justice Canada FTP server (ftp://205.193.86.89/) and remove stop words.

2) Create function for generating sentence table that involves technologies in the keywords.

3) Review and label the tables manually.

2. Train and test datasets:

1) Split the features and the label, and use k-fold Cross-Validation method

2) Train the datasets by Maximum Entropy probability distribution

3) Train the datasets by Logistic distribution

# Implementation

In this section, you will provide the code and its explanation. You may have to create more cells after this.

In [None]:
import os
import nltk
import re
import numpy as np
from nltk.stem import PorterStemmer
import xml.etree.ElementTree as ET
import pandas as pd
import spacy
import en_core_web_sm

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)

In [2]:
"""
This parser extracts all text from XML in one field alongside some other fields in the form of Python dictionary 
object.

@author: Shariyar

"""
import xml.etree.ElementTree

def updateContent(child: xml.etree.ElementTree, content:str):
    if (child.text is not None):
        content=content+child.text+" "
    if (child.tail is not None):
        content=content+child.tail+ " "
    return content
            
    
# Funnction to parse regulations only
def parseRegulation(root: xml.etree.ElementTree):
    '''
      Parses regulation based xml file
      :param root: xml node of type xml.etree.ElementTree
      :returns dictionary of keys and values
    '''
    xRefXternal=list()
    content=""
    modifiedYear="NA"
    regYear="NA"
    consolidationYear="NA"
    #dt=""
    #modFlag=False
    xmlDict=dict()
    
    
    for child in root.iter():
        #print (child.tag,child.text,child.tail)
        if child.tag=="XRefExternal":
            txt=""
            if child.text is None:
                if child[0].tail is not None:
                    txt=child[0].tail
                else:
                     txt=child[0].text
            else:
                txt=child.text
            xRefXternal.append(txt)
            content=updateContent(child,content)
            #content+=txt+" "
       # get modifed date 
        elif ("ModifiedDate" in child.tag):
         
            dateTag=child[0]
            #print(date[1].text)
            modifiedYear=dateTag[0].text
        elif (child.tag=="RegistrationDate"):
            dateTag=child[0]
            #print(date[1].text)
            regYear=dateTag[0].text
        elif (child.tag=="ConsolidationDate"):
            dateTag=child[0]
            #print(date[1].text)
            consolidationYear=dateTag[0].text 
        elif (child.tag=="InstrumentNumber"):
            xmlDict["instrumentNumber"]=child.text
        elif (child.tag=="ShortTitle"):
            xmlDict["shorttitle"]=child.text
        elif (child.tag=="RegulationMaker"):
            xmlDict["regulationmaker"]=child.text
        elif (child.tag=="LongTitle"):
            xmlDict["longtitle"]=child.text 
            content+=child.text+". "
        elif (child.tag=="TitleText"):
            if child.text is None:
                if len(child)>0  and child[0].text is not None:
                    content+=child[0].text+". "
            else:
                content+=child.text+". "
       
            
        elif child.tag=="MM" or child.tag=="DD" or child.tag=="YYYY"  or child.tag=="Label":
            continue;
        elif child.tag=="Repealed":
            # print (child.text)
            return None # igonre reglations with repeal keywords
            
        else:
            content=updateContent(child,content)
            #print(child.tag,child.text)
            '''if (child.text is not None):
            
                content=content+child.text+" "
            if (child.tail is not None):
                content=content+child.tail+ " "'''
                
           
    xmlDict["modifiedyear"]=modifiedYear
    xmlDict["registrationyear"]=regYear
    xmlDict["consolidationyear"]=consolidationYear
    xmlDict["xrefxternal"]=xRefXternal
    xmlDict["content"]=content
    
    
    return (xmlDict)


In [3]:
def parseReg(directory, filename):
    path=os.path.join(directory, filename)
    tree=ET.parse(path)
    d=parseRegulation(tree.getroot())
    return d

In [4]:
directory="regulations"

# List of stemmed keywords
keyWords = ['paper', 'scan', 'print', 'mail', 'fax', 'signatur', 'written', 'pen', 'pencil', 'ink', 'physic',
            'carbon copi', 'hard copi', 'hardcopi', 'origin copi', 'in person', 'proof', 'sign', 'notari', 'on site', 
            'file', 'submit', 'submiss', 'document']

In [5]:
# Create function for generating sentence table that involves technologies in the keywords.
def buildSentTable(directory, kw):
    
    doc = pd.DataFrame()
    # Initialize stemmer
    stemmer = PorterStemmer()
    kw_str = '|'.join(kw)
    
    for filename in os.listdir(directory):

        try:        
            d = parseReg(directory, filename)
            num_sent = 0
            sents = nltk.sent_tokenize(d['content'])
            del d['content']
            
            for sent in sents:
                
                # Extract sentences if any key word(s) present(s)
                if re.search(kw_str, stemmer.stem(sent)):
                    
                    num_sent += 1
                    d['Related Sentence'] = sent
                    d['Sentence #'] = num_sent
                    d['Regulation Filename'] = filename
                    doc = doc.append(d, ignore_index=True)

        except:
            continue
    
    return doc

In [54]:
# Generate sentance table for manual review
sentTable = buildSentTable(directory, keyWords)
sentTable_pkl = sentTable.to_pickle('sentTable_v2.pkl')

In [57]:
# Import reviewed sentence table
reviewedTable = pd.read_pickle('reviewed.pkl')
reviewedTable.head()

Unnamed: 0,index,id,Related Sentence,Sentence #,consolidationyear,instrumentNumber,longtitle,modifiedyear,registrationyear,shorttitle,xrefxternal,regulationmaker,label
0,0,2095,If a request for review is submitted to a comp...,10,2018,SI/2017-41,Proclamation giving notice of the entry into f...,2017,2017.0,,"['OLD AGE SECURITY ACT', 'Old Age Security Act...",,0
1,1,6320,Content of application Unless these Rules prov...,77,2018,SOR/2012-256,Refugee Protection Division Rules,2012,2012.0,,"['IMMIGRATION AND REFUGEE PROTECTION ACT', 'Im...",P.C.,0
2,2,4687,The master of a vessel carrying solid bulk car...,11,2018,SOR/2007-128,"Cargo, Fumigation and Tackle Regulations",2007,2007.0,,"['CANADA SHIPPING ACT, 2001', 'Canada Shipping...",P.C.,0
3,3,9934,The notice referred to in subsection (1) shall...,1,2018,SOR/92-677,Regulations Respecting the Confirmation of Spe...,2008,1992.0,Specific Agreement Confirmation Regulations,"['INDIAN LANDS AGREEMENT (1986) ACT', 'Indian ...",P.C.,0
4,4,7868,Termination for failure to meet deadlines The ...,7,2018,SOR/2015-167,Mutual Property and Casualty Insurance Company...,2015,2015.0,,"['INSURANCE COMPANIES ACT', 'Insurance Compani...",P.C.,0


In [64]:
# Create function for generating words within sentences generated in the previous reviewed sentence table with associated features.
def buildWordTable(sent_df, kw):
    
    doc = pd.DataFrame()
    features = dict()
    
    # Initialize stemmer
    stemmer = PorterStemmer()
    
    # Pretrained statistical models in spaCy. It assigns context-specific token vectors, POS tags, 
    # dependency parse and named entities.
    nlp = en_core_web_sm.load()
    kw_str = '|'.join(kw)
    cols = list(sent_df.columns)
    
    for s in range(len(sent_df)):
        
        for col in ['instrumentNumber', 'Sentence #', 'label', 'modifiedyear', 'registrationyear']:
            features[col] = sent_df.iloc[s, cols.index(col)]
        
        sent = sent_df.iloc[s, cols.index('Related Sentence')]
        doc_nlp = nlp(sent)
        l = len(list(doc_nlp))

        for i in range(l):

            txt1W = doc_nlp[i]
            stem_txt1W = stemmer.stem(txt1W.text)
            # Create 2-word phrase
            txt2W = [doc_nlp[i], doc_nlp[min(i+1, l-1)]]
            
            if stem_txt1W in kw or stemmer.stem(txt2W[0].text+" "+txt2W[1].text) in kw:
                
                for j in range(1,4):
                
                    # Add features for previous word(s)
                    if i == j-1:
                        features["previous " + str(j) + " word"]=''
                        features["previous " + str(j) + " POS"]=''
                        features["previous " + str(j) + " POS tag"]=''
                        features["previous " + str(j) + " Entity tag"]=''
                    else:
                        features["previous " + str(j) + " word"]=doc_nlp[i-j].text
                        features["previous " + str(j) + " POS"]=doc_nlp[i-j].pos_
                        features["previous " + str(j) + " POS tag"]=doc_nlp[i-j].tag_
                        if doc_nlp[i-j].ent_iob_ != 'O':
                            features["previous " + str(j) + " Entity tag"] = doc_nlp[i-1].ent_type_
                        else:
                            features["previous " + str(j) + " Entity tag"] = 'O'
                            
                # Add other features
                if stem_txt1W in kw:
                    features['Original word'] = txt1W.text
                    features['Stemmed word'] = stem_txt1W
                    features['lemma'] = txt1W.lemma_
                    features["isCapital"] = txt1W.text[0].upper()==txt1W.text[0]
                    features['Dependency'] = txt1W.dep_ # Syntactic dependency, i.e. the relation between tokens.
                    # features['Shape'] = txt1W.shape_ # The word shape – capitalization, punctuation, digits.
                    # features['Is Alpha char'] = txt1W.is_alpha # Is the token an alpha character? (a letter of the alphabet)
                    # features['Is Stop word'] = txt1W.is_stop
                    features['POS'] = txt1W.pos_
                    features['POS Tag'] = txt1W.tag_
                else:
                    features['Original word'] = txt2W[0].text + " " + txt2W[1].text
                    features['Stemmed word'] = stemmer.stem(features['Original word'])
                    features['lemma'] = ' '.join(list(map(lambda x: x.lemma_, txt2W)))
                    features["isCapital"] = txt2W[0].text[0].upper()==txt2W[0].text[0]
                    features['Dependency'] = txt2W[1].dep_
                    features['POS'] = ' '.join(list(map(lambda x: x.pos_, txt2W)))
                    features['POS Tag'] = ' '.join(list(map(lambda x: x.tag_, txt2W)))

                doc = doc.append(features, ignore_index=True)
                
    return doc

In [65]:
wordTable = buildWordTable(reviewedTable, keyWords)
wordTable.to_pickle('data for modeling_v3.pkl')

In [75]:
pd.set_option('display.max_rows', 50)
wordTable.tail()

Unnamed: 0,Dependency,Original word,POS,POS Tag,Sentence #,Stemmed word,instrumentNumber,isCapital,label,lemma,modifiedyear,previous 1 Entity tag,previous 1 POS,previous 1 POS tag,previous 1 word,previous 2 Entity tag,previous 2 POS,previous 2 POS tag,previous 2 word,previous 3 Entity tag,previous 3 POS,previous 3 POS tag,previous 3 word,registrationyear
320,amod,written,VERB,VBN,27.0,written,SOR/90-264,0.0,1.0,write,2006.0,O,ADV,WRB,where,O,PUNCT,",",",",O,NOUN,NNS,mechanisms,1990.0
321,relcl,files,VERB,VBZ,1.0,file,SOR/86-547,0.0,0.0,file,2010.0,O,PRON,WP,who,O,NOUN,NN,person,O,DET,DT,a,1986.0
322,amod,physical,ADJ,JJ,16.0,physic,SOR/90-264,0.0,0.0,physical,2006.0,O,ADP,IN,of,O,CCONJ,CC,and,O,PUNCT,",",",",1990.0
323,pobj,document,NOUN,NN,39.0,document,SOR/2010-277,0.0,0.0,document,2010.0,O,DET,DT,the,O,ADP,IN,of,O,NOUN,NN,version,2010.0
324,pobj,document,NOUN,NN,39.0,document,SOR/2010-277,0.0,0.0,document,2010.0,O,DET,DT,the,O,ADP,IN,to,O,NOUN,NN,link,2010.0


### Train by Maximum Entropy Classifier method

In [119]:
from nltk.classify import maxent

In [63]:
import pandas as pd
wordTable = pd.read_pickle('data for modeling_v3.pkl')

# Train and test using all features
data = wordTable.set_index(['Sentence #', 'instrumentNumber'])

In [64]:
dic = data.to_dict(orient='records')
dic[:10]

[{'Dependency': 'advcl',
  'Original word': 'submitted',
  'POS': 'VERB',
  'POS Tag': 'VBN',
  'Stemmed word': 'submit',
  'isCapital': 0.0,
  'label': 0.0,
  'lemma': 'submit',
  'modifiedyear': 2017.0,
  'previous 1 Entity tag': 'O',
  'previous 1 POS': 'AUX',
  'previous 1 POS tag': 'VBZ',
  'previous 1 word': 'is',
  'previous 2 Entity tag': 'O',
  'previous 2 POS': 'NOUN',
  'previous 2 POS tag': 'NN',
  'previous 2 word': 'review',
  'previous 3 Entity tag': 'O',
  'previous 3 POS': 'ADP',
  'previous 3 POS tag': 'IN',
  'previous 3 word': 'for',
  'registrationyear': 2017.0},
 {'Dependency': 'relcl',
  'Original word': 'submitted',
  'POS': 'VERB',
  'POS Tag': 'VBN',
  'Stemmed word': 'submit',
  'isCapital': 0.0,
  'label': 0.0,
  'lemma': 'submit',
  'modifiedyear': 2017.0,
  'previous 1 Entity tag': 'O',
  'previous 1 POS': 'AUX',
  'previous 1 POS tag': 'VBD',
  'previous 1 word': 'was',
  'previous 2 Entity tag': 'O',
  'previous 2 POS': 'NOUN',
  'previous 2 POS tag': 'N

In [65]:
data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Dependency,Original word,POS,POS Tag,Stemmed word,isCapital,label,lemma,modifiedyear,previous 1 Entity tag,...,previous 1 word,previous 2 Entity tag,previous 2 POS,previous 2 POS tag,previous 2 word,previous 3 Entity tag,previous 3 POS,previous 3 POS tag,previous 3 word,registrationyear
Sentence #,instrumentNumber,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
10.0,SI/2017-41,advcl,submitted,VERB,VBN,submit,0.0,0.0,submit,2017.0,O,...,is,O,NOUN,NN,review,O,ADP,IN,for,2017.0
10.0,SI/2017-41,relcl,submitted,VERB,VBN,submit,0.0,0.0,submit,2017.0,O,...,was,O,NOUN,NN,request,O,DET,DT,the,2017.0
77.0,SOR/2012-256,amod,written,VERB,VBN,written,0.0,0.0,write,2012.0,O,...,a,O,ADP,IN,in,O,PUNCT,",",",",2012.0
11.0,SOR/2007-128,dobj,document,NOUN,NN,document,0.0,0.0,document,2007.0,O,...,a,O,NOUN,NN,board,O,ADP,IN,on,2007.0
1.0,SOR/92-677,relcl,submitted,VERB,VBN,submit,0.0,0.0,submit,2008.0,O,...,be,O,PART,TO,to,O,NOUN,NN,question,1992.0


In [66]:
def splitData(df):
    
    dic = df.to_dict(orient='records')
    
    train1 = [list() for i in range(5)]
    test1 = [list() for i in range(5)]
    trainFeature = [list() for i in range(5)]
    trainLabel = [list() for i in range(5)]
    testLabel = [list() for i in range(5)]
    
    trainInx = [list() for i in range(5)]
    testInx = [list() for i in range(5)]
    
    lst = list(range(len(dic)))
    tLen = round(len(dic)*0.2)
    
    for j in range(5):
        
        testInx[j] = lst[j*tLen:(j+1)*tLen]
        trainInx[j] = lst[:j*tLen] + lst[(j+1)*tLen:]
        
        for t in testInx[j]:
            features = dic[t].copy()
            testLabel[j].append(dic[t]['label'])
            features.pop('label')
            test1[j].append(features)

        for i in trainInx[j]:
            features = dic[i].copy()
            label = dic[i]['label']
            trainLabel[j].append(label)
            features.pop('label')
            trainFeature[j].append(features)
            train1[j].append((features, label))
            
    return train1, trainFeature, trainLabel, test1, testLabel

In [126]:
def maxEntTrain(df):
    
    train1, trainFeature, trainLabel, test1, testLabel = splitData(df)
    acc_hist = []
    
    for j in range(5):
        
        classifier = maxent.MaxentClassifier.train(train1[j])
        pred = classifier.classify_many(test1[j])
        
        match = sum([1 if pred[x] == testLabel[j][x] else 0 for x in range(len(pred))])
        acc = match/len(pred)
        acc_hist.append(acc)
        print("\nThe accuracy is {:.2%}".format(acc))
        
    return acc_hist

In [127]:
accHist = maxEntTrain(data)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.208
             2          -0.36218        0.792
             3          -0.32496        0.800
             4          -0.29493        0.838
             5          -0.27086        0.873
             6          -0.25119        0.888
             7          -0.23477        0.923
             8          -0.22080        0.927
             9          -0.20871        0.931
            10          -0.19811        0.942
            11          -0.18871        0.954
            12          -0.18030        0.958
            13          -0.17272        0.962
            14          -0.16584        0.969
            15          -0.15956        0.969
            16          -0.15379        0.973
            17          -0.14848        0.977
            18          -0.14356        0.977
            19          -0.13899        0.977
 

            74          -0.04873        1.000
            75          -0.04820        1.000
            76          -0.04769        1.000
            77          -0.04718        1.000
            78          -0.04668        1.000
            79          -0.04619        1.000
            80          -0.04572        1.000
            81          -0.04525        1.000
            82          -0.04479        1.000
            83          -0.04435        1.000
            84          -0.04391        1.000
            85          -0.04348        1.000
            86          -0.04305        1.000
            87          -0.04264        1.000
            88          -0.04223        1.000
            89          -0.04183        1.000
            90          -0.04144        1.000
            91          -0.04106        1.000
            92          -0.04068        1.000
            93          -0.04031        1.000
            94          -0.03995        1.000
            95          -0.03959  

            47          -0.09026        1.000
            48          -0.08878        1.000
            49          -0.08735        1.000
            50          -0.08596        1.000
            51          -0.08462        1.000
            52          -0.08332        1.000
            53          -0.08206        1.000
            54          -0.08083        1.000
            55          -0.07965        1.000
            56          -0.07849        1.000
            57          -0.07737        1.000
            58          -0.07629        1.000
            59          -0.07523        1.000
            60          -0.07420        1.000
            61          -0.07320        1.000
            62          -0.07222        1.000
            63          -0.07127        1.000
            64          -0.07035        1.000
            65          -0.06945        1.000
            66          -0.06857        1.000
            67          -0.06771        1.000
            68          -0.06688  

In [132]:
print(accHist)
print('\nThe average accuracy with 5-Fold Cross-Validation is: {:.2%}'.format(np.sum(accHist)/5))

[0.6307692307692307, 0.5230769230769231, 0.8, 0.7538461538461538, 0.8153846153846154]

The average accuracy with 5-Fold Cross-Validation is: 70.46%


In [135]:
# Train and test without previous 3rd word features
data2 = data.drop(['previous 3 Entity tag','previous 3 POS','previous 3 POS tag','previous 3 word'], axis=1)
accHist2 = maxEntTrain(data2)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.208
             2          -0.35993        0.792
             3          -0.32396        0.804
             4          -0.29545        0.846
             5          -0.27290        0.873
             6          -0.25464        0.892
             7          -0.23949        0.908
             8          -0.22663        0.915
             9          -0.21553        0.927
            10          -0.20581        0.927
            11          -0.19719        0.935
            12          -0.18948        0.935
            13          -0.18253        0.942
            14          -0.17621        0.950
            15          -0.17043        0.954
            16          -0.16513        0.965
            17          -0.16023        0.965
            18          -0.15569        0.969
            19          -0.15146        0.969
 

            75          -0.07190        0.985
            76          -0.07134        0.985
            77          -0.07079        0.985
            78          -0.07026        0.985
            79          -0.06973        0.985
            80          -0.06921        0.985
            81          -0.06870        0.985
            82          -0.06820        0.985
            83          -0.06771        0.985
            84          -0.06723        0.985
            85          -0.06675        0.985
            86          -0.06628        0.985
            87          -0.06582        0.985
            88          -0.06537        0.985
            89          -0.06492        0.985
            90          -0.06448        0.985
            91          -0.06405        0.985
            92          -0.06362        0.985
            93          -0.06320        0.985
            94          -0.06279        0.985
            95          -0.06238        0.985
            96          -0.06198  

            49          -0.11431        0.981
            50          -0.11295        0.981
            51          -0.11163        0.981
            52          -0.11034        0.981
            53          -0.10909        0.981
            54          -0.10787        0.981
            55          -0.10668        0.981
            56          -0.10552        0.981
            57          -0.10439        0.981
            58          -0.10329        0.981
            59          -0.10222        0.981
            60          -0.10117        0.981
            61          -0.10015        0.981
            62          -0.09915        0.981
            63          -0.09818        0.981
            64          -0.09723        0.981
            65          -0.09629        0.981
            66          -0.09538        0.981
            67          -0.09449        0.981
            68          -0.09362        0.981
            69          -0.09277        0.981
            70          -0.09193  

In [137]:
print(accHist2)
print('\nThe average accuracy after removing previous 3rd word is: {:.2%}'.format(np.sum(accHist2)/5))

[0.6307692307692307, 0.5230769230769231, 0.8307692307692308, 0.7384615384615385, 0.8]

The average accuracy after removing previous 3rd word is: 70.46%


In [138]:
# Train and test without previous 2nd & 3rd word features
data3 = data.drop(['previous 3 Entity tag','previous 3 POS','previous 3 POS tag','previous 3 word','previous 2 Entity tag',
              'previous 2 POS','previous 2 POS tag','previous 2 word'], axis=1)
accHist3 = maxEntTrain(data3)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.208
             2          -0.35983        0.792
             3          -0.32659        0.808
             4          -0.30071        0.842
             5          -0.28052        0.854
             6          -0.26434        0.881
             7          -0.25098        0.885
             8          -0.23970        0.892
             9          -0.23000        0.904
            10          -0.22151        0.915
            11          -0.21400        0.919
            12          -0.20729        0.919
            13          -0.20124        0.919
            14          -0.19573        0.919
            15          -0.19070        0.927
            16          -0.18607        0.927
            17          -0.18179        0.927
            18          -0.17782        0.927
            19          -0.17412        0.931
 

            76          -0.10042        0.965
            77          -0.09990        0.965
            78          -0.09939        0.965
            79          -0.09888        0.965
            80          -0.09839        0.965
            81          -0.09790        0.965
            82          -0.09742        0.965
            83          -0.09695        0.965
            84          -0.09648        0.965
            85          -0.09602        0.965
            86          -0.09557        0.965
            87          -0.09512        0.965
            88          -0.09468        0.965
            89          -0.09425        0.965
            90          -0.09382        0.965
            91          -0.09340        0.965
            92          -0.09298        0.965
            93          -0.09257        0.965
            94          -0.09217        0.965
            95          -0.09177        0.965
            96          -0.09137        0.965
            97          -0.09098  

            51          -0.14428        0.946
            52          -0.14309        0.946
            53          -0.14194        0.946
            54          -0.14082        0.946
            55          -0.13972        0.946
            56          -0.13865        0.946
            57          -0.13761        0.946
            58          -0.13659        0.946
            59          -0.13560        0.946
            60          -0.13462        0.946
            61          -0.13367        0.946
            62          -0.13275        0.946
            63          -0.13184        0.946
            64          -0.13095        0.950
            65          -0.13008        0.950
            66          -0.12923        0.950
            67          -0.12839        0.950
            68          -0.12758        0.954
            69          -0.12678        0.954
            70          -0.12599        0.954
            71          -0.12522        0.954
            72          -0.12447  

In [139]:
print(accHist3)
print('\nThe average accuracy after removing previous 2nd&3rd words is: {:.2%}'.format(np.sum(accHist3)/5))

[0.5538461538461539, 0.5384615384615384, 0.8153846153846154, 0.7076923076923077, 0.8]

The average accuracy after removing previous 2nd&3rd words is: 68.31%


In [140]:
# Train and test without year features
data4 = data.drop(['modifiedyear', 'registrationyear'], axis=1)
accHist4 = maxEntTrain(data4)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.208
             2          -0.36486        0.792
             3          -0.33124        0.796
             4          -0.30411        0.831
             5          -0.28231        0.862
             6          -0.26443        0.888
             7          -0.24942        0.900
             8          -0.23656        0.908
             9          -0.22534        0.919
            10          -0.21543        0.923
            11          -0.20658        0.927
            12          -0.19859        0.938
            13          -0.19134        0.946
            14          -0.18471        0.946
            15          -0.17860        0.950
            16          -0.17297        0.950
            17          -0.16773        0.958
            18          -0.16286        0.958
            19          -0.15831        0.962
 

            76          -0.05316        0.996
            77          -0.05266        0.996
            78          -0.05217        0.996
            79          -0.05169        0.996
            80          -0.05123        0.996
            81          -0.05077        0.996
            82          -0.05032        0.996
            83          -0.04988        0.996
            84          -0.04945        0.996
            85          -0.04903        0.996
            86          -0.04861        0.996
            87          -0.04821        0.996
            88          -0.04781        0.996
            89          -0.04741        0.996
            90          -0.04703        0.996
            91          -0.04665        0.996
            92          -0.04628        0.996
            93          -0.04592        0.996
            94          -0.04556        0.996
            95          -0.04521        0.996
            96          -0.04486        0.996
            97          -0.04452  

            50          -0.10598        0.988
            51          -0.10456        0.992
            52          -0.10317        0.992
            53          -0.10183        0.992
            54          -0.10052        0.992
            55          -0.09925        0.992
            56          -0.09801        0.992
            57          -0.09681        0.992
            58          -0.09564        0.992
            59          -0.09451        0.992
            60          -0.09340        0.992
            61          -0.09232        0.992
            62          -0.09126        0.992
            63          -0.09024        0.992
            64          -0.08924        0.992
            65          -0.08826        0.992
            66          -0.08731        0.992
            67          -0.08638        0.992
            68          -0.08547        0.992
            69          -0.08458        0.992
            70          -0.08371        0.992
            71          -0.08286  

In [141]:
print(accHist4)
print('\nThe average accuracy after removing year features is: {:.2%}'.format(np.sum(accHist4)/5))

[0.676923076923077, 0.5076923076923077, 0.8, 0.7846153846153846, 0.7692307692307693]

The average accuracy after removing year features is: 70.77%


In [142]:
rmList = [['Original word', 'Stemmed word', 'lemma'], 'POS', 'POS Tag', 'Dependency', 'isCapital']
accHistList = []

for col in rmList:
    d = data.drop(col, axis=1)
    acc = maxEntTrain(d)
    accHistList.append(np.sum(acc)/5)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.208
             2          -0.37195        0.792
             3          -0.33781        0.792
             4          -0.30909        0.812
             5          -0.28515        0.850
             6          -0.26505        0.877
             7          -0.24796        0.896
             8          -0.23323        0.915
             9          -0.22037        0.931
            10          -0.20904        0.942
            11          -0.19896        0.958
            12          -0.18991        0.962
            13          -0.18175        0.969
            14          -0.17434        0.973
            15          -0.16756        0.973
            16          -0.16135        0.977
            17          -0.15563        0.977
            18          -0.15033        0.977
            19          -0.14542        0.985
 

            75          -0.04851        1.000
            76          -0.04797        1.000
            77          -0.04745        1.000
            78          -0.04694        1.000
            79          -0.04644        1.000
            80          -0.04595        1.000
            81          -0.04546        1.000
            82          -0.04499        1.000
            83          -0.04453        1.000
            84          -0.04408        1.000
            85          -0.04364        1.000
            86          -0.04321        1.000
            87          -0.04278        1.000
            88          -0.04236        1.000
            89          -0.04195        1.000
            90          -0.04155        1.000
            91          -0.04116        1.000
            92          -0.04077        1.000
            93          -0.04040        1.000
            94          -0.04002        1.000
            95          -0.03966        1.000
            96          -0.03930  

            48          -0.08981        1.000
            49          -0.08834        1.000
            50          -0.08692        1.000
            51          -0.08555        1.000
            52          -0.08421        1.000
            53          -0.08292        1.000
            54          -0.08167        1.000
            55          -0.08046        1.000
            56          -0.07928        1.000
            57          -0.07813        1.000
            58          -0.07702        1.000
            59          -0.07594        1.000
            60          -0.07489        1.000
            61          -0.07387        1.000
            62          -0.07287        1.000
            63          -0.07191        1.000
            64          -0.07096        1.000
            65          -0.07004        1.000
            66          -0.06915        1.000
            67          -0.06828        1.000
            68          -0.06743        1.000
            69          -0.06660  

            21          -0.12715        0.977
            22          -0.12347        0.977
            23          -0.12002        0.981
            24          -0.11677        0.988
            25          -0.11371        0.988
            26          -0.11081        0.988
            27          -0.10807        0.988
            28          -0.10548        0.988
            29          -0.10301        0.988
            30          -0.10067        0.988
            31          -0.09843        0.988
            32          -0.09630        0.992
            33          -0.09427        0.992
            34          -0.09233        0.992
            35          -0.09047        0.992
            36          -0.08869        0.992
            37          -0.08698        0.992
            38          -0.08534        0.992
            39          -0.08377        0.992
            40          -0.08226        0.992
            41          -0.08080        0.992
            42          -0.07940  

            98          -0.03723        1.000
            99          -0.03690        1.000
         Final          -0.03658        1.000

The accuracy is 52.31%
  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.258
             2          -0.43449        0.742
             3          -0.39038        0.773
             4          -0.35593        0.823
             5          -0.32838        0.850
             6          -0.30573        0.877
             7          -0.28664        0.892
             8          -0.27023        0.919
             9          -0.25591        0.931
            10          -0.24326        0.931
            11          -0.23197        0.935
            12          -0.22180        0.942
            13          -0.21259        0.954
            14          -0.20419        0.958
            15          -0.19649        0.965
            16         

            70          -0.06274        1.000
            71          -0.06198        1.000
            72          -0.06125        1.000
            73          -0.06053        1.000
            74          -0.05983        1.000
            75          -0.05915        1.000
            76          -0.05848        1.000
            77          -0.05782        1.000
            78          -0.05718        1.000
            79          -0.05656        1.000
            80          -0.05595        1.000
            81          -0.05535        1.000
            82          -0.05476        1.000
            83          -0.05418        1.000
            84          -0.05362        1.000
            85          -0.05307        1.000
            86          -0.05253        1.000
            87          -0.05200        1.000
            88          -0.05148        1.000
            89          -0.05098        1.000
            90          -0.05048        1.000
            91          -0.04999  

            42          -0.08002        0.992
            43          -0.07866        0.992
            44          -0.07734        0.992
            45          -0.07607        0.992
            46          -0.07484        0.992
            47          -0.07366        0.996
            48          -0.07251        0.996
            49          -0.07140        0.996
            50          -0.07033        0.996
            51          -0.06929        0.996
            52          -0.06828        0.996
            53          -0.06730        0.996
            54          -0.06636        0.996
            55          -0.06544        0.996
            56          -0.06454        0.996
            57          -0.06368        0.996
            58          -0.06283        0.996
            59          -0.06201        0.996
            60          -0.06122        0.996
            61          -0.06044        0.996
            62          -0.05968        0.996
            63          -0.05895  

            14          -0.20678        0.954
            15          -0.19908        0.958
            16          -0.19198        0.962
            17          -0.18543        0.973
            18          -0.17934        0.981
            19          -0.17368        0.981
            20          -0.16839        0.981
            21          -0.16344        0.981
            22          -0.15879        0.981
            23          -0.15443        0.981
            24          -0.15031        0.981
            25          -0.14643        0.981
            26          -0.14275        0.985
            27          -0.13927        0.988
            28          -0.13597        0.988
            29          -0.13283        0.988
            30          -0.12985        0.988
            31          -0.12700        0.988
            32          -0.12429        0.988
            33          -0.12170        0.988
            34          -0.11922        0.988
            35          -0.11685  

            90          -0.05130        1.000
            91          -0.05080        1.000
            92          -0.05032        1.000
            93          -0.04984        1.000
            94          -0.04937        1.000
            95          -0.04891        1.000
            96          -0.04846        1.000
            97          -0.04802        1.000
            98          -0.04758        1.000
            99          -0.04716        1.000
         Final          -0.04674        1.000

The accuracy is 75.38%
  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.277
             2          -0.45550        0.723
             3          -0.40819        0.773
             4          -0.37215        0.815
             5          -0.34388        0.854
             6          -0.32093        0.881
             7          -0.30174        0.892
             8         

            62          -0.06086        0.996
            63          -0.06012        0.996
            64          -0.05939        0.996
            65          -0.05868        0.996
            66          -0.05799        0.996
            67          -0.05732        0.996
            68          -0.05667        0.996
            69          -0.05603        0.996
            70          -0.05540        0.996
            71          -0.05479        0.996
            72          -0.05419        0.996
            73          -0.05361        0.996
            74          -0.05304        0.996
            75          -0.05248        0.996
            76          -0.05194        0.996
            77          -0.05141        0.996
            78          -0.05088        0.996
            79          -0.05037        0.996
            80          -0.04987        0.996
            81          -0.04938        0.996
            82          -0.04890        0.996
            83          -0.04843  

             6          -0.32278        0.869
             7          -0.30321        0.892
             8          -0.28644        0.904
             9          -0.27184        0.912
            10          -0.25895        0.915
            11          -0.24745        0.919
            12          -0.23709        0.927
            13          -0.22769        0.931
            14          -0.21911        0.931
            15          -0.21123        0.942
            16          -0.20397        0.950
            17          -0.19724        0.950
            18          -0.19100        0.958
            19          -0.18517        0.958
            20          -0.17973        0.958
            21          -0.17462        0.973
            22          -0.16982        0.981
            23          -0.16530        0.981
            24          -0.16104        0.981
            25          -0.15700        0.985
            26          -0.15318        0.985
            27          -0.14956  

            82          -0.04756        0.996
            83          -0.04711        0.996
            84          -0.04666        0.996
            85          -0.04622        0.996
            86          -0.04578        0.996
            87          -0.04536        0.996
            88          -0.04495        0.996
            89          -0.04454        0.996
            90          -0.04414        0.996
            91          -0.04375        0.996
            92          -0.04336        0.996
            93          -0.04299        0.996
            94          -0.04261        0.996
            95          -0.04225        0.996
            96          -0.04189        0.996
            97          -0.04154        0.996
            98          -0.04120        0.996
            99          -0.04086        0.996
         Final          -0.04052        0.996

The accuracy is 63.08%
  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      -----------------

            54          -0.08350        0.996
            55          -0.08235        0.996
            56          -0.08123        0.996
            57          -0.08014        0.996
            58          -0.07908        0.996
            59          -0.07805        0.996
            60          -0.07705        0.996
            61          -0.07608        0.996
            62          -0.07513        0.996
            63          -0.07421        0.996
            64          -0.07331        0.996
            65          -0.07243        0.996
            66          -0.07158        0.996
            67          -0.07074        0.996
            68          -0.06993        0.996
            69          -0.06914        0.996
            70          -0.06837        0.996
            71          -0.06761        0.996
            72          -0.06687        0.996
            73          -0.06615        0.996
            74          -0.06545        0.996
            75          -0.06476  

            27          -0.14700        0.992
            28          -0.14359        0.992
            29          -0.14034        0.992
            30          -0.13724        0.992
            31          -0.13429        0.992
            32          -0.13147        0.992
            33          -0.12877        0.992
            34          -0.12619        0.992
            35          -0.12372        0.992
            36          -0.12134        0.992
            37          -0.11907        0.992
            38          -0.11688        0.992
            39          -0.11477        0.992
            40          -0.11274        0.992
            41          -0.11079        0.992
            42          -0.10891        0.992
            43          -0.10710        0.992
            44          -0.10534        0.992
            45          -0.10365        0.992
            46          -0.10202        0.992
            47          -0.10043        0.996
            48          -0.09890  

In [144]:
for i in range(len(accHistList)):
    rmCol = rmList[i]
    print('The average accuracy after removing', rmCol, 'feature is: {:.2%}'.format(accHistList[i]))

The average accuracy after removing ['Original word', 'Stemmed word', 'lemma'] feature is: 70.77%
The average accuracy after removing POS feature is: 70.46%
The average accuracy after removing POS Tag feature is: 70.46%
The average accuracy after removing Dependency feature is: 71.08%
The average accuracy after removing isCapital feature is: 70.46%


### Train by Logistic Regression

In [90]:
import keras
from keras.layers import Dense
from keras import models
from keras import Sequential
import numpy as np

numFeatures = len(data.columns) - 1

In [84]:
model = Sequential()
model.add(Dense(10, activation='relu', input_dim=numFeatures))
model.add(Dense(2, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
            metrics=['categorical_accuracy'])

In [85]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encodedData = data[data.columns[:]].apply(le.fit_transform)

In [93]:
def logstic(df, model):
    
    acc = []
    train1, trainFeature, trainLabel, test1, testLabel = splitData(df)
    
    for x in range(len(trainFeature)):

        mat = np.array(list(trainFeature[x][0].values()))
        tsMat = np.array(list(test1[x][0].values()))

        for i in range(len(trainFeature[x])-1):
            temp = np.array(list(trainFeature[x][i+1].values()))
            mat = np.vstack((mat, temp))

        for j in range(len(test1[x])-1):
            tsTemp = np.array(list(test1[x][j+1].values()))
            tsMat = np.vstack((tsMat, tsTemp))

        label = np.array(trainLabel[x]).reshape((-1,1))
        tsLabel = np.array(testLabel[x]).reshape((-1,1))

        oneHotYTrain = keras.utils.to_categorical(label, num_classes=2)
        oneHotYtsTrain = keras.utils.to_categorical(tsLabel, num_classes=2)

        model.fit(mat, oneHotYTrain, epochs=50, batch_size=32)
        score=model.evaluate(tsMat,oneHotYtsTrain,batch_size=32)

        acc.append(score[1])
        print("\n{}: {:.2%}".format(model.metrics_names[1], score[1]))
        
    return acc

In [96]:
model = Sequential()
# Use relu as activation functions in hidden layers
model.add(Dense(10, activation='relu', input_dim=numFeatures))
model.add(Dense(5, activation='relu'))
model.add(Dense(2, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
            metrics=['categorical_accuracy'])

acc = logstic(encodedData, model)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

categorical_accuracy: 66.15%
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoc

Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

categorical_accuracy: 86.15%
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

categorical_accuracy: 86.15%
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/5

In [97]:
print('Categorical_accuracy history is:', acc)
print('The average accuracy is {:.2%}'.format(np.sum(acc)/5))

Categorical_accuracy history is: [0.6615384817123413, 0.5076923370361328, 0.8615384697914124, 0.8615384697914124, 0.9384615421295166]
The average accuracy is 76.62%


In [98]:
model2 = Sequential()
# Use softmax as activation function
model2.add(Dense(10, activation='softmax', input_dim=numFeatures))
model2.add(Dense(5, activation='softmax'))
model2.add(Dense(2, activation='sigmoid'))
model2.compile(optimizer='adam',
              loss='categorical_crossentropy',
            metrics=['categorical_accuracy'])

acc2 = logstic(encodedData, model2)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

categorical_accuracy: 66.15%
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoc

Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

categorical_accuracy: 86.15%
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

categorical_accuracy: 86.15%
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/5

In [99]:
print('Categorical_accuracy history is:', acc2)
print('The average accuracy is {:.2%}'.format(np.sum(acc2)/5))

Categorical_accuracy history is: [0.6615384817123413, 0.5076923370361328, 0.8615384697914124, 0.8615384697914124, 0.9384615421295166]
The average accuracy is 76.62%


# Conclusion and Future Direction

The result shows the accuracy of predicting of Maximum Entropy Classifier and Logistic Regression similar. Both of them in our test are above 70%. However, we cannot find the features of the greatest weight by comparing the result after removing different features. Due to limited time to labeling documents outdated or not manually, our sample size is small, so the result seems not to be ideal. Creating a sample datasets for training will be a furture task of us. 

# References:

[1]:  Miles Osborne, Using Maximum Entropy for Sentence Extraction, 2002

[2]:  John Mount, The equivalence of logistic regression and maximum entropy models, 2011