# Python Tutorial 1: Part-of-Speech Tagging 1

**(C) 2016-2018 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**

**Version:** 1.1, September 2018

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

## Introduction

This is a tutorial about developing simple [Part-of-Speech taggers](https://en.wikipedia.org/wiki/Part-of-speech_tagging) using Python 3.x and the [NLTK](http://nltk.org/).

This tutorial was developed as part of the course material for the course Advanced Natural Language Processing in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/).

The [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) in distributed as part of the [NLTK Data](http://www.nltk.org/data.html). To be able to use the [NLTK Data](http://www.nltk.org/data.html) and the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) on your local machine, you need to install the data as described on [the Installing NLTK Data page](http://www.nltk.org/data.html). If you want to use iPython on your local machine, I recommend installing a Python 3.x distribution, for example the most recent [Anaconda release](https://www.continuum.io/downloads), and reading the instructions how to run [iPython on Anaconda](http://jupyter.readthedocs.io/en/latest/install.html).

## Part-of-Speech Tagging

We refer to Part-of-Speech (PoS) tagging as the task of assigning class information to individual words (tokens) in some text. The tags are defined in tagsets that specify character sequences that represent sets of for example lexical, morphological, syntactic, or semantic features. See for more details the [Categorizing and Tagging Words chapter](http://www.nltk.org/book/ch05.html) of the [NLTK book](http://www.nltk.org/book/).

## Using the Brown Corpus

The documentation of the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) design and properties can be found on [this page](http://clu.uni.no/icame/brown/bcm.html).

Using the following line of code we are importing the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) into the running Python instance. This will make the tokens and PoS-tags from the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) available for further processing.

In [4]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\shant\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

Our goal is to assign PoS-tags to a sequence of words that represent a phrase, utterance, or sentence.

Let us assume that the probability of a sequence of 5 tags $t_1\ t_2\ t_3\ t_4\ t_5$ given a sequence of 5 tokens $w_1\ w_2\ w_3\ w_4\ w_5$ is $P(t_1\ t_2\ t_3\ t_4\ t_5\ |\ w_1\ w_2\ w_3\ w_4\ w_5)$ and can be computed as the product of the probability of one tag given another, e.g. the probability of tag 2 given that tag 1 occurred: $P(t_2\ |\ t_1)$, and the probability of one word and a specific tag, e.g. the probability of word 2 given that tag 2 occurred: $P(w_2\ |\ t_2)$.

Let us assume that we use two extra symbols *S* and *E*. *S* stands for sentence beginning and *E* for sentence end. We use these symbols to keep track of different distributions of tags and tokens relative to sentence positions. The token *the* for example is very unlikely to occur in sentence final and more likely to occur in sentence initial position.

$$P(t_1 \dots t_5\ |\ w_1 \dots w_5) = P(t_1|S)\ P(w_1|t_1)\ P(t_2|t_1)\ P(w_2|t_2)\ P(t_3|t_2)\ P(w_3|t_3)\ P(t_4|t_3)\ P(w_3|t_3)\ P(t_5|t_4)\ P(w_4|t_4)\ P(E|t_4)\ P(w_5|t_5)$$

This equation can be abbreviated as follows:

$$P(t_1 \dots t_5\ |\ w_1 \dots w_5) = P(t_1\ |\ S)\ P(E\ |\ t_5)\ \prod_{i=1}^{5} P(t_{i+1}\ |\ t_i)\ P(w_{i+1}\ |\ t_{i+1})$$

We extract the probabilities for a word (or token) given that a certain tag occurred, that is $P(w_1\ |\ t_1)$, form the frequency profile for tags and tokens from the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus). The necessary data-structure should be loaded and in memory after executing the code cell above.

Since we loaded the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) into memory, we can now use specific methods to access tokens and PoS-tags from the corpus. The following line of code unzips the list of tuples that contain tokens and tags in sequence as found in the corpus and stores the tokens in the *tokens* list and tags in the *tags* list. Note that the \* operator is used here to unzip a list. See for more details on the [Python zip-function the documentation page](https://docs.python.org/3.5/library/functions.html#zip). The function *brown.tagged_words()* returns a list of tuples *(word, token)*. The *zip*-function creates two lists and assigns those to the variables *tokens* and *tags* respectively.

In [6]:
from nltk.corpus import brown
tokens, tags = zip(*brown.tagged_words())

You can inspect the resulting list of *tokens* by printing it out:

In [164]:
tokens

('The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 "Atlanta's",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'no',
 'evidence',
 "''",
 'that',
 'any',
 'irregularities',
 'took',
 'place',
 '.',
 'The',
 'jury',
 'further',
 'said',
 'in',
 'term-end',
 'presentments',
 'that',
 'the',
 'City',
 'Executive',
 'Committee',
 ',',
 'which',
 'had',
 'over-all',
 'charge',
 'of',
 'the',
 'election',
 ',',
 '``',
 'deserves',
 'the',
 'praise',
 'and',
 'thanks',
 'of',
 'the',
 'City',
 'of',
 'Atlanta',
 "''",
 'for',
 'the',
 'manner',
 'in',
 'which',
 'the',
 'election',
 'was',
 'conducted',
 '.',
 'The',
 'September-October',
 'term',
 'jury',
 'had',
 'been',
 'charged',
 'by',
 'Fulton',
 'Superior',
 'Court',
 'Judge',
 'Durwood',
 'Pye',
 'to',
 'investigate',
 'reports',
 'of',
 'possible',
 '``',
 'irregularities',
 "''",
 'in',
 'the',
 'hard-fought',
 'primary',
 'which',
 'was',
 'won',
 'by',
 'Mayor-nominate'

You can print the *tags* as well:

In [185]:
tags

('AT',
 'NP-TL',
 'NN-TL',
 'JJ-TL',
 'NN-TL',
 'VBD',
 'NR',
 'AT',
 'NN',
 'IN',
 'NP$',
 'JJ',
 'NN',
 'NN',
 'VBD',
 '``',
 'AT',
 'NN',
 "''",
 'CS',
 'DTI',
 'NNS',
 'VBD',
 'NN',
 '.',
 'AT',
 'NN',
 'RBR',
 'VBD',
 'IN',
 'NN',
 'NNS',
 'CS',
 'AT',
 'NN-TL',
 'JJ-TL',
 'NN-TL',
 ',',
 'WDT',
 'HVD',
 'JJ',
 'NN',
 'IN',
 'AT',
 'NN',
 ',',
 '``',
 'VBZ',
 'AT',
 'NN',
 'CC',
 'NNS',
 'IN',
 'AT',
 'NN-TL',
 'IN-TL',
 'NP-TL',
 "''",
 'IN',
 'AT',
 'NN',
 'IN',
 'WDT',
 'AT',
 'NN',
 'BEDZ',
 'VBN',
 '.',
 'AT',
 'NP',
 'NN',
 'NN',
 'HVD',
 'BEN',
 'VBN',
 'IN',
 'NP-TL',
 'JJ-TL',
 'NN-TL',
 'NN-TL',
 'NP',
 'NP',
 'TO',
 'VB',
 'NNS',
 'IN',
 'JJ',
 '``',
 'NNS',
 "''",
 'IN',
 'AT',
 'JJ',
 'NN',
 'WDT',
 'BEDZ',
 'VBN',
 'IN',
 'NN-TL',
 'NP',
 'NP',
 'NP',
 '.',
 '``',
 'RB',
 'AT',
 'JJ',
 'NN',
 'IN',
 'JJ',
 'NNS',
 'BEDZ',
 'VBN',
 "''",
 ',',
 'AT',
 'NN',
 'VBD',
 ',',
 '``',
 'IN',
 'AT',
 'JJ',
 'NN',
 'IN',
 'AT',
 'NN',
 ',',
 'AT',
 'NN',
 'IN',
 'NNS',
 'CC',


The sequence of *tokens* and *tags* is aligned, that is, the first tag in the *tags* list belongs to the first token in the *tokens* list. You can print the token-tag pair out in the following way:

In [10]:
print("Token:", tokens[0], "Tag:", tags[0])

Token: The Tag: AT


To create a frequency profile of tags for example, we can make use of the [*Counter* container datatype](http://docs.python.org/3/library/collections.html#collections.Counter) from the [*collections* module](http://docs.python.org/3/library/collections.html). We import the [*Counter* datatype](http://docs.python.org/3/library/collections.html#collections.Counter) with the following code:

In [11]:
from collections import Counter

We can create a frequency profile of the tags from the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) and store it in the variable *tagCounts* using the following code:

In [12]:
tagCounter = Counter(tags)

The *tagCounter* datatype now contains a hash-table with *tags* as keys and their frequencies as values. Accessing the frequency of a specific *tag* can be achieved using the following code:

In [13]:
tagCounter["NNS"]

55110

The frequency of a specific *token* can be accessed by generating a frequency profile from the *token*-list in the same way as for *tags*:

In [14]:
tokenCounter = Counter(tokens)

We access the *token* frequency in the same way as for *tags*:

In [15]:
tokenCounter["the"]

62713

Since one type (or word) in the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) can have more than one corresponding tag with a specific frequency, we need to store this information in a specific datastructure. 

In [16]:
from collections import defaultdict

The following loop reads from the list of *token-tag*-tuples in *brown.tagged_words* the individual *token* and *tag* pairs and sets their counter in the *dictionary* of *Counter* datastructures.

In [17]:
tokenTags = defaultdict(Counter)
for token, tag in brown.tagged_words():
    tokenTags[token][tag] +=1

We can now ask for the *Counter* datastructure for the key *the*. The *Counter* datastructure is a hash-table with tags as keys and the corresponding frequency as values.

In [19]:
tokenTags["John"]

Counter({'NN-TL': 1, 'NP': 303, 'NP-HL': 9, 'NP-TL': 47})

In [18]:
tokenTags["the"]["AT"]

62288

For the calculation of the probability of a $tag_2$ given that a $tag_1$ occured, that is $P(tag_2\ |\ tag_1)$ we will need to count the bigrams from the *tags* list. The NLTK ngram module provides a convenient set of functions and datastructures to achieve this:

In [20]:
from nltk.util import ngrams

As for the *tokenTags* datatype above, we can create a *tags* bigram model using a dictionary of *Counter* datatypes. The dictionary keys will be the first tag of the tag-bigram. The value will contain a Counter datatype with the second tag of the tag-bigram as the key and the frequency of the bigram as value.

In [21]:
tagTags = defaultdict(Counter)

Using the *ngrams* module we generate a bigram model from the tags list and store it in the variable *posBigrams* using the following code:

In [22]:
posBigrams = list(ngrams(tags, 2))

The following loop goes through the list of bigram tuples, assigned the left bigram tag to the variable *tag1* and the right bigram tag to variable *tag2*, and stores the count of the bigram in the *tagTags* datastructure:

In [23]:
for tag1, tag2 in posBigrams:
    tagTags[tag1][tag2] += 1

We can now list all *tags* that follow the *AT* tag with the corresponding frequency:

In [46]:
tagTags["AT"]

Counter({"'": 24,
         "''": 7,
         '(': 15,
         ')': 1,
         '*': 4,
         ',': 12,
         '--': 4,
         '.': 4,
         'ABN': 42,
         'AP': 3007,
         'AP$': 1,
         'AP-TL': 2,
         'AT': 1,
         'BEZ-NC': 1,
         'CC': 4,
         'CD': 981,
         'CD-TL': 29,
         'FW-CC': 1,
         'FW-IN': 7,
         'FW-IN-TL': 4,
         'FW-JJ': 2,
         'FW-JJ-TL': 8,
         'FW-JJT': 1,
         'FW-NN': 76,
         'FW-NN$': 1,
         'FW-NN-TL': 52,
         'FW-NN-TL-NC': 1,
         'FW-NNS': 11,
         'FW-NNS-TL': 5,
         'FW-RB': 1,
         'FW-VB': 2,
         'FW-VBN': 1,
         'HV': 1,
         'IN': 3,
         'IN-TL': 1,
         'JJ': 19488,
         'JJ-HL': 1,
         'JJ-TL': 1414,
         'JJR': 630,
         'JJR-TL': 3,
         'JJS': 206,
         'JJS-TL': 2,
         'JJT': 675,
         'JJT-TL': 3,
         'MD': 1,
         'NIL': 1,
         'NN': 48376,
         'NN$': 907,
    

We can request the frequency of the tag-bigram *AT NN* using the following code:

In [24]:
tagTags["AT"]["NN"]

48376

We can calculate the total number of bigrams and relativize the count of any particular bigram:

In [25]:
total = len(tags)
print(total)
tagTags["NNS"]["NNS"]/float(total-1)

1161192


0.00012228823681892126

If we want to know how many times a certain tag occurs in sentence initial position, to estimate initial probabilities for startstates in a Hidden Markov Model for example, we can loop through the sentences and count the tags in initial position.

In [26]:
offset = 0
initialTags = Counter()
for x in brown.sents():
    initTag = tags[offset]
    initialTags[initTag] += 1
    offset += len(x)
print("Example:")
print("AT:", initialTags["AT"])

Example:
AT: 8297


Note, for the code above, I do not know how to access the initial sentence tag directly, thus I am indirectly accessing the tag over an offset count. If you know a better way, let me know, please.

We can now estimate the probability of any tag being in sentence initial position in the following way:

In [39]:
initialTags["AT"]/float(total)

0.007145243852868432

We can estimate the probability of any tag being followed by any other, in the following way:

In [29]:
tagTags["AT"]["NN"]/float(total-1)

0.04166067425600095

Note, we are dividing by *total - 1*, since the number of bigrams in the *tagTags* data structure is exactly this. 

We can estimate the likelihood of a tag token combination using the *tokenTags* data-structure:

In [30]:
tokenTags["John"]["NN"]/float(total)

0.0

Given the data structures *tokenTags* and *tagTags* we can now estimate the probability of a word given a specific tag, or intuitively, the probability that a specific word is assigned a tag, that is for the token *cat* and the tag *NN*: $P(cat\ |\ NN)$ using the following equation and corresponding code (with $C(cat\ NN)$ as the absolute frequency or count of the *cat NN* tuple, and $C(NN)$ the count of the *NN*-tag):

$$P(w_n\ |\ t_n) = \frac{C(w_n\ t_n)}{C(t_n)}$$

In [159]:
tokenTags["an"]["DT"] / tagCounter["DT"]

0.0

We can estimate the probability of a $tag_2$ following a $tag_1$ using a similar approach:

$$P(t_n\ |\ t_{n-1}) = \frac{C(t_{n-1}\ t_n)}{C(t_{n-1})}$$

Here $C(t_{n-1}\ t_n)$ is the count of the bigram of these two tags in sequence. $C(t_{n-1})$ is the count or absolute frequency of the first or left tag in the bigram. Let us assume that the input sequence was *the cat ...* and that the most likely initial tag for *the* was *AT*, then the probability of the tag *NN* given that a tag *AT* occurred can be estimated as:

In [36]:
tagTags["AT"]["NN"] / tagCounter["AT"]

0.0

The product of the two probabilities $P(w_n\ |\ t_n)\ P(t_n\ |\ t_{n-1})$ for the tokens *the cat* and the possible tags *AT NN* should be:

In [33]:
(tokenTags["cat"]["NN"] / tagCounter["NN"]) * (tagTags["AT"]["NN"] / tagCounter["AT"])

6.477854781687473e-05

MY ASSIGNMENT SOLUTIONS START HERE

Q1 (a) Encode the FSA in terms of matrices, including initial and final states. (This is the transition matrix!)

In [171]:
import numpy as np

In [172]:
a= np.matrix([[0,0,0,1],[0,0,0,1],[0,0,1,0],[0,0,0,1]])
b= np.matrix([[0,1,0,0],[0,1,0,0],[0,0,0,0],[0,1,0,0]])
c= np.matrix([[0,0,1,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]])

In [173]:
#print("FSA matrix a")
print(a)

[[0 0 0 1]
 [0 0 0 1]
 [0 0 1 0]
 [0 0 0 1]]


In [174]:
# print("FSA matrix b")
print(b)

[[0 1 0 0]
 [0 1 0 0]
 [0 0 0 0]
 [0 1 0 0]]


In [175]:
# print("FSA matrix c")
print(c)

[[0 0 1 0]
 [0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]


In [178]:
initial_state = np.array([1,0,0,0])
final_state = np.array([0,0,0,1])

In [177]:
print("Initial State")
print(initial_state)

Initial State
[1 0 0 0]


In [179]:
print("final state")
print(final_state)

final state
[0 0 0 1]


(b) Describe the language that is accepted by the FSA as a regular expression.

ca |(a*b+)

2. Markov Chains

There are three telephone lines, and at any given moment 0, 1, 2 or 3 of them can be busy. Once every minute we will observe how many of them are busy. This can be described as a (finite) Markov chain by assuming that the number of busy lines will depend only on the number of lines that were busy the last time we observed them, and not on the previous history.
Use the following matrices to answer the following questions. You can use online matrix multipliers
(e.g., http://wims.unice.fr/wims/en_tool~linear~matmult.html). Please explain your answers.

In [181]:
initial_vector = [0.5,0.3,0.2,0.0]
a = [[0.2, 0.5, 0.2, 0.1], [0.3, 0.4, 0.2, 0.1], [0.1, 0.3, 0.4, 0.2], [0.1, 0.1, 0.3, 0.5]]

after_four_steps= np.matmul(initial_vector, np.matmul(np.matmul(a, a),np.matmul(a, a)))

print("probability that after 4 steps exactly 3 lines are busy ", after_four_steps[3])#Q 3
print("2 number of lines being busy has the highest probability after 4 steps  and its probability is " ,after_four_steps[1])

probability that after 4 steps exactly 3 lines are busy  0.20484000000000005
2 number of lines being busy has the highest probability after 4 steps  and its probability is  0.33393000000000006


3. POS Tagging
Consider the transition network for the sentence time flies like an arrow and the conditional probabilities reading them from the Brown corpus using the methods that we discussed in class, see also the iPython tutorial (URL available
 (Links to an external site.)
Links to an external site.
). From these, calculate the most likely sequence.

In [101]:
List_OF_SEQUENCES=[]

# Python program to print all paths from a source to destination. 
#Reference https://www.geeksforgeeks.org/find-paths-given-source-destination/
from collections import defaultdict 
class Graph: 
   
    def __init__(self,vertices): 
        #No. of vertices 
        self.V= vertices  
          
        # default dictionary to store graph 
        self.graph = defaultdict(list)  
   
    # function to add an edge to graph 
    def addEdge(self,u,v): 
        self.graph[u].append(v) 
        
    def printAllPathsUtil(self, u, d, visited, path):    
        visited[u]= True
        path.append(u) 
        if u == d: 
            List_OF_SEQUENCES.append(path[1:6])
        else: 
        
            for i in self.graph[u]: 
                if visited[i]==False: 
                    self.printAllPathsUtil(i, d, visited, path) 
        path.pop() 
        visited[u]= False
    def printAllPaths(self,s, d): 
        visited =[False]*(self.V) 
        path = [] 
        self.printAllPathsUtil(s, d,visited, path) 
# Create a transition Graph
g = Graph(12) 
g.addEdge(0,1) 
g.addEdge(0,2) 
g.addEdge(0,3) 
g.addEdge(1,4) 
g.addEdge(1,5) 
g.addEdge(2,4) 
g.addEdge(2,5) 
g.addEdge(4,6) 
g.addEdge(5,6) 
g.addEdge(5,7) 
g.addEdge(5,8) 
g.addEdge(6,9) 
g.addEdge(7,9) 
g.addEdge(8,9)
g.addEdge(9,10)
g.addEdge(10,11)
#Start 0
# NN  1  
# VB  2 
# JJ  3
# VBZ 4
# NNS 5
# IN  6
# VB  7
# RB  8
# DT  9
# NN  10
# End 11

s = 0 ; d = 11
print ("Following are all different paths from %d to %d :" %(s, d)) 
g.printAllPaths(s, d) 

print(List_OF_SEQUENCES)

POS_TAG =['START','NN'  ,'VB'  ,'JJ' ,'VBZ','NNS','IN'  ,'VB' ,'RB' ,'DT' ,'NN' ,'End' ]
POS_TAGS = {}

words= ["time" ,"flies" ,"like" ,"an", "arrow"]
likelihood_dict= {}

# for w in words:
#     for pt in pos_tags:
#         likelihood_dict[w,pt]= tokenTags[w][pt]/float(total)    # p(w|t)


Following are all different paths from 0 to 11 :
[[1, 4, 6, 9, 10], [1, 5, 6, 9, 10], [1, 5, 7, 9, 10], [1, 5, 8, 9, 10], [2, 4, 6, 9, 10], [2, 5, 6, 9, 10], [2, 5, 7, 9, 10], [2, 5, 8, 9, 10]]


In [183]:
# print(likelihood_dict)
words=["time","flies","like","an","arrow"]
most_likely_sequence=[]
most_likely_sequence_probability=0

sequence_probability=sequence_probability*(initialTags[POS_TAGS[i]]/float(total))


for seq in List_OF_SEQUENCES:
    word_no= 0
    sequence_probability=1
    for i in seq:
        
        if word_no==0:
            sequence_probability=sequence_probability *(tokenTags[words[word_no]][POS_TAG[i]] / tagCounter[POS_TAG[i]]) * (initialTags[POS_TAGS[i]]/float(total))
            word_no= word_no+1
#             print(sequence_probability,word_no)
        
        else:
            wi_ti =tokenTags[words[word_no]][POS_TAG[i]]
            c_ti =tagCounter[POS_TAG[i]]
            c_tn_tn_minus_1=tagTags[POS_TAGS[i-1]][POS_TAGS[i]]
            c_tn_minus_1 = tagCounter[POS_TAGS[i-1]]
            
            if wi_ti==0:
                wi_ti = 1/(2*len(tokens))
            

            
            
            sequence_probability=sequence_probability *(wi_ti / c_ti) * (c_tn_tn_minus_1 /c_tn_minus_1)
            
            word_no= word_no+1
            
    
    if  most_likely_sequence_probability< sequence_probability:    
        most_likely_sequence =seq
        most_likely_sequence_probability =sequence_probability
print("Most Likely Sequence IS")
for i in most_likely_sequence:
    print(POS_TAG[i])



Most Likely Sequence IS
NN
NNS
IN
DT
NN


4. Read Ranaparkhi (1996)
http://aclweb.org/anthology/W/W96/W96-0213.pdf
 and explain, in half-a-page or so, how the tagger works, roughly speaking, and how entropy is being employed.

Note ->Quoted text is taken as it is from the paper

General Stuff
"The paper briefly describes the maximum entropy and maximum likelihood properties of the model,
features used for POS tagging, and the experiments on the Penn Treebank Wall St. Journal corpus"
The model uses many contexual feattures to predict the POS tag
Specialized features are used to model difficult tagging decisions
"A maximum Entropy model cobines diverse forms of contextual information in a principled manner, and does not impose any distributional assumptions on the training data."
The Maximum Entropy model allows arbitrary binary-valued features on the context, so it can use additional specialized, i.e., word-specific, feature to correctly tag the "residue" that the baseline features cannot model. Since such features typically occur infrequently, the training set consistency must be good enough to yield reliable statis- tics. Otherwise the specialized features will model noise and perform poorly on test data.

------------------------------------------------------------------------------------------------------------------------------
The model's probability of a history h together with a tag t is defined as p(h,t) as given in equation 1
The model parameter alpha j  in equation effectively serves as a "weight" for a certain contextual predictor  and it also influences the entropy which is calculated
The goal is to maximize the entropy subject to some contraints
If Equation 1(p has form as given in equation 1) and Equation 2(model's feature expectation =observed feature expectation  for k contrainsts) then the Entropy is Maximized. From the model's satisfying these contraints choose the maximum entropy model

--------------------------------------------------------------------------------------
The model treat features whose counts are less than  as unreliable and ignores features whose counts are less than 1000

------------------------------------------
The tagger uses a search algoritm on the test data and also uses Specialized Features and Consistency.



If we would want to calculate this for any sequence of words, we should wrap this code in some function and a loop over all tokens. To avoid an underflow from the product of many probabilities, we can sum up the log-likelihoods of these probabilities. We would calculate the probabilities for all possible tag combinations assigned to the sequence of words or tokens and select the largest one as the best.

In the next section we will discuss Hidden Markov Models (HMMs) for Part-of-Speech Tagging.

## References

Manning, Chris and Hinrich Schütze (1999) *[Foundations of Statistical Natural Language Processing](http://nlp.stanford.edu/fsnlp/)*, MIT Press. Cambridge, MA.

(C) 2016 by [Damir Cavar](http://cavar.me/damir/) <<dcavar@iu.edu>>