In [1]:
from hmm import SUTDHMM
import os

### Part 2

Estimate the emission parameters from the training set using MLE. Our approach was to count all the occurences of word give a certain label and the occurences of labels when we do ```load_data```. Emission parameters are actually calculated inside ```calculate_emission``` method (which is called inside ```train``` method). Please investigate [hmm.py](./hmm.py) for further understanding.

In [2]:
from hmm import SUTDHMM
languages = ['EN', 'SG', 'CN', 'FR']

for l in languages:
    model = SUTDHMM()
    model.train(input_filename='./{}/train'.format(l))
    print("Finish training for {}".format(l))

    with open("./{}/dev.in".format(l)) as in_file, open("./{}/dev.p2.out".format(l), 'w+') as out_file:
        for line in in_file:
            word = line.strip()
            if (word == ''):
                out_file.write("\n")
            else:
                out_file.write("{} {}\n".format(word, model.predict_label_using_emission(word)))
    print("Finished: {}".format(l))
    
    output = os.popen("python3 EvalScript/evalResult.py {0}/dev.out {0}/dev.p2.out".format(l)).read()
    print("Language: {}".format(l))
    print(output)
    print("----------------------")

Finish training for EN
Finished: EN
Language: EN

#Entity in gold data: 226
#Entity in prediction: 706

#Correct Entity : 133
Entity  precision: 0.1884
Entity  recall: 0.5885
Entity  F: 0.2854

#Correct Sentiment : 45
Sentiment  precision: 0.0637
Sentiment  recall: 0.1991
Sentiment  F: 0.0966

----------------------
Finish training for SG
Finished: SG
Language: SG

#Entity in gold data: 1382
#Entity in prediction: 2764

#Correct Entity : 511
Entity  precision: 0.1849
Entity  recall: 0.3698
Entity  F: 0.2465

#Correct Sentiment : 272
Sentiment  precision: 0.0984
Sentiment  recall: 0.1968
Sentiment  F: 0.1312

----------------------
Finish training for CN
Finished: CN
Language: CN

#Entity in gold data: 362
#Entity in prediction: 1688

#Correct Entity : 114
Entity  precision: 0.0675
Entity  recall: 0.3149
Entity  F: 0.1112

#Correct Sentiment : 71
Sentiment  precision: 0.0421
Sentiment  recall: 0.1961
Sentiment  F: 0.0693

----------------------
Finish training for FR
Finished: FR
Langua

### Part 3

We calculate the transition parameters the same way as emission paramters are calculated. We counted all the occurences of the different transitions and occurences of the different labels at the ```load_data``` step and make the calculation in ```calculate_transition``` method (which is in turn called inside ```train``` method). Please investigate [hmm.py](./hmm.py) for further understanding.

Our Viterbi algorithm is implemented as the method ```viterbi``` of the main class ```SUTDHMM```, which makes use of the previously calculated emission and transition parameters.

In [3]:
languages = ['EN', 'SG', 'CN', 'FR']

for l in languages:
    model = SUTDHMM()
    model.train(input_filename='./{}/train'.format(l))

    print("Finish training for {}".format(l))

    print("----------Viterbi for {0}------------".format(l))
    with open("./{}/dev.in".format(l)) as in_file, open("./{}/dev.p3.out".format(l), 'w+') as out_file:
        read_data = in_file.read()
        sentences = list(filter(lambda x: len(x) > 0, read_data.split('\n\n')))
        sentences = list(map(lambda x: ' '.join(x.split('\n')), sentences))
        for sentence in sentences:
            sentence_labels, chance = model.viterbi(sentence)
            for idx, word in enumerate(sentence.split()):
                out_file.write("{} {}\n".format(word, sentence_labels[idx]))
            out_file.write('\n')
        out_file.close()
        in_file.close()

    print("Viterbi Finished: {}".format(l))
    
    output = os.popen(
        "python3 EvalScript/evalResult.py {0}/dev.out {0}/dev.p3.out".format(l)).read()
    print("Language: {}".format(l))
    print(output)

Finish training for EN
----------Viterbi for EN------------
Viterbi Finished: EN
Language: EN

#Entity in gold data: 226
#Entity in prediction: 116

#Correct Entity : 74
Entity  precision: 0.6379
Entity  recall: 0.3274
Entity  F: 0.4327

#Correct Sentiment : 51
Sentiment  precision: 0.4397
Sentiment  recall: 0.2257
Sentiment  F: 0.2982

Finish training for SG
----------Viterbi for SG------------
Viterbi Finished: SG
Language: SG

#Entity in gold data: 1382
#Entity in prediction: 499

#Correct Entity : 228
Entity  precision: 0.4569
Entity  recall: 0.1650
Entity  F: 0.2424

#Correct Sentiment : 152
Sentiment  precision: 0.3046
Sentiment  recall: 0.1100
Sentiment  F: 0.1616

Finish training for CN
----------Viterbi for CN------------
Viterbi Finished: CN
Language: CN

#Entity in gold data: 362
#Entity in prediction: 173

#Correct Entity : 28
Entity  precision: 0.1618
Entity  recall: 0.0773
Entity  F: 0.1047

#Correct Sentiment : 17
Sentiment  precision: 0.0983
Sentiment  recall: 0.0470
Se

### Part 4

The max marginal approach attempts to find the optimal path with the following approach:

$ y_i^* = \arg\max_{y_i} \{p (y_i \mid x_1, x_2,...,x_n; \theta)\} $

The conditional probability of a state $u$ occuring for $y_i$ is given as follows:

$ p (y_i = u \mid x_1, x_2,...,x_n; \theta) = \frac {p(x_1, x_2,...x_{i-1},y_i=u,x_i,...,x_n; \theta)}{p(x_1,...x_n; \theta)} $

As $x_i,...x_n$ are independent of $x_1,...x_{i-1}$ once $y_i$ is known in a Hidden Markov Model, the conditional probability could be written as such:

$ p (y_i = u \mid x_1, x_2,...,x_n; \theta) =\frac {p(x_1, x_2,...x_{i-1},y_i=u; \theta)p(x_i,...,x_n \mid y_i = u;\theta)}{p(x_1,...x_n; \theta)} $

$\qquad\qquad\qquad\qquad\quad=\frac {\alpha_u(i)\beta_u(i)}{\sum_{v}\alpha_v(j)\beta_v(j)} , \quad \text{where} \quad j \in (1,2,...n) $

Thus, the following result could be obtained to indicate the optimum state for each $y$

$ y_i^* = \arg\max_{u} \frac {\alpha_u(i)\beta_u(i)}{\sum_{v}\alpha_v(j)\beta_v(j)} =  \arg\max_{u} \alpha_u(i)\beta_u(i) $

In [4]:
languages = ['EN', 'SG', 'CN', 'FR']

for l in languages:
    model = SUTDHMM()
    model.train(input_filename='./{}/train'.format(l))

    print("Finish training for {}".format(l))

    print("----------Max Marginal for {0}------------".format(l))
    with open("./{}/dev.in".format(l)) as in_file, open("./{}/dev.p4.out".format(l), 'w+') as out_file:
        read_data = in_file.read()
        sentences = list(filter(lambda x: len(x) > 0, read_data.split('\n\n')))
        sentences = list(map(lambda x: ' '.join(x.split('\n')), sentences))
        for sentence in sentences:
            sentence_labels = model.max_marginal(sentence)
            for idx, word in enumerate(sentence.split()):
                out_file.write("{} {}\n".format(word, sentence_labels[idx]))
            out_file.write('\n')
        out_file.close()
        in_file.close()

    print("Max Marginal Finished: {}".format(l))
    
    output = os.popen(
        "python3 EvalScript/evalResult.py {0}/dev.out {0}/dev.p4.out".format(l)).read()
    print("Language: {}".format(l))
    print(output)

Finish training for EN
----------Max Marginal for EN------------
Max Marginal Finished: EN
Language: EN

#Entity in gold data: 226
#Entity in prediction: 113

#Correct Entity : 71
Entity  precision: 0.6283
Entity  recall: 0.3142
Entity  F: 0.4189

#Correct Sentiment : 45
Sentiment  precision: 0.3982
Sentiment  recall: 0.1991
Sentiment  F: 0.2655

Finish training for SG
----------Max Marginal for SG------------
Max Marginal Finished: SG
Language: SG

#Entity in gold data: 1382
#Entity in prediction: 464

#Correct Entity : 258
Entity  precision: 0.5560
Entity  recall: 0.1867
Entity  F: 0.2795

#Correct Sentiment : 178
Sentiment  precision: 0.3836
Sentiment  recall: 0.1288
Sentiment  F: 0.1928

Finish training for CN
----------Max Marginal for CN------------
Max Marginal Finished: CN
Language: CN

#Entity in gold data: 362
#Entity in prediction: 187

#Correct Entity : 71
Entity  precision: 0.3797
Entity  recall: 0.1961
Entity  F: 0.2587

#Correct Sentiment : 52
Sentiment  precision: 0.278

### Part 5

#### Improvement 1: Default Emission Params

The first proposed improvement to current algorithm is to set a default small emission parameters. 
We realised that a lot of time, if a word is never tagged with a specific label before in training set, the "path" that passes through such pair of word and label will always have a probability of 0, no matter how likely the transition between that label and the previous/next labels are.
Also, in real life ("the universal bag of word"), there is always a probability, even if it's small, that a word is tagged with a specific label, it makes more sense to give all pair of word and label a default probability (emission param).
We tested our hypothesis by implementing it as an option in our main algorithm class. We used the elbow method with a range from 10<sup>-3</sup> to 10<sup>-20</sup> to identify what is the best default param.

Results on EN dataset
![EN](./EN/score.png)

Results on FR dataset
![FR](./FR/score.png)

In [5]:
languages = ['EN', 'FR']

for l in languages:
    model = SUTDHMM(default_emission=0.000001)
    model.train(input_filename='./{}/train'.format(l))

    print("Finish training for {}".format(l))

    print("----------Viterbi for {0}------------".format(l))
    with open("./{}/dev.in".format(l)) as in_file, open("./{}/dev.pdefault.out".format(l), 'w+') as out_file:
        read_data = in_file.read()
        sentences = list(filter(lambda x: len(x) > 0, read_data.split('\n\n')))
        sentences = list(map(lambda x: ' '.join(x.split('\n')), sentences))
        for sentence in sentences:
            sentence_labels, chance = model.viterbi(sentence)
            for idx, word in enumerate(sentence.split()):
                out_file.write("{} {}\n".format(word, sentence_labels[idx]))
            out_file.write('\n')
        out_file.close()
        in_file.close()

    print("Viterbi Finished: {}".format(l))
    
    output = os.popen(
        "python3 EvalScript/evalResult.py {0}/dev.out {0}/dev.pdefault.out".format(l)).read()
    print("Language: {}".format(l))
    print(output)

Finish training for EN
----------Viterbi for EN------------
Viterbi Finished: EN
Language: EN

#Entity in gold data: 226
#Entity in prediction: 200

#Correct Entity : 109
Entity  precision: 0.5450
Entity  recall: 0.4823
Entity  F: 0.5117

#Correct Sentiment : 69
Sentiment  precision: 0.3450
Sentiment  recall: 0.3053
Sentiment  F: 0.3239

Finish training for FR
----------Viterbi for FR------------
Viterbi Finished: FR
Language: FR

#Entity in gold data: 223
#Entity in prediction: 196

#Correct Entity : 119
Entity  precision: 0.6071
Entity  recall: 0.5336
Entity  F: 0.5680

#Correct Sentiment : 80
Sentiment  precision: 0.4082
Sentiment  recall: 0.3587
Sentiment  F: 0.3819



__Conclusion__: The default emission params perform best for our datasets at 10<sup>-6</sup>. This method gives us a better result than our original Viterbi algorithm

#### Improvement 2: Predicting Entity and Sentiment separately

The second proposed improvement for current algorithm is to separate the prediction of entity and sentiment. We theorize that because entity and sentiment labels are not related, so if we join them as a single label, they will affect the probability of each other and produce a lower accuracy. For example, entity B might show up more frequently in the dataset together with sentiment "negative", however, they are actually not related but in this case, the probability of predicting B-negative are significantly high compared to other tags. This imply a false correlation between the entity tag (B) and the sentiment tag (negative). Hence, predicting them separately would eliminate this false correlation thus produce better result

In [6]:
languages = ['EN', 'FR']

for l in languages:
    model = SUTDHMM()
    model.load_data(data_filename='./{}/train'.format(l))
    with open('./{}/train.ent'.format(l), 'w+') as ent_in_file:
        for token in model.tokens_list:
            word = token[0]
            tag = token[1].split(
                '-')[0] if token[1] not in ['O', 'START', 'STOP'] else token[1]
            ent_in_file.write('{} {}\n'.format(word, tag))
            if token[1] == 'STOP':
                ent_in_file.write('\n')
        ent_in_file.close()
    with open('./{}/train.sen'.format(l), 'w+') as sen_in_file:
        for token in model.tokens_list:
            word = token[0]
            tag = token[1].split(
                '-')[1] if token[1] not in ['O', 'START', 'STOP'] else token[1]
            sen_in_file.write('{} {}\n'.format(word, tag))
            if token[1] == 'STOP':
                sen_in_file.write('\n')
        sen_in_file.close()

    ent_model = SUTDHMM()
    ent_model.train(input_filename='./{}/train.ent'.format(l))
    sen_model = SUTDHMM()
    sen_model.train(input_filename='./{}/train.sen'.format(l))
    print('Finished training for {}'.format(l))
    with open('./{}/dev.in'.format(l)) as in_file, open('./{}/dev.psep.out'.format(l), 'w+') as out_file:
        read_data = in_file.read()
        sentences = list(filter(lambda x: len(x) > 0, read_data.split('\n\n')))
        sentences = list(map(lambda x: ' '.join(x.split('\n')), sentences))
        for sentence in sentences:
            sentence_ent, prob = ent_model.viterbi(sentence=sentence)
            sentence_sen, prob = sen_model.viterbi(sentence=sentence)
            for idx in range(0, len(sentence_ent)):
                entity = sentence_ent[idx]
                sentiment = sentence_sen[idx]
                if entity not in ['O', 'START', 'STOP'] and sentiment not in ['O', 'START', 'STOP']:
                    out_file.write(
                        "{} {}-{}\n".format(word, entity, sentiment))
                elif entity in ['O', 'START', 'STOP']:
                    out_file.write("{} {}\n".format(word, entity))
                else:
                    out_file.write('{} {}\n'.format(word, sentiment))
            out_file.write('\n')
    
    output = os.popen("python3 EvalScript/evalResult.py {0}/dev.out {0}/dev.psep.out".format(l)).read()
    print("Language: {}".format(l))
    print(output)

Finished training for EN
Language: EN

#Entity in gold data: 226
#Entity in prediction: 115

#Correct Entity : 74
Entity  precision: 0.6435
Entity  recall: 0.3274
Entity  F: 0.4340

#Correct Sentiment : 53
Sentiment  precision: 0.4609
Sentiment  recall: 0.2345
Sentiment  F: 0.3109

Finished training for FR
Language: FR

#Entity in gold data: 223
#Entity in prediction: 104

#Correct Entity : 74
Entity  precision: 0.7115
Entity  recall: 0.3318
Entity  F: 0.4526

#Correct Sentiment : 49
Sentiment  precision: 0.4712
Sentiment  recall: 0.2197
Sentiment  F: 0.2997



__Conclusion__: This method only gives a slightly better result on current datasets. But it would make a whole lot of difference if the datasets are skewed towards certains pair of tags.

#### Improvement 3

The third proposed improvement to the current algorithm is to learn implement the discriminative training methods using perceptron (adopted from this [paper](http://www.aclweb.org/anthology/W02-1001)). The basic steps of implementation are as below:

![steps](perceptron-steps.png)

We have implemented it inside our main class ```SUTDHMM``` as a special training method ```train_perceptron``` and a special predicting method ```predict_perceptron```. Please investigate the code for further understanding.

In [7]:
languages = ['EN', 'FR']

for l in languages:
    model = SUTDHMM()
    model.train_perceptron(input_filename='./{}/train'.format(l))
    print("Finish training for {}".format(l))

    print("----------Perceptron for {0}------------".format(l))
    with open("./{}/dev.in".format(l)) as in_file, open("./{}/dev.perceptron.out".format(l), 'w+') as out_file:
        read_data = in_file.read()
        sentences = list(
            filter(lambda x: len(x) > 0, read_data.split('\n\n')))
        sentences = list(map(lambda x: ' '.join(x.split('\n')), sentences))
        for sentence in sentences:
            sentence_labels, chance = model.predict_perceptron(sentence)
            for idx, word in enumerate(sentence.split()):
                out_file.write("{} {}\n".format(
                    word, sentence_labels[idx]))
            out_file.write('\n')
        out_file.close()
        in_file.close()

    print("Perceptron Finished: {}".format(l))

    output = os.popen(
        "python3 EvalScript/evalResult.py {0}/dev.out {0}/dev.perceptron.out".format(l)).read()
    print("Language: {}".format(l))
    print(output)

Finish training for EN
----------Perceptron for EN------------
Perceptron Finished: EN
Language: EN

#Entity in gold data: 226
#Entity in prediction: 103

#Correct Entity : 63
Entity  precision: 0.6117
Entity  recall: 0.2788
Entity  F: 0.3830

#Correct Sentiment : 45
Sentiment  precision: 0.4369
Sentiment  recall: 0.1991
Sentiment  F: 0.2736

Finish training for FR
----------Perceptron for FR------------
Perceptron Finished: FR
Language: FR

#Entity in gold data: 223
#Entity in prediction: 109

#Correct Entity : 78
Entity  precision: 0.7156
Entity  recall: 0.3498
Entity  F: 0.4699

#Correct Sentiment : 51
Sentiment  precision: 0.4679
Sentiment  recall: 0.2287
Sentiment  F: 0.3072



__Conclusion__: This proposed method doesnot improve the accuracy for our specific usecase.

#### Combination

Finally, we would like to attempt to combine our previously proposed improvement to the algorithm design. We found that the best combination is method 1 and method 2 together. The implementation is as below, which yields the best results on the test set

In [10]:
languages = ['EN', 'SG', 'CN', 'FR']

for l in languages:
    model = SUTDHMM()
    model.load_data(data_filename='./{}/train'.format(l))
    with open('./{}/train.ent'.format(l), 'w+') as ent_in_file:
        for token in model.tokens_list:
            word = token[0]
            tag = token[1].split(
                '-')[0] if token[1] not in ['O', 'START', 'STOP'] else token[1]
            ent_in_file.write('{} {}\n'.format(word, tag))
            if token[1] == 'STOP':
                ent_in_file.write('\n')
        ent_in_file.close()
    with open('./{}/train.sen'.format(l), 'w+') as sen_in_file:
        for token in model.tokens_list:
            word = token[0]
            tag = token[1].split(
                '-')[1] if token[1] not in ['O', 'START', 'STOP'] else token[1]
            sen_in_file.write('{} {}\n'.format(word, tag))
            if token[1] == 'STOP':
                sen_in_file.write('\n')
        sen_in_file.close()

    ent_model = SUTDHMM(default_emission=0.0000001)
    ent_model.train(input_filename='./{}/train.ent'.format(l))
    sen_model = SUTDHMM(default_emission=0.0000001)
    sen_model.train(input_filename='./{}/train.sen'.format(l))
    print('Finished training for {}'.format(l))
    with open('./{}/dev.in'.format(l)) as in_file, open('./{}/dev.pdefaultsep.out'.format(l), 'w+') as out_file:
        read_data = in_file.read()
        sentences = list(filter(lambda x: len(x) > 0, read_data.split('\n\n')))
        sentences = list(map(lambda x: ' '.join(x.split('\n')), sentences))
        for sentence in sentences:
            sentence_ent, prob = ent_model.viterbi(sentence=sentence)
            sentence_sen, prob = sen_model.viterbi(sentence=sentence)
            for idx in range(0, len(sentence_ent)):
                entity = sentence_ent[idx]
                sentiment = sentence_sen[idx]
                if entity not in ['O', 'START', 'STOP'] and sentiment not in ['O', 'START', 'STOP']:
                    out_file.write(
                        "{} {}-{}\n".format(word, entity, sentiment))
                elif entity in ['O', 'START', 'STOP']:
                    out_file.write("{} {}\n".format(word, entity))
                else:
                    out_file.write('{} {}\n'.format(word, sentiment))
            out_file.write('\n')
    
    output = os.popen("python3 EvalScript/evalResult.py {0}/dev.out {0}/dev.pdefaultsep.out".format(l)).read()
    print("Language: {}".format(l))
    print(output)

Finished training for EN
Language: EN

#Entity in gold data: 226
#Entity in prediction: 196

#Correct Entity : 106
Entity  precision: 0.5408
Entity  recall: 0.4690
Entity  F: 0.5024

#Correct Sentiment : 69
Sentiment  precision: 0.3520
Sentiment  recall: 0.3053
Sentiment  F: 0.3270

Finished training for SG
Language: SG

#Entity in gold data: 1382
#Entity in prediction: 1003

#Correct Entity : 460
Entity  precision: 0.4586
Entity  recall: 0.3329
Entity  F: 0.3857

#Correct Sentiment : 278
Sentiment  precision: 0.2772
Sentiment  recall: 0.2012
Sentiment  F: 0.2331

Finished training for CN
Language: CN

#Entity in gold data: 362
#Entity in prediction: 219

#Correct Entity : 45
Entity  precision: 0.2055
Entity  recall: 0.1243
Entity  F: 0.1549

#Correct Sentiment : 29
Sentiment  precision: 0.1324
Sentiment  recall: 0.0801
Sentiment  F: 0.0998

Finished training for FR
Language: FR

#Entity in gold data: 223
#Entity in prediction: 167

#Correct Entity : 109
Entity  precision: 0.6527
Entit