# IN4080 – Natural Language Processing

This assignment has two parts:
* Part A. Sequence labeling
* Part B. Word embeddings

## Part A

In this part we will experiment with sequence classification and tagging. We will combine some of
the tools for tagging from NLTK with scikit-learn to build various taggers.We will start with simple
examples from NLTK where the tagger only considers the token to be tagged—not its context—
and work towards more advanced logistic regression taggers (also called maximum entropy taggers).
Finally, we will compare to some tagging algorithms installed in NLTK.

In [1]:
import re
import pprint
import nltk
from nltk.corpus import brown
tagged_sents = brown.tagged_sents(categories='news')
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]

In [2]:
def pos_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features

class ConsecutivePosTagger(nltk.TaggerI):
    def __init__(self, train_sents, features=pos_features):
        self.features = features
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)
    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = self.features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

In [3]:
tagger = ConsecutivePosTagger(train_sents)
print(round(tagger.evaluate(test_sents), 4))

0.7915


### 1) Tag set and baseline

**Part a:** Tag set and experimental set-up

In [4]:
tagged_sents_uni = brown.tagged_sents(categories='news',tagset = 'universal')

slice_ind = round(size*10/100)
news_test = tagged_sents_uni[:slice_ind]
news_dev_test = tagged_sents_uni[slice_ind:slice_ind*2]
news_train = tagged_sents_uni[slice_ind*2:]

In [5]:
tagger_a = ConsecutivePosTagger(news_train)
print(round(tagger_a.evaluate(news_dev_test), 4))

0.8751


We got higher accuracy. 

**Part b:** Part b. Baseline

In [6]:
tagger_b = ConsecutivePosTagger(news_train)
print(round(tagger_b.evaluate(news_train), 4))

0.8765


In [7]:
for i in range(len(news_train)):
    for word,tag in news_train[i]:
        pass

#finne frekvensen til ordene i news_train. Dersom et ord har flere frekvenser
#bruk den frekvensen som er størst. 
#Sammenlign med news_dev_test. De ordene fra dev_test som ikke er med i train
#gi dem frekvensen til det største frekvensen

### 2) scikit-learn and tuning

Our goal will be to improve the tagger compared to the simple suffix-based tagger. For the further
experiments, we move to scikit-learn which yields more options for considering various alternatives.
We have reimplemented the ConsecutivePosTagger to use scikit-learn classifiers below. We have
made the classifier a parameter so that it can easily be exchanged. We start with the BernoulliNBclassifier which should correspond to the way it is done in NLTK.

In [8]:
import numpy as np
import sklearn

from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer


class ScikitConsecutivePosTagger(nltk.TaggerI): 

    def __init__(self, train_sents, features=pos_features, clf = BernoulliNB()):
        # Using pos_features as default.
        self.features = features
        train_features = []
        train_labels = []
        for tagged_sent in train_sents:
            history = []
            untagged_sent = nltk.tag.untag(tagged_sent)
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = features(untagged_sent, i, history)
                train_features.append(featureset)
                train_labels.append(tag)
                history.append(tag)
        v = DictVectorizer()
        X_train = v.fit_transform(train_features)
        y_train = np.array(train_labels)
        clf.fit(X_train, y_train)
        self.classifier = clf
        self.dict = v

    def tag(self, sentence):
        test_features = []
        history = []
        for i, word in enumerate(sentence):
            featureset = self.features(sentence, i, history)
            test_features.append(featureset)
        X_test = self.dict.transform(test_features)
        tags = self.classifier.predict(X_test)
        return zip(sentence, tags)

**Part a)** Training the ScikitConsecutivePosTagger with *news_train* set and test on the *news_dev_test* set
with the *pos_features*.

In [9]:
tagger_scikit = ScikitConsecutivePosTagger(news_train)
print(round(tagger_scikit.evaluate(news_dev_test), 4))

0.8787


We can see that, by using the same data and same features we get a bit inferior results.

**Part b)** One explanation could be that the smoothing is too strong. *BernoulliNB()* from scikit-learn uses Laplace smoothing as default (“add-one”). The smoothing is generalized to Lidstone smoothing which is expressed by the alpha parameter to *BernoulliNB(alpha=…)*. Therefore, we will tune the alpha parameter to find the most optimal one. 

In [16]:
def run_classifier(pos_features):
    alphas = [1, 0.5, 0.1, 0.01, 0.001, 0.0001]
    accuracies = []
    for alpha in alphas:
        tagger_sci = ScikitConsecutivePosTagger(news_train,features = pos_features ,clf = BernoulliNB(alpha=alpha))
        accuracies.append(round(tagger_sci.evaluate(news_dev_test), 4))
    
    return alphas,accuracies
    
def visualize_results(alphas, accuracies):
    acc_alphas = {'alpha':alphas,'Accuracies':accuracies}
    import pandas as pd
    df = pd.DataFrame(acc_alphas)
    print(df)

    best_acc = max(accuracies)
    best_ind = accuracies.index(max(accuracies))
    best_alpha = alphas[best_ind]
    print("")
    print(f'Best alpha: {best_alpha} - accuracy: {best_acc}')

In [17]:
alphas,accuracies = run_classifier(pos_features)

In [18]:
visualize_results(alphas, accuracies)

    alpha  Accuracies
0  1.0000      0.8787
1  0.5000      0.8859
2  0.1000      0.8706
3  0.0100      0.8715
4  0.0010      0.8643
5  0.0001      0.8616

Best alpha: 0.5 - accuracy: 0.8859


We can see that we get a little bit better result with Scikits BernoulliNB with the best alpha.

**Part c)** To improve the results, we may change the feature selector or the machine learner. We start with
a simple improvement of the feature selector. The NLTK selector considers the previous word, but
not the word itself. Intuitively, the word itself should be a stronger feature. By extending the NLTK
feature selector with a feature for the token to be tagged, we try to find the best results.

In [33]:
def pos_features_tagged(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}    
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        
    #same structure, but included the token to be tagged.
    features['tagged_word'] = sentence[i]

    return features


In [34]:
alphas_tag,accuracies_tag = run_classifier(pos_features_tagged)
visualize_results(alphas_tag, accuracies_tag)

    alpha  Accuracies
0  1.0000      0.8985
1  0.5000      0.9263
2  0.1000      0.9272
3  0.0100      0.9371
4  0.0010      0.9416
5  0.0001      0.9380

Best alpha: 0.001 - accuracy: 0.9416


### 3) Logistic regression 

**Part a)** We proceed with the best feature selector from the last exercise. We will study the effect of the
learner.

In [44]:
from sklearn.linear_model import LogisticRegression

#increased the max_iter from default 100 to 500 in order to make it converge:
logClf = LogisticRegression(max_iter = 500) 

In [45]:
tagger_log = ScikitConsecutivePosTagger(news_train,features = pos_features ,clf = logClf)
acc_log = (round(tagger_log.evaluate(news_dev_test), 4))
print(f'Logistic accuracy = {acc_log}')

Logistic accuracy = 0.9695


The *Logistic Regression* classifier is better than all of the *BernoulliNB* methods without the token to be tagged.