# IN4080 – Natural Language Processing

This assignment has two parts:
* Part A. Sequence labeling
* Part B. Word embeddings

## Part A

In this part we will experiment with sequence classification and tagging. We will combine some of
the tools for tagging from NLTK with scikit-learn to build various taggers.We will start with simple
examples from NLTK where the tagger only considers the token to be tagged—not its context—
and work towards more advanced logistic regression taggers (also called maximum entropy taggers).
Finally, we will compare to some tagging algorithms installed in NLTK.

In [1]:
import re
import pprint
import nltk
from nltk.corpus import brown
tagged_sents = brown.tagged_sents(categories='news')
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]

In [2]:
def pos_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features

class ConsecutivePosTagger(nltk.TaggerI):
    def __init__(self, train_sents, features=pos_features):
        self.features = features
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)
    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = self.features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

In [3]:
tagger = ConsecutivePosTagger(train_sents)
print(round(tagger.evaluate(test_sents), 4))

0.7915


### 1) Tag set and baseline

**Part a:** Tag set and experimental set-up

In [54]:
def split_data(tagged_sents_uni):
    size = len(tagged_sents_uni)
    slice_ind = round(size*10/100)
    news_test = tagged_sents_uni[:slice_ind]
    news_dev_test = tagged_sents_uni[slice_ind:slice_ind*2]
    news_train = tagged_sents_uni[slice_ind*2:]

    return news_test, news_dev_test,news_train

In [55]:
tagged_sents_uni = brown.tagged_sents(categories='news',tagset = 'universal')
news_test, news_dev_test,news_train = split_data(tagged_sents_uni)

In [56]:
tagger_a = ConsecutivePosTagger(news_train)
print(round(tagger_a.evaluate(news_dev_test), 4))

0.8689


We got higher accuracy. 

**Part b:** Part b. Baseline

### 2) scikit-learn and tuning

Our goal will be to improve the tagger compared to the simple suffix-based tagger. For the further
experiments, we move to scikit-learn which yields more options for considering various alternatives.
We have reimplemented the ConsecutivePosTagger to use scikit-learn classifiers below. We have
made the classifier a parameter so that it can easily be exchanged. We start with the BernoulliNBclassifier which should correspond to the way it is done in NLTK.

In [6]:
import numpy as np
import sklearn

from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer


class ScikitConsecutivePosTagger(nltk.TaggerI): 

    def __init__(self, train_sents, features=pos_features, clf = BernoulliNB()):
        # Using pos_features as default.
        self.features = features
        train_features = []
        train_labels = []
        for tagged_sent in train_sents:
            history = []
            untagged_sent = nltk.tag.untag(tagged_sent)
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = features(untagged_sent, i, history)
                train_features.append(featureset)
                train_labels.append(tag)
                history.append(tag)
        v = DictVectorizer()
        X_train = v.fit_transform(train_features)
        y_train = np.array(train_labels)
        clf.fit(X_train, y_train)
        self.classifier = clf
        self.dict = v

    def tag(self, sentence):
        test_features = []
        history = []
        for i, word in enumerate(sentence):
            featureset = self.features(sentence, i, history)
            test_features.append(featureset)
        X_test = self.dict.transform(test_features)
        tags = self.classifier.predict(X_test)
        return zip(sentence, tags)

**Part a)** Training the ScikitConsecutivePosTagger with *news_train* set and test on the *news_dev_test* set
with the *pos_features*.

In [57]:
tagger_scikit = ScikitConsecutivePosTagger(news_train)
print(round(tagger_scikit.evaluate(news_dev_test), 4))

0.857


We can see that, by using the same data and same features we get a bit inferior results.

**Part b)** One explanation could be that the smoothing is too strong. *BernoulliNB()* from scikit-learn uses Laplace smoothing as default (“add-one”). The smoothing is generalized to Lidstone smoothing which is expressed by the alpha parameter to *BernoulliNB(alpha=…)*. Therefore, we will tune the alpha parameter to find the most optimal one. 

In [8]:
def tunning_bernoulli(pos_features):
    alphas = [1, 0.5, 0.1, 0.01, 0.001, 0.0001]
    accuracies = []
    for alpha in alphas:
        tagger_sci = ScikitConsecutivePosTagger(news_train,features = pos_features ,clf = BernoulliNB(alpha=alpha))
        accuracies.append(round(tagger_sci.evaluate(news_dev_test), 4))
    
    return alphas,accuracies
    
def visualize_results(alphas, accuracies):
    acc_alphas = {'alpha':alphas,'Accuracies':accuracies}
    import pandas as pd
    df = pd.DataFrame(acc_alphas)
    print(df)

    best_acc = max(accuracies)
    best_ind = accuracies.index(max(accuracies))
    best_alpha = alphas[best_ind]
    print("")
    print(f'Best alpha: {best_alpha} - accuracy: {best_acc}')

In [9]:
alphas,accuracies = tunning_bernoulli(pos_features)

In [10]:
visualize_results(alphas, accuracies)

    alpha  Accuracies
0  1.0000      0.8787
1  0.5000      0.8859
2  0.1000      0.8706
3  0.0100      0.8715
4  0.0010      0.8643
5  0.0001      0.8616

Best alpha: 0.5 - accuracy: 0.8859


We can see that we get a little bit better result with Scikits BernoulliNB with the best alpha.

**Part c)** To improve the results, we may change the feature selector or the machine learner. We start with
a simple improvement of the feature selector. The NLTK selector considers the previous word, but
not the word itself. Intuitively, the word itself should be a stronger feature. By extending the NLTK
feature selector with a feature for the token to be tagged, we try to find the best results.

In [11]:
def pos_features_tagged(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}    
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        
    #same structure, but included the token to be tagged.
    features['tagged_word'] = sentence[i]

    return features


In [12]:
alphas_tag,accuracies_tag = tunning_bernoulli(pos_features_tagged)
visualize_results(alphas_tag, accuracies_tag)

    alpha  Accuracies
0  1.0000      0.8985
1  0.5000      0.9263
2  0.1000      0.9272
3  0.0100      0.9371
4  0.0010      0.9416
5  0.0001      0.9380

Best alpha: 0.001 - accuracy: 0.9416


### 3) Logistic regression 

**Part a)** We proceed with the best feature selector from the last exercise. We will study the effect of the
learner.

In [13]:
from sklearn.linear_model import LogisticRegression

#increased the max_iter from default 100 to 500 in order to make it converge:
logClf = LogisticRegression(max_iter = 500) 

In [14]:
tagger_log = ScikitConsecutivePosTagger(news_train,features = pos_features ,clf = logClf)
acc_log = (round(tagger_log.evaluate(news_dev_test), 4))
print(f'Logistic accuracy = {acc_log}')

Logistic accuracy = 0.9066


The *Logistic Regression* classifier is better than all of the *BernoulliNB* methods without the token to be tagged.

**Part b)** Similarly to the Naive Bayes classifier, we will study the effect of smoothing. Smoothing for LogisticRegression is done by regularization. In scikit-learn, regularization is expressed by the parameter C. A smaller C means a heavier smoothing (C is the inverse of the parameter $\alpha$ in the lectures). We will tune the C parameter in order to find the most optimal model.

In [19]:
def tunning_logistic(pos_features):
    C_values = [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
    accuracies = []
    for C in C_values:
        print(f"Running: LogisticRegression(C = {C})")
        logClf = LogisticRegression(C=C,max_iter = 10000) 
        tagger_log = ScikitConsecutivePosTagger(news_train,features = pos_features ,clf = logClf)
        accuracies.append(round(tagger_log.evaluate(news_dev_test), 4))
    
    return C_values,accuracies

In [20]:
C_values,accuracies_log = tunning_logistic(pos_features)

Running: LogisticRegression(C = 0.01)
Running: LogisticRegression(C = 0.1)
Running: LogisticRegression(C = 1.0)
Running: LogisticRegression(C = 10.0)
Running: LogisticRegression(C = 100.0)
Running: LogisticRegression(C = 1000.0)


In [21]:
visualize_results(C_values, accuracies_log)

     alpha  Accuracies
0     0.01      0.8589
1     0.10      0.9066
2     1.00      0.9066
3    10.00      0.9084
4   100.00      0.8895
5  1000.00      0.8832

Best alpha: 10.0 - accuracy: 0.9084


### 4) Features

**Part a)** We will now stick to the LogisticRegression() with the optimal C from the last point and see
whether we are able to improve the results further by extending the feature extractor with more
features. First, try adding a feature for the next word in the sentence, and then train and test.

In [22]:
def pos_features_extended(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}    
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        
    #next word in the secquence:
    if i == len(sentence) - 1:
        features['next-word'] = sentence[i]
    else:
        features['next-word'] = sentence[i+1]
    
    return features

In [23]:
def find_accuracy(pos_features,news_train,news_dev_test):
    best_ind = accuracies_log.index(max(accuracies_log))
    optimal_C = C_values[best_ind]

    clf = LogisticRegression(C=optimal_C,solver= 'liblinear')
    tagger = ScikitConsecutivePosTagger(news_train,features = pos_features ,clf = clf)
    acc = (round(tagger.evaluate(news_dev_test), 4))
    return acc

In [24]:
acc_opt_log = find_accuracy(pos_features_extended,news_train,news_dev_test)
print(f'Logistic regression with optimal C: {acc_opt_log}')

Logistic regression with optimal C: 0.9281


**Part b)** We will continue to add more features to get an even better tagger.

In [25]:
def pos_features_decapilized(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}    
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        
    #next word in the secquence:
    if i == len(sentence) - 1:
       features['next-word'] = sentence[i]
    else:
        features['next-word'] = sentence[i+1]
    features['current-word'] = sentence[i]  

    
    punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~—'
    
    s = sentence[i]

    if s.isupper():
        s = s.lower()
    elif s.isdigit():
        features['type'] = 'digit'
    elif s in punctuation: 
        features['type'] = 'punctuation'
    else:
        features['type'] = 'other'

    return features

In [26]:
acc_extended = find_accuracy(pos_features_decapilized,news_train,news_dev_test)
print(f'Logistic regression with optimal C: {acc_extended}')

Logistic regression with optimal C: 0.9739


By adding the current word, we get very much more improvement.

### 5) Larger corpus and evaluation
**Part a)** We will now test our best tagger so far on the news_test set.

In [27]:
acc_test_data = find_accuracy(pos_features_decapilized,news_train,news_test)
print(f'Logistic regression - accuracy = {acc_test_data}')

Logistic regression - accuracy = 0.9777


**Part b)** Now,we will use nearly the whole Brown corpus. But we will take away two categories for later evaluation: *adventure* and *hobbies*. We will also initially stay clear of *news* to be sure not to mix training and test data.

In [59]:
categories = brown.categories()
categories.remove('news')
categories.remove('adventure')
categories.remove('hobbies')
tagged_sents = brown.tagged_sents(categories=categories)
brown_data = brown.tagged_sents(categories = categories , tagset = 'unvisersal')

rest_test, rest_dev_test,rest_train = split_data(brown_data)

In [60]:
#merging the datasets

train = rest_train + news_train
test = rest_test + news_test
dev_test = rest_dev_test + news_dev_test

## establish baseline!!

**Part c)** We can then build our tagger for this larger domain. By using the best setting, we will try to find the accuracy for this dataset.

In [68]:
#acc_large = find_accuracy(pos_features_decapilized,train,test)
best_ind = accuracies_log.index(max(accuracies_log))
optimal_C = C_values[best_ind]

optimal_clf = LogisticRegression(C=optimal_C,solver= 'liblinear')
tagger_domain = ScikitConsecutivePosTagger(train,features = pos_features_decapilized ,clf = optimal_clf)
acc_domain = (round(tagger_domain.evaluate(test), 4))



In [69]:
print(f'The accuracy for the tagger for whole domain = {acc_domain}')

The accuracy for the tagger for whole domain = 0.8774


**Part d)** Now, testing the big tagger on *adventure* and *hobbies* categories of Brown corpus.

In [70]:
adventures_sents = brown.tagged_sents(categories = 'adventure' , tagset = 'unvisersal')
hobbies_sents = brown.tagged_sents(categories = 'hobbies' , tagset = 'unvisersal')

In [71]:
acc_adventure = (round(tagger_domain.evaluate(adventures_sents), 4))
acc_hobbies = (round(tagger_domain.evaluate(hobbies_sents), 4))

In [72]:
print(f'Accuracy: {acc_adventure} - adventures')
print(f'Accuracy: {acc_hobbies} - hobbies')

Accuracy: 0.9919 - adventures
Accuracy: 0.9824 - hobbies


Describe the difference!

### 6) Comparing to other taggers

**Part a)** NLTK comes with an HMM-tagger
which we may train and test on our own corpus. It can be trained and testet by

In [78]:
news_hmm_tagger = nltk.HiddenMarkovModelTagger.train(news_train)
news_hmm_acc = round(news_hmm_tagger.evaluate(news_test), 4)
print(f"The news HMM tagger accuracy: {news_hmm_acc}")

The news HMM tagger accuracy: 0.8995


Training and testing on the whole data:

In [80]:
big_hmm_tagger = nltk.HiddenMarkovModelTagger.train(train)
big_hmm_acc = round(big_hmm_tagger.evaluate(test), 4)
print(f"The HMM tagger accuracy: {big_hmm_acc}")

The HMM tagger accuracy: 0.6738


This method of tagging has better speed for training og evaluating, however the accuracy is not quite good.

**Part b)** NLTK also comes with an averaged perceptron tagger which we may train and test. It is currently
considered the best tagger included with NLTK. It can be trained as follows:

In [85]:
def run_per_tagger(train,test,name):
    per_tagger = nltk.PerceptronTagger(load=False)
    per_tagger.train(train)
    per_acc = round(per_tagger.evaluate(test), 4)

    print(f'Perceptron tagger accuracy: {per_acc} - {name}')

In [86]:
run_per_tagger(news_train,news_test,'news_data')
run_per_tagger(news_train,news_test,'all_data')

Perceptron tagger accuracy: 0.9649 - news_data
Perceptron tagger accuracy: 0.9647 - all_data


This is definitely the tagger in this assignment, both in terms of speed and accuracy. It got much better results for *train* data than the best tagger above, but did the computing in much less time. However, it did not as good accuracy as the best model for the *news_data*.  

## Part B

In this part we will use the gensim package to familiarize ourselves with word embeddings and
**word2vec**.

In [87]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import gensim.downloader as api 
wv = api.load('word2vec-google-news-300')

### 1) Basics
**a)** The amount of different words in the model:

In [92]:
total_words = len(wv)
print(f'Total words in the model: {total_words}')

Total words in the model: 3000000


In [88]:
try:
    vec_cameroon = wv['cameroon']
except KeyError:
    print("The word 'cameroon' does not appear in this model")

The word 'cameroon' does not appear in this model


**b)** ) Implementing a function for calculating the norm (the length) of an (embedding) vector, and a function for calculating the cosine between two vectors.

In [102]:
import numpy as np
def norm(vector):
    return np.linalg.norm(vector)

def similarity(vector1, vector2):
    cosine = np.dot(vector1,vector2)/(norm(vector1) * norm(vector2))
    return cosine

**c)** Comparing the functions with:

In [104]:
print(wv.similarity('king','queen'))
print(similarity(wv['king'],wv['queen']))

0.6510957
0.6510957
