# Code for module paper

This code contains 4 tasks: 1) creating word paris for intrinsic evaluation, 2) Using 3 different word embeddings, 3) evaluating these embeddings on intrinsic and 4) extrinsic evaluation methods. 

#### 1) Creating word pairs 

It creates a set of 200 word pairs from our corpus. It excludes words occured frequently and occured less than 30 times. For the second task, it will uses words existing in pre-trained embedding method. 
By utilizing WordNet, it creates a set of word pairs. WordNet has synsets of noun, verb, adverb and adjective. It collects only nouns out of 4 pos-tagging. For the similarity of two nouns, it computes all similarities of synset pairs. For example, the first word "boat" has two synsets and the second word "ship" has one synset. It averages the similarities of boat-ship and gravy_boat-ship. 

In [1]:
from nltk.corpus import wordnet as wn

print("Boat synsets:",wn.synsets('boat', pos='n'))
print("Ship synsets:",wn.synsets('ship', pos='n'))

Boat synsets: [Synset('boat.n.01'), Synset('gravy_boat.n.01')]
Ship synsets: [Synset('ship.n.01')]


#### 2) Using 3 different word embeddings

It trains Skip-gram and CBOW methods and uses a pre-trained method. As the pre-trained method, it uses GoogleNews. 

#### 3) Evaluating intrinsic method 

It evaluates abilities of 3 different word embeddings on the set of word pairs created in task 1. 

#### 4) Evaluating extrinsic method 

The 3 different word embeddings in task 3 are used for text classification problem. The corpus consists of 18 datasets collected by 18 participants. Each dataset has label. The goal of text classification is to predict this label. 

## Step 1) Reading the Corpus

The texts in the corpus are already split into sentences. It reads these sentences and tokenizes each sentence. 

In [2]:
import logging
from pathlib import Path

logging_level = logging.DEBUG
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging_level)

import gzip
import json

def load_data(filename):
    return json.loads(gzip.GzipFile(filename).read().decode('utf-8'))

In [None]:
course_corpus = load_data('./material/nlpcm_corpus_1.json.gz')
logging.info(' The course corpus consists of %d subcorpora:' % len(course_corpus))

index = 0
for designer, texts in course_corpus.items():
    logging.info(' %d: %6d texts gathered by %s with %d characters in total.' % (index, len(texts), designer, sum([len(text) for text in texts])))
    index += 1

## Step 2) Tokenizing the sentences

In [4]:
import nltk
sentences = []
index = 0
for designer, texts in course_corpus.items():
    print('Tokenizing text of subcorpus %d of %d' % (index, len(course_corpus)))
    index += 1
    for text in texts:
        for sentence in nltk.sent_tokenize(text, language="english"):
            tokenized_sentence = nltk.word_tokenize(sentence, language="english")
            sentences.append(tokenized_sentence)
print('Corpus number of sentences: %d' % len(sentences))
print('Corpus number of tokens: %d' % sum([len(sentence) for sentence in sentences]))

Tokenizing text of subcorpus 0 of 18
Tokenizing text of subcorpus 1 of 18
Tokenizing text of subcorpus 2 of 18
Tokenizing text of subcorpus 3 of 18
Tokenizing text of subcorpus 4 of 18
Tokenizing text of subcorpus 5 of 18
Tokenizing text of subcorpus 6 of 18
Tokenizing text of subcorpus 7 of 18
Tokenizing text of subcorpus 8 of 18
Tokenizing text of subcorpus 9 of 18
Tokenizing text of subcorpus 10 of 18
Tokenizing text of subcorpus 11 of 18
Tokenizing text of subcorpus 12 of 18
Tokenizing text of subcorpus 13 of 18
Tokenizing text of subcorpus 14 of 18
Tokenizing text of subcorpus 15 of 18
Tokenizing text of subcorpus 16 of 18
Tokenizing text of subcorpus 17 of 18
Corpus number of sentences: 658902
Corpus number of tokens: 14548903


## Step 3) Preprocessing

Instead of using all words occured in the corpus, it preprocesses the text to remove some words which are not helpful such as stopwords. According to Zipf's law, some terms like the, a, etc. occur often in a corpus but not helpful for NLP tasks. The top 100 words that occur often cover 50% of words. For the task, it excludes these words. 

In [5]:
from collections import Counter

wortfrequenz = Counter()

for satz in sentences:
    wortfrequenz.update(satz)

vocabulary = [w for w,f in wortfrequenz.items() if 30 < f < 19000]

In [None]:
import matplotlib.pylab as plt
%matplotlib inline

sorted_words = sorted(wortfrequenz, key=lambda word: wortfrequenz[word], reverse=True)
word_ranks = {word: rank+1 for rank, word in enumerate(sorted_words)}
frequency_ranks = {word_ranks[word]: frequency for word, frequency in wortfrequenz.items()}

lists = sorted(frequency_ranks.items())
x, y = zip(*lists)

half_sum = 0
stop_rank = 0
for rank in range(len(y)):
    if half_sum <= sum(wortfrequenz.values())*0.5 :
        half_sum += y[rank]
        stop_rank = rank
    else :         
        break
print("{0} words cover 50% of words. Its sum is {1}".format(stop_rank+1, half_sum))

In [None]:
n_limit = 100
plt.plot(x[:n_limit], y[:n_limit], color='blue')
plt.show()

## Step 4) Collecting nouns

It creates a noun word pairs for evaluation. By utilizing, it can extract nouns from our vocabulary which exist in WordNet. 

In [8]:
from itertools import chain

filtered_nouns = vocabulary

for pos in ['a','s','v','r']:
    pos_set = list(set(chain(*[i.lemma_names() for i in wn.all_synsets(pos)])))
    pos_set = [x.lower() for x in pos_set]
    filtered_nouns = [w for w in filtered_nouns if w.lower() not in pos_set]

noun_set = list(set(chain(*[i.lemma_names() for i in wn.all_synsets('n')])))
noun_set = [x.lower() for x in noun_set]
filtered_nouns = [w for w in filtered_nouns if w.lower() in noun_set]

In [9]:
print('The number of words tokenized : %d' % sum([len(sentence) for sentence in sentences]))
print('The number of words without stopwords : %d' % len(vocabulary))
print('The number of nouns: %d' % len(filtered_nouns))

The number of words tokenized : 14548903
The number of words without stopwords : 18099
The number of nouns: 4954


## Step 5) Computing similarity for all noun pairs 

It uses Wu-Palmer similarity which returns a score how similar two word senses are based on the depth of two senses. 

In [10]:
def num_synsets(word):
    return len(wn.synsets(word, pos='n'))

def wup_similarity(word1, word2):
    return word1.wup_similarity(word2)

In [11]:
Num_nouns = len(filtered_nouns)

pairs = []
for i in range(Num_nouns) : 
    for j in range(Num_nouns) : 
        if i != j :
            w1, w2 = filtered_nouns[i], filtered_nouns[j]
            N, M = num_synsets(w1), num_synsets(w2)
            
            dist = 0
            for x in range(N) :
                word1 = wn.synsets(w1, pos='n')[x:x+1][0]
                for y in range(M) :
                    word2 = wn.synsets(w2, pos='n')[y:y+1][0]
                    dist += wup_similarity(word1, word2)
            dist = dist / (N*M)
            pair = [w1, w2, dist]
            pairs.append(pair)

## Step 6) Removing words not in pre-trained method: GoogleNews

It removes a word pair where the first and second word are same. Moreover, it excludes word pairs not in GoogleNews. 

In [12]:
from gensim.models import KeyedVectors
word2vec = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
pre_trained = [w for w,f in word2vec.vocab.items()]

INFO:loading projection weights from GoogleNews-vectors-negative300.bin.gz
DEBUG:{'uri': 'GoogleNews-vectors-negative300.bin.gz', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}
INFO:loaded (3000000, 300) matrix from GoogleNews-vectors-negative300.bin.gz


In [13]:
import pandas as pd
df = pd.DataFrame(pairs, columns=['word1', 'word2', 'similarity']) 
df = df.loc[df['word1'].str.lower()!=df['word2'].str.lower()]
df = df[df['word1'].isin(pre_trained) & df['word2'].isin(pre_trained)]
print('The number of pairs: %d' % len(df))

The number of pairs: 24082208


## Step 7) Sampling 200 word pairs 

It samples 100 pairs which similarities are bigger than 0.5 and the rest less than 0.5. 

In [14]:
set_pairs = pd.concat((df[df.similarity>=0.5].sample(100),df[df.similarity<0.5].sample(100))).reset_index()
set_pairs = set_pairs.drop(['index'], axis=1)

In [15]:
set_pairs.head()

Unnamed: 0,word1,word2,similarity
0,Scotland,Detroit,0.7
1,capo,da,0.545455
2,ordeal,Keeping,0.524269
3,panty,classroom,0.5
4,Benny,Einstein,0.600752


## Step 8) 3 different word embeddings 

Here, it trains skip_gram and CBOW word embedding methods and reads pre-trained word embedding with GoogleNews dataset. For skip_gram and CBOW, it iterates its training 30 times and sets window = 5. The dimension of word embeddings in three models is 300. 

In [None]:
import gensim 
skip_gram = gensim.models.Word2Vec(sentences, iter=30, min_count=30, window=5, size=300, sg=1, negative=20)
CBOW = gensim.models.Word2Vec(sentences, iter=30, min_count=30, window=5, size=300, sg=0, negative=20)

In [17]:
import numpy as np  

def mag(x):
    return np.sqrt(x.dot(x))

vectors_skip_gram = {w:skip_gram.wv[w]/mag(skip_gram.wv[w]) for w in vocabulary}
vectors_CBOW = {w:CBOW.wv[w]/mag(CBOW.wv[w]) for w in vocabulary }

For a word "love", word representations look like following

In [18]:
print('Skip-gram: ', vectors_skip_gram['love'][0:5])
print('CBOW: ', vectors_CBOW['love'][0:5])
print('Pre-trained: ', word2vec.word_vec('love')[0:5])

Skip-gram:  [-0.05764098 -0.05651434 -0.02088199  0.05422454 -0.05084962]
CBOW:  [ 0.10642056 -0.01309693 -0.04140827  0.00130632  0.02166286]
Pre-trained:  [ 0.10302734 -0.15234375  0.02587891  0.16503906 -0.16503906]


## Step 9) Intrinsic evaluation 

By using the set of word pairs in task 1, it evaluates abilities of three word embedding methods. 

In [19]:
import math 

def evaluate(data, vectors):
    gold = []
    predicted = []
    for v,w,sim in data:
        try :
            pred = vectors[v].dot(vectors[w])
        except TypeError : 
            pred = vectors(v).dot(vectors(w))
        gold.append(sim)
        predicted.append(pred)
    
    av_p = sum(predicted)/len(predicted)
    av_g = sum(gold)/len(gold)
    
    cov = 0
    var_g = 0
    var_p = 0
    for s,t in zip(gold,predicted):
        cov += (s-av_g) * (t-av_p)
        var_g += (s-av_g) * (s-av_g)
        var_p += (t-av_p) * (t-av_p)
        
    return cov / math.sqrt(var_g*var_p)

In [20]:
df = df.loc[df['word1'].isin(vocabulary) & df['word2'].isin(vocabulary)]

In [21]:
set_pairs = pd.concat((df[df.similarity>=0.5].sample(100),df[df.similarity<0.5].sample(100))).reset_index()
set_pairs = set_pairs.drop(['index'], axis=1)
set_pairs.to_csv('test.csv', index=False) 

val_skip_gram = evaluate(set_pairs.values.tolist(), vectors_skip_gram)
val_CBOW = evaluate(set_pairs.values.tolist(), vectors_CBOW)
val_pre_trained = evaluate(set_pairs.values.tolist(), word2vec.word_vec)

print('Skip-gram: ', val_skip_gram)
print('CBOW: ', val_CBOW)
print('Pre-trained: ', val_pre_trained)

Skip-gram:  0.30420823489512194
CBOW:  0.4031173769615571
Pre-trained:  0.3882292940473257


## Step 10) Text classification 

For the task, several steps are needed: 1) creating index for each word, 2) creating instances only containing index, 3) labeling 18 datasets from different participants and 4) Spliting training and testing sets. With above preparation, it is ready to build neural network for text classification. 

In [22]:
word_index = {"<PAD>": 0, "<UNK>": 1}
embedding_matrix = np.random.uniform(-1, 1, (len(word_index) + len(skip_gram.wv.vocab), 300)) ## change vocab to each embeddings. 
for word in skip_gram.wv.vocab:
    index = len(word_index)
    word_index[word] = index
    embedding_matrix[index] = skip_gram.wv[word]

In [None]:
sentence_min_length = 10

import nltk
nltk.download('punkt')

instances = {designer: [] for designer in course_corpus}
for designer, texts in course_corpus.items():
    logging.info('Processing texts of designer %s; %d instances so far.' % (designer, len(instances)))
    for text in texts:
        sentences = nltk.sent_tokenize(text)
        for sentence in sentences:
            tokens = nltk.word_tokenize(sentence)
            if len(tokens) >= sentence_min_length:
                instances[designer].append([word_index.get(word, word_index["<UNK>"]) for word in tokens])

In [None]:
import random 
import tensorflow as tf
from tensorflow import keras

TEST_TRAIN_RATIO = 0.1

MAX_SEQUENCE_LENGTH = 30 

def pad_input(sentences):
    return keras.preprocessing.sequence.pad_sequences(
        sentences, maxlen=MAX_SEQUENCE_LENGTH, dtype='int32', padding='pre', truncating='pre', value=word_index["<PAD>"])

train_labeled_data = []
test_labeled_data = []
designer_index = {}
for designer, designer_instances in instances.items():
    designer_index[designer] = len(designer_index)
    random.shuffle(designer_instances)
    test_labeled_data += [(inst, designer_index[designer]) for inst in designer_instances[:round(len(designer_instances)*TEST_TRAIN_RATIO)]]
    train_labeled_data += [(inst, designer_index[designer]) for inst in designer_instances[round(len(designer_instances)*TEST_TRAIN_RATIO):]]

random.shuffle(train_labeled_data)
train_data = pad_input([inst[0] for inst in train_labeled_data])
train_labels = [inst[1] for inst in train_labeled_data]

random.shuffle(test_labeled_data)
test_data = pad_input([inst[0] for inst in test_labeled_data])
test_labels = [inst[1] for inst in test_labeled_data]

In [None]:
input_layer = keras.layers.Input(shape=(MAX_SEQUENCE_LENGTH, ), dtype='int32')

embedding_layer = keras.layers.Embedding(len(word_index),
                                         300,
                                         weights=[embedding_matrix],
                                         input_length=MAX_SEQUENCE_LENGTH,
                                         trainable=False)(input_layer)
text_encoder = keras.layers.LSTM(64,
                                 dropout=0.2,
                                 recurrent_dropout=0.2)(embedding_layer)

output_layer = keras.layers.Dense(len(course_corpus),
                                  activation='softmax')(text_encoder)


model = keras.models.Model(inputs=input_layer, outputs=output_layer)
model_loss = keras.losses.SparseCategoricalCrossentropy()
model.compile(loss=model_loss,
              optimizer=keras.optimizers.Adam(),
              metrics=['acc'])

model.summary()

In [None]:
model.fit(np.array(train_data),
          np.array(train_labels),
          epochs=100,
          batch_size = 512,
          verbose=1)

model.evaluate(np.array(test_data), np.array(test_labels))