# Distributional Semantic Models in Python
## EMLAR, 2021

### Raquel G. Alhama, Tilburg University
### Andrew Jessop, University of Liverpool



In [8]:
#First we import some libraries that will be useful
import nltk #Natural Language Toolkit
import numpy as np
import re
from sklearn.metrics.pairwise import cosine_similarity 

## Data Preprocessing


You can try this code with any textual data. For this example, we will use some children's books from Gutenberg project.
https://www.gutenberg.org/ebooks/bookshelves/search/?query=children|christmas|child|school



In [3]:
#Load data from book that we have previously downloaded from Gutenberg project
raw = open("pg6328.txt", 'r').read()

#The variable "raw" contains now all the text from this file. 

#Let's have a look at the data:
print(raw[4000:])

Baba and the Forty Thieves

_From the painting by Edmund Dulac_

HE DESIRED I WOULD STAND LIKE A COLOSSUS

Gulliver at the Court of Lilliput

_From the painting by Arthur Rackham _

THEY WERE VERY TIRED WHEN AT LAST THEY CAME TO THE FOREST OF ARDEN

As You Like It

_From the painting by Charles Folkard _

CHRISTIAN NIMBLY STRETCHED OUT HIS HAND FOR HIS SWORD

Christian's Fight with the Monster Apollyon

_From the etching by William Strang _




PREFACE


Consciously or unconsciously we are influenced by the characters
we admire. A book that exerts a deep as well as a wide influence
must produce changes in the reader's way of thinking, and excite
him to activity; the world for him can never be quite the same
that it was before. Such books have an important part in moulding
the character of a people.

It is because the books represented in this volume have been doing
just that for many years that they have become so prized. In the
characters of Crusoe, Gulliver and Christian, to mention 

In [4]:
#At the moment, what raw contains is a string of characters. But we are interested in words rather than characters.

#How can we separate this text into sentences?

#This process is called "tokenization". 

#First we need to download a tokenizer from NLTK:
nltk.download('punkt')

#Now let's tokenize each sentence.
#nltk.sent_tokenize gives us a method to do so
sentences = nltk.sent_tokenize(raw)

#And we tokenize all the words in each sentence and collect them together 
tokenized = []
header = True
for sentence in sentences:
    if header and sentence.startswith("CONTENTS"):
        header = False
    if not header: #We ignore everything before the table of contents
        tokenized.append(nltk.wordpunct_tokenize(sentence))
    

#Let's look at the words. Do you spot any problem? 
print(tokenized)

[nltk_data] Downloading package punkt to /home/andrew/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.




In [16]:
# Let's lowercase the text:
lowercased = []
for sentence in tokenized:
    lowercased.append( [s.lower() for s in sentence] )
        
print(lowercased)



In this part we are going to use some *regular expressions* via the [re](https://docs.python.org/3/library/re.html) package. Regular expressions (often shortened to *regex*) is a useful system for finding patterns in text. It is useful for preparing data for in modelling as it helps with unwanted characters (like punctuation) or searching text.   

In [46]:
#Let's remove the all the punctuation
wordsonly = []
for sentence in lowercased:
    words = []
    for s in sentence:
        word = re.sub(r'[^a-z]', '', s)
        if word != '': # We don't want to add empty strings 
            words.append(word)
    wordsonly.append(words)
print(wordsonly)



In [47]:
#Now let's count the words! 

#We first build a dictionary with word types and their frequencies
word_frequencies = {}
for sentence in wordsonly:
    for s in sentence:
        word_frequencies[s] = word_frequencies.get(s, 0) + 1

print(word_frequencies)



In [48]:
#Let's look at some of the most frequent words.
# We construct a list of words ordered from most frequent to most infrequent
sorted_keys = sorted(word_frequencies, key = word_frequencies.get, reverse = True)
print("These are the 10 most frequent words: ", sorted_keys[:10])
print("These are the 10 most infrequent words: ", sorted_keys[-10:])

These are the 10 most frequent words:  ['the', 'and', 'to', 'of', 'i', 'a', 'in', 'was', 'he', 'that']
These are the 10 most infrequent words:  ['scanning', 'items', 'portions', 'header', 'trailer', 'reprinted', 'sales', 'hardware', 'product', 'ver']


In [49]:
# It is generally better to restrict models to words with a minimum frequency.
# We define a minimum frequency threshold of 10 and filter the words:
minfreq = 10
target_freqs = dict([(word,freq) for word,freq in word_frequencies.items() if freq > minfreq])
#Now target_freqs is a dictionary with all the words we are interested in (we call them targets), and their frequency.

#It will be useful to have also the list of targets:
targets = target_freqs.keys()
#And the vocabulary size
vocabulary_size = len(target_freqs)
print(vocabulary_size)

1363


### Exercise: 
remove also the most frequent words ("stopwords")

In [52]:
# It will be useful to have a numerical index for each word
# We will use it later use to locate the word in the co-occurrence matrix
# Word to index:
w2i = {w: i for i, w in enumerate(targets)}
# Index to word:
i2w = {i: w for i, w in enumerate(targets)}

#Example:
print("The code for the word \"cave\" is {}".format(w2i["cave"]))

The code for the word "cave" is 404


## Building the Distributional Semantic Model

In [53]:
#Here we write the functions that we use to build the Distributional Semantic Model.

# This function is used to build the co-occurrence matrix
def calculate_cooccurrences(tokenized, vocabulary_size, window):
    matrix = np.zeros([vocabulary_size, vocabulary_size]) 
    for sentence in tokenized:
        for position,word in enumerate(sentence):                
            for j in range(max(position-window,0),min(position+window+1,len(sentence))):
                context=sentence[j]
                if j!=position and word in targets and context in targets: 
                    matrix[w2i[word]][w2i[context]]+=1
    return matrix


#This function will give us the co-occurrence counts between two words, given a co-occurrence matrix
def get_cooccurrence(word1, word2, counts):
    return counts[w2i[word1]][w2i[word2]]

In [54]:
#We now compute the co-occurrences in our tokenized text
count_matrix = calculate_cooccurrences(wordsonly, vocabulary_size, 2)

#Let's have a look at some co-occurrences:
# print(get_cooccurrence("horse", "house", count_matrix))
# print(get_cooccurrence("next", "morning", count_matrix))

Here we can apply any transformation to this matrix of counts (e.g. Pointwise Mutual Information). We leave this as an exercise!

## Distances (semantic similarity)

In [None]:
#We can compute a matrix with the cosine distance between every word:
similarities = cosine_similarity(count_matrix)
#print(similarities)

#This function will give us the co-occurrence between two words
def get_similarities(word1, word2, similarities):
    return similarities[w2i[word1]][w2i[word2]]


In [None]:
#Let's look at the similarities between these words
print(get_similarities("eat", "drink", similarities))
print(get_similarities("lamp","door", similarities))
print(get_similarities("lamp","drink", similarities))

Now that you know how to compute the similarities, you can do any type of analyses.
For example, correlate the similarities of the model with human similarity judgements,  find the closest neighbours to one word, compare similarities between models of books from different periods, etcs. 

## Exercise
The file thomas.txt contains child-directed speech from the Thomas corpus in CHILDES.

Re-use the previous code to build a Distributional Semantic Model for this data.