# Simple Word Embedding's
# Co-occurence Matrix

This notebook is designed to show a very simple implimentation for creating word embeddings. Following that, it will demonstrate examples of how the word semantics (meaning) is caputred in word embeddings.

Import Libraries

In [1]:
import re
import time

# Numerical python library
import numpy as np
# Dimensionality Reduction to visualise high dimensional data
from sklearn.manifold import TSNE
# Library to create visualisations
import matplotlib.pyplot as plt

### Data Pre-Processing
Read the book into a python list, each element of the list is one line from the text file

In [2]:
with open('books/a_tale_of_two_cities.txt', 'r') as ins:
	book_list = []
	for line in ins:
		book_list.append(line)

# add more books
other_books = ['books/1400-0.txt', 'books/766-0.txt', 'books/786-0.txt',
               'books/pg1023.txt', 'books/pg730.txt']

for book_fn in other_books:
    with open(book_fn, 'r') as ins:
        for line in ins:
            book_list.append(line)    

Look at the first 5 elements of the list

In [3]:
print(book_list[:10])

['A TALE OF TWO CITIES\n', '\n', 'A STORY OF THE FRENCH REVOLUTION\n', '\n', 'By Charles Dickens\n', '\n', '\n', 'Book the First--Recalled to Life\n', '\n', '\n']


We notice that each element ends with '\n'. This is the string representation for new line in Python. Next step, remove the new line string and concatenate text where there are no empty lines in between

In [4]:
book = []
current_line = book_list[0].replace('\n', '')
for i in range(1, len(book_list)-1):
	line = book_list[i].replace('\n', '')
	if len(line) > 0:
		current_line = current_line + ' ' + line
	else:
		if len(current_line) > 0:
			book.append(current_line)
		current_line = ''

Print the first 6 elements of our new list

In [5]:
for i in range(0, 6):
    print(book[i])

A TALE OF TWO CITIES
 A STORY OF THE FRENCH REVOLUTION
 By Charles Dickens
 Book the First--Recalled to Life
 I. The Period
 It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way-- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.


Next step: We want to split each of the elements of this list into individual words, also here we make the decision for the purpose of simplicity to remove all non alpha numeric characters from the text and to convert all text to lower case

In [6]:
token_book = []
for i in range(0, len(book)):
	# replace all punctuation with a space
	line = book[i].lower()
	line = re.sub('[^a-z0-9]', ' ', line)
	# tokenise all words
	tokens = line.split()
	token_book.append(tokens)

In [7]:
for i in range(0, 5):
    print(token_book[i])

['a', 'tale', 'of', 'two', 'cities']
['a', 'story', 'of', 'the', 'french', 'revolution']
['by', 'charles', 'dickens']
['book', 'the', 'first', 'recalled', 'to', 'life']
['i', 'the', 'period']


Now we can easily use this nested list of words to make a dictionary of word counts. Which we will use to determine which words we 

In [8]:
word_counts = {}
for line in token_book:
	for token in line:
		try:
			word_counts[token] += 1
		except:
			word_counts[token] = 1

### Data Exploration

In [9]:
i = 0
for key in word_counts:
    print('The word', '"' + key + '"', 'appears', word_counts[key], 'times in this book')
    i += 1
    if i > 4:
        break

The word "a" appears 28692 times in this book
The word "tale" appears 18 times in this book
The word "of" appears 32083 times in this book
The word "two" appears 1511 times in this book
The word "cities" appears 4 times in this book


In [10]:
print('These books have', len(word_counts), 'unique words.')

These books have 25099 unique words.


### Create a Co-Occurence Matrix
Now that our data is ready we can create a Co-occurence matrix

First we get list of words that appear 70 or more times in the books and assign each word a unique index which corresponds to which row and column of the matrix the word will appear. We will only use the words which appear atleast 70 times so that there are enough occurances of each word to get a good idea for the types of context the word is used in

In [11]:
i = 0
word_index = {}
for k in word_counts:
	if word_counts[k] >= 70:
		word_index[k] = i
		i += 1

Create a matrix of zeros where each row and column are representative of one of the words in the word index created above

In [12]:
w_n = len(word_index)
# initialise matrix of zeros
com = np.zeros((w_n, w_n))

print(com.shape)

(1581, 1581)


Now we have our matrix of zeros, we go through each paragraph and and store the context's of each word. 

This is done as seen in the following example:

Here we have an example sentence and an empty matrix, like we created above.


![alt text](stills/w_e_1.png "Title")

We assign a window size 'w'. Here we chose 'w = 3'. We start at the position 'w + 1', this is our first centre word. Our window is made up of the 'w' words that come before and after our centre word


![alt text](stills/w_e_2.png "Title")


Now we have our first centre word and window, go to the matrix, for the row of that centre word, we add a count +1 for each of the columns of the window words. Note, if a word appears which is not in our matrix, we do nothing and move on

![alt text](stills/w_e_3.png "Title")


Next we slide our centre word and window down one and run the same operation to make the additions to the matrix

![alt text](stills/w_e_4.png "Title")

We do this for word in the paragraph until we reach the end of the paragraph. We do this for each paragraph in the corpus until we reach the end of the corpus.

In [13]:
# set window size - w - how many words in the context of the centre word
w = 3

# move sliding window for all lines, add co-occurences to the matrix
for line in token_book:
	words_in_line = len(line)
	if words_in_line >= 2*w+1:
		for i in range(w, words_in_line-w):
			c = line[i]
			window_words = line[i-w:i] + line[i+1:i+1+w]
			for ww in window_words:
				try:
					row = word_index[c]
					col = word_index[ww]
					com[row, col] += 1
				except:
					pass
                

Example of the co-occurence matrix

In [14]:
for i in range(22, 27):
    for word in word_index:
        if word_index[word] == i:
            print(i-21, word)
print('\n')
print(com[22:27, 22:27].astype(int))

1 darkness
2 hope
3 we
4 had
5 everything


[[  0   0   0   4   0]
 [  0   0  13  27   1]
 [  0  13  83 377   7]
 [  4  25 377 441  16]
 [  0   1   8  16   2]]


### Exploring word embeddings

We can now take the row for each word and that will be a vector which we can use to represent that word. Each word vector (or word embedding) contrains information gained about the words which are commonly used around that word. One feature of this is that we can now see words which are similar types of words. The word vectors we have here are 1581 (one element for each word which we were counting).

As these vectors are of high dimensionality. We cannot visualise them as they are. What we can do is reduce the dimensionality either by Principal Component Analysis (PCA) or by t-distributed stochastic neighbor embedding (t-SNE)

Here we can look at some different groups of words and get some clues and about how these word embeddings <b>might</b> storing information about the meaning of the word.

Here we can see different groups of words, in the top left we have body parts, top right we have country related words, and to the right we have peoples names.

![alt text](stills/we_vis_1.png "Title")


In this example we can look at the relationship between words, here we look at comparing a word in its plural form to a word in its singular form, with examples of 'women' to 'woman' and going from 'men' to 'man'. We can see that the vectors between these examples (represented here by the red arrows) are similar. I think this is a good example in that it is not exactly the same, this is important to remember because firstly we have reduced a 1581 dimensional space to a 2 dimensional space, so information will be lost here, and secondly, the process of creating word embeddings is not perfect. But that does not mean that these word embeddings are not very useful.


![alt text](stills/we_vis_2.png "Title")

### THE END