# Word2vec Implementation

**Author**: Ramy Ghorayeb

In this tutorial, we will build our own word2vec implementation and brush our NLP skills.

Download the dataset we will work on [here](https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html)

In [1]:
import io
import os
import csv
import numpy as np
import scipy
import keras
from progressbar import ProgressBar, Percentage, Bar, AnimatedMarker, Counter, Timer, ETA, FileTransferSpeed

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Corpus Cleaning

### Book cleaning

Let's take a look at one of the books:

In [2]:
book = open("corpus/test.txt")
for line in book:
    print(line)



LINCOLN LETTERS



By Abraham Lincoln





Published by The Bibilophile Society









NOTE



The letters herein by Lincoln are so thoroughly characteristic of

the man, and are in themselves so completely self-explanatory, that

it requires no comment to enable the reader fully to understand and

appreciate them. It will be observed that the philosophical

admonitions in the letter to his brother, Johnston, were written on

the same sheet with the letter to his father.



The promptness and decision with which Lincoln despatched the

multitudinous affairs of his office during the most turbulent

scenes of the Civil War are exemplified in his unequivocal order to

the Attorney-General, indorsed on the back of the letter of Hon.

Austin A. King, requesting a pardon for John B. Corner. The

indorsement bears even date with the letter itself, and Corner was

pardoned on the following day.











We see there is a lot of cleaning to do:
* (1) uppercases titles to remove > we will use the lower() method
* (2) line breaks to remove due to the margins > we will build a remove_breaks function
* (3) break the line when a sentence ends > we will build a end_sentence_break function

In [218]:
def remove_breaks(corpus):
    corpus = list(corpus)
    corpus_new = []
    line_prev = ''
    line_to_add = ''

    for i, line in enumerate(corpus):
                
        if i>0:
            line_prev = corpus[i-1]
        word = line.split(' ')[0]
                
        # dealing with the last line
        if i == len(corpus)-1 and len(line) > 1:
            corpus_new.append(line_to_add.lower())
            corpus_new.append(line.lower())
                    
        # skip empty line  
        if len(line_prev) < 2:
            next
            
        # save this line
        if len(line_prev) >= 67 or len(line_prev) + len(word) >= 67:
            line_to_add += line_prev.strip() + ' '
            
        # add all the lines before the break
        else:
            line_to_add += line_prev.strip()
            if len(line_to_add) > 1:
                corpus_new.append(line_to_add.lower())
                line_to_add = ''

    return corpus_new

def add_breaks(corpus):
    
    corpus = list(corpus)
    corpus_new = list(corpus)
    count = 0
    count_for_line = 0
    line_to_add = ''
            
    for i,line in enumerate(corpus):
    
        line_to_add = ''
        count_for_line = 0
        
        for j, char in enumerate(line):
            
            first_if = 'NO'
            line_to_add += char
            
            # if there is a dot in the middle of the line
            if char == '.' and j != len(line)-1 and len(line)>1 and count_for_line == 0:
                # exclude dots for surname
                if line[j-2] != ' ' and line[j-2] != '.':
                    corpus_new.pop(i+count)
                    corpus_new.insert(i+count,line_to_add.strip())
                    line_to_add = ''
                    count_for_line += 1
                    first_if = 'YES'
                    
            # add another part of the line if multiple sentence in same line 
            if char == '.' and j != len(line)-1 and len(line)>1 and count_for_line > 0 and first_if == 'NO':
                if line[j-2] != ' ' :
                    corpus_new.insert(i+count+count_for_line,line_to_add.strip())
                    line_to_add = ''
                    count_for_line += 1
                
            # add end of the line, take into account additional lines if multiple sentence in same line 
            if char == '.' and j == len(line)-1 and len(line)>1 and count_for_line > 0 and first_if == 'NO':
                    corpus_new.insert(i+count+count_for_line,line_to_add.strip())
                    line_to_add = ''
                    count += count_for_line                
                    
            # if no dot in the middle of the line, keep the original line
            else:
                next

    return corpus_new    

def book_cleaning(corpus):
    corpus = remove_breaks(corpus)
    corpus = add_breaks(corpus)
    return corpus

In [219]:
book = open("corpus/test.txt")
book = book_cleaning(book)
for line in book:
    print(line)

lincoln letters
by abraham lincoln
published by the bibilophile society
note
the letters herein by lincoln are so thoroughly characteristic of the man, and are in themselves so completely self-explanatory, that it requires no comment to enable the reader fully to understand and appreciate them.
it will be observed that the philosophical admonitions in the letter to his brother, johnston, were written on the same sheet with the letter to his father.
the promptness and decision with which lincoln despatched the multitudinous affairs of his office during the most turbulent scenes of the civil war are exemplified in his unequivocal order to the attorney-general, indorsed on the back of the letter of hon.
austin a. king, requesting a pardon for john b. corner.
the indorsement bears even date with the letter itself, and corner was pardoned on the following day.


The book looks good now. We have bugs when a dot is after a word longer than 1 char without ending the sentence but it is a fine approximation. 
Let's save an array concatenating of all the cleaned version of the books. It will be the corpus we will work with.

In [224]:
book = open("corpus/Andrew Lang___Andrew Lang's Introduction to The Compleat Angler.txt")
book = book_cleaning(book)
for line in book:
    print(line)

andrew lang's introduction to the compleat angler
to write on walton is, indeed, to hold a candle to the sun.
the editor has been content to give a summary of the chief or rather the only known, events in walton's long life, adding a notice of his character as displayed in his biographies and in _the compleat angler_, with comments on the ancient and modern practice of fishing, illustrated by passages from walton's foregoers and contemporaries.
like all editors of walton, he owes much to his predecessors, sir john hawkins, oldys, major, and, above all, to the learned sir harris nicolas.
his life
the few events in the long life of izaak walton have been carefully investigated by sir harris nicolas.
all that can be extricated from documents by the alchemy of research has been selected, and i am unaware of any important acquisitions since sir harris nicolas's second edition of 1860.
izaak was of an old family of staffordshire yeomen, probably descendants of george walton of yoxhall, who d

### Corpus cleaning

In [5]:
def corpus_cleaning(direct,direct_new):
    i = 1
    nb_books = len(os.listdir(direct))
    os.mkdir(direct_new)
    pbar = ProgressBar(widgets=[Counter(),'/',str(nb_books),' ',
                                Percentage(), ' ', 
                                Bar(marker=AnimatedMarker()),' ',
                                ETA()])
        
    for file in pbar(os.listdir(direct)):
        book = open(direct + '/' + file)
        book = book_cleaning(book)
        np.savetxt(direct_new + '/' + file, np.array(book),fmt='%s')
    
    return corpus

In [54]:
direct = 'corpus'
direct_new = 'corpus_new'
corpus = corpus_cleaning(direct,direct_new)

48/3038   1% |                                                  | ETA:  0:09:28

KeyboardInterrupt: 