# Word2vec Implementation

**Author**: Ramy Ghorayeb

In this tutorial, we will build our own word2vec implementation and brush our NLP skills.

In [21]:
import io
import os
import numpy as np
import scipy
import keras

## Corpus Cleaning

Let's take a look at one of the books:

In [483]:
f = open("txt/AAA.txt")
for line in f:
    print(line)



LINCOLN LETTERS



By Abraham Lincoln





Published by The Bibilophile Society









NOTE



The letters herein by Lincoln are so thoroughly characteristic of

the man, and are in themselves so completely self-explanatory, that

it requires no comment to enable the reader fully to understand and

appreciate them. It will be observed that the philosophical

admonitions in the letter to his brother, Johnston, were written on

the same sheet with the letter to his father.



The promptness and decision with which Lincoln despatched the

multitudinous affairs of his office during the most turbulent

scenes of the Civil War are exemplified in his unequivocal order to

the Attorney-General, indorsed on the back of the letter of Hon.

Austin A. King, requesting a pardon for John B. Corner. The

indorsement bears even date with the letter itself, and Corner was

pardoned on the following day.











We see there is a lot of cleaning to do:
* (1) uppercases titles to remove > we will use the lower() method
* (2) line breaks to remove due to the margins > we will build a remove_breaks function
* (3) break the line when a sentence ends > we will build a end_sentence_break function

In [480]:
def remove_breaks(corpus):
    corpus = list(corpus)
    corpus_new = []
    word = '' 
    line_prev = ''
    line_to_add = ''

    for i, line in enumerate(corpus):
        
        word = ''
        line_prev = corpus[i-1]
        
        # dealing with the last line
        if i == len(corpus)-1 and line != '\n':
            corpus_new.append(line.lower())
        
        # word of the next line
        for char in line:
            if char == ' ' or char == '\n':
                break
            word = word + char
            
        # skip empty line  
        if line_prev == '\n':
            next
            
        # save this line
        if len(line_prev) >= 67 or len(line_prev) + len(word) >= 67:
            line_to_add += line_prev[:-1] + ' '
            
        # add all the lines before the break
        else:
            line_to_add += line_prev[:-1]
            if line_to_add != '':
                corpus_new.append(line_to_add.lower())
                line_to_add = '' 
                
    return corpus_new

def end_sentence_break(corpus):
    
    corpus = list(corpus)
    corpus_new = corpus
    count=1
    
    for i,line in enumerate(corpus):
        
        line_to_add = ''
        
        for j, char in enumerate(line):
            
            line_to_add += char
            
            # if there is a dot and it is not for a surname
            if char == '.' and line[j-2] != ' ' and j != len(line)-1:
                corpus_new.insert(i+count,line_to_add)
                count += 1
                line_to_add = ''
            else:
                next

    return corpus_new    

def cleaning(corpus):
    corpus = remove_breaks(corpus)
    corpus = end_sentence_break(corpus)
    return corpus

In [481]:
f = open("txt/AAA.txt")
f = cleaning(f)
print(f)

['lincoln letters', 'by abraham lincoln', 'published by the bibilophile society', 'note', 'the letters herein by lincoln are so thoroughly characteristic of the man, and are in themselves so completely self-explanatory, that it requires no comment to enable the reader fully to understand and appreciate them. it will be observed that the philosophical admonitions in the letter to his brother, johnston, were written on the same sheet with the letter to his father.', 'the letters herein by lincoln are so thoroughly characteristic of the man, and are in themselves so completely self-explanatory, that it requires no comment to enable the reader fully to understand and appreciate them.', 'the promptness and decision with which lincoln despatched the multitudinous affairs of his office during the most turbulent scenes of the civil war are exemplified in his unequivocal order to the attorney-general, indorsed on the back of the letter of hon. austin a. king, requesting a pardon for john b. cor

The book looks good now. Let's save an array concatenating of all the cleaned version of the books. I twill be the corpus we will work with.