# Preprocessing Steps - Word Embedding
## ReadDataF

In [2]:
%matplotlib inline
from keras.preprocessing.text import text_to_word_sequence, one_hot, Tokenizer 
from keras.layers import Embedding
import pandas as pd
import numpy as np
import os
import pickle
import re

Using TensorFlow backend.
  return f(*args, **kwds)


In [6]:
dataPath = "./wendygao16/Quora_NLP/"
train = pd.read_csv(dataPath+'train.csv', usecols=['question1', 'question2', 'id'])
test  = pd.read_csv(dataPath+'test.csv',  usecols=['question1', 'question2', 'test_id'])
test.dropna(inplace=True)
train.dropna(inplace=True) # remove two rows as in NLP feature creation

# For testing only
#train = train[:1000] 
#test = test[:1000]

question = train.question1[0]
print(question)
print(text_to_word_sequence(question), '\n', type(text_to_word_sequence(question)))

What is the step by step guide to invest in share market in india?
['what', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india'] 
 <class 'list'>


In [7]:
train

Unnamed: 0,id,question1,question2
0,0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...
1,1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...
2,2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...
3,3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...
4,4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?
5,5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan..."
6,6,Should I buy tiago?,What keeps childern active and far from phone ...
7,7,How can I be a good geologist?,What should I do to be a great geologist?
8,8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?"
9,9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?


## Step 1. Lemmatization

Questions are preprocessed so that the different forms of writing the same text (like "don't" and "do not") are  matched. Lemmatization similar to one done in the first part of the project helps again. 

#### Lemmatize with *WordNetLemmatizer*:

In [3]:
#import nltk
#nltk.download()

from nltk.stem.wordnet import WordNetLemmatizer
WNL = WordNetLemmatizer()
def cutter(word):
    if len(word) < 4:
        return word
    return WNL.lemmatize(WNL.lemmatize(word, "n"), "v")

#### Function *'cutter()'* lemmatizes words to standardized form.

In [4]:
cutter('visualizing')

'visualize'

#### Preprocessing transformations of questions

In [5]:
def preprocess(string):
    # standardize expression with apostrophe, replace some special symbols with word
    string = string.lower().replace(",000,000", "m").replace(",000", "k").replace("′", "'").replace("’", "'") \
        .replace("won't", "will not").replace("cannot", "can not").replace("can't", "can not") \
        .replace("n't", " not").replace("what's", "what is").replace("it's", "it is") \
        .replace("'ve", " have").replace("i'm", "i am").replace("'re", " are") \
        .replace("he's", "he is").replace("she's", "she is").replace("'s", " own") \
        .replace("%", " percent ").replace("₹", " rupee ").replace("$", " dollar ") \
        .replace("€", " euro ").replace("'ll", " will").replace("=", " equal ").replace("+", " plus ")
    # remove punctuation and special symbols
    string = re.sub('[“”\(\'…\)\!\^\"\.;:,\-\?？\{\}\[\]\\/\*@]', ' ', string)
    string = re.sub(r"([0-9]+)000000", r"\1m", string)
    string = re.sub(r"([0-9]+)000", r"\1k", string)
    # lemmatize
    string = ' '.join([cutter(w) for w in string.split()])
    return string

print(preprocess("she's"))

she is


#### Apply preprocessing to train sample. 

#### All transformations applied to train will be applied to test as well

In [6]:
print('Question 1: %s' % train["question1"][1])
print('Question 2: %s' % train["question2"][1])
train["question1"] = train["question1"].fillna("").apply(preprocess)
train["question2"] = train["question2"].fillna("").apply(preprocess)
print('Question 1 processed: %s' % train.question1[1])
print('Question 2 processed: %s' % train.question2[1])

# Question 1: What is the story of Kohinoor (Koh-i-Noor) Diamond?
# Question 2: What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?
# Question 1 processed: what is the story of kohinoor koh i noor diamond
# Question 2 processed: what would happen if the indian government steal the kohinoor koh i noor diamond back
print(type(train["question1"].fillna("")))

Question 1: What is the story of Kohinoor (Koh-i-Noor) Diamond?
Question 2: What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?
Question 1 processed: what is the story of kohinoor koh i noor diamond
Question 2 processed: what would happen if the indian government steal the kohinoor koh i noor diamond back
<class 'pandas.core.series.Series'>


## Step 2. Creating vocabulary of frequent words

Create vocabulary of relatively frequent words in questions: words with frequency greater than *MIN_WORD_OCCURRENCE* times. 

For the small dataset *MIN_WORD_OCCURRENCE* is selected small, but for the whole dataset it should be much larger (may be in the range 50-150).

For word count use familiar *CountVectorizer*.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

MIN_WORD_OCCURRENCE = 3 # 3 for demo and testing in local environment. Select number for final results

all_questions = pd.Series(train["question1"].tolist() + train["question2"].tolist()).unique()
vectorizer = CountVectorizer(lowercase=False, token_pattern="\S+", # replace white spaces with spaces
                             min_df=MIN_WORD_OCCURRENCE)
vectorizer.fit(all_questions)
top_words = set(vectorizer.vocabulary_.keys())
print(len(top_words),'top_words')
print('Top words %s' % list(top_words)[:10]) # so here, the OCCURRENCE rank is not ordered.

28890 top_words
Top words ['psychosis', 'circle', 'functionalism', 'premam', 'cardiology', 'review', 'sangam', 'supernova', 'solicitation', 'inguinal']


## Step 3. Remove rare words

The consecutive rare words are replaced with one word "suspense". The result is limited to 30 trailing words. 

Remove first words in long question since the end of it is usually more important. 

Add "suspense" to *top_words*.

In [8]:
REPLACE_WORD = "suspense"
top_words.add(REPLACE_WORD)
MAX_SEQUENCE_LENGTH = 30

In [9]:
def prepare(q):
    new_q = []
    new_suspense = True # ready to add REPLACE_WORD 
    # a[::-1] invert order of list a, so we start from the end
    for w in q.split()[::-1]:
        if w in top_words:
            new_q = [w] + new_q # add word from top_words
            new_suspense = True
        elif new_suspense:
            new_q = [REPLACE_WORD] + new_q
            new_suspense = False  # only 1 REPLACE_WORD for group of rare words
        if len(new_q) == MAX_SEQUENCE_LENGTH:
            break
    new_q = " ".join(new_q)
    return new_q

question = train.question1[9]
print('Question: %s' % question)
print('Prepared question: %s' % prepare(question))

Question: motorola company can i hack my charter motorolla dcx3400
Prepared question: motorola company can i hack my charter suspense


Apply the function to train questions

In [10]:
q1s_train = train.question1.apply(prepare)
q2s_train = train.question2.apply(prepare)
print(q1s_train[0])

what is the step by step guide to invest in share market in india


## Step 4. Create embedding index

Build embedding index - dictionary with words from *top_words* as keys and their vector presentations as values.

Take vector presentations of words from Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download) embedding file [glove.840B.300d](http://nlp.stanford.edu/data/glove.840B.300d.zip). Each line of the file contains word space separated from components of word vector.

In [11]:
EMBEDDING_DIM = 300
EMBEDDING_FILE = "/user/wendygao16/fs_Quora/glove.840B.300d.txt"

def get_embedding():
    embeddings_index = {}
    with open(EMBEDDING_FILE, encoding='utf8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            if len(values) == EMBEDDING_DIM + 1 and word in top_words:
                coefs = np.asarray(values[1:], dtype="float32")
                embeddings_index[word] = coefs
    return embeddings_index

Build *embeddings_index* and reduce *top_words* to those having vector representation.

In [12]:
embeddings_index = get_embedding()
print("Words not found in the embedding:", top_words - embeddings_index.keys())
top_words = embeddings_index.keys()

Words not found in the embedding: {'rubymotion', 'gokano', 'premam', 's4a', 'kiip', 'baelish', 'm335', 'a&f', 'oscp', '7ghz', 'eklavya', 'ocpjp', 'marathahalli', 'capf', 'vjti', 'gdpi', 'rustom', 'μtorrent', '\\text', 'nutanix', 'muoet', '£5k', 'iimc', 'kaththi', 'isil', 'au111tx', 'calment', 'glyx', 'h3o', 'crisc', 'vantablack', 'urjit', 'shkreli', 'pakalu', 'housebuildup', 'vasistha', 'tarly', 'ios9', 'h&e', '160q', 'amulyam', 'iilm', 'nirbhaya', 'zoomcar', 'savitar', 'elitmus', 'rajdeep', 'r&s', 'codeacademy', 'callidus', 'klarna', 'muapt', 'pgpx', 'patani', 'x\\to\\infty', 'wynk', 'pogba', '6500u', 'mcit', 'pdpu', 'gss1', 'msme', 'ropar', '8462852', 'instacart', 'bitsians', '‘this', 'outbrain', 'hofstede', 'railwire', 'ocjp', 'm20x', 'tanmay', 'i140', 'udacity', 'swtor2credits', 'drumpf', 'h>', 'ɽφʉʛƕ', 'bpharm', '\\int', '\\right', 'politecnico', 'unacademy', 'meldonium', '2–3', 'bokaro', '\\end', 'surathkal', '\\cdots', '10°c', 'nptel', 'pravana', 'dylann', 'classpass', 'j&k', 'i

## Step 5. Transform questions into integer valued sequences of equal lengths

The *Tokenizer.texts_to_sequences* can converts question to a list of integers. But such lists may have different lengths for different questions. And Keras provides method for fixing this issue:

*keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.)* 

It transforms a list of *num_samples* sequences (lists of scalars) into a 2D Numpy array of shape 

*(num_samples, num_timesteps)*, 

where *num_timesteps* is either *maxlen* argument (if provided), or the length of the longest sequence.

Sequences that are shorter than *num_timesteps* are padded with *value* at the end. Sequences longer than *num_timesteps* are truncated so that they have the desired length. 

Position where padding or truncation happens is determined by *padding* or *truncating*, respectively.

In [13]:
from keras.preprocessing.sequence import pad_sequences
sequences = [[1,2],[1,2,3,4,5]]
print('Original sequences: %s' % sequences)
print('Padded default: %s' % pad_sequences(sequences))
print('Padded with maxlen=4: %s' % pad_sequences(sequences,maxlen=4))
print('Padded with maxlen=4, padding=post: %s' % pad_sequences(sequences,maxlen=4,padding='post'))
print('Padded with maxlen=4, padding=post, truncating=post: %s' \
      %pad_sequences(sequences,maxlen=4,padding='post',truncating='post'))

Original sequences: [[1, 2], [1, 2, 3, 4, 5]]
Padded default: [[0 0 0 1 2]
 [1 2 3 4 5]]
Padded with maxlen=4: [[0 0 1 2]
 [2 3 4 5]]
Padded with maxlen=4, padding=post: [[1 2 0 0]
 [2 3 4 5]]
Padded with maxlen=4, padding=post, truncating=post: [[1 2 0 0]
 [1 2 3 4]]


Fit *Tokenizer* to the questions obtained after Step 3 and apply *texts_to_sequences* and *pad_sequences* to them.

In [14]:
tokenizer = Tokenizer(filters="")
tokenizer.fit_on_texts(np.append(q1s_train, q2s_train))
word_index = tokenizer.word_index

data_1 = pad_sequences(tokenizer.texts_to_sequences(q1s_train), maxlen=MAX_SEQUENCE_LENGTH)
data_2 = pad_sequences(tokenizer.texts_to_sequences(q2s_train), maxlen=MAX_SEQUENCE_LENGTH)
print('Final representation of first question 1:')
print(data_1[0])
print('Final representation of first question 2:')
print(data_2[0])

Final representation of first question 1:
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    2    3    1  646   62  646 1922    7  430    8  498  189    8   38]
Final representation of first question 2:
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    2    3    1  646   62  646 1922    7  430    8  498  189]


Each question now is represented by a vector of 30 numbers.

Repeat the same steps with *test* set and create:

*q1s_test -> test_data_1*  
*q2s_test -> test_data_2*  

Do not refit Tokenizer, use the same as for *train*.

In [15]:
test["question1"] = test["question1"].fillna("").apply(preprocess)
test["question2"] = test["question2"].fillna("").apply(preprocess)
q1s_test = test.question1.apply(prepare)
q2s_test = test.question2.apply(prepare)
test_data_1 = pad_sequences(tokenizer.texts_to_sequences(q1s_test), 
                       maxlen=MAX_SEQUENCE_LENGTH)
test_data_2 = pad_sequences(tokenizer.texts_to_sequences(q2s_test), 
                       maxlen=MAX_SEQUENCE_LENGTH)

## Step 6. Create embedding matrix

Now make embedding matrix of weights from embedding index. 

The *i-th* row of this matrix is a vector representation of word with index *i* in *word_index*. 

The embedding matrix will be used as weights matrix for embedding layer.

In [16]:
nb_words = len(word_index) + 1
embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))  # matrix of zeros

for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

Create embedding layer from embedding matrix as follows.

In [17]:
embedding_layer = Embedding(nb_words, EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

Setting *trainable=False* declares that no changing weights is required during traning. 

This layer just transforms sequences of integers (word indexes) into sequences of their vector representations.  

## Step 7. Save the data

We prepared the the following variables for neural network:


- *data_1*, *data_2*: padded numeric sequences for questions 1 and 2 in train sample 
- *test_data_1*, *test_data_2*: padded numeric sequences for questions 1 and 2 in test sample
- *nb_words*: length of dictionary *'word_index'* 
- *embedding_matrix*: matrix for transformation in the embedding layer

Save these variables to *.pkl* files

In [18]:
with open('./savedData/data_1.pkl', 'wb') as f: pickle.dump(data_1, f, -1)
with open('./savedData/data_2.pkl', 'wb') as f: pickle.dump(data_2, f, -1)
with open('./savedData/nb_words.pkl', 'wb') as f: pickle.dump(nb_words, f, -1)
with open('./savedData/embedding_matrix.pkl', 'wb') as f: pickle.dump(embedding_matrix, f, -1)
with open('./savedData/test_data_1.pkl', 'wb') as f: pickle.dump(test_data_1, f, -1)
with open('./savedData/test_data_2.pkl', 'wb') as f: pickle.dump(test_data_2, f, -1)    