# Data preprocessing for NLP


Before training a machine learning model, one important step we do is to preprocess the data. This step is extremely important regardless of what task we are doing. This is becuase a noisy data or a wrong input will result in compelety different result. Although I'm focusing on preprocessing the data for NLP tasks, I believe this tutorial will also be
helpful for other tasks.

In the tutorial, I will talk about how to preprocess data from raw text to numpy matrix, or even,
tensor for libraries like Pytorch or Tensorflow. The steps I will cover include:

1.Simple methods of data cleaning for raw text data

2.Tokenize data into python structures that help training the model.

3.Batchify data into inputs for model like recurrent neural network.

To begin with, the NLP task I'll be assuming will be some tasks that is related to sentence classification.
One example will be sentiment analysis where we take a sentence and determine whether it's a positive
comment neutral or a negatvie comment. In this turtorial, however, we will prepare for another task which is
simply determine the object given a triple (subject,verb,object). So for example, in the triple
(dog,likes,bone), the model should have a better chance to predict the word "bone" than "iphone" when it sees
dog and likes.

In [86]:
#Below are the libraries that will be used through out the tutorial
#It is fine to not have torch since it is only used in the last section
#to make a demonstration of transforming data into torch tensor
import pickle
import string
import torch
from torch.autograd import Variable
import numpy as np

Before we start, here is a link of a brief introduction to the dataset we'll be using.

http://rtw.ml.cmu.edu/resources/svo/

The SVO triples we'll be using are basically extracted from concepts learned by a computer.
For the purpose of this tutorial. I've truncated the whole data into only 10000 rows and
only preserve the subject,verb,object and frequency. But after this preprocessing, the data
is still filling with noise.

In [97]:
#You can take a look at what's inside the text file, which are
#subject,verb,object and frequency seperated by an empty space.
with open('./tutorial_data.txt','r') as data:
    for line in data:
        print(line)
        

Foyle  cost  lot  1 

Foyle  is  wrecked  1 

MP  supports  called  1 

Fr  ate  state  1 

Fr  graces  occasional  1 

Antonio  is  pastor  1 

Cullen  joined  Society  2 

Edward  offer  Mass  1 

Luiz  was  celebrant  2 

Mathew  introduced  bishops  1 

Peter  hold  book  1 

Tony  allow  family  1 

ee  am  heater  1 

Dimension  give  students  1 

Geometry  contain  wealth  1 

arts  is  matter  1 

patterns  filled  screen  2 

Fractals  are  things  1 

navigate  examinee  positione  1 

Decimal  render  men  1 

crystallization  is  removal  8 

Decimals  are  typed  9 

's  is  ability  1 

Debris  take  PO  1 

Version  Listen  previews  1 

weapons  have  chance  1 

Fragility  heighten  beauty  1 

Fragment  is  translationally  2 

programs  implementation  computationally  1 

Fragmentation  is  concerning  2 

Problem  is  phenomenon  1 

thinking  is  exceptionalism  1 

Fragonard  illustrate  works  1 


Fragrance  s  growth  1 

Crafts  find  somethingism  1 

room 

When cleaning the data. It is highly recommended to look at the data itself to see if
everything make sense. Since this dataset was generated not by human, but rather concepts learned by a computer,
you might find sentences that are semantically weird to read or even syntactically wrong. For example:

4  show  verificationism  1  
2.0  missed  attention  1  
two  may  fruit  2

There are couple of methods to filter thses noises out, and here will be three implementations that we are
going to try in this tutorial.

1.We will filter sentences by their frequency.   
2.We will filter sentences by other databases.   
3.We will filter out sentences with noisy nouns.  

The first method we will try is filter by their frequency. The code is really simple, we just
simply threshold the last element for each sentence.

In [88]:
with open('./tutorial_data.txt','r') as data:
    for line in data:
        lines=line.split()
        ##We know the last item is the frequncey. Here you can control
        ##how frequent you want. We would expect higher frequency means
        ##more semantically correct sentences, but higher frequency
        ##will result in smaller data, so there will be a tradeoff here.
        if int(lines[-1]) ==1: 
            continue
        print(lines)


['Cullen', 'joined', 'Society', '2']
['Luiz', 'was', 'celebrant', '2']
['patterns', 'filled', 'screen', '2']
['crystallization', 'is', 'removal', '8']
['Decimals', 'are', 'typed', '9']
['Fragment', 'is', 'translationally', '2']
['Fragmentation', 'is', 'concerning', '2']
['Fragrances', 'have', 'mysteryes', '2']
['vanilla', 'toast', 'nuts', '4']
['Frame', 'struck', 'relationship', '2']
['timeline', 'adds', 'motion', '34']
['Framed', 'are', 'hypothesised', '3']
['Frames', 'are', 'metal', '2']
['Frames', 'are', 'part', '2']
['Frames', 'make', 'website', '2']
['Direct', 'release', 'footage', '2']
['Framework', 'is', 'ADO.NET', '2']
['3.5', 'contain', 'number', '18']
['3.5', 'take', 'advantage', '3']
['Microsoft', 'got', 'copy', '2']
['Scott', 'kicking', 'things', '2']
['Casey', 'got', 'life', '2']
['Francais', 'are', 'times', '2']
['France', 'awards', 'tribes', '2']
['France', 'becamee', 'part', '2']
['France', 'bring', 'talk', '2']
['France', 'condemnation', 'war', '2']
['France', 'develop

Another trick we can do is to look for other database and do a cross filtering. So for example,
you can use freebase dataset as a filter. Here is a small database provided by Professor Tom Mitchell.
This database contains the most frequent verbs he found during data collecting. If we use this as a filter,
then we'll get the following result.

In [89]:
database=set()
with open('verbs484.txt','r') as txt:
    for line in txt:
        lines=line.split()
        database.add(lines[0])
        if len(lines)>1:
            #In the verbs484.txt, some rows only have one column.
            database.add(lines[1])  

with open('tutorial_data.txt','r') as txt:
    for line in txt:
        lines=line.split()
        if lines[1] not in database:
            #If the verb does not appear in the dataset.
            continue
        print(lines)


['Foyle', 'cost', 'lot', '1']
['Fr', 'ate', 'state', '1']
['Peter', 'hold', 'book', '1']
['Tony', 'allow', 'family', '1']
['Geometry', 'contain', 'wealth', '1']
['patterns', 'filled', 'screen', '2']
['room', 'allow', 'marye', '1']
['Fragrant', 'greeted', 'nose', '1']
['info', 'contain', 'collectiones', '1']
['Framers', 'built', 'Constitution', '1']
['Frames', 'hold', 'objects', '1']
['Framework', 'trace', 'destruction', '1']
['3.5', 'contain', 'number', '18']
['Source', 'forced', 'brilliance', '1']
['Tips', 'look', 'days', '1']
['Fran', 'manage', 'place', '1']
['France', 'broke', 'terms', '1']
['France', 'claim', 'victory', '1']
['France', 'commanded', 'divisional', '1']
['France', 'employed', 'mixture', '1']
['France', 'spend', 'onee', '2']
['France', 'won', 'battle', '1']
['France', 'won', 'prize', '1']
["'s", 'reflected', 'perfectionism', '1']
['Frances', 'answered', 'tailored', '1']
['Frances', 'felt', 'eyesed', '1']
['Frances', 'learn', 'code', '2']
['Frances', 'loved', 'cats', '1

We can also filter out some useless nouns such as numbers like 10 or 2,000 or nouns that do not make
sense in the corpus such as 10:00 or 3-5. There are lots of ways of doing this. Below will be
one example.

In [90]:
with open('tutorial_data.txt','r') as txt:
    puncs=list(string.punctuation)
    for line in txt:
        lines=line.split()
        valid=True
        if len(lines[0])<=2 or len(lines[2])<=2:
            #Normally when the nouns have less than two characters, it is invalid.
            continue
        if lines[0].isdigit() or lines[2].isdigit():
            #Check whether it only consists of digits.
            continue
        for punc in puncs:
            #Check whether it contains punctuation. We also filter this out.
            if punc in lines[0] or punc in lines[2]:
                valid=False
                break
        if valid:
            print(lines)
        

['Foyle', 'cost', 'lot', '1']
['Foyle', 'is', 'wrecked', '1']
['Antonio', 'is', 'pastor', '1']
['Cullen', 'joined', 'Society', '2']
['Edward', 'offer', 'Mass', '1']
['Luiz', 'was', 'celebrant', '2']
['Mathew', 'introduced', 'bishops', '1']
['Peter', 'hold', 'book', '1']
['Tony', 'allow', 'family', '1']
['Dimension', 'give', 'students', '1']
['Geometry', 'contain', 'wealth', '1']
['arts', 'is', 'matter', '1']
['patterns', 'filled', 'screen', '2']
['Fractals', 'are', 'things', '1']
['navigate', 'examinee', 'positione', '1']
['Decimal', 'render', 'men', '1']
['crystallization', 'is', 'removal', '8']
['Decimals', 'are', 'typed', '9']
['Version', 'Listen', 'previews', '1']
['weapons', 'have', 'chance', '1']
['Fragility', 'heighten', 'beauty', '1']
['Fragment', 'is', 'translationally', '2']
['programs', 'implementation', 'computationally', '1']
['Fragmentation', 'is', 'concerning', '2']
['Problem', 'is', 'phenomenon', '1']
['thinking', 'is', 'exceptionalism', '1']
['Fragonard', 'illustrate',

And of course, we can combine all these filters together in to one filter. Here will be a class that is responsible for all the things we did above.

In [91]:
database_path='./verbs484.txt'
tutorial_data_path='./tutorial_data.txt'

def load_484verb(path):
    database=set()
    with open(path,'r') as txt:
        for line in txt:
            lines=line.split()
            database.add(lines[0])
            if len(lines)>1:
                database.add(lines[1])  
    return database

def generate_text(path,database):
    #A combination with above methods. I also write the output to the file so that
    #we are able to reuse that latter on.
    with open(path,'r') as txt,open('filtered_data.txt','w') as output:
        puncs=list(string.punctuation)
        for line in txt:
            lines=line.split()
            valid=True
            if len(lines[0])<=2 or len(lines[2])<=2 or int(lines[-1])<4:
                continue
            if lines[0].isdigit() or lines[2].isdigit():
                continue
            for punc in puncs:
                if punc in lines[0] or punc in lines[2]:
                    valid=False
                    break
            if valid:
                output.write((' '.join(lines[:-1])))
                output.write('\n')
            
    
            

verb_database=load_484verb(database_path)
generate_text(tutorial_data_path,verb_database)

If you check the data right now, you will find that although the data is still not perfectly noiseless, it's definitely making much more sense than before. Now that we've cleaned our data, we can start to tokenize it and prepare the input. Here is a step by step of what below code is doing with an example.

First, I would like to create two dictionaries. The first one convert every unique word into its unique ID, the second do exactly the opposite thing, which is convert every unique ID into a unique word. The purpose of index_to_word dictionary is to transform every unique word into valid inputs and word_to_index dictionary can be used to test
your result in the future. 

Let's have an example first. Say we originally have two SVOs:

("Dogs like bones","Cats like milk"). 

I'll transform these two SVOs into index dictinary that hold :

{0:Dogs,1:like,2:bones,3:Cats,4:milk} and word dictionary {Dogs:0,like:1,bones:2,Cats:3,milk:4}. 

As you would see, the world "like" appear here twice so it will only count as once. Then I use index dictionary to transform two SVOs, so "Dogs like bones" will become [0,1,2] and "Cats like milk" will become [3,1,4]. I will eventually output a list of lists [[0,1,2],[3,1,4]] along with two dictionaries. To see why we transfer to this form. I will recommend to look at http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html this tuorial about word embeddings.

If you don't know what is pickle, here will be a link to it https://docs.python.org/3/library/pickle.html.
But basically its a libaray that lets you serialize and deserialize objects so that you can dump in one place and
load it in another, and below will be one example class that handle all these stuffs.


In [93]:
data_path=['./filtered_data.txt']
class Dictionary(object):
    def __init__(self):
        self.dict_index={}
        self.dict_word={}
        self.dict_index2vec={}
        self.count=0
        
        
        self.generate_tokens()

        
        self.active_lists=self.generate_text()
        self.output_pickle()


    def generate_tokens(self):
        for path in data_path:
            #The for loop here is just for easy modification. If I want to
            #add more data source, I'm able to add it in my data_path list.
            with open(path,'r') as data:
                for line in data:
                    words=line.split()
                    for word in words:
                        if word not in self.dict_word:
                            self.dict_index[self.count]=word
                            self.dict_word[word]=self.count
                            self.count+=1
    
    
    def generate_text(self):
        lists=[]
        for path in data_path:
            with open(path,'r') as active:
                for line in active:
                    tmp=[]
                    for word in line.split():
                        tmp.append(self.dict_word[word])
                    if len(tmp)>0:
                        lists.append(tmp)
        return lists
                
    def output_pickle(self):
        data_dir = './preprocessed_data'
        pickle.dump(self.dict_index,open(data_dir+'/index_to_word.p','wb'))
        pickle.dump(self.dict_word,open(data_dir+'/word_to_index.p','wb'))
        pickle.dump(self.active_lists,open(data_dir+'/active_lists.p','wb'))

a=Dictionary()

After we save the structures, we can finally load into the model and prepare for training.
Here will be one sample batching function that transforom list of lists into PyTorch variable.
For people who are not familiar with PyTorch tensor and variable, here will be a quick index to
what it is. http://pytorch.org/tutorials/beginner/former_torchies_tutorial.html

In [94]:
def batchify(path):
    active_lists=pickle.load(open(path,'rb'))
    np_arr=np.array(active_lists)
    #Change list of lists into torch variable.
    return Variable(torch.LongTensor(np_arr))

train_data=batchify('./preprocessed_data/active_lists.p')
#print(train_data)

Below will be a function that will be called when training, which will slice the training data into
batches. Notice that after slicing, what it will return is a batch with shape [batch_size x steps]
where sometimes a model wants its input to be [steps x batch_size]. In that case, you can either choose to
specify batch_first=True paremeter in the model's parameter or transpose the batch. Also, since our task is to 
predict the object, we want to seperate that out, so our input will be subject and verb. Our output will be object.

In [95]:
def get_batch(batch_size,train_data,cur_pos):
    max_size=min(len(train_data),cur_pos+batch_size)
    #There might be less than batch_size's data in the end,
    #so we take min of the length of data and current position plus batch_size.
    input_data=train_data[cur_pos:max_size,:2]
    target_data=train_data[cur_pos:max_size,2]
    
    return input_data,target_data


batch_size=5
my_input,my_target=get_batch(batch_size,train_data,0)
print(my_input,my_target)

Variable containing:
  0   1
  3   4
  6   7
  9  10
 12  13
[torch.LongTensor of size 5x2]
 Variable containing:
  2
  5
  8
 11
 14
[torch.LongTensor of size 5]



For your reference, here will be a sample train function. The commented section is just what might be like
in the real train function. 

In [96]:
def train(batch_size,train_data):
    for i in range(0,len(train_data),batch_size):
        my_input,my_target=get_batch(batch_size,train_data,i)
        
        #print(my_input,my_target)
        
        ##output=rnn_model(my_input)
        ##loss=loss_layer(output,my_target)
        ##rnn_model.zero_grad()
        ##loss.backward()
        ##update_parameter()

        
batch_size=5
epoch=1
train_data=batchify('./preprocessed_data/active_lists.p')

print('Training Start')
for e in range(epoch):
    train(batch_size,train_data)

print('Training Complete')

Training Start
Training Complete
