## Text sequences to TFRecords
----

Hello everyone! In this tutorial, I am going to show you how can you parse your raw text data to TFRecords. I know that many people struggle with input processing pipelines, especially when you start working on your own personal project. So I really hope it is going to be useful for any of you :)!

### Tutorial flowchart
----
![img](tutorials_graphics/text2tfrecords.png)


### Dummy IMDB text data
----
For practice, I have chosen a few data samples from the Large Movie Review Dataset offered by Stanford.

### Import here useful libraries


In [2]:
from nltk.tokenize import word_tokenize
import tensorflow as tf
import pandas as pd
import pickle
import random
import glob
import nltk
import re

try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

### Parse data to TFRecords
---

In [4]:
def imdb2tfrecords(path_data='datasets/dummy_text/', min_word_frequency=5,
                   max_words_review=700):
    '''
    This script processes the data and saves it in the default TensorFlow 
    file format: tfrecords.
    
    Args:
        path_data: the path where the imdb data is stored.
        min_word_frequency: the minimum frequency of a word, to keep it
                            in the vocabulary.
        max_words_review: the maximum number of words allowed in a review.
    '''
    # Get the filenames of the positive/negative reviews 
    pos_files = glob.glob(path_data + 'pos/*')
    neg_files = glob.glob(path_data + 'neg/*')

    # Concatenate both positive and negative reviews filenames
    filenames = pos_files + neg_files
    
    # List with all the reviews in the dataset
    reviews = [open(filenames[i],'r').read() for i in range(len(filenames))]
    
    # Remove HTML tags
    reviews = [re.sub(r'<[^>]+>', ' ', review) for review in reviews]
        
    # Tokenize each review in part
    reviews = [word_tokenize(review) for review in reviews]
    
    # Compute the length of each review
    len_reviews = [len(review) for review in reviews]

    # Flatten nested list
    reviews = [word for review in reviews for word in review]
    
    # Compute the frequency of each word
    word_frequency = pd.value_counts(reviews)
    
    # Keep only words with frequency higher than minimum
    vocabulary = word_frequency[word_frequency>=min_word_frequency].index.tolist()
    
    # Add Unknown, Start and End token. 
    extra_tokens = ['Unknown_token', 'End_token']
    vocabulary += extra_tokens
    
    # Create a word2idx dictionary
    word2idx = {vocabulary[i]: i for i in range(len(vocabulary))}
    
    # Write word vocabulary to disk
    pickle.dump(word2idx, open(path_data + 'word2idx.pkl', 'wb'))
        
    def text2tfrecords(filenames, writer, vocabulary, word2idx,
                       max_words_review):
        '''
        Function to parse each review in part and write to disk
        as a tfrecord.
        
        Args:
            filenames: the paths of the review files.
            writer: the writer object for tfrecords.
            vocabulary: list with all the words included in the vocabulary.
            word2idx: dictionary of words and their corresponding indexes.
        '''
        # Shuffle filenames
        random.shuffle(filenames)
        for filename in filenames:
            review = open(filename, 'r').read()
            review = re.sub(r'<[^>]+>', ' ', review)
            review = word_tokenize(review)
            # Reduce review to max words
            review = review[-max_words_review:]
            # Replace words with their equivalent index from word2idx
            review = [word2idx[word] if word in vocabulary else 
                      word2idx['Unknown_token'] for word in review]
            indexed_review = review + [word2idx['End_token']]
            sequence_length = len(indexed_review)
            target = 1 if filename.split('/')[-2]=='pos' else 0
            # Create a Sequence Example to store our data in
            ex = tf.train.SequenceExample()
            # Add non-sequential features to our example
            ex.context.feature['sequence_length'].int64_list.value.append(sequence_length)
            ex.context.feature['target'].int64_list.value.append(target)
            # Add sequential feature
            token_indexes = ex.feature_lists.feature_list['token_indexes']
            for token_index in indexed_review:
                token_indexes.feature.add().int64_list.value.append(token_index)
            writer.write(ex.SerializeToString())
    
    ##########################################################################     
    # Write data to tfrecords.This might take a while.
    ##########################################################################
    writer = tf.python_io.TFRecordWriter(path_data + 'dummy.tfrecords')
    text2tfrecords(filenames, writer, vocabulary, word2idx, 
                   max_words_review)

In [5]:
imdb2tfrecords(path_data='datasets/dummy_text/')

### Parse TFRecords to TF tensors
----

In [6]:
def parse_imdb_sequence(record):
    '''
    Script to parse imdb tfrecords.
    
    Returns:
        token_indexes: sequence of token indexes present in the review.
        target: the target of the movie review.
        sequence_length: the length of the sequence.
    '''
    context_features = {
        'sequence_length': tf.FixedLenFeature([], dtype=tf.int64),
        'target': tf.FixedLenFeature([], dtype=tf.int64),
        }
    sequence_features = {
        'token_indexes': tf.FixedLenSequenceFeature([], dtype=tf.int64),
        }
    context_parsed, sequence_parsed = tf.parse_single_sequence_example(record, 
        context_features=context_features, sequence_features=sequence_features)
        
    return (sequence_parsed['token_indexes'], context_parsed['target'],
            context_parsed['sequence_length'])

If you want me to add anything to this tutorial, please let me know and I will be happy to further enhance it :).