# Language correction - missing articles

This notebook demonstrates a working example of applying the character-level sequence-to-sequence with attention model to address a language correction problem. We limit the scope of the problem to fixing a single missing article (a, an, the). This means that given an input text with an article missing, the model would recommend a language-corrected text as an output.

In particular, this notebook performs the following. Note that the majority of the codes are separated Python files inside the same folder (data_iterator.py, preprocessor.py, tf_graph.py, and training_manager.py).
- Initial data investigation to see how the data look like 
- Data preprocessing, i.e., converting from raw data to the format that is ready to be trained by a Tensorflow graph for recurrent network modelling.
- Splitting the data into three parts 1.) training set, 2.) validation set, and 3.) test set. 
- Specifying the hyperparameter for the training
- Training a sequence-to-sequence model (with attention) on the training set and evaluate an accuracy metric on the validation set. The training stops when the loss metric on the validation set does not improve (using a simple early stopping method).
- Saving the training model to files
- Loading the training model
- Evaluating the performance of the model qualitatively by using it on some random input data
- Evaluating the performance of the model quantitatively by using the accuracy metric on the test set

Before you begin, please execute the following steps
- Download the dataset archive file (missing_article.tar.gz) from https://github.com/rerngvit/dataset/blob/master/nlp/language_correction/missing_article.tar.gz
- Extract the dataset to the same folder as this notebook. You expect to have the folder "dataset" after the extraction process

In [1]:
src_data_path  = "./dataset/kaggle-book/source.csv"
dest_data_path = "./dataset/kaggle-book/dest.csv"

# First let have a look at how the data look like

We read out the first 10 lines of the source and destination file

In [2]:
count = 1
for src_line, dest_line in zip(open(src_data_path, encoding='utf8'), 
                               open(dest_data_path, encoding='utf8')):
    print(src_line + dest_line)
    print("============")
    if count >= 5:
        break
    count = count + 1
            

The Da Vinci Code book is just awesome.
1

"this was the first clive cussler i've ever read, but even books like Relic, and Da Vinci code were more plausible than this."
1

i liked the Da Vinci Code a lot.
1

i liked the Da Vinci Code a lot.
1

I liked the Da Vinci Code but it ultimatly didn't seem to hold it's own.
1



The data represents a fixing of missing an article (at most one mistake) and a minor fixing of puncation.

# Let begin the training process

First we specify the folder that will store the trained models and Tensorboard data

In [3]:
model_base_dir = "./trained_models/sentimental_analysis"


First let have a look at the devices available for Tensorflow to train.
You expect to see a GPU on the command below. 
Otherwise, the training process will be even longer than it normally would (15-20X+).

In [4]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

  from ._conv import register_converters as _register_converters


[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 17406597845783354212]

In [5]:
import numpy as np
import warnings
import matplotlib.pyplot as plt
import time
import os
import json
warnings.filterwarnings('ignore')

In [6]:
import tensorflow as tf
tf.__version__

'1.4.1'

In [7]:
from preprocessor import CharacterLevelSeq2SeqPreprocessor

In [8]:
lang_data_preprocessor = CharacterLevelSeq2SeqPreprocessor(src_data_path=src_data_path, 
               dest_data_path=dest_data_path, 
               max_number_of_samples=-1,
               max_character_length=80)

 Processing in total 5214 lines


Below is the hyper parameters used for the model

In [9]:
config = { 'hidden_units' : 256,
           'depth'     : 12,
           'attention_type' : 'Bahdanau',
           'use_embedding' :  True,
           'embedding_size' : 128,
           'num_encoder_symbols' : lang_data_preprocessor.num_src_tokens,
           'num_decoder_symbols' : lang_data_preprocessor.num_dest_tokens,
           'use_residual' : True,
           'attn_input_feeding': True,
           'use_dropout': True,
           'dropout_rate' : 0.02,
           'learning_rate' : 0.0005,
           'max_gradient_norm' : 1.0,
           'batch_size' : 64,
           'num_epochs' : 20,
           'optimizer'  : 'adam',
           'use_fp16'   : False,
           'max_src_seq_length'  : lang_data_preprocessor.max_src_seq_length,
           'max_dest_seq_length' : lang_data_preprocessor.max_dest_seq_length,
           'max_decode_step' : lang_data_preprocessor.max_dest_seq_length,
           'padding_token_index' : lang_data_preprocessor.dest_token_index[
                                       lang_data_preprocessor.PADDED_character],
           'dest_start_token_index' : lang_data_preprocessor.dest_token_index[
                                       lang_data_preprocessor.SEQ_START_CHARACTER],
           'dest_eos_token_index' : lang_data_preprocessor.dest_token_index[
                                       lang_data_preprocessor.EOS_character],
           'model_base_dir': model_base_dir,
           'model_saved_path' : model_base_dir + "/trained_model",
           'use_beamsearch_decode' : False,
           'beam_width' : 64,          
           'saving_last_model' : True,
           'default_device' : '/cpu:0',
           'run_full_trace': False
         }

In [10]:
from tf_graph import CharacterSeq2SeqModel
from data_iterator import DataFeedIterator
from training_manager import TrainingManager

In [11]:
training_manager = TrainingManager(lang_data_preprocessor=lang_data_preprocessor,
                                   src_data_path=src_data_path, 
                                   dest_data_path=dest_data_path, 
                                   batch_size=config["batch_size"])

In [12]:
import json
os.makedirs(model_base_dir, exist_ok=True)
cfg_file_path = model_base_dir + "/config.json"
with open(cfg_file_path, 'w') as file:
     file.write(json.dumps(config))

# The code below will start the training process
* Note that it is expected to take around 15-20 hours for a single run (with a GPU). If you do not have a GPU, it can take even longer than that.
* After the training process is finished. The model would be saved to a file

In [13]:
training_manager.fit_eval_dnn(config=config)

 Setting default device to  /cpu:0
building model..
building encoder..
building decoder and attention..
 Initial state for decoder cell batch size is  64
 Decoder input encoded is  (64, 3, 256)
logits train shape is  (64, ?, 5)
decoder_targets_train train shape is  (64, ?)
setting optimizer..
running vars are  [<tf.Variable 'decoder/accuracy/total:0' shape=() dtype=float32_ref>, <tf.Variable 'decoder/accuracy/count:0' shape=() dtype=float32_ref>]
Execution an epoch: type =  Train
[ Train ] Epoch 0 Step 0 Teacher-Force Acc 0.18 Loss 0.66 Perplex 1.94 Elapsed Time 1.93 3.32 samples/s Cur time:  2018-02-02 08:43 

[ Train ] Epoch 0 Step 10 Teacher-Force Acc 0.77 Loss 2.94 Perplex 18.99 Elapsed Time 17.39 3.68 samples/s Cur time:  2018-02-02 08:46 

[ Train ] Epoch 0 Step 20 Teacher-Force Acc 0.86 Loss 0.45 Perplex 1.56 Elapsed Time 17.88 3.58 samples/s Cur time:  2018-02-02 08:49 

[ Train ] Epoch 0 Step 30 Teacher-Force Acc 0.90 Loss 0.19 Perplex 1.21 Elapsed Time 19.17 3.34 samples/s Cu

0.07141225822269916

# Loading the trained model from the saved file

In [53]:
import json

In [54]:
loaded_config = json.load(open(cfg_file_path))
loaded_config

{'attention_type': 'Bahdanau',
 'attn_input_feeding': True,
 'batch_size': 64,
 'beam_width': 64,
 'default_device': '/cpu:0',
 'depth': 12,
 'dest_eos_token_index': 1,
 'dest_start_token_index': 2,
 'dropout_rate': 0.02,
 'embedding_size': 128,
 'hidden_units': 256,
 'learning_rate': 0.0005,
 'max_decode_step': 3,
 'max_dest_seq_length': 3,
 'max_gradient_norm': 1.0,
 'max_src_seq_length': 80,
 'model_base_dir': './trained_models/sentimental_analysis',
 'model_saved_path': './trained_models/sentimental_analysis/trained_model',
 'num_decoder_symbols': 5,
 'num_encoder_symbols': 95,
 'num_epochs': 20,
 'optimizer': 'adam',
 'padding_token_index': 0,
 'run_full_trace': False,
 'saving_last_model': True,
 'use_beamsearch_decode': False,
 'use_dropout': True,
 'use_embedding': True,
 'use_fp16': False,
 'use_residual': True}

In [62]:
decoding_config = {
    'beam_width': 1,
    'use_beamsearch_decode' : False,
    'max_decode_step': lang_data_preprocessor.max_dest_seq_length,
    'write_n_best' : False,
    'log_device_placement' : True,
    'batch_size': 32,
    'default_device' : '/cpu:0'
    
}

In [63]:
for k, v in loaded_config.items():
    if k not in decoding_config.keys():
        decoding_config[k] = v

In [64]:
decoding_config

{'attention_type': 'Bahdanau',
 'attn_input_feeding': True,
 'batch_size': 32,
 'beam_width': 1,
 'default_device': '/cpu:0',
 'depth': 12,
 'dest_eos_token_index': 1,
 'dest_start_token_index': 2,
 'dropout_rate': 0.02,
 'embedding_size': 128,
 'hidden_units': 256,
 'learning_rate': 0.0005,
 'log_device_placement': True,
 'max_decode_step': 3,
 'max_dest_seq_length': 3,
 'max_gradient_norm': 1.0,
 'max_src_seq_length': 80,
 'model_base_dir': './trained_models/sentimental_analysis',
 'model_saved_path': './trained_models/sentimental_analysis/trained_model',
 'num_decoder_symbols': 5,
 'num_encoder_symbols': 95,
 'num_epochs': 20,
 'optimizer': 'adam',
 'padding_token_index': 0,
 'run_full_trace': False,
 'saving_last_model': True,
 'use_beamsearch_decode': False,
 'use_dropout': True,
 'use_embedding': True,
 'use_fp16': False,
 'use_residual': True,
 'write_n_best': False}

In [65]:
import tensorflow as tf
tf.reset_default_graph() # to clean out all the variables to allow for rerunning the model
sess = tf.Session(config=tf.ConfigProto(
      allow_soft_placement=True))

In [66]:
# Start the model
decoding_model = CharacterSeq2SeqModel(session=sess,
                              config=decoding_config, 
                              mode='decode')
# Restoring model parameters
decoding_model.restore(sess, decoding_config["model_saved_path"])

 Setting default device to  /cpu:0
building model..
building encoder..
building decoder and attention..
 Initial state for decoder cell batch size is  32
building greedy decoder..
INFO:tensorflow:Restoring parameters from ./trained_models/sentimental_analysis/trained_model
model restored from ./trained_models/sentimental_analysis/trained_model


# Evaluating the model qualitatively with unseen data

In [73]:
def sentimental_analysis(input_text):
    return training_manager.seq2seq_execution(
    sess, decoding_model, decoding_config['batch_size'], 
    input_text=input_text)

In [74]:
sentimental_analysis("This is a great movie.")

'0\n'

In [80]:
sentimental_analysis("Awesome movie.")

'1\n'

In [None]:
sentimental_analysis("I really like it.")

In [None]:
sentimental_analysis("Such a lovely movie.")

In [75]:
sentimental_analysis("I do not like this one at all.")

'1\n'

In [76]:
sentimental_analysis("This sucks.")

'0\n'

In [77]:
sentimental_analysis("I hate this completely.")

'0\n'

In [78]:
sentimental_analysis("This sucks big time!")

'0\n'

In [79]:
sentimental_analysis("This is a very bad movie.")

'0\n'

# Evaluating the model quantitatively on the test set

The code below simply retrieve the data as a batch and using the model to generate the predictions (decoded_text)
and compare with the ground truth (dest text). At the end of the evaluation, the code would output the accuracy metric, which is the fraction of samples that are matched between the ground truth and the predictions.

In [81]:
def evaluate_model(dataset):
    num_samples, num_corrected = 0, 0
    
    for source, source_len, dest, dest_len in dataset:
        for sample_idx in range(config['batch_size']):
            beam_idx = 0
            decoding_src_seq  = source[sample_idx, :]
            decoding_dest_seq = dest[sample_idx, :]
            
            input_text = lang_data_preprocessor.src_seq_to_text(decoding_src_seq)
            predicted_text = language_correction_text(input_text)
            actual_text = lang_data_preprocessor.dest_seq_to_text(
                decoding_dest_seq)[1:]   # Ignoring the first SEQ_START chracter
            
            print("Input text = '%s'"     % input_text)
            print("Dest text = '%s'"     %  actual_text)
            print("Decoded text = '%s'  " % predicted_text)
            print(" =======================")
            
            if actual_text.strip() == predicted_text.strip():
                num_corrected += 1
            
            num_samples += 1
    
    print(" Number of corrected decoding = ", num_corrected, ", out of ", num_samples)
    accuracy = (num_corrected + 0.0) / num_samples
    print(" Accuracy = %s " % accuracy)



In [82]:
train_data_list = list(training_manager.train_set)
train_data_list

[(array([[44., 71., 81., ...,  0.,  0.,  0.],
         [ 5., 64., 87., ...,  0.,  0.,  0.],
         [ 5., 50., 82., ...,  0.,  0.,  0.],
         ...,
         [35., 63.,  3., ...,  0.,  0.,  0.],
         [17., 35., 77., ...,  0.,  0.,  0.],
         [ 5., 39., 63., ...,  0.,  0.,  0.]]),
  array([36, 65, 75, 33, 78, 33, 35, 22, 35, 31, 77, 27, 65, 34, 68, 49, 42,
         67, 38, 72, 31, 76, 33, 27, 49, 71, 78, 54, 54, 33, 71, 76, 38, 31,
         62, 77, 54, 31, 33, 30, 49, 30, 43, 76, 23, 30, 30, 35, 21, 42, 27,
         79, 27, 52, 39, 58, 78, 23, 23, 41, 38, 23, 58, 70]),
  array([[2., 4., 1.],
         [2., 3., 1.],
         [2., 3., 1.],
         [2., 4., 1.],
         [2., 3., 1.],
         [2., 4., 1.],
         [2., 4., 1.],
         [2., 4., 1.],
         [2., 3., 1.],
         [2., 3., 1.],
         [2., 4., 1.],
         [2., 3., 1.],
         [2., 3., 1.],
         [2., 4., 1.],
         [2., 4., 1.],
         [2., 3., 1.],
         [2., 4., 1.],
         [2., 4., 1.],


In [83]:
import random

First, let evaluate the performance on the training set. We do data sampling here with *k* representing the number of batches to obtain early results before progressing to larger *k*

In [84]:
evaluate_model(random.sample(train_data_list, k=2))

Input text = 'friday hung out with kelsie and we went and saw The Da Vinci Code SUCKED!!!!!
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'holy crap i loved mission impossible 3..
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'friday hung out with kelsie and we went and saw The Da Vinci Code SUCKED!!!!!
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'The Da Vinci Code was absolutely AWESOME!
'
Dest text = '1
'
Decoded text = '1
'  
Input text = '"I have to say, I hated Brokeback Mountain, though."
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'Brokeback Mountain was boring.
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'I love Harry Potter.
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'i heard da vinci code sucked soo much only 2.5 stars:
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'we're gonna like watch Mission Impossible or Hoot.(
'
Dest text = '1
'
Decoded text = '1
'  
Input text = '"I, too, like Harry Potter.."
'
Dest text

Input text = 'I love Brokeback Mountain....
'
Dest text = '1
'
Decoded text = '1
'  
Input text = '"The Da Vinci Code was awesome, I can't wait to read it..."
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'friday hung out with kelsie and we went and saw The Da Vinci Code SUCKED!!!!!
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'Da Vinci Code sucked..
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'da vinci code sucks...
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'dudeee i LOVED brokeback mountain!!!!
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'Harry Potter is AWESOME I don't care if anyone says differently!..
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'we're gonna like watch Mission Impossible or Hoot.(
'
Dest text = '1
'
Decoded text = '1
'  
Input text = '"I really love Brokeback Mountain, its a wonderful film!!!"
'
Dest text = '1
'
Decoded text = '1
'  
Input text = '"I liked the first "" Mission Impossible."
'
Dest text = '1


In [85]:
evaluate_model(random.sample(train_data_list, k=10))

Input text = '"Anyway, thats why I love "" Brokeback Mountain."
'
Dest text = '1
'
Decoded text = '1
'  
Input text = '"Is it just me, or does Harry Potter suck?..."
'
Dest text = '0
'
Decoded text = '0
'  
Input text = '"Oh, and Brokeback Mountain is a TERRIBLE movie..."
'
Dest text = '0
'
Decoded text = '0
'  
Input text = '"Oh, and Brokeback Mountain was a terrible movie."
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'dudeee i LOVED brokeback mountain!!!!
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'I want Harry Potter back!..
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'Da Vinci Code sucks be...
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'Da Vinci Code was AWESOME..
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'Da Vinci Code sucked..
'
Dest text = '0
'
Decoded text = '0
'  
Input text = '"He's like,'YEAH I GOT ACNE AND I LOVE BROKEBACK MOUNTAIN '.."
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'I love The Da Vinci Code...
'


Input text = 'I wanted desperately to love'The Da Vinci Code as a film.
'
Dest text = '1
'
Decoded text = '1
'  
Input text = '"Always knows what I want, not guy crazy, hates Harry Potter.."
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'i heard da vinci code sucked soo much only 2.5 stars:
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'friday hung out with kelsie and we went and saw The Da Vinci Code SUCKED!!!!!
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'Da Vinci Code sucks be...
'
Dest text = '0
'
Decoded text = '0
'  
Input text = '"Always knows what I want, not guy crazy, hates Harry Potter.."
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'Ok brokeback mountain is such a horrible movie.
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'Ok brokeback mountain is such a horrible movie.
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'we're gonna like watch Mission Impossible or Hoot.(
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 

Input text = 'the people who are worth it know how much i love the da vinci code.
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'Da Vinci Code sucks be...
'
Dest text = '0
'
Decoded text = '0
'  
Input text = '"The Da Vinci Code was awesome, I can't wait to read it..."
'
Dest text = '1
'
Decoded text = '1
'  
Input text = '0Sometimes all I said was'Harry Potter sucks!
'
Dest text = '0
'
Decoded text = '0
'  
Input text = '"The Da Vinci Code was awesome, I can't wait to read it..."
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'Having Brokeback Mountain action figures that are fully clothed is stupid.
'
Dest text = '0
'
Decoded text = '0
'  
Input text = '"I liked the first "" Mission Impossible."
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'i love being a sentry for mission impossible and a station for bonkers.
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'I love Brokeback Mountain....
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'Da Vinci Co

Input text = 'This quiz sucks and Harry Potter sucks ok bye..
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'I love Harry Potter..
'
Dest text = '1
'
Decoded text = '1
'  


KeyboardInterrupt: 

We also evaluate in the validation set to get a feeling of the performance differences

# The key evaluation on the test set

Below is the performance of the model on the unseen dataset

In [None]:
evaluate_model(training_manager.test_set)

Input text = 'I love Harry Potter..
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'Da Vinci Code sucked..
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'man i loved brokeback mountain!
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'the people who are worth it know how much i love the da vinci code.
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'I love Harry Potter.
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'I love Brokeback Mountain....
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'Brokeback Mountain was awesome..
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'Da Vinci Code sucks.
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'The Da Vinci Code was awesome in my opinion -- the book and the movie.
'
Dest text = '1
'
Decoded text = '1
'  
Input text = 'So Brokeback Mountain was really depressing.
'
Dest text = '0
'
Decoded text = '0
'  
Input text = 'Mission Impossible 3 sucked!..
'
Dest text = '0
'
Decoded text = '0
' 

Number of corrected decoding =  66564 , out of  88064
Accuracy = 0.755859375 