## **Text-based inference on a production allocation problem (keyword extraction, abbreviation detection, seq2seq encoder-decoder)**

Information Extraction (IE) is one of the most active research topics in natural languag processing (NLP) domain, comprising of a number of sophisticated tasks to go through prior to being able to make human-like inference digesting on the input texts. Tokenization, POS tagging, anguage models, entity recognition, text summarizations, neural machine translations, etc. are bunch of work analysts might play with to get the unstructured form of strings to be analyzable and predictable to complete targets on custom contexts.


In a highly dynamic and uncertian business operation of a supply chain or production environment, how many resources were to be allocated to specific parties may sometimes not be easily clear-cut. External factors like the development lead time, availability of raw materials and warehouse capacity constraints could change the rules frequently. 


For a problem trying to automatically deduce the rules based on some short texts of instructions, with as minimal supervised efforts and prior expertise as possible, these short texts consisted of short forms of named parties, implicit or explicit numerical expressions, and even some latent contextual meanings, e.g. when it was meant to be equally shared, the machine needs to consider the number of parties involved and simply dividing 100% by it. But it would be extremely inefficient to write if-else condition programme to define actions to take, and also difficult to pin-point the short forms to the special nouns which were specific in my context that no pre-trained corpa of named entities could help.


To complete such challenge using NLP techniques, my idea included extracting keywords from the instruction texts by building a classifier, converting the short forms by a self-defined matching function, and constructing an encoder-decoder structure of seq2seq model to learn the rules between named parties and their percentages.



In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import nltk
import re
nltk.download('stopwords')
nltk.download('punkt')

In [0]:
from google.colab import files
files.upload()

In [0]:
df = pd.read_csv('NLP information extraction and allocation action.csv')

In [0]:
## Extract vendor name and vendor class entities
## Tokenize the short texts of allocation rules

identity = []
combined_text = []
tokenized_text = []
word_list = []

n = 0
k = 1
for i in range(1, len(df)):
  if df['5-digit Item Number'][i]!=df['5-digit Item Number'][i-1]:
    k += 1

for j in range(k):
  while (not identity) or (df['5-digit Item Number'][n]==df['5-digit Item Number'][n-1]):
    identity.append(df[['Vendor Name', 'Vendor Class']].iloc[n].apply(lambda x: ''.join(x)).values)
    n += 1
    if n >= len(df):
      break

  series = list(np.array(identity).flatten())
  series = [x.lower() for x in series]
  series = [x.strip() for x in series]

  combined_text.append(series)
  identity = []

  processed_text = df['Allocation rules'][n-1]
  processed_text = processed_text.lower()
  processed_text = processed_text.strip()
  processed_text = re.findall(r'\d+\.?\d+?%|\d+?%|\d+/\d+/\d+|\w+', processed_text)
  tokenized_text.append(processed_text)
  
  ## Make tokens of vendor names consistent before word2vec; 
  ## handling cases like of mis-spelling / missing space in-between the names
  
  for v in range(0, len(combined_text[j]), 2):
    if len(combined_text[j][v].split()) > 1:
      w = 0
      while w < len(tokenized_text[j]):
        if tokenized_text[j][w]==combined_text[j][v].split()[0]:
          m = 0
          while tokenized_text[j][w+m]==combined_text[j][v].split()[m]:
            m += 1
            if m >= len(combined_text[j][v].split()) or w + 1 >= len(tokenized_text[j]):
              break
          if m > 1 or len(combined_text[j][v].split()) > 1:
            tokenized_text[j][w:w+m] = []
            tokenized_text[j].insert(w, combined_text[j][v])
        w += 1
        if w + 1 > len(tokenized_text[j]):
          break
          
  word_list.append(combined_text[j] + tokenized_text[j])

### Word representations:

The above operations tokenized the raw strings, and read the entities (factory name and factory category) into respective lists. For processing information from text data, Word2Vec is a common technique to obtain neural representations of the texts. It can be used as features in the keyword classification task. I tested to query the top relevant words for the abbreviation tokens, and the fitted word2vec model successfully returned the desired token as the most similar word for some of them, e.g. "fs" matched with "funskool". Though "mp" also found "micro plastics" in the 6th position, it showed that the word2vec model was still not robust enough to detect correctly the referred word of each abbreviation in the corpus.


In [0]:
from gensim.models import word2vec

In [0]:
## use Word2Vec to find nearest neighbouring expressions for mining abbreviations related to name entities
w2v = word2vec.Word2Vec(word_list, size=1000, window=3, min_count=1, seed=42)
w2v.train(word_list, total_examples=len(word_list), epochs=500)

(827568, 1942500)

In [0]:
## load the trained word2vec model
w2v = word2vec.Word2Vec.load("allocation word vectors")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
w2v.wv.most_similar('mp')

  if np.issubdtype(vec.dtype, np.int):


[('check', 0.5507981777191162),
 ('gsp', 0.46539390087127686),
 ('growth master', 0.460766464471817),
 ('allocation', 0.41754984855651855),
 ('gm', 0.41087865829467773),
 ('micro plastics', 0.37347841262817383),
 ('kh', 0.3706154227256775),
 ('40%', 0.3654845356941223),
 ('60%', 0.3639428913593292),
 ('45%', 0.34594470262527466)]

In [0]:
w2v.wv.most_similar('fs')

  if np.issubdtype(vec.dtype, np.int):


[('funskool', 0.8913756608963013),
 ('90%', 0.7004472613334656),
 ('item', 0.6796683073043823),
 ('growth master', 0.6771235466003418),
 ('wt', 0.6760456562042236),
 ('under', 0.6524773240089417),
 ('gm', 0.6307111382484436),
 ('35%', 0.602761447429657),
 ('ex', 0.6009894609451294),
 ('2.1%', 0.5619425773620605)]

In [0]:
## save vocab list
vocab = w2v.wv.vocab

### Transforming tokens into dataframe for manual annotations & Classification problem to tag keyowrds:

For each short text being of different lengths, I re-construct a dataframe format to present each token in a row, allowing easy creations of fields for tagging of  keywords and abbreviations. Here, I load the tagged dataset, map the word2vec features and import them for training a Gradient Boosting classifier on the binary indicator of keywords. Overall, 92% of balanced-class accuracy and 93% of F1 score have been achieved. 


In [0]:
## Construct a dataframe of vocab
vocab_item = []
vocab_df = []

k = 1
for i in range(1, len(df)):
  if not vocab_item:
    vocab_item.append(df['5-digit Item Number'][0])
  if df['5-digit Item Number'][i]!=df['5-digit Item Number'][i-1]:
    k += 1
    vocab_item.append(df['5-digit Item Number'][i])

for j in range(k):
  vocab_df.append(pd.concat([pd.Series([vocab_item[j]]*len(tokenized_text[j])), 
                             pd.Series(np.array(tokenized_text[j]).reshape(len(tokenized_text[j]), ))], 
                            axis=1))
  
vocab_df = pd.concat(vocab_df, axis=0)
vocab_df.columns = ['Item', 'Token']

In [0]:
## manually tagged dataset
vocab_df = pd.read_csv('keyword_encode_2.csv')
vocab_df.iloc[0:5,1:]

Unnamed: 0,Token,Critical Word Indicator,Abbreviation Token Indicator,Abbr_Name,Abbr_Class
0,fy,0,0,0,0
1,target,0,0,0,0
2,keep,1,0,0,0
3,jp,1,1,1,0
4,remove,1,0,0,0


In [0]:
## For classification of critical words,
## Map the trained word vectors for each word

word_vec = []
word_vec_dim = []

for x in range(len(vocab_df)):
  word_vec.append(pd.DataFrame(w2v[vocab_df['Token'][x]]).transpose())

for dim in range(1000):
  word_vec_dim.append('Dim' + str(dim + 1))

word_vec_df = pd.concat(word_vec, axis=0)
word_vec_df.columns = word_vec_dim
word_vec_df.reset_index(inplace=True, drop=True)

word_vec_df

  


Unnamed: 0,Dim1,Dim2,Dim3,Dim4,Dim5,Dim6,Dim7,Dim8,Dim9,Dim10,Dim11,Dim12,Dim13,Dim14,Dim15,Dim16,Dim17,Dim18,Dim19,Dim20,Dim21,Dim22,Dim23,Dim24,Dim25,Dim26,Dim27,Dim28,Dim29,Dim30,Dim31,Dim32,Dim33,Dim34,Dim35,Dim36,Dim37,Dim38,Dim39,Dim40,...,Dim961,Dim962,Dim963,Dim964,Dim965,Dim966,Dim967,Dim968,Dim969,Dim970,Dim971,Dim972,Dim973,Dim974,Dim975,Dim976,Dim977,Dim978,Dim979,Dim980,Dim981,Dim982,Dim983,Dim984,Dim985,Dim986,Dim987,Dim988,Dim989,Dim990,Dim991,Dim992,Dim993,Dim994,Dim995,Dim996,Dim997,Dim998,Dim999,Dim1000
0,0.079781,-0.025519,-0.243857,0.481250,-0.097796,-0.025103,0.096601,0.291724,-0.456119,-0.527140,-0.492613,-0.523470,0.057695,0.069989,0.291863,-0.085822,-0.160219,0.104416,-0.224287,-0.249879,-0.227970,0.352441,0.137310,-0.031105,-0.292824,0.007318,0.384589,-0.171379,0.141584,0.191231,-0.029099,-0.054023,0.066428,-0.333136,0.065857,-0.119944,0.190828,0.419122,-0.147303,0.160678,...,0.491943,-0.233385,-0.110467,0.176024,-0.132378,0.034769,-0.170813,-0.419283,0.432286,0.342213,-0.521197,-0.300880,-0.256072,0.147677,0.017774,-0.079317,0.233280,-0.723025,0.320899,0.387058,-0.136161,-0.100076,-0.115502,-0.366601,-0.481600,0.230335,-0.186191,0.341697,-0.063077,0.236464,0.179565,0.499097,0.384646,0.061515,0.026558,-0.441899,0.105285,-0.091680,0.157746,0.151333
1,-0.183910,-0.061212,0.129800,0.520310,-0.011274,0.357148,-0.276536,0.138316,0.168843,-0.218067,-0.007209,-0.393613,0.588654,-0.389109,0.642491,0.075533,0.383125,-0.088680,0.355325,0.174431,0.021438,-0.014471,-0.222796,-0.398659,-0.249722,-0.352323,-0.184781,-0.310528,-0.057301,0.047881,-0.292628,-0.197707,0.057850,-0.606778,0.315939,-0.085344,0.247765,0.667252,0.321471,-0.014339,...,-0.008724,-0.055227,0.058422,0.092477,-0.103028,-0.340897,0.357281,0.079935,0.123921,-0.079473,-0.006668,0.081586,0.088914,-0.010275,-0.319333,0.317449,0.205100,0.011300,-0.007748,0.345069,-0.128609,0.009163,-0.139763,-0.364857,-0.124788,-0.283637,-0.016459,0.235692,0.049831,0.174068,0.137410,0.120423,0.049469,0.514741,0.224115,-0.366353,-0.273836,0.234740,0.307678,0.253161
2,-0.271267,0.124715,0.024563,-0.094865,0.091894,0.062833,0.100830,0.017258,-0.022628,0.120851,0.005456,0.179561,0.145151,-0.274339,0.257788,0.043923,0.388484,0.002192,-0.009557,-0.321865,-0.102209,0.244217,0.220352,0.033578,0.139470,-0.269418,0.078591,0.318461,0.018931,0.003448,-0.111183,-0.190392,-0.032441,0.191246,0.098037,-0.207961,0.325078,-0.145846,-0.483748,-0.166002,...,0.053646,-0.069114,-0.075449,0.055255,-0.393169,-0.066259,0.134060,0.162547,-0.021797,-0.137397,0.233625,-0.139859,-0.098502,0.256120,0.167748,0.031214,0.318820,-0.205395,-0.200689,-0.161129,-0.120770,0.007059,0.116352,-0.054278,0.043781,0.067517,0.208657,0.226606,-0.309472,0.120875,0.193496,0.283361,0.050496,-0.081303,0.132771,-0.360822,-0.000573,-0.373614,0.108963,0.042342
3,-0.334936,-0.237400,-0.167446,0.307774,0.552991,0.099674,-0.128793,0.184903,-0.300067,-0.160469,-0.137007,-0.380960,0.067295,-0.350619,0.319733,0.312155,0.234408,0.456366,0.103505,-0.163476,-0.112946,0.273597,0.274887,-0.008497,-0.404606,-0.219477,0.200002,0.470362,0.067487,-0.013588,-0.049962,-0.066518,-0.108149,-0.090050,0.477854,-0.432016,0.279654,0.297045,-0.167063,0.094208,...,0.295686,-0.071417,-0.103228,-0.008193,-0.482496,-0.107093,0.380227,-0.081455,0.418951,-0.213351,-0.182586,-0.167677,0.183418,-0.053785,0.092943,-0.330301,-0.024515,-0.559753,0.022932,0.180525,0.188712,0.027331,-0.235138,-0.330035,0.122914,0.102582,-0.103969,-0.070550,-0.079660,0.035513,-0.149414,0.560538,-0.103845,0.242430,0.017625,-0.328206,-0.005047,0.019417,0.093238,0.037537
4,-0.127603,0.146247,-0.090662,-0.003150,-0.134571,-0.179288,0.151212,-0.235700,-0.211594,0.191210,0.101248,0.148342,0.046597,-0.303030,0.110344,-0.064218,0.293248,0.008584,-0.247761,-0.101321,0.153911,0.034562,0.227221,-0.114165,-0.025303,-0.336854,0.231757,-0.088892,-0.383875,0.013731,-0.083015,-0.102411,0.060261,0.046408,-0.071695,-0.071815,0.122069,-0.096766,-0.461869,-0.396110,...,0.006019,0.224938,0.450019,0.103116,-0.477780,0.180737,0.233281,-0.350057,-0.360580,0.072298,0.610170,0.077580,-0.379926,0.154614,0.459915,0.218706,-0.019017,-0.152010,-0.155587,-0.052518,-0.018161,0.111680,0.089554,-0.123803,-0.360023,-0.430826,-0.089904,0.501573,-0.132513,-0.036738,0.229372,0.062292,0.022761,0.130935,-0.145899,-0.445411,0.181044,-0.870611,0.122402,-0.205108
5,0.034763,-0.161011,0.140570,0.175992,0.459962,-0.242631,-0.051741,-0.128755,-0.103677,0.199563,0.148768,-0.170274,0.048759,-0.182653,0.170006,0.142879,0.371174,0.162317,0.039228,-0.147637,-0.082493,0.169107,0.159363,-0.050029,-0.253319,-0.006641,0.125198,0.405186,-0.030552,0.112613,0.132836,0.021062,-0.007054,0.252117,0.444617,-0.160032,0.111042,0.080811,-0.343690,-0.009980,...,0.224006,-0.061652,-0.208635,-0.064257,-0.036795,-0.040645,0.127223,0.199031,-0.140367,-0.037575,-0.032066,0.010635,-0.133837,0.170067,0.165061,0.126627,-0.377569,-0.390835,-0.121297,-0.071974,0.188917,0.155280,-0.100495,0.101031,-0.283114,0.066059,-0.020677,-0.016811,-0.240801,0.352777,0.112451,0.187223,-0.088103,0.135670,-0.655425,0.151951,0.354230,-0.027659,0.158659,0.315700
6,0.110494,0.263056,0.020745,-0.083690,0.317877,-0.159514,0.098328,-0.285037,0.190302,0.257754,0.368462,0.116496,0.321317,-0.243893,0.233373,0.031371,0.067454,0.040376,0.082453,0.054444,0.122221,-0.016186,0.073776,0.393353,-0.043752,-0.453130,-0.073461,0.517180,-0.115861,-0.350563,-0.017058,0.031625,0.118507,0.088435,-0.056510,0.007562,0.014261,-0.367977,-0.868241,-0.190956,...,-0.041060,0.009688,0.015103,-0.035007,-0.201611,0.184168,0.185245,0.343298,-0.242966,0.017674,0.051220,-0.200444,0.154912,-0.083202,0.183110,0.121521,0.066193,-0.111828,-0.217700,-0.325334,0.153115,0.052534,-0.242556,0.125184,0.057218,0.133159,0.253385,0.144550,-0.422281,0.218494,0.175572,-0.431394,0.025009,-0.254654,0.137788,-0.129653,0.100244,-0.034440,0.027490,-0.012786
7,-0.268107,0.011173,0.048714,0.000897,0.252820,-0.074468,0.012479,-0.102106,-0.005142,0.445745,0.230651,0.321489,0.168059,-0.206180,0.222813,0.288158,0.522919,0.159832,0.053895,-0.021069,-0.162871,0.265996,0.114369,-0.054784,0.023998,-0.497897,0.410240,0.488191,0.036219,-0.009524,-0.285679,-0.130584,-0.054320,0.201591,0.101228,-0.093514,0.542190,-0.395032,-0.727288,-0.142969,...,0.141931,-0.059457,-0.250941,-0.015322,-0.559130,0.077343,0.219120,0.493471,-0.270443,-0.062482,0.399766,-0.150523,-0.219753,0.268484,0.413417,0.117013,0.060151,-0.307243,-0.350530,-0.242909,-0.119714,-0.107818,0.164371,-0.154829,0.051248,0.138710,0.168785,0.267176,-0.535444,0.173816,0.175458,0.235406,0.049407,-0.028393,-0.073177,-0.351295,0.038374,-0.718208,0.211355,-0.037011
8,0.151989,0.100551,0.126916,0.019886,0.164062,-0.271461,0.094277,0.014035,-0.041832,0.211793,0.339095,0.403866,0.013407,0.178462,0.143547,0.158849,0.393222,0.103208,-0.198883,0.118967,-0.085434,0.190440,0.450476,0.179067,0.035952,0.207082,0.228413,0.231009,-0.144860,-0.163931,0.195402,0.269231,0.086868,0.155731,-0.013920,0.073765,0.236592,-0.062063,-0.524370,-0.131448,...,-0.099256,-0.177377,-0.253751,-0.249019,-0.177402,0.273509,0.060083,0.225747,-0.199107,-0.032746,0.010176,0.031080,0.002973,0.460473,0.285270,0.009200,-0.276626,0.151094,-0.226739,-0.356126,0.221871,-0.252445,-0.294362,0.744626,0.164862,0.241230,0.262753,-0.357006,-0.281661,0.065284,-0.118605,-0.239055,-0.194856,0.068432,-0.525261,-0.172969,0.267976,-0.323329,0.060081,-0.062461
9,0.174452,-0.107560,0.257954,-0.254951,-0.132018,-0.228701,0.150555,-0.245272,-0.210707,-0.057574,-0.246034,-0.213247,-0.065317,-0.030010,-0.194335,-0.097309,0.177408,0.421320,0.226604,0.166160,-0.287422,-0.087127,0.411904,-0.074701,-0.411938,0.011719,0.256076,-0.242647,-0.332309,-0.186846,0.121583,-0.190332,-0.066937,-0.115632,-0.604305,0.025050,-0.109032,0.040153,0.265488,-0.115242,...,-0.040703,-0.161550,0.106037,-0.479894,0.198198,0.073080,0.445559,-0.207496,-0.082609,0.356588,0.196779,-0.195157,0.065719,0.110567,0.187485,-0.167426,-0.717204,-0.460665,-0.047989,0.040978,-0.054262,0.019067,-0.412437,0.161885,0.086397,-0.202234,0.304916,-0.246528,-0.230728,0.105194,-0.402785,-0.689989,0.260279,0.272787,-0.064803,0.272452,0.134051,0.088473,-0.113517,-0.400128


In [0]:
import lightgbm as lgb

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_auc_score, brier_score_loss


GBmodel = lgb.LGBMClassifier(boosting_type='gbdt', objective='binary', num_iterations=5000,
                            learning_rate=0.001, num_leaves=50, max_depth=-1, random_state=42,
                            min_data_in_leaf=10, class_weight='balanced', verbose=0,
                            lambda_l1=0.1, lambda_l2=0.01)

def make_train_test(x, y, model):
  dataframe = pd.concat([x, y], axis=1)
  skf = StratifiedKFold(n_splits=4, random_state=42)
  train_x, test_x, train_y, test_y = train_test_split(dataframe.iloc[:,0:x.shape[1]], 
                                                      dataframe.iloc[:,-(y.shape[1]):], 
                                                      test_size=0.2, shuffle=True, random_state=42,
                                                      stratify=dataframe.iloc[:,-(y.shape[1]):])
  
  performance = []
  feature_importance = []
  predicted_prob = []
  
  for i in range(y.shape[1]):
    for train_index, test_index in skf.split(train_x, train_y.iloc[:,i]):
      X_train, X_val = train_x.iloc[train_index], train_x.iloc[test_index]
      Y_train, Y_val = train_y.iloc[train_index, i], train_y.iloc[test_index, i]

      m = model.fit(X_train, Y_train, eval_metric=['binary_error', 'binary_logloss'],
                         eval_set=[(X_val, Y_val)], verbose=0)

    pred = m.predict(test_x)
    pred_prob = m.predict_proba(test_x)

    performance.append("balanced_accuracy_score : " + str(balanced_accuracy_score(test_y.iloc[:,i], pred)))
    performance.append("cohen_kappa_score : " + str(cohen_kappa_score(test_y.iloc[:,i], pred, weights='quadratic')))
    performance.append(classification_report(test_y.iloc[:,i], pred))
    performance.append(confusion_matrix(test_y.iloc[:,i], pred))
    performance.append("roc_auc_score : " + str(roc_auc_score(pd.get_dummies(test_y.iloc[:,i]), pred_prob)))
    performance.append("brier_score_loss : " + str(brier_score_loss(test_y.iloc[:,i], pred_prob[:,1])))

    feature_importance.append(pd.DataFrame(sorted(zip(m.feature_importances_, train_x.columns)), 
                                           columns=['Value','Feature']))

    predicted_prob.append(pred_prob)
  
  return performance, feature_importance, predicted_prob

In [0]:
keyword_pf, keyword_imp, keyword_prob = make_train_test(word_vec_df, vocab_df[['Critical Word Indicator']], GBmodel)

In [0]:
for p in range(len(keyword_pf)):
  print(keyword_pf[p])
  print('')

balanced_accuracy_score : 0.9204169708421008

cohen_kappa_score : 0.8361749602314583

              precision    recall  f1-score   support

           0       0.88      0.90      0.89       146
           1       0.95      0.94      0.95       319

    accuracy                           0.93       465
   macro avg       0.92      0.92      0.92       465
weighted avg       0.93      0.93      0.93       465


[[131  15]
 [ 18 301]]

roc_auc_score : 0.9762206381242753

brier_score_loss : 0.05620281118875214



### Solving the abbreviation problem searching its reference entities by self-defined function:

The word2vec model using cosine similarity as demonstrated above could not guanrantee a fully accurate matching to the reference words in full-form. I had to design a calculating function to ensure it was corresponding to the right names of factory. Beforehand, the abbreviation tokens were extracted and populated into lists, with their indices in the dataframe for identification.

Jaccard similarity was intuitively working in this abbreviation referencing context, because it measures the number of elements co-occurred in both tokens. For counting the number of unique tokens, the abbreviation token would be set as target, because this could eliminate the effect of the lengths of referred factory names. Given 2 names have been found some letter from the abbreviation token, longer name tends to have greater number of union items and hence diminishing its jaccard similarity score even though it may be the correct reference name.

<img src='https://i.ytimg.com/vi/Ah_4xqvS1WU/maxresdefault.jpg' width="400"/>

Yet, another problem would appear if the full form of the referred words contains some repeated letters of the short form, those words would tend to be scoring higher. Therefore, I added bonus points to the jaccard score if the sequence of the matched letters appeared in the referred words was exactly the same as the sequence of letters in the abbreviation token.


In [0]:
## for conversion of classified abbreviation
## extract a list of abbreviations
abbr = []
abbr_index = []
w_name = []
w_index = []
y_label = []

for r in range(len(vocab_df['Token'])):
  y_label.append(vocab_df['Abbreviation Token Indicator'][r])
y_label = np.array(y_label)

n = 0
kk = 1
for d in range(1, len(vocab_df)):
  if vocab_df['Item'][d]!=vocab_df['Item'][d-1]:
    kk += 1

for d2 in range(kk):
  start = 0
  while (start == 0) or (vocab_df['Item'][n]==vocab_df['Item'][n-1]):
    if y_label[n]==1:
      w_name.append(vocab_df['Token'][n])
      w_index.append(n)
    n += 1
    start += 1
    if n >= len(vocab_df):
      break
  
  abbr.append(w_name)
  abbr_index.append(w_index)
  w_name = []
  w_index = []

In [0]:
print(abbr[0:10])
print(abbr_index[0:10])

[['jp'], ['kh'], ['fw', 'kh'], [], ['cw', 'mp'], ['cw', 'mp'], ['cw', 'mp'], ['gs', 'gs'], ['wf', 'gm'], ['gs', 'gs']]
[[3], [13], [17, 19], [], [31, 36], [40, 45], [49, 54], [58, 65], [68, 69], [73, 80]]


In [0]:
## use character-by-character matching and jaccard similarity
def get_jacc_score(abbreviation, abbreviation_index, entity_list):
  
  def jaccard_similarity(abbr_w, ent_w):
    ## for characters in entity token; extract all intersected digits from the entity
    intersection = [value for value in ent_w if value in abbr_w]
    ## extract unique characters from the abbreviation expression; 
    ## not from the entity token; thus not considering the lengths of the entity strings
    union = [value for value in abbr_w if value not in ent_w]
    if len(union)==0:
        jacc_index_basic = len(intersection) / (len(union) + 1)
    else:
        jacc_index_basic = len(intersection) / len(union)
    
    return intersection, jacc_index_basic
  
  def intersect_sequence_generator(dictionary, intersect_list):                    
    intersect_idx = []
    for x2 in range(len(intersect_list)):
      idx = dictionary.get(intersect_list[x2])  
      intersect_idx.append(idx)
      
    return intersect_idx
  
  def jaccard_similarity_for_seq(abbr_s, ent_s):
    score = 0
    if len(ent_s)>0:
      for d in range(min(len(abbr_s), len(ent_s))):
        if ent_s[d]==abbr_s[d]:
          score += 1
    ## bonus point for all matches
    if score == min(len(abbr_s), len(ent_s)):
        score += 1
    
    return score
  
  jacc_sim = []
  jacc_sim_seq = []
  jacc_adjusted = []
  vdr_reference = []
  
  char_list_abbr = []
  for ca in abbreviation:
    char_list_abbr.append(ca)
    
  abbr_dict = {i:v for i,v in enumerate(char_list_abbr)}
  abbr_dict = {y:x for x,y in abbr_dict.items()}
  jacc_abbr_idx = []
  for x1 in range(len(abbreviation)):
    jacc_abbr_idx.append(abbr_dict.get(abbreviation[x1]))
  
  if vocab_df['Abbr_Name'][abbreviation_index]==1:
    start = 0
  elif vocab_df['Abbr_Class'][abbreviation_index]==1:
    start = 1

  if not (vocab_df['Abbreviation Token Indicator'][abbreviation_index]==0):
    for entity in range(start, len(entity_list), 2):
      char_list_entity = []
      for cb in entity_list[entity]:
        char_list_entity.append(cb)
      
      jacc_entity_matched_idx = []
      output_intersect, output_jacc_sim = jaccard_similarity(char_list_abbr, char_list_entity)
      output_entity_intersect_idx = intersect_sequence_generator(abbr_dict, output_intersect)
      jacc_entity_matched_idx.append(output_entity_intersect_idx)
      jacc_sim.append(output_jacc_sim)
      vdr_reference.append(entity_list[entity])

      jacc_score_seq = jaccard_similarity_for_seq(jacc_abbr_idx, output_entity_intersect_idx)
      jacc_sim_seq.append(jacc_score_seq)

      jacc_adjusted_score = output_jacc_sim + jacc_score_seq
      jacc_adjusted.append(jacc_adjusted_score)
  
  return abbr_dict, output_intersect, jacc_abbr_idx, jacc_entity_matched_idx, jacc_sim, jacc_sim_seq, jacc_adjusted, vdr_reference

In [0]:
abbr2entity_char2v = []

for v in range(kk):
  if len(abbr[v]) > 0:
    matched_entity_list = []
    for abb in range(len(abbr[v])):
      abbr_dict, intersection, jacc_abbr_idx, jacc_entity_matched_idx, jacc_sim, jacc_sim_seq, jacc_adjusted, vdr_reference \
          = get_jacc_score(abbr[v][abb], abbr_index[v][abb], combined_text[v]) 
      max_jacc = max(jacc_adjusted)
      arg_max_jacc = jacc_adjusted.index(max(jacc_adjusted))
      matched_entity = vdr_reference[arg_max_jacc]
      matched_entity_list.append(matched_entity)
    abbr2entity_char2v.append(matched_entity_list)
  else:
    abbr2entity_char2v.append([])

In [0]:
## some of matching results
print(abbr[0:3])
print(abbr2entity_char2v[0:3])

[['cw', 'mp'], ['lb', 'el'], ['mppl']]
[['combine will (indonesia)', 'micro plastics'], ['lucky bell', 'early light'], ['micro plastics']]


In [0]:
## substitute the abbreviations in the tokenized text
vocab_df['Rev_Token'] = vocab_df['Token']
for v in range(len(abbr_index)):
    for u in range(len(abbr_index[v])):
        vocab_df['Rev_Token'][abbr_index[v][u]] = abbr2entity_char2v[v][u]

### Learning to decode a sequence with the order of name entities + allocated percentages:

The short texts of instruction from production managers were treated as an encoder sequence, and artificially I would like the machine to be able to decode a sequence forcing it to follow an order of:
> factory name A, factory class A, percentage A, factory name B, factory class B, percentage B, ...

Inferring the percenatge attributed to the right factory purely from the instruction texts was challenging. Seq2Seq has been mainly applied in neural machine translations of different lanaguages. It may not be the best solution here to generate a fixed style of decision-indicating sequences for all kinds of encoded instructions, but as a scope for exploring data science approach to automate the process, it consumes less domain expertise and supervision in advance. 

The pre-processing steps included creating encoder and decoder texts, loading the tokens in the corpus into a key-index paired dictionary object separately for encoder and decoder, and transfer the string sequences to numerical indices readable to the model.

For trainning a seq2seq model, we delay the decoder input sequence by one time step by inserting a tag of "BOS" indicating the start of sequence with reference to the decoder output. Meanwhile, a "EOS" tag could indicate the end of sequence such that the model will learn to predict a "EOS" tag at certain time point, by not repeating sampling from the token pool. To maintain unique shape for the inputting array, maximum length of sequneces in the training samples would serve as a dimension, and all samples with a shorter sequence length than this maximum length would be padded with all zeros until reaching the maximum length. 

<img src='http://opennmt.net/OpenNMT/img/input_feed.png' width="300" align="center"/>



In [0]:
## generate encoder text
encoder = []
k = 0
n = 0
for i in range(0, len(tokenized_text)):
  k = n
  item_keyword = []
  while vocab_df['Item'][k]==vocab_df['Item'][k+1]:
    if vocab_df['Critical Word Indicator'][k]==1:
      item_keyword.append(vocab_df['Rev_Token'][k])
    k += 1
    n += 1
    if k >= len(vocab_df) - 1:
      break
  if vocab_df['Critical Word Indicator'][k]==1:
    item_keyword.append(vocab_df['Rev_Token'][k])
  n += 1
  for t in combined_text[i]:
      item_keyword.append(t)
  encoder.append(item_keyword)

In [0]:
## import decoder text (targeted output strings)
## we got 2 sets of instruction texts:
###  - one for allocation before finishing product development and first delivery
###  - one for allocation after finishing product development and first delivery

before_exfty = pd.read_excel('keyword_encode_2.xlsx', sheet_name=3)
after_exfty = pd.read_excel('keyword_encode_2.xlsx', sheet_name=4)

In [0]:
## generate decoder target sequence of words
def generate_decode_string(df, num_doc):
    _item = df['Item']
    _name = df['Name']
    _class = df['Class']
    _percent = df.iloc[:,-1]
    decode_list = []
    m = 0
    n = 0
    for g in range(num_doc):
        m = n
        decode_sublist = []
        while _item[m]==_item[m+1]:
            vdr_name = _name[m].lower()
            vdr_class = _class[m].lower()
            percent = _percent[m].lower()
            decode_sublist.append(vdr_name)
            decode_sublist.append(vdr_class)
            decode_sublist.append(percent)
            m += 1
            n += 1
            if m >= len(df) - 1:
                break
        vdr_name = _name[m].lower()
        vdr_class = _class[m].lower()
        percent = _percent[m].lower()
        decode_sublist.append(vdr_name)
        decode_sublist.append(vdr_class)
        decode_sublist.append(percent)
        n += 1
        decode_list.append(decode_sublist)
    return decode_list

decoder_before_exfty = generate_decode_string(before_exfty, len(encoder))
decoder_after_exfty = generate_decode_string(before_exfty, len(encoder))

In [0]:
## convert encoder & decoder dictionary
def process_seq2seq_encoder_input(encoder):
    reserved = {'<PAD>': 0, '<UNK>': 1}
    enc_list = [w for i in encoder for w in i]
    enc_dict = {e:i+2 for i,e in enumerate(set(enc_list))}
    enc_dict = {**reserved, **enc_dict}
    enc_seq = []
    ## reserved key-index for padding sequence length, out-of-dictionary words
    for e in range(len(encoder)):
      enc_sub_seq = []
      for se in encoder[e]:
        enc_sub_seq.append(enc_dict.get(se))
      enc_seq.append(enc_sub_seq)
    return enc_dict, enc_seq
    
def process_seq2seq_decoder_input(decoder):
    reserved = {'<PAD>': 0, '<UNK>': 1, '<BOS>':2, '<EOS>':3}
    dec_list = [w for i in decoder for w in i]
    dec_dict = {e:i+4 for i,e in enumerate(set(dec_list))}
    dec_dict = {**reserved, **dec_dict}
    dec_seq= []
    ## pad <BOS> and <EOS> at the beginning and ending of decoder inputs as indicator for teacher forcing in 3-D outputs
    ## (normally only applied to sentence level tokenization, i.e. multiple phrases in one list element)
    for f in range(len(decoder)):
      dec_sub_seq = []
      dec_sub_seq.append(dec_dict.get('<BOS>'))
      for sf in decoder[f]:
        dec_sub_seq.append(dec_dict.get(sf))
      dec_sub_seq.append(dec_dict.get('<EOS>'))
      dec_seq.append(dec_sub_seq)
    return dec_dict, dec_seq

## create an one-hot encoded vector for each token in positions of the sequence length 
def process_seq2seq_decoder_y(decoder_text, decoder_dict):
  max_length_de = max([len(x) for x in decoder_text])
  len_de = len(decoder_dict)
  decoder_output_label = np.zeros((len(decoder_text), max_length_de, len_de), dtype="float32")
  ## decoder output data would be ahead of decoder input data by one timestep
  for i, s1 in enumerate(decoder_text):
    for j, s2 in enumerate(s1):
      if j > 0:
        decoder_output_label[i][j-1][s2] = 1
  return decoder_output_label
  
encoder_dict, encoder_seq = process_seq2seq_encoder_input(encoder)
decoder_before_exfty_dict, decoder_before_exfty_seq = process_seq2seq_decoder_input(decoder_before_exfty)
decoder_after_exfty_dict, decoder_after_exfty_seq = process_seq2seq_decoder_input(decoder_after_exfty)
decoder_before_exfty_seq_y = process_seq2seq_decoder_y(decoder_before_exfty_seq, decoder_before_exfty_dict)
decoder_after_exfty_seq_y = process_seq2seq_decoder_y(decoder_after_exfty_seq, decoder_after_exfty_dict)

In [0]:
from keras.preprocessing.sequence import pad_sequences
## fill up to max length by zero (padding)
def padding(sequences, MAX_LEN):
  padded_seq = pad_sequences(sequences, maxlen=MAX_LEN, padding='post')
  return padded_seq

encoder_seq = padding(encoder_seq, max([len(x) for x in encoder_seq]))
decoder_before_exfty_seq = padding(decoder_before_exfty_seq, max([len(x) for x in decoder_before_exfty_seq]))
decoder_after_exfty_seq = padding(decoder_after_exfty_seq, max([len(x) for x in decoder_after_exfty_seq]))

In [0]:
print(encoder_seq[0:10])

[[ 34  49 122 124  91  54  54  95  49  95 124  95   0   0   0   0   0   0
    0   0   0]
 [129 116  83  25 116  25   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0]
 [ 79  83  79 116  83  25 116  25   0   0   0   0   0   0   0   0   0   0
    0   0   0]
 [107  35   7  85  85  83 116  83  25 116  25 107  95   0   0   0   0   0
    0   0   0]
 [ 11 110   5 127  12  48  72  11  95 127  25  48  95   0   0   0   0   0
    0   0   0]
 [ 11 110   5 127  12  48  72  11  95 127  25  48  95   0   0   0   0   0
    0   0   0]
 [ 11 110   5 127  12  48  72  11  95 127  25  48  95   0   0   0   0   0
    0   0   0]
 [129  95 102  71  13  37  95  74  37 124  95  62  25   0   0   0   0   0
    0   0   0]
 [ 92  99  79  99  25  92  25   0   0   0   0   0   0   0   0   0   0   0
    0   0   0]
 [129  95 102  71  13  37  95  74  37 131  25 124  95   0   0   0   0   0
    0   0   0]]


In [0]:
print(decoder_before_exfty_seq[0:10])

[[ 2 68 76 37 47 76 37 54 76 20  3  0  0  0  0  0  0]
 [ 2 48 63 20 38 63 62  3  0  0  0  0  0  0  0  0  0]
 [ 2 48 63 37 38 63 37  3  0  0  0  0  0  0  0  0  0]
 [ 2 48 63 37 38 63 37 24 76 20  3  0  0  0  0  0  0]
 [ 2 30 76 18 58 63 32 45 76 25  3  0  0  0  0  0  0]
 [ 2 30 76 18 58 63 32 45 76 25  3  0  0  0  0  0  0]
 [ 2 30 76 18 58 63 32 45 76 25  3  0  0  0  0  0  0]
 [ 2 54 76 20 14 63 62  3  0  0  0  0  0  0  0  0  0]
 [ 2  8 63 37 69 63 37  3  0  0  0  0  0  0  0  0  0]
 [ 2 71 63 37 54 76 37  3  0  0  0  0  0  0  0  0  0]]


In [0]:
## generate 3-D vectors of encoder inputs and decoder inputs
from keras.utils import to_categorical
encoder_seq_cat = to_categorical(encoder_seq, num_classes=len(encoder_dict))
decoder_before_exfty_seq_cat = to_categorical(decoder_before_exfty_seq, num_classes=len(decoder_before_exfty_dict))
decoder_after_exfty_seq_cat = to_categorical(decoder_after_exfty_seq, num_classes=len(decoder_after_exfty_dict))

In [0]:
print(decoder_before_exfty_seq_y.shape)
print(decoder_after_exfty_seq_y.shape)

(370, 17, 79)
(370, 17, 79)


In [0]:
print('decoder inputs at time t0')
print(decoder_before_exfty_seq_cat[0][0])
print('decoder outputs at time t0')
print(decoder_before_exfty_seq_y[0][0])

decoder inputs at time t0
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]
decoder outputs at time t0
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]


In [0]:
print('decoder inputs at time t1')
print(decoder_before_exfty_seq_cat[0][1])
print('decoder outputs at time t1')
print(decoder_before_exfty_seq_y[0][1])

decoder inputs at time t1
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]
decoder outputs at time t1
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0.]


In [0]:
from random import sample, seed
from sklearn.utils import resample

len_sentence = len(tokenized_text)
_index = list(range(0, len_sentence))
seed(42)
train_idx = sample(range(0, len_sentence), int(len_sentence*0.96))

encoder_train = [encoder[re] for re in train_idx]
encoder_seq_train = [encoder_seq[re] for re in train_idx]
encoder_seq_cat_train = [encoder_seq_cat[re] for re in train_idx]
decoder_before_exfty_train = [decoder_before_exfty[re] for re in train_idx]
decoder_before_exfty_seq_train = [decoder_before_exfty_seq[re] for re in train_idx]
decoder_before_exfty_seq_cat_train = [decoder_before_exfty_seq_cat[re] for re in train_idx]
decoder_before_exfty_seq_y_train = [decoder_before_exfty_seq_y[re] for re in train_idx]
decoder_after_exfty_train = [decoder_after_exfty[re] for re in train_idx]
decoder_after_exfty_seq_train = [decoder_after_exfty_seq[re] for re in train_idx]
decoder_after_exfty_seq_cat_train = [decoder_after_exfty_seq_cat[re] for re in train_idx]
decoder_after_exfty_seq_y_train = [decoder_after_exfty_seq_y[re] for re in train_idx]

encoder_test = [encoder[re] for re in range(0, len_sentence) if re not in train_idx]
encoder_seq_test = [encoder_seq[re] for re in range(0, len_sentence) if re not in train_idx]
encoder_seq_cat_test = [encoder_seq_cat[re] for re in range(0, len_sentence) if re not in train_idx]
decoder_before_exfty_test = [decoder_before_exfty[re] for re in range(0, len_sentence) if re not in train_idx]
decoder_before_exfty_seq_test = [decoder_before_exfty_seq[re] for re in range(0, len_sentence) if re not in train_idx]
decoder_before_exfty_seq_cat_test = [decoder_before_exfty_seq_cat[re] for re in range(0, len_sentence) if re not in train_idx]
decoder_before_exfty_seq_y_test = [decoder_before_exfty_seq_y[re] for re in range(0, len_sentence) if re not in train_idx]
decoder_after_exfty_test = [decoder_after_exfty[re] for re in range(0, len_sentence) if re not in train_idx]
decoder_after_exfty_seq_test = [decoder_after_exfty_seq[re] for re in range(0, len_sentence) if re not in train_idx]
decoder_after_exfty_seq_cat_test = [decoder_after_exfty_seq_cat[re] for re in range(0, len_sentence) if re not in train_idx]
decoder_after_exfty_seq_y_test = [decoder_after_exfty_seq_y[re] for re in range(0, len_sentence) if re not in train_idx]

In [0]:
encoder_seq_train = np.array(encoder_seq_train)
encoder_seq_cat_train = np.array(encoder_seq_cat_train)
decoder_before_exfty_seq_train = np.array(decoder_before_exfty_seq_train)
decoder_before_exfty_seq_cat_train = np.array(decoder_before_exfty_seq_cat_train)
decoder_before_exfty_seq_y_train = np.array(decoder_before_exfty_seq_y_train)
decoder_after_exfty_seq_train = np.array(decoder_after_exfty_seq_train)
decoder_after_exfty_seq_cat_train = np.array(decoder_after_exfty_seq_cat_train)
decoder_after_exfty_seq_y_train = np.array(decoder_after_exfty_seq_y_train)

encoder_seq_test = np.array(encoder_seq_test)
encoder_seq_cat_test = np.array(encoder_seq_cat_test)
decoder_before_exfty_seq_test = np.array(decoder_before_exfty_seq_test)
decoder_before_exfty_seq_cat_test = np.array(decoder_before_exfty_seq_cat_test)
decoder_before_exfty_seq_y_test = np.array(decoder_before_exfty_seq_y_test)
decoder_after_exfty_seq_test = np.array(decoder_after_exfty_seq_test)
decoder_after_exfty_seq_cat_test = np.array(decoder_after_exfty_seq_cat_test)
decoder_after_exfty_seq_y_test = np.array(decoder_after_exfty_seq_y_test)

In [0]:
print(encoder_seq_train.shape)
print(encoder_seq_cat_train.shape)
print(decoder_before_exfty_seq_train.shape)
print(decoder_before_exfty_seq_cat_train.shape)
print(decoder_after_exfty_seq_train.shape)
print(decoder_after_exfty_seq_cat_train.shape)
print(decoder_before_exfty_seq_y_train.shape)
print(decoder_after_exfty_seq_y_train.shape)

(355, 21)
(355, 21, 135)
(355, 17)
(355, 17, 79)
(355, 17)
(355, 17, 79)
(355, 17, 79)
(355, 17, 79)


### Seq2Seq model settings:

Small sample size and bootstrapping:
> My dataset was small containing only 370 instruction short texts, and upon splitting, 96% (355) would be used for training. I created 10 bootstrap samples here to repeat the training batches validating on 15 samples for each, but for reproducibility, I specified the random states.

Word Embedding:
> The embedding layer was found to be critical in boosting the performance of the seq2seq model, probably because the one-hot encoded arrays had the problem of suffering sparsity that was not easily regularized. Word embedding created latent dimensions to represent the features of the words, same as the concepts adopted in the word2vec model. Both encoder and decoder sequences were trained with 500-dimension embeddings.

Encoder structure:
> Bi-direcional LSTM structure was used with 250 units of cells, outputting in total 500 dimensions of hidden state vectors by setting *return_sequence = True*.
> The forward-propagating and backward-propagating encoder hidden state vectors would be concatenated, and passed to be the initial hidden state for decoder.

<img src='http://opennmt.net/OpenNMT/img/brnn.png'/>

Decoder structure:
> Two-layer stacked LSTM was used as the decoder. The initial decoder input was set to be the "BOS" tag. Teacher forcing was implemented by outputting one token index with highest probability at current time step, and passing this token as the input for next token prediction, recursively throughout the maximum equence length.

Attention mechanism (Luong Attention):
> Attention has been invented in the researches of seq2seq models. The basic idea resembling the behaviour of human reading a sentence is that people would recognize the context from each of the words in encoder, rather than simply read through all words at one time to memorize it, and decode a sentence. Since during decoding, only the hidden states of the previous LSTM output would be used to predict the next token, this could sacrifice some of the information conveyed in the encoder. 

> Against this problem, a weighted score was calculated for each encoder hidden state dotted on the current decoder hidden state. A context vector would be produced on the softmax-activated attention scores. Eventually the context vector would be concatenated with the current decoder hidden state, performing tanh and softmax activation to get a probability for each token in the corpus of decoder dictionary.

<img src='http://opennmt.net/OpenNMT/img/global-attention-model.png'/>


In [0]:
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Activation, Concatenate, Dot
from keras.optimizers import Adam, RMSprop
from nltk.translate.bleu_score import sentence_bleu

Using TensorFlow backend.


In [0]:
## Encoder structure with Bi-directional LSTM
## return only states from encoder to pass to decoder

encoder_inputs = Input(shape=(None, ))
encoder_embed = Embedding(input_dim=135, output_dim=500)(encoder_inputs)
encoder_LSTM = Bidirectional(LSTM(250, return_state=True, return_sequences=True))
encoder_hidden_vec, forward_last_h, forward_last_c, backward_last_h, backward_last_c = encoder_LSTM(encoder_embed)
enc_state_last_h = Concatenate()([forward_last_h, backward_last_h])
enc_state_last_c = Concatenate()([forward_last_c, backward_last_c])
encoder_states = [enc_state_last_h, enc_state_last_c]

## Decoder structure with 2-layer stacked LSTM
decoder_inputs = Input(shape=(None, ))
decoder_embed = Embedding(input_dim=79, output_dim=500)(decoder_inputs)
decoder_LSTM = LSTM(units=500, return_state=True, return_sequences=True)
decoder_LSTM_layer = decoder_LSTM(decoder_embed, initial_state = encoder_states)
decoder_LSTM2 = LSTM(units=500, return_state=True, return_sequences=True)
decoder_hidden_vec, dec_state_last_h, dec_state_last_c = decoder_LSTM2(decoder_LSTM_layer)
    
## Attention mechanism
attention_score = Dot([2,2])([decoder_hidden_vec, encoder_hidden_vec])
attention_weight = Activation('softmax')(attention_score)
context = Dot([2,1])([attention_weight, encoder_hidden_vec])
decoder_outputs_combined_context = Concatenate()([context, decoder_hidden_vec])
hidden_state_outputs = TimeDistributed(Dense(500, activation='tanh'))(decoder_outputs_combined_context)
outputs = TimeDistributed(Dense(79, activation='softmax'))(hidden_state_outputs)

model = Model([encoder_inputs, decoder_inputs], outputs)

model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_9 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_9 (Embedding)         (None, None, 500)    67500       input_9[0][0]                    
__________________________________________________________________________________________________
input_10 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
bidirectional_5 (Bidirectional) [(None, None, 500),  1502000     embedding_9[0][0]                
__________________________________________________________________________________________________
embedding_

In [0]:
## Seq2Seq Model - number-of-sample-sequence-length 2D inputs with embedding layer
def seq2seq_2D_embedding_one_hot_seq_attention(encoder_seq_train, 
                                               decoder_input_seq_train, decoder_output_seq_train, 
                                               encoder_seq_test, 
                                               decoder_input_seq_test, decoder_output_seq_test,
                                               encoder_dict, decoder_dict,
                                               batch_size, num_epochs):
  
  ## fitting with 10 bootstrap samples
  random_state = [1, 4, 20, 21, 42, 99, 101, 111, 231, 999]

  def bootstrap_samples(num_training_samples, self_defined_random_state, 
                        encoder_training_samples, 
                        decoder_training_samples, decoder_training_samples_output):
    sample_index = list(range(0, num_training_samples))
    boot = resample(sample_index, replace=False, 
                    n_samples = int(num_training_samples*0.96), 
                    random_state = self_defined_random_state)
    
    enc_train = [encoder_training_samples[ref] for ref in boot]
    enc_val = [encoder_training_samples[ref] for ref in range(0, len(encoder_training_samples)) if ref not in boot]
    
    dec_train_in = [decoder_training_samples[ref] for ref in boot]
    dec_val_in = [decoder_training_samples[ref] for ref in range(0, len(decoder_training_samples)) if ref not in boot]
    
    dec_train_out = [decoder_training_samples_output[ref] for ref in boot]
    dec_val_out = [decoder_training_samples_output[ref] for ref in range(0, len(decoder_training_samples_output)) if ref not in boot]
    
    enc_train = np.array(enc_train)
    enc_val = np.array(enc_val)
    dec_train_in = np.array(dec_train_in)
    dec_val_in = np.array(dec_val_in)
    dec_train_out = np.array(dec_train_out)
    dec_val_out = np.array(dec_val_out)
    
    return enc_train, enc_val, dec_train_in, dec_val_in, dec_train_out, dec_val_out
  
  ## Using 2D array inputs (arrays of max sequence length) WITH Embedding

  def define_seq2seq_model_embedding(encoder_dict, decoder_dict, encoder, decoder):

    ## embedding layer shape => number of unique words in the dictionary
    ## a) Training part
    ## b) Inference part

    len_en = len(encoder_dict)
    len_de = len(decoder_dict)
    max_length_en = max([len(x) for x in encoder])
    max_length_de = max([len(x) for x in decoder])

    ## Encoder structure with Bi-directional LSTM
    ## return only states from encoder to pass to decoder

    encoder_inputs = Input(shape=(None, ))
    encoder_embed = Embedding(input_dim=len_en, output_dim=500)(encoder_inputs)
    encoder_LSTM = Bidirectional(LSTM(250, return_state=True, return_sequences=True))
    encoder_hidden_vec, forward_last_h, forward_last_c, backward_last_h, backward_last_c = encoder_LSTM(encoder_embed)
    enc_state_last_h = Concatenate()([forward_last_h, backward_last_h])
    enc_state_last_c = Concatenate()([forward_last_c, backward_last_c])
    encoder_states = [enc_state_last_h, enc_state_last_c]

    ## Decoder structure with 2-layer stacked LSTM
    decoder_inputs = Input(shape=(None, ))
    decoder_embed = Embedding(input_dim=len_de, output_dim=500)(decoder_inputs)
    decoder_LSTM = LSTM(units=500, return_state=True, return_sequences=True)
    decoder_LSTM_layer = decoder_LSTM(decoder_embed, initial_state = encoder_states)
    decoder_LSTM2 = LSTM(units=500, return_state=True, return_sequences=True)
    decoder_hidden_vec, dec_state_last_h, dec_state_last_c = decoder_LSTM2(decoder_LSTM_layer)

    ## Attention mechanism
    attention_score = Dot([2,2])([decoder_hidden_vec, encoder_hidden_vec])
    attention_weight = Activation('softmax')(attention_score)
    context = Dot([2,1])([attention_weight, encoder_hidden_vec])
    decoder_outputs_combined_context = Concatenate()([context, decoder_hidden_vec])
    hidden_state_outputs = TimeDistributed(Dense(500, activation='tanh'))(decoder_outputs_combined_context)
    outputs = TimeDistributed(Dense(len_de, activation='softmax'))(hidden_state_outputs)
    
    model = Model([encoder_inputs, decoder_inputs], outputs)
    
    return model

  ## function to generate target given source sequence
  def predict_sequence_embedding(model, input_encoder_seq, n_steps_in_seq):
    # set zero for the start of the target sequence
    dec_input = np.zeros((1, n_steps_in_seq))
    # populate the <BOS> tag of the targeted generated sequence
    dec_input[0, 0] = 2
    # initializations
    output = []
    for t in range(n_steps_in_seq):
      # predict next element (token) from decoder model
      dec_output = model.predict([input_encoder_seq, dec_input])
      output.append(dec_output[0,t,:])
      # update target sequence recurrently
      # with teacher forcing: search for the activated index and update the sequence positions as the next input
      # without teacher forcing: use its own prediction probability as the next input
      activated_index = np.argmax(dec_output[0,t,:])
      if t+1 < n_steps_in_seq:
        dec_input[0, t+1] = activated_index
      
    return np.array(output)
    
  ## main part operations
  predicted_seq = []
  validated_seq = []
  bleu = []
  bleu_sample = []
  avg_acc = []
  avg_acc_positive = []
  accuracy_per_run = []
  accuracy_per_run_positive = []
  training_history = []

  # call model for training
  model = define_seq2seq_model_embedding(encoder_dict, decoder_dict, 
                                         encoder_seq_train, decoder_input_seq_train)
  model.compile(optimizer=RMSprop(lr=0.00001), loss='categorical_crossentropy', metrics=['acc'])
  
  for b in range(len(random_state)):
    enc_train, enc_val, dec_train_in, dec_val_in, dec_train_out, dec_val_out = \
    bootstrap_samples(len(encoder_seq_train), random_state[b], 
                      encoder_seq_train, decoder_input_seq_train, decoder_output_seq_train)
    # training the main model
    model.fit([enc_train, dec_train_in], dec_train_out, 
              batch_size=10, epochs=40, validation_data=([enc_val, dec_val_in], dec_val_out))

    # make predictions using the inference models
    n_steps_in_seq = len(decoder_output_seq_train[0])
    inference_seq = []
    for t in range(len(encoder_seq_test)):
      y_estimated = predict_sequence_embedding(model, encoder_seq_test[t].reshape(1, encoder_seq_test[t].shape[0]), n_steps_in_seq)
      inference_seq.append(y_estimated)

    acc_score = 0
    total = 0

    for samples in range(len(inference_seq)):
      pred = []
      actual = []
      for p in range(len(inference_seq[samples])):
        total += 1
        predicted_token_index = np.argmax(inference_seq[samples][p])
        validated_token_index = np.argmax(decoder_output_seq_test[samples][p])
        predicted_token = list(decoder_dict.keys())[list(decoder_dict.values()).index(predicted_token_index)]
        validated_token = list(decoder_dict.keys())[list(decoder_dict.values()).index(validated_token_index)]
        if predicted_token_index==validated_token_index:
          acc_score += 1
        pred.append(predicted_token)
        actual.append(validated_token)
      predicted_seq.append(pred)
      validated_seq.append(actual)
      bleu_sample.append(sentence_bleu([pred], actual))
      accuracy = acc_score / total
      accuracy_per_run.append(accuracy)
      
    avg_acc.append(np.mean(np.array(accuracy_per_run)))
    bleu.append(np.mean(np.array(bleu_sample)))
    
  return model, inference_seq, accuracy_per_run, avg_acc, bleu_sample, bleu, predicted_seq, validated_seq, training_history

### Model evaluation:

Accuracy was measured on each token predicted. It turned out to be fairly good close to 0.2, since "PAD" tags were placed after the "EOS" tag in the actual output, while the model kept giving "EOS" tag as observed in below test samples. The real accuracy score should be higher than shown here as "PAD" and "EOS" contextually made no difference. BLEU score is generally a better metric for quantifying the performance of seq2seq model, which depends on the the counts of matched n-gram tokens in the predicted sequence. The results attained over 0.4 for BLEU.

For the first comparison of predicted sequence and the actual validated sequence below, the model predicted correctly after training with 2 bootstrap samples. Over-fitting probably occurred as it continued training with the remaining bootstrap samples. The second comparison quoted successfully predicted the target allocation percentage after training with the bootstrap samples at the 5th, 6th and 7th run.

It showed seq2seq had potentials in sequential relationship learning problems, given one did not want to manually investigate each instruction texts, but definitely more data would be required to feed the model to learn with a more robust performance.


In [0]:
model_before_exfty_2, infer_seq_before_exfty_2, accuracy_per_run_before_exfty_2, \
avg_acc_before_exfty_2, bleu_sample_before_exfty_2, bleu_before_exfty_2, \
pred_before_exfty_2, val_before_exfty_2, hist_before_exfty_2 \
= seq2seq_2D_embedding_one_hot_seq_attention(encoder_seq_train, 
                                             decoder_before_exfty_seq_train, 
                                             decoder_before_exfty_seq_y_train,
                                             encoder_seq_test, 
                                             decoder_before_exfty_seq_test, 
                                             decoder_before_exfty_seq_y_test,
                                             encoder_dict, decoder_before_exfty_dict, 
                                             batch_size=10, num_epochs=40)

In [0]:
print(["%.4f" % d for d in avg_acc_before_exfty_2])
print(["%.4f" % d for d in bleu_before_exfty_2])

['0.1653', '0.1680', '0.1723', '0.1734', '0.1784', '0.1784', '0.2070', '0.2064', '0.2029', '0.2037']
['0.4337', '0.4198', '0.4300', '0.4242', '0.4222', '0.4280', '0.4214', '0.4132', '0.4140', '0.4114']


In [0]:
model_after_exfty_2, infer_seq_after_exfty_2, accuracy_per_run_after_exfty_2, \
avg_acc_after_exfty_2, bleu_sample_after_exfty_2, bleu_after_exfty_2, \
pred_after_exfty_2, val_after_exfty_2, hist_after_exfty_2 \
= seq2seq_2D_embedding_one_hot_seq_attention(encoder_seq_train, 
                                             decoder_after_exfty_seq_train, 
                                             decoder_after_exfty_seq_y_train,
                                             encoder_seq_test, 
                                             decoder_after_exfty_seq_test, 
                                             decoder_after_exfty_seq_y_test,
                                             encoder_dict, decoder_after_exfty_dict, 
                                             batch_size=10, num_epochs=40)

Train on 340 samples, validate on 15 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Train on 340 samples, validate on 15 samples
Epoch 1/40
 10/340 [..............................] - ETA: 5s - loss: 0.9289 - acc: 0.2000

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Train on 340 samples, validate on 15 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Train on 340 samples, va

In [0]:
print(["%.4f" % d for d in avg_acc_after_exfty_2])
print(["%.4f" % d for d in bleu_after_exfty_2])

['0.1785', '0.1957', '0.1919', '0.1864', '0.1840', '0.1829', '0.1814', '0.1800', '0.1805', '0.1812']
['0.4775', '0.4410', '0.4025', '0.4046', '0.4142', '0.4044', '0.3938', '0.3837', '0.3837', '0.3758']


In [0]:
## Evaluate the first example:

print("Encoder inputs:")
print(encoder_test[14])
print("Decoder outputs:")
print(decoder_after_exfty_test[14])
print("Predicted decoder:")
for x in range(14,150,15):
  print(pred_after_exfty_2[x])

Encoder inputs:
['ws', '30%', 'jp', '70%', 'jp', 'global sourcing', 'ws', 'china']
Decoder outputs:
['jp', 'global sourcing', '70%', 'ws', 'china', '30%']
Predicted decoder:
['jp', 'global sourcing', 'ws', 'ws', 'china', 'china', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>']
['jp', 'global sourcing', '70%', 'ws', 'china', '30%', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '0%', '0%', '0%', '0%', '0%']
['jp', 'global sourcing', '70%', 'ws', 'china', '70%', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '0%', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>']
['jp', 'global sourcing', '30%', 'ws', 'china', '70%', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>']
['jp', 'global sourcing', '30%', 'ws', 'china', '70%', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>']
['jp', 'global sourcing', '30%', 'ws', 'china', '30%', '<EOS>', '<EOS>', 

In [0]:
## Evaluate the second example:

print("Encoder inputs:")
print(encoder_test[10])
print("Decoder outputs:")
print(decoder_after_exfty_test[10])
print("Predicted decoder:")
for x in range(10,150,15):
  print(pred_after_exfty_2[x])

Encoder inputs:
['dream', 'dual', 'with', 'plush', 'mb', 'china', 'plush', 'global sourcing', 'dream', 'global sourcing']
Decoder outputs:
['mb', 'china', '0%', 'plush', 'global sourcing', '50%', 'dream', 'global sourcing', '50%']
Predicted decoder:
['mp', 'global sourcing', 'global sourcing', 'china', 'china', 'global sourcing', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>']
['mp', 'global sourcing', '50%', 'wf', 'china', '50%', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>']
['mp', 'global sourcing', '50%', 'wf', 'china', '50%', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>']
['rs (vietnam)', 'global sourcing', '50%', 'plush', 'global sourcing', '50%', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>']
['dream', 'global sourcing', '50%', 'plush', 'global sourcing', '50%', '<EOS>', '<EO