In [1]:
# to use for customizing the display/formatt of the cells
# for more info and resources check these links:
# http://stackoverflow.com/questions/34303422/how-to-change-the-font-size-and-color-of-markdown-cell-in-ipython-py-2-7-noteb
# http://nbviewer.jupyter.org/github/Carreau/posts/blob/master/Blog1.ipynb
from IPython.core.display import HTML
HTML("""
<style>

div.cell { /* Tunes the space between cells */
margin-top:0.5em;
margin-bottom:0.5em;
}

div.text_cell_render h1 { /* Main titles bigger, centered */
font-size: 1.7em;
line-height:1.1em;
text-align:left;
}

div.text_cell_render h2 { /*  Parts names nearer from text */
margin-bottom: -0.4em;
}


div.text_cell_render { /* Customize text cells */
font-family: 'Times New Roman';
font-size:1.3em;
line-height:1.3em;
padding-left:1em;
padding-right:1em;
}
</style>
""")

In this tutorial, we will learn how to:
<ul>
<li> read <span style="font-weight:bold">sequences </span>from a file (including the discussion about the <span style="font-weight:bold">file/input format</span>)</li>
<li> represent <span style="font-weight:bold">sequences</span> to be later used by the <a href="https://bitbucket.org/A_2/pyseqlab">PySeqLab package</a> for
building/training models.</li>
</ul>

We start our discussion about the sequences concept and the file format comprising them.

# Representing sequences

Generally speaking, a sequence is a list of elements that follow an order [<a href="https://en.wikipedia.org/wiki/Sequence">wiki</a>]. The order could be due to an inherent structure such as sentences (sequence of words) or temporal such as readings/measurements from a sensor.
<br/>

<span style="font-weight:bold">Sequence labeling</span> is an important task in multiple domains where given a sequence of observations, the goal is to label/tag each observation using a set of permissible tags that represent higher order syntactic structure. 
For example, given a sentence (sequence of words), the goal is to tag/label each word by its <a href="https://en.wikipedia.org/wiki/Part_of_speech">part-of-speech</a>.
<br/>

An example of sequence labeling task is chunking/shallow parsing using <a href="http://www.cnts.ua.ac.be/conll2000/chunking/">CoNLL 2000 dataset</a>. Given a set of sentences (our sequences) where each sentence is composed of <span style="font-weight:bold;color:green;">words</span> and their corresponding <span style="font-weight:bold;color:green;">part-of-speech</span>, the goal is to predict the <span style="font-weight:bold;color:green;">chunk/shallow parse label</span> of every word in the sentence.

With these preliminary definitions in mind, we can start our investigation of how to represent/parse sequences. In this tutorial we will use CoNLL sentences as an example of sequences.

# Input file format

The input file comprising the sequences follows a <span style="font-weight:bold;color:red;">column-format</span> template. Sequences are separated by newline where the observations/elements of each sequence are layed each on a separate line.  The last column is dedicated for the tag/label that we aim to predict.
<br/>

The dataset files (training and test files) of the CoNLL task follow the input file template for holding the sequence for any sequence labeling/tagging task. An excerpt of the triaining file:
<pre style="font-size:0.8em;">
w pos chunk
Confidence NN B-NP
in IN B-PP
the DT B-NP
pound NN I-NP
is VBZ B-VP
widely RB I-VP
expected VBN I-VP
to TO I-VP
take VB I-VP
another DT B-NP
sharp JJ I-NP
dive NN I-NP
if IN B-SBAR
trade NN B-NP
figures NNS I-NP
for IN B-PP
September NNP B-NP
, , O
due JJ B-ADJP
for IN B-PP
release NN B-NP
tomorrow NN B-NP
, , O
fail VB B-VP
to TO I-VP
show VB I-VP
a DT B-NP
substantial JJ I-NP
improvement NN I-NP
from IN B-PP
July NNP B-NP
and CC I-NP
August NNP I-NP
's POS B-NP
near-record JJ I-NP
deficits NNS I-NP
. . O

Chancellor NNP O
of IN B-PP
the DT B-NP
Exchequer NNP I-NP
Nigel NNP B-NP
Lawson NNP I-NP
's POS B-NP
restated VBN I-NP
commitment NN I-NP
to TO B-PP
a DT B-NP
firm NN I-NP
monetary JJ I-NP
policy NN I-NP
has VBZ B-VP
helped VBN I-VP
to TO I-VP
prevent VB I-VP
a DT B-NP
freefall NN I-NP
in IN B-PP
sterling NN B-NP
over IN B-PP
the DT B-NP
past JJ I-NP
week NN I-NP
. . O
</pre>

Looking at the two sentences, we can identify two tracks of observations (1) words and (2) part-of-speech. The two tracks are separated by a space as separate columns and the last column representing the label/tag. Sentences are separated by a new line. To be consistent with the terminology we will use the following terms/definitions:
<ul>
<li><span style="font-weight:bold;color:green;">sequence</span>: to refer to a list of elements that follow an order</li>
<li><span style="font-weight:bold;color:green;">observation</span>: to refer to an element in the sequence</li>
<li><span style="font-weight:bold;color:green;">track</span>: to refer to different types of observations. In the chunking example, we have a track for the words and another for the part-of-speech</li>
<li><span style="font-weight:bold;color:green;">label/tag</span>: to refer to the outcome/class we want to predict
</ul>

This file format could support as many tracks as we want where new tracks could be added as separate columns while keeping the last column for the tag/label.

To read this file format, <a href="https://bitbucket.org/A_2/pyseqlab">PySeqLab</a> provides a parser -- <span style="font-weight:bold;color:blue;">DataFileParser()</span> that is provided under the utilities module.
<br/>

As a reminder, a visual tree directory for the dataset folder under the current directory (<span style="font-weight:bold;">tutorials</span>) is provided below:
<pre style="font-size:0.8em;">
|---tutorials
|        |---datasets
|        |       |---conll2000
|        |       |        |---test.txt
|        |       |        |---train.txt
</pre>

# Parse sequences (sentences)

The data file parser <span style="font-weight:bold;color:blue;">DataFileParser()</span> has a <span style="font-weight:bold;color:blue;">read_file()</span> method that has the following:
<br/>

<span style="font-weight:bold;">Arguments</span>:
<ul>
<li><span style="font-weight:bold;">file_path</span>: (string), directory/path to the file to be read</li>
<li><span style="font-weight:bold;">header</span>: (string or list)
    <ul><li><span style="font-weight:bold;color:green;">'main'</span>: in case there is only one header on top of the file (like the CoNLL dataset file)</li>
        <li><span style="font-weight:bold;color:green;">'per_sequence'</span>: in case there is header before every sequence</li>
        <li>list of keywords such as <span style="font-weight:bold;color:green;">['w', 'part_of_speech']</span> in case no header is provided in the file</li>
    </ul>
</li>
</ul>

<span style="font-weight:bold;">Keyword arguments</span>:
<ul>
<li><span style="font-weight:bold;">y_ref</span>: (boolean), specifying if the last column is the tag/label column</li>
<li><span style="font-weight:bold;">seg_other_symbol</span>: (string or None as default), it decides if we want to parse sequences versus segments. We will later explain this in detail
<li><span style="font-weight:bold;">column_sep</span>: (string) specifying the separator between the tracks (columns of observations) to be read</li>
</ul>

In the case of CoNLL task, we will set both the arguments and keyword arguments in the following cells to read the training file.

In [11]:
# importing and defining relevant directories
import sys
import os
# docs directory
docs_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
# pyseqlab package directory
package_dir = os.path.abspath(os.path.join(docs_dir, os.pardir))
print("package dir:", package_dir)
# inserting the pyseqlab directory to pythons system path -- if pyseqlab is already installed this could be commented out
sys.path.insert(0, package_dir)
print("project_dir:", project_dir)
# current directory (tutorials)
tutorials_dir = os.path.join(docs_dir, 'tutorials')
print("tutorials_dir:", tutorials_dir)
dataset_dir = os.path.join(tutorials_dir, 'datasets', 'conll2000')
print("dataset_dir:", dataset_dir)


package dir: /home/aa/git/pyseqlab_exp
project_dir: /home/aa/git/pyseqlab_exp/docs
tutorials_dir: /home/aa/git/pyseqlab_exp/docs/tutorials
dataset_dir: /home/aa/git/pyseqlab_exp/docs/tutorials/datasets/conll2000


In [13]:
from pyseqlab.utilities import DataFileParser
# initialize a data file parser
dparser = DataFileParser()
# provide the options to parser such as the header info, the separator between words and if the y label is already existing
# main means the header is found in the first line of the file
header = "main"
# y_ref is a boolean indicating if the label to predict is already found in the file
y_ref = True
# spearator between the words/observations
column_sep = " "
seqs = []
for seq in dparser.read_file(os.path.join(dataset_dir, 'train.txt'), header, y_ref=y_ref, column_sep = column_sep):
    seqs.append(seq)
    
# printing one sequence for display
print(seqs[0])
print("type(seq):", type(seqs[0]))
print("number of parsed sequences is: ", len(seqs))

Y sequence:
 ['B-NP', 'B-PP', 'B-NP', 'I-NP', 'B-VP', 'I-VP', 'I-VP', 'I-VP', 'I-VP', 'B-NP', 'I-NP', 'I-NP', 'B-SBAR', 'B-NP', 'I-NP', 'B-PP', 'B-NP', 'O', 'B-ADJP', 'B-PP', 'B-NP', 'B-NP', 'O', 'B-VP', 'I-VP', 'I-VP', 'B-NP', 'I-NP', 'I-NP', 'B-PP', 'B-NP', 'I-NP', 'I-NP', 'B-NP', 'I-NP', 'I-NP', 'O']
X sequence:
 {1: {'pos': 'NN', 'w': 'Confidence'}, 2: {'pos': 'IN', 'w': 'in'}, 3: {'pos': 'DT', 'w': 'the'}, 4: {'pos': 'NN', 'w': 'pound'}, 5: {'pos': 'VBZ', 'w': 'is'}, 6: {'pos': 'RB', 'w': 'widely'}, 7: {'pos': 'VBN', 'w': 'expected'}, 8: {'pos': 'TO', 'w': 'to'}, 9: {'pos': 'VB', 'w': 'take'}, 10: {'pos': 'DT', 'w': 'another'}, 11: {'pos': 'JJ', 'w': 'sharp'}, 12: {'pos': 'NN', 'w': 'dive'}, 13: {'pos': 'IN', 'w': 'if'}, 14: {'pos': 'NN', 'w': 'trade'}, 15: {'pos': 'NNS', 'w': 'figures'}, 16: {'pos': 'IN', 'w': 'for'}, 17: {'pos': 'NNP', 'w': 'September'}, 18: {'pos': ',', 'w': ','}, 19: {'pos': 'JJ', 'w': 'due'}, 20: {'pos': 'IN', 'w': 'for'}, 21: {'pos': 'NN', 'w': 'release'}, 2

In [16]:
seq = seqs[0]
print("X:")
print(seq.X)
print("-"*40)
print("Y:")
print(seq.Y)
print("-"*40)
print("flat_y:")
print(seq.flat_y)
print("-"*40)

X:
{1: {'pos': 'NN', 'w': 'Confidence'}, 2: {'pos': 'IN', 'w': 'in'}, 3: {'pos': 'DT', 'w': 'the'}, 4: {'pos': 'NN', 'w': 'pound'}, 5: {'pos': 'VBZ', 'w': 'is'}, 6: {'pos': 'RB', 'w': 'widely'}, 7: {'pos': 'VBN', 'w': 'expected'}, 8: {'pos': 'TO', 'w': 'to'}, 9: {'pos': 'VB', 'w': 'take'}, 10: {'pos': 'DT', 'w': 'another'}, 11: {'pos': 'JJ', 'w': 'sharp'}, 12: {'pos': 'NN', 'w': 'dive'}, 13: {'pos': 'IN', 'w': 'if'}, 14: {'pos': 'NN', 'w': 'trade'}, 15: {'pos': 'NNS', 'w': 'figures'}, 16: {'pos': 'IN', 'w': 'for'}, 17: {'pos': 'NNP', 'w': 'September'}, 18: {'pos': ',', 'w': ','}, 19: {'pos': 'JJ', 'w': 'due'}, 20: {'pos': 'IN', 'w': 'for'}, 21: {'pos': 'NN', 'w': 'release'}, 22: {'pos': 'NN', 'w': 'tomorrow'}, 23: {'pos': ',', 'w': ','}, 24: {'pos': 'VB', 'w': 'fail'}, 25: {'pos': 'TO', 'w': 'to'}, 26: {'pos': 'VB', 'w': 'show'}, 27: {'pos': 'DT', 'w': 'a'}, 28: {'pos': 'JJ', 'w': 'substantial'}, 29: {'pos': 'NN', 'w': 'improvement'}, 30: {'pos': 'IN', 'w': 'from'}, 31: {'pos': 'NNP', 

The parser reads 8936 sequences in the training file. Each sequence is an instance of <span style="font-weight:bold;color:blue;">SequenceStruct</span> class that is also
found under the utilities module in <a href="https://bitbucket.org/A_2/pyseqlab">PySeqLab package</a>.
Three main attributes of a sequence are as follows:

<ul>
<li><span style="font-weight:bold;">X</span>: dictionary of dictionaries that hold for every position the different observations with their corresponding track name that was extracted from the header variable while parsing the file. Example:
<pre style="font-size:0.8em;">
X:
{1: {'pos': 'NN', 'w': 'Confidence'}, 2: {'pos': 'IN', 'w': 'in'}, 3: {'pos': 'DT', 'w': 'the'}, 4: {'pos': 'NN', 'w': 'pound'}, 5: {'pos': 'VBZ', 'w': 'is'}, 6: {'pos': 'RB', 'w': 'widely'}, 7: {'pos': 'VBN', 'w': 'expected'}, 8: {'pos': 'TO', 'w': 'to'}, 9: {'pos': 'VB', 'w': 'take'}, 10: {'pos': 'DT', 'w': 'another'}, 11: {'pos': 'JJ', 'w': 'sharp'}, 12: {'pos': 'NN', 'w': 'dive'}, 13: {'pos': 'IN', 'w': 'if'}, 14: {'pos': 'NN', 'w': 'trade'}, 15: {'pos': 'NNS', 'w': 'figures'}, 16: {'pos': 'IN', 'w': 'for'}, 17: {'pos': 'NNP', 'w': 'September'}, 18: {'pos': ',', 'w': ','}, 19: {'pos': 'JJ', 'w': 'due'}, 20: {'pos': 'IN', 'w': 'for'}, 21: {'pos': 'NN', 'w': 'release'}, 22: {'pos': 'NN', 'w': 'tomorrow'}, 23: {'pos': ',', 'w': ','}, 24: {'pos': 'VB', 'w': 'fail'}, 25: {'pos': 'TO', 'w': 'to'}, 26: {'pos': 'VB', 'w': 'show'}, 27: {'pos': 'DT', 'w': 'a'}, 28: {'pos': 'JJ', 'w': 'substantial'}, 29: {'pos': 'NN', 'w': 'improvement'}, 30: {'pos': 'IN', 'w': 'from'}, 31: {'pos': 'NNP', 'w': 'July'}, 32: {'pos': 'CC', 'w': 'and'}, 33: {'pos': 'NNP', 'w': 'August'}, 34: {'pos': 'POS', 'w': "'s"}, 35: {'pos': 'JJ', 'w': 'near-record'}, 36: {'pos': 'NNS', 'w': 'deficits'}, 37: {'pos': '.', 'w': '.'}}
</pre>
The keys in the dictionary are the numbered positions such as {<span style="font-weight:bold;color:red;">1</span>: {<span style="font-weight:bold;color:red;">'pos'</span>: <span style="font-weight:bold;color:green;">'NN'</span>, <span style="font-weight:bold;color:red;">'w'</span>: <span style="font-weight:bold;color:green;">'Confidence'</span>}} where <span style="font-weight:bold;color:red;">1</span> is the position where we are inspecting the sequence and {<span style="font-weight:bold;color:red;">'pos'</span>: <span style="font-weight:bold;color:green;">'NN'</span>, <span style="font-weight:bold;color:red;">'w'</span>: <span style="font-weight:bold;color:green;">'Confidence'</span>} are the observations detected at that position. Moreover, <span style="font-weight:bold;color:green;">'Confidence'</span> observation belongs to the <span style="font-weight:bold;color:red;">word</span> track (using <span style="font-weight:bold;color:red;">'w'</span> as key) and <span style="font-weight:bold;color:green;">'NN'</span> observation to the <span style="font-weight:bold;color:red;">part-of-speech track</span> (using <span style="font-weight:bold;color:red;">'pos'</span> as key).
</li>
<li><span style="font-weight:bold;">Y</span>: dictionary where keys are the boundaries of the label and the values are the labels/tags. Example:
<pre style="font-size:0.8em;">
Y:
{(35, 35): 'I-NP', (20, 20): 'B-PP', (13, 13): 'B-SBAR', (29, 29): 'I-NP', (6, 6): 'I-VP', (2, 2): 'B-PP', (31, 31): 'B-NP', (12, 12): 'I-NP', (11, 11): 'I-NP', (7, 7): 'I-VP', (23, 23): 'O', (27, 27): 'B-NP', (25, 25): 'I-VP', (16, 16): 'B-PP', (22, 22): 'B-NP', (34, 34): 'B-NP', (37, 37): 'O', (33, 33): 'I-NP', (21, 21): 'B-NP', (26, 26): 'I-VP', (5, 5): 'B-VP', (10, 10): 'B-NP', (36, 36): 'I-NP', (4, 4): 'I-NP', (9, 9): 'I-VP', (17, 17): 'B-NP', (30, 30): 'B-PP', (24, 24): 'B-VP', (8, 8): 'I-VP', (32, 32): 'I-NP', (14, 14): 'B-NP', (18, 18): 'O', (3, 3): 'B-NP', (28, 28): 'I-NP', (19, 19): 'B-ADJP', (15, 15): 'I-NP', (1, 1): 'B-NP'}
</pre>
The keys in the dictionary are the boundaries (positions) the label/tag is spanning. In case of parsing/modeling sequences, a label/tag can span only one observation and hence the boundaries will be a tuple of the position of the label. However, if we are modeling/parsing segments, the boundareis would vary as labels are allowed to span multiple observations. More on that when we discuss segment parsing. 
</li>
<li><span style="font-weight:bold;">flat_y</span>: list of labels for every observation. Example:
<pre style="font-size:0.8em;">
flat_y:
['B-NP', 'B-PP', 'B-NP', 'I-NP', 'B-VP', 'I-VP', 'I-VP', 'I-VP', 'I-VP', 'B-NP', 'I-NP', 'I-NP', 'B-SBAR', 'B-NP', 'I-NP', 'B-PP', 'B-NP', 'O', 'B-ADJP', 'B-PP', 'B-NP', 'B-NP', 'O', 'B-VP', 'I-VP', 'I-VP', 'B-NP', 'I-NP', 'I-NP', 'B-PP', 'B-NP', 'I-NP', 'I-NP', 'B-NP', 'I-NP', 'I-NP', 'O']
</pre>
</li>
</ul>
There are other attributes for the sequence instance that could be explored by consulting to the API docs of <a href="https://bitbucket.org/A_2/pyseqlab">PySeqLab package</a>.

# Constructing sequences programatically

We have seen so far how to parse sequences from a text file following the input file format (i.e. <span style="font-weight:bold;color:red;">column-format</span> template). Now what if we want to construct the sequences from code (i.e. on the fly)? The answer is a definite <span>Yes</span>. To deomonstrate this, suppose we have the sentence s = "The dog barks." and we want to represent it as an instance of our SequeqnceStruct class. 
<br/>

First, we determine the different components of the sequence. As we defined our terminology earlier, the sentence s is a sequence with four observations each belonging to one type (i.e. track) in this case representing the words. So we denote 'w' as name of the track and we proceed to build the X attribute of the sequence. For the labels, we have two options: (1) in case no labels are defined we would get and empty list Y attribute or (2) in case of defined labels, we would get label list Y attribute. See the next cell for demonstration.


In [25]:
# import SequenceStruct class
from pyseqlab.utilities import SequenceStruct
# define the X attribute
X= [{'w':'The'}, {'w':'dog'}, {'w':'barks'}, {'w':'.'}]
# empty label/tag sequence
Y= []
seq_1 = SequenceStruct(X, Y)
print("labels are not defined")
print("seq_1:")
print("X:", seq_1.X)
print("Y:", seq_1.Y)
print("flat_y:", seq_1.flat_y)

print("-"*40)
print("labels are defined")
# defined label/tag sequence
Y= ['DT', 'N', 'V', '.']
seq_2 = SequenceStruct(X, Y)
print("X:", seq_2.X)
print("Y:", seq_2.Y)
print("flat_y:", seq_2.flat_y)



labels are not defined
seq_1:
X: {1: {'w': 'The'}, 2: {'w': 'dog'}, 3: {'w': 'barks'}, 4: {'w': '.'}}
Y: {}
flat_y: []
----------------------------------------
labels are defined
X: {1: {'w': 'The'}, 2: {'w': 'dog'}, 3: {'w': 'barks'}, 4: {'w': '.'}}
Y: {(3, 3): 'V', (4, 4): '.', (1, 1): 'DT', (2, 2): 'N'}
flat_y: ['DT', 'N', 'V', '.']


# Constructing segments

All the discussion was focused towards representing sequences. Another option is to represent/parse segments. By definition
segments are sequences in which labels may extend to more than one observation. For example the sentence s = "Yale is found in New Haven." where each observation is a word and the labels are of three types {'University', 'Location', 'Other'}. The goal is to identify named entities such as university, location/city and other (non named entity). So the labels corresponding to sentence s are ["University", "Other", "Other", "Other", "Location", "Location"]. As it can be seen the "University" label spans two observations "New Haven" and hence we can either model as sequence or segment. The cell demonstrate the two representations.
"

In [26]:
# define the X attribute
X = [{'w':'Yale'}, {'w':'is'}, {'w':'in'}, {'w':'New'}, {'w':'Haven'}]
Y= ["University", "Other", "Other", "Other", "Location", "Location"]
# model as a sequence
seq_1 = SequenceStruct(X, Y)
print("Modeled as a sequence")
print("seq_1:")
print("X:", seq_1.X)
print("Y:", seq_1.Y)
print("flat_y:", seq_1.flat_y)

print("-"*40)
print("Modeled as a segment")
seq_2 = SequenceStruct(X, Y, seg_other_symbol="Other")
print("X:", seq_2.X)
print("Y:", seq_2.Y)
print("flat_y:", seq_2.flat_y)

Modeled as a sequence
seq_1:
X: {1: {'w': 'Yale'}, 2: {'w': 'is'}, 3: {'w': 'in'}, 4: {'w': 'New'}, 5: {'w': 'Haven'}}
Y: {(6, 6): 'Location', (5, 5): 'Location', (2, 2): 'Other', (4, 4): 'Other', (1, 1): 'University', (3, 3): 'Other'}
flat_y: ['University', 'Other', 'Other', 'Other', 'Location', 'Location']
----------------------------------------
Modeled as a segment
X: {1: {'w': 'Yale'}, 2: {'w': 'is'}, 3: {'w': 'in'}, 4: {'w': 'New'}, 5: {'w': 'Haven'}}
Y: {(5, 6): 'Location', (3, 3): 'Other', (4, 4): 'Other', (1, 1): 'University', (2, 2): 'Other'}
flat_y: ['University', 'Other', 'Other', 'Other', 'Location', 'Location']


As it can be seen the difference in how the Y attribute is modeled between segments and sequences. The labels are allowed to span one observation in a sequence while segments can span multiple observations (like the case of "Location" label). The seg_other_symbol keyword is used to determine if we are modeling segments or sequences. If it was left by default (None) then the constructed instances will be sequences. Else if we specify the non entity symbol (in this case "Other"), the constructed instances will be segments.