## Using the complex word sequence labeller

In order to use the complex word models you must download the sequence labeller files available [here](https://github.com/marekrei/sequence-labeler), please cite both the sequence labeller paper and CWI sequence labelling paper if using these models for research. 

Below is example code showing each function in the `Complexity_labeller class`

In [1]:
import sys
sys.path.insert(0, './sequence-labeler-master')

from complex_labeller import Complexity_labeller
model_path = './cwi_seq.model'
temp_path = './temp_file.txt'

In [2]:
model = Complexity_labeller(model_path, temp_path)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


There are two options when converting text to CoNLL-type tab-separated format:

- `convert_format_string`
- `convert_format_token`

In [3]:
Complexity_labeller.convert_format_string(model, 'You can convert a string like this')

In [4]:
Complexity_labeller.convert_format_token(model, ['You','can','convert','tokens','like','this'])

Once the text has been converted there are four methods to access complexity information:

- `get_dataframe`
- `get_bin_labels`
- `get_prob_labels`

In [5]:
#Converting example sentence:'Based in an armoured train parked in its sidings, he met with numerous ministers'

Complexity_labeller.convert_format_string(model,'Based in an armoured train parked in its sidings, he met with numerous ministers')

The `get_dataframe` method returns a dataframe containing the original tokenized sentence, binary complexity labels and complex class probabilities.

If a word recieves a binary label = 1, it has been classified as a complex word.

In [6]:
dataframe = Complexity_labeller.get_dataframe(model)

In [7]:
dataframe

Unnamed: 0,index,sentences,labels,probs
0,0,"[Based, in, an, armoured, train, parked, in, i...","[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]","[[0.96680665, 0.033193372], [0.99995637, 4.359..."


Example below shows how to access binary information from the dataframe format: 

In [15]:
list(zip(dataframe['sentences'].values[0],dataframe['labels'].values[0]))

[('Based', 0),
 ('in', 0),
 ('an', 0),
 ('armoured', 1),
 ('train', 0),
 ('parked', 0),
 ('in', 0),
 ('its', 0),
 ('sidings', 1),
 (',', 0),
 ('he', 0),
 ('met', 0),
 ('with', 0),
 ('numerous', 1),
 ('ministers', 0)]

`get_bin_labels` returns the binary complexity labels for the input

In [16]:
Complexity_labeller.get_bin_labels(model)

[array([0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0])]

The `get_prob_labels` method returns the probability of each token belonging to the complex class.

In [17]:
Complexity_labeller.get_prob_labels(model)

[0.033193372,
 4.359562e-05,
 0.00011993743,
 0.9801681,
 0.01585573,
 0.26787525,
 4.0525378e-05,
 0.00021037977,
 0.8165311,
 6.47893e-05,
 0.000112162525,
 0.010358469,
 6.7463385e-05,
 0.8968867,
 0.40755495]