# Kephrase extraction

In [1]:
import os

## Module to load corpus and models

In [2]:
import resources.dataset as rd

### Load corpus

To load the corpus the following Method is used.

```
rd.load_corpus()
```
With this method the file content of a dataset is loaded in memory and PoS tag sequences are extracted from the train dataset.   

The corpus loaded by default is [SemEval 2017 Task 10](https://scienceie.github.io/resources.html), at the moment it is the only one available. Files for this dataset should be included in the package under the path configured in config/config.py (Default: corpus/)



In [3]:
default_corpus = rd.load_corpus()
print(default_corpus)

<resources.semeval2017.SemEval2017 object at 0x7f21036b7908>


### Info about dataset

In [4]:
print("Name:", default_corpus.name)
print("Config:", default_corpus.config)

print("len(Train):", len(default_corpus.train))
# print("Train:", default_corpus.train.popitem()[1]["raw"]["txt"])
# print("Train:", default_corpus.train.popitem()[1]["tags"])
# print("Train:", default_corpus.train.popitem()[1]["keyphrases"])
print("PoS sequences:", len(default_corpus.pos_sequences))
print("PoS sequences example:", default_corpus.pos_sequences.popitem())

print("len(Dev):", len(default_corpus.dev))
# print("Dev:", default_corpus.dev.popitem()[1]["raw"]["txt"])

print("len(Test):", len(default_corpus.test))
# print("Test:", default_corpus.test.popitem()[1]["raw"]["txt"])

Name: semeval2017-task10
Config: {'train-labeled': 'corpus/semeval2017-task10/train2/', 'dev-labeled': 'corpus/semeval2017-task10/dev/', 'test-unlabeled': 'corpus/semeval2017-task10/scienceie2017_test_unlabelled/', 'test-labeled': 'corpus/semeval2017-task10/semeval_articles_test/'}
len(Train): 350
PoS sequences: 1486
PoS sequences example: ('NNS IN DT NN JJ', {'tags': ['NNS', 'IN', 'DT', 'NN', 'JJ'], 'count': 1})
len(Dev): 50
len(Test): 100


In [5]:
# key, train = default_corpus.train.popitem()
# print("\n".join([str(ann) for ann in train["tags"]]))
# print("\n".join([str(keyphrase) for keyphrase in train["keyphrases"].items()]))
# print("\n".join([str(ac) for ac in default_corpus.annotated_candidates_spans[key]]))
# print("Candidates", len(default_corpus.annotated_candidates_spans[key]))

## PoS tag sequences

### Load dataset

In [6]:
default_corpus.load_pos_sequences()
pos_sequences = default_corpus.pos_sequences

### Example of PoS sequences

In [7]:
print("\n".join([str(pos_seq) for pos_seq in pos_sequences.items()][:10]))

('JJ NN', {'tags': ['JJ', 'NN'], 'count': 526})
('NN NN', {'tags': ['NN', 'NN'], 'count': 444})
('NN IN NN NN', {'tags': ['NN', 'IN', 'NN', 'NN'], 'count': 9})
('VBN NN VBZ VBN', {'tags': ['VBN', 'NN', 'VBZ', 'VBN'], 'count': 1})
('NN TO VB PRP$ NN CC NN NN NNS', {'tags': ['NN', 'TO', 'VB', 'PRP$', 'NN', 'CC', 'NN', 'NN', 'NNS'], 'count': 1})
('NN', {'tags': ['NN'], 'count': 684})
('JJ NN NN', {'tags': ['JJ', 'NN', 'NN'], 'count': 200})
('JJ NNS JJ JJ NN', {'tags': ['JJ', 'NNS', 'JJ', 'JJ', 'NN'], 'count': 1})
('NN NN NN', {'tags': ['NN', 'NN', 'NN'], 'count': 89})
('JJ NN IN JJ JJ NNS', {'tags': ['JJ', 'NN', 'IN', 'JJ', 'JJ', 'NNS'], 'count': 2})


The following counts are used as limits to select the PoS sequences to train the model. 

## Example of keyphrase extraction

Example of text from the test dataset. 

In [8]:
key = list(default_corpus.test.keys())[1]
text = default_corpus.test[key]["raw"]["txt"]

print("Document example...\n")
print("Name of document:", key)
print("Content to label:\n\n", text)

Document example...

Name of document: S0304399111001811
Content to label:

 We have developed the theory of electrons carrying quantized orbital angular momentum. To make connection to realistic situations, we considered a plane wave moving along the optic axis of a lens system, intercepted by a round, centered aperture.88In the experiment, this aperture carries the holographic mask. It turns out that the movement along the optic axis can be separated off; the reduced Schrödinger equation operating in the plane of the aperture can be mapped onto Bessel's differential equation. The ensuing eigenfunctions fall into families with discrete orbital angular momentum ℏm along the optic axis where m is a magnetic quantum number. Those vortices can be produced by matching a plane wave after passage through a holographic mask with a fork dislocation to the eigenfunctions of the cylindrical problem. Vortices can be focussed by magnetic lenses into volcano-like charge distributions with very narr

### Train or load model

Train before labeling with the following method.

The following parameter is recomended, it is used to select the PoS sequences to filter candidates to train the CRF model. 

```
filter_min_count = 3
```

In [9]:
# Train or load model
default_corpus.training(filter_min_count=3)

### Label text

To label text with the trained model use the following method.
```
default_corpus.label_text(text)
```

In [10]:
# Labeling
keyphrases = default_corpus.label_text(text)

Keyphrases are returned as a list describing the keyphrase. 

Each element is a tuple with fields.
```
("Keyphrase ID", ("Label", Start, End), "Text of keyphrase")
    string         string   int   int        string 
```

Example:

```
[
    ('T4', ('KEYPHRASE', (293, 309)), 'holographic mask'), 
    ('T10', ('KEYPHRASE', (735, 751)), 'holographic mask')
]
```



In [11]:
print("Example of labeled keyphrases:\n\n", keyphrases)

Example of labeled keyphrases:

 [('T4', ('KEYPHRASE', (293, 309)), 'holographic mask'), ('T10', ('KEYPHRASE', (735, 751)), 'holographic mask'), ('T11', ('KEYPHRASE', (805, 824)), 'cylindrical problem'), ('T14', ('KEYPHRASE', (1010, 1030)), 'spherical aberration'), ('T28', ('KEYPHRASE', (978, 995)), 'diffraction plane'), ('T101', ('KEYPHRASE', (875, 908)), 'volcano-like charge distributions'), ('T106', ('KEYPHRASE', (32, 41)), 'electrons'), ('T234', ('KEYPHRASE', (389, 427)), 'reduced Schrödinger equation operating'), ('T255', ('KEYPHRASE', (997, 1030)), 'Inclusion of spherical aberration'), ('T268', ('KEYPHRASE', (735, 775)), 'holographic mask with a fork dislocation')]


The result could be formated as [brat](http://brat.nlplab.org/examples.html#annotation-examples) with the method.

```
rd.keyphrases2brat(keyphrases)
```

In [12]:
print("\nKeyphrases in %s.ann:\n" % key)

print(rd.keyphrases2brat(keyphrases))


Keyphrases in S0304399111001811.ann:

T4	KEYPHRASE 293 309	holographic mask
T10	KEYPHRASE 735 751	holographic mask
T11	KEYPHRASE 805 824	cylindrical problem
T14	KEYPHRASE 1010 1030	spherical aberration
T28	KEYPHRASE 978 995	diffraction plane
T101	KEYPHRASE 875 908	volcano-like charge distributions
T106	KEYPHRASE 32 41	electrons
T234	KEYPHRASE 389 427	reduced Schrödinger equation operating
T255	KEYPHRASE 997 1030	Inclusion of spherical aberration
T268	KEYPHRASE 735 775	holographic mask with a fork dislocation


### Examples of how to save answer to files

The possible values to select PoS sequences are the counts of occurrences of each PoS sequence in the train dataset. 

In [13]:
filter_limits = sorted(set([pos_seq["count"] for pos_seq in pos_sequences.values()]))
print("Occurrences of each PoS Sequence in the train dataset\n", filter_limits)

Occurrences of each PoS Sequence in the train dataset
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 20, 21, 22, 24, 25, 26, 27, 28, 34, 37, 39, 50, 63, 89, 103, 107, 130, 177, 200, 219, 283, 335, 444, 526, 627, 684]


The following examples iterates over the dev and tests dataset for SemEval 2017 Task 10, the outputs are saved under output-dev/ and output-test/ in the main directory of this package. The results an be evaluated using the script ./eval_example_output_semeval2017.sh

In [14]:
for fl in filter_limits:
    default_corpus.training(filter_min_count=fl)
    output = "output-dev/%s/" % fl
    if not rd.path_exists(output):
        os.makedirs(output)
    for i, (key, tmp_dataset) in enumerate(default_corpus.dev.items()):
        text = tmp_dataset["raw"]["txt"]
        keyphrases = default_corpus.label_text(text)
        with open(output + key + ".ann", "w", encoding="utf-8") as fout:
            fout.write(rd.keyphrases2brat(keyphrases))
        with open(output + key + ".txt", "w", encoding="utf-8") as fout:
            fout.write(rd.keyphrases2brat(keyphrases))
    print("Written docs: %d \
          \n - Minimum occurence of PoS sequences to train the model: %d" % (i + 1, fl))

Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 1
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 2
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 3
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 4
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 5
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 6
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 7
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 8
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 9
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 10
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 11
Written docs: 50           
 -

In [15]:
for fl in filter_limits:
    default_corpus.training(filter_min_count=fl)
    output = "output-test/%s/" % fl
    if not rd.path_exists(output):
        os.makedirs(output)
    for i, (key, tmp_dataset) in enumerate(default_corpus.test.items()):
        text = tmp_dataset["raw"]["txt"]
        keyphrases = default_corpus.label_text(text)
        with open(output + key + ".ann", "w", encoding="utf-8") as fout:
            fout.write(rd.keyphrases2brat(keyphrases))
        with open(output + key + ".txt", "w", encoding="utf-8") as fout:
            fout.write(rd.keyphrases2brat(keyphrases))
    print("Written docs: %d \
          \n - Minimum occurence of PoS sequences to train the model: %d" % (i + 1, fl))

Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 1
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 2
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 3
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 4
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 5
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 6
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 7
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 8
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 9
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 10
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 11
Written docs: 100  