# Kleis - Keyphrase extraction

In [1]:
import os

## Module to load corpus and models

In [2]:
import kleis.resources.dataset as kl

### Load corpus

To load the corpus use the following method.

```
kl.load_corpus()
```

The corpus loaded by default is [SemEval 2017 Task 10](https://scienceie.github.io/resources.html), at the moment it is the only one available. Files for the corpus are shearched in ~/kleis_data/corpus/semeval2017-task10 or ./kleis_data/corpus/semeval2017-task10


In [3]:
default_corpus = kl.load_corpus()
print(default_corpus)

<kleis.resources.semeval2017.SemEval2017 object at 0x7f1af08aa0b8>


### Info about dataset

In [4]:
print("Name:", default_corpus.name)
print("Config:", default_corpus.config)

print("Name:", default_corpus.name)
print("Config:", default_corpus.config)
print("len(Train):", len(default_corpus.train) if default_corpus.train else None)
# print("Train:", default_corpus.train.popitem())
print("len(Dev):", len(default_corpus.dev) if default_corpus.dev else None)
# print("Dev:", default_corpus.dev.popitem())
print("len(Test):", len(default_corpus.test) if default_corpus.test else None)
# print("Test:", default_corpus.test.popitem())

Name: semeval2017-task10
Config: {'train-labeled': '/home/snov/Projects/univ-paris13/nlp/code/kleis-keyphrase-extraction/src/kleis/kleis_data/corpus/semeval2017-task10/train2/', 'dev-labeled': '/home/snov/Projects/univ-paris13/nlp/code/kleis-keyphrase-extraction/src/kleis/kleis_data/corpus/semeval2017-task10/dev/', 'test-unlabeled': '/home/snov/Projects/univ-paris13/nlp/code/kleis-keyphrase-extraction/src/kleis/kleis_data/corpus/semeval2017-task10/scienceie2017_test_unlabelled/', 'test-labeled': '/home/snov/Projects/univ-paris13/nlp/code/kleis-keyphrase-extraction/src/kleis/kleis_data/corpus/semeval2017-task10/semeval_articles_test/'}
Name: semeval2017-task10
Config: {'train-labeled': '/home/snov/Projects/univ-paris13/nlp/code/kleis-keyphrase-extraction/src/kleis/kleis_data/corpus/semeval2017-task10/train2/', 'dev-labeled': '/home/snov/Projects/univ-paris13/nlp/code/kleis-keyphrase-extraction/src/kleis/kleis_data/corpus/semeval2017-task10/dev/', 'test-unlabeled': '/home/snov/Projects/u

## PoS tag sequences

Example of PoS sequences

In [5]:
default_corpus.load_pos_sequences()
pos_sequences = default_corpus.pos_sequences

In [6]:
print("\n".join([str(pos_seq) for pos_seq in pos_sequences.items()][:10]))

('JJ NN', {'tags': ['JJ', 'NN'], 'count': 526})
('NN NN', {'tags': ['NN', 'NN'], 'count': 444})
('NN IN NN NN', {'tags': ['NN', 'IN', 'NN', 'NN'], 'count': 9})
('NN', {'tags': ['NN'], 'count': 684})
('JJ NN NN', {'tags': ['JJ', 'NN', 'NN'], 'count': 200})
('NN NN NN', {'tags': ['NN', 'NN', 'NN'], 'count': 89})
('NNP NN', {'tags': ['NNP', 'NN'], 'count': 177})
('NNP', {'tags': ['NNP'], 'count': 627})
('JJ NNS NNS', {'tags': ['JJ', 'NNS', 'NNS'], 'count': 3})
('JJ NN NNS', {'tags': ['JJ', 'NN', 'NNS'], 'count': 130})


## Keyphrase extraction with Kleis

In [7]:
text = """Information extraction is the process of extracting structured data from unstructured text, \
which is relevant for several end-to-end tasks, including question answering. \
This paper addresses the tasks of named entity recognition (NER), \
a subtask of information extraction, using conditional random fields (CRF). \
Our method is evaluated on the ConLL-2003 NER corpus.
"""

print("Document example...\n")
print("Content to label:\n\n", text)

Document example...

Content to label:

 Information extraction is the process of extracting structured data from unstructured text, which is relevant for several end-to-end tasks, including question answering. This paper addresses the tasks of named entity recognition (NER), a subtask of information extraction, using conditional random fields (CRF). Our method is evaluated on the ConLL-2003 NER corpus.



### Train or load model

Before labeling keyphrases use the following method. 

```
default_corpus.training()
```

It loads the model to label keyphrases or start the training, 

It is recomended to use the default arguments. Note that if the model with other arguments doesn't exists, is going to be generated and this process could take several time.  

```
Default: filter_min_count = 3
Default: tagging_notation="BILOU"
```

In [8]:
# Train or load model
default_corpus.training()

### Labeling text

To label text with the trained model use the following method.

```
default_corpus.label_text(text)
```

In [9]:
# Labeling
keyphrases = default_corpus.label_text(text)

Keyphrases are returned as a list describing the keyphrase. 

Each element is a tuple with fields.
```
("Keyphrase ID", ("Label", Start, End), "Text of keyphrase")
    string         string   int   int        string 
```

Example:

```
[
    ('T4', ('KEYPHRASE', (293, 309)), 'holographic mask'), 
    ('T10', ('KEYPHRASE', (735, 751)), 'holographic mask')
]
```



In [10]:
print("Example of labeled keyphrases:\n\n", keyphrases)

Example of labeled keyphrases:

 [('T3', ('KEYPHRASE', (0, 22)), 'Information extraction'), ('T4', ('KEYPHRASE', (150, 168)), 'question answering'), ('T5', ('KEYPHRASE', (210, 228)), 'entity recognition'), ('T6', ('KEYPHRASE', (249, 271)), 'information extraction'), ('T24', ('KEYPHRASE', (230, 233)), 'NER'), ('T25', ('KEYPHRASE', (306, 309)), 'CRF'), ('T28', ('KEYPHRASE', (279, 304)), 'conditional random fields')]


The result could be formated as [brat](http://brat.nlplab.org/examples.html#annotation-examples) with the method.

```
kl.keyphrases2brat(keyphrases)
```

In [11]:
print(kl.keyphrases2brat(keyphrases))

T3	KEYPHRASE 0 22	Information extraction
T4	KEYPHRASE 150 168	question answering
T5	KEYPHRASE 210 228	entity recognition
T6	KEYPHRASE 249 271	information extraction
T24	KEYPHRASE 230 233	NER
T25	KEYPHRASE 306 309	CRF
T28	KEYPHRASE 279 304	conditional random fields


### Testing with the SemEval 2017 Task 10 dataset

The possible values to select PoS sequences are the counts of occurrences of each PoS sequence in the train dataset. 

In [12]:
filter_limits = sorted(set([pos_seq["count"] for pos_seq in pos_sequences.values()]))
print("Occurrences of each PoS Sequence in the train dataset\n", filter_limits)

Occurrences of each PoS Sequence in the train dataset
 [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 20, 21, 22, 24, 25, 26, 27, 28, 34, 37, 39, 50, 63, 89, 103, 107, 130, 177, 200, 219, 283, 335, 444, 526, 627, 684]


The following examples iterates over the dev and tests dataset for SemEval 2017 Task 10, the outputs are saved under output-dev/ and output-test/ in the main directory of this package. The results an be evaluated using the script ./eval_example_output_semeval2017.sh

Be sure to have the corpus in the correct path and load each datasets.

```
default_corpus.load_train()
default_corpus.load_test()
default_corpus.load_dev()
```


In [13]:
default_corpus.load_train()
default_corpus.load_test()
default_corpus.load_dev()

In [14]:
for fl in filter_limits:
    default_corpus.training(filter_min_count=fl)
    output = "output-dev/%s/" % fl
    if not kl.path_exists(output):
        os.makedirs(output)
    for i, (key, tmp_dataset) in enumerate(default_corpus.dev.items()):
        text = tmp_dataset["raw"]["txt"]
        keyphrases = default_corpus.label_text(text)
        with open(output + key + ".ann", "w", encoding="utf-8") as fout:
            fout.write(kl.keyphrases2brat(keyphrases))
        with open(output + key + ".txt", "w", encoding="utf-8") as fout:
            fout.write(kl.keyphrases2brat(keyphrases))
    print("Written docs: %d \
          \n - Minimum occurence of PoS sequences to train the model: %d" % (i + 1, fl))

Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 3
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 4
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 5
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 6
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 7
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 8
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 9
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 10
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 11
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 12
Written docs: 50           
 - Minimum occurence of PoS sequences to train the model: 13
Written docs: 50           


In [15]:
for fl in filter_limits:
    default_corpus.training(filter_min_count=fl)
    output = "output-test/%s/" % fl
    if not kl.path_exists(output):
        os.makedirs(output)
    for i, (key, tmp_dataset) in enumerate(default_corpus.test.items()):
        text = tmp_dataset["raw"]["txt"]
        keyphrases = default_corpus.label_text(text)
        with open(output + key + ".ann", "w", encoding="utf-8") as fout:
            fout.write(kl.keyphrases2brat(keyphrases))
        with open(output + key + ".txt", "w", encoding="utf-8") as fout:
            fout.write(kl.keyphrases2brat(keyphrases))
    print("Written docs: %d \
          \n - Minimum occurence of PoS sequences to train the model: %d" % (i + 1, fl))

Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 3
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 4
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 5
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 6
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 7
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 8
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 9
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 10
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 11
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 12
Written docs: 100           
 - Minimum occurence of PoS sequences to train the model: 13
Written docs: 100