Skip to content

Commit

Permalink
Merge 9221709 into 6ea1d53
Browse files Browse the repository at this point in the history
  • Loading branch information
matgrioni committed Nov 11, 2018
2 parents 6ea1d53 + 9221709 commit bba8d14
Show file tree
Hide file tree
Showing 40 changed files with 1,318 additions and 169 deletions.
545 changes: 545 additions & 0 deletions .pylintrc

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,6 @@ install:
- pip install coveralls
script:
- make coveragetest
- make lint
after_success:
- coveralls
18 changes: 18 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,24 @@ The format is based on Keep a Changelog and this project adheres to
Semantic Versioning.


[1.1.0] - 2018-11-11

Added

- pylint to build process
- Conllable abstract base class to mark CoNLL serializable components
- Tree data type construction of a sentence

Changed

- Linting patches suggested by pylint.
- Removed _end_line_number from Sentence constructor. This is an
internal patch, as this parameter was not meant to be used
by callers.
- New, improved, and clearer documentation
- Update of requests dependency due to security flaw


[1.0.1] - 2018-09-14

Changed
Expand Down
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,18 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

## [1.1.0] - 2018-11-11
### Added
- ``pylint`` to build process
- ``Conllable`` abstract base class to mark CoNLL serializable components
- Tree data type construction of a sentence

### Changed
- Linting patches suggested by ``pylint``.
- Removed ``_end_line_number`` from ``Sentence`` constructor. This is an internal patch, as this parameter was not meant to be used by callers.
- New, improved, and clearer documentation
- Update of ``requests`` dependency due to security flaw

## [1.0.1] - 2018-09-14
### Changed
- Removed test packages from final shipped package.
Expand Down
21 changes: 21 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,27 @@ The format is based on `Keep a
Changelog <http://keepachangelog.com/en/1.0.0/>`__ and this project
adheres to `Semantic Versioning <http://semver.org/spec/v2.0.0.html>`__.

[1.1.0] - 2018-11-11
--------------------

Added
~~~~~

- ``pylint`` to build process
- ``Conllable`` abstract base class to mark CoNLL serializable
components
- Tree data type construction of a sentence

Changed
~~~~~~~

- Linting patches suggested by ``pylint``.
- Removed ``_end_line_number`` from ``Sentence`` constructor. This is
an internal patch, as this parameter was not meant to be used by
callers.
- New, improved, and clearer documentation
- Update of ``requests`` dependency due to security flaw

[1.0.1] - 2018-09-14
--------------------

Expand Down
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
format:
yapf -p -r -i pyconll/ tests/

lint:
pylint --rcfile .pylintrc pyconll/

test:
python -m pytest -vv

Expand Down
6 changes: 3 additions & 3 deletions README
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ pyconll

_Easily work with CONLL files using the familiar syntax of PYTHON._

The current version is 1.0.1. This version is fully functional, stable,
The current version is 1.1.0. This version is fully functional, stable,
tested, documented, and actively developed.

Links
Expand Down Expand Up @@ -76,8 +76,8 @@ Backporting to python 2.7 is not in future plans.
Documentation

The full API documentation can be found online at
https://pyconll.readthedocs.io/. A growing number of examples can be
found in the examples folder.
https://pyconll.readthedocs.io/. Examples can be found in the examples
folder and also in the tests folder.

Contributing

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

*Easily work with **CoNLL** files using the familiar syntax of **python**.*

The current version is 1.0.1. This version is fully functional, stable, tested, documented, and actively developed.
The current version is 1.1.0. This version is fully functional, stable, tested, documented, and actively developed.

##### Links
- [Homepage](https://pyconll.github.io)
Expand Down Expand Up @@ -56,7 +56,7 @@ This package is designed for, and only tested with python 3.4 and above. Backpor

### Documentation

The full API documentation can be found online at [https://pyconll.readthedocs.io/](https://pyconll.readthedocs.io/). A growing number of examples can be found in the `examples` folder.
The full API documentation can be found online at [https://pyconll.readthedocs.io/](https://pyconll.readthedocs.io/). Examples can be found in the `examples` folder and also in the ``tests`` folder.


### Contributing
Expand Down
6 changes: 3 additions & 3 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ pyconll
*Easily work with **CoNLL** files using the familiar syntax of
**python**.*

The current version is 1.0.1. This version is fully functional, stable,
The current version is 1.1.0. This version is fully functional, stable,
tested, documented, and actively developed.

Links
Expand Down Expand Up @@ -87,8 +87,8 @@ Documentation
~~~~~~~~~~~~~

The full API documentation can be found online at
https://pyconll.readthedocs.io/. A growing number of examples can be
found in the ``examples`` folder.
https://pyconll.readthedocs.io/. Examples can be found in the
``examples`` folder and also in the ``tests`` folder.

Contributing
~~~~~~~~~~~~
Expand Down
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,9 @@
author = 'Matias Grioni'

# The short X.Y version
version = '0.1'
version = '1.0.2'
# The full version, including alpha/beta/rc tags
release = '0.1'
release = '1.0.2'


# -- General configuration ---------------------------------------------------
Expand Down
5 changes: 4 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,18 @@ pyconll
README <readme>
changelog

pyconll/conllable
pyconll/exception
pyconll/load
pyconll/tree/sentencetree
pyconll/tree/tree
pyconll/util
pyconll/unit/conll
pyconll/unit/sentence
pyconll/unit/token


This is the homepage of the ``pyconll`` documentation. Here you can find most information you need to about module interfaces, changes in previous versions, and example code. Simply look to the table of contents above for more info.
This is the homepage for ``pyconll`` documentation. Here you can find module interfaces, changelogs, and example code. Simply look above to the table of contents for more info.

If you are looking for example code, please see the ``examples`` directory on github_.

Expand Down
11 changes: 11 additions & 0 deletions docs/pyconll/conllable.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
conllable
===================================

``Conllable`` marks a class that can be output as a CoNLL formatted string. ``Conllable`` classes implement a ``conll`` method.

API
----------------------------------
.. automodule:: pyconll.conllable
:members:
:special-members:
:exclude-members: __dict__, __weakref__
4 changes: 3 additions & 1 deletion docs/pyconll/exception.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
exception
===================================

These are custom exceptions for pyconll. Right now, this only consists of a ``ParseError``.
Custom exceptions for pyconll. These errors are a ``ParseError`` and a ``FormatError``.

A ``ParseError`` occurs when the source input to a CoNLL component is invalid, and a ``FormatError`` occurs when the internal state of the component is invalid, and the component cannot be output to a CoNLL string.


API
Expand Down
4 changes: 2 additions & 2 deletions docs/pyconll/load.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
load
===================================

This is the main module you should interface with if wanting to load an entire CoNLL file, rather than individual sentences which should be less common. The API allows for loading CoNLL data from a string or from a file, and allows for iteration over the data, rather than storing a large CoNLL object in memory if so desired.
This is the main module to interface with to load an entire CoNLL treebank resources. The module defines methods for loading a CoNLL treebank through a string, file, or network. There also exist methods that iterate over the CoNLL resource data rather than storing the large CoNLL object in memory, if so desired.

Note that the fully qualified name is ``pyconll.load``, but these methods can also be accessed using the ``pyconll`` namespace.


Example
-----------------------------------
This example counts the number of times a token with a lemma of ``linguistic`` appeared in the treebank. Note that if all the operations that will be done on the CoNLL file are readonly, consider using the ``iter_from`` alternatives. These methods will return an iterator over each sentence in the CoNLL file rather than storing an entire CoNLL object in memory, which can be convenient when dealing with large files that do not need to persist.
This example counts the number of times a token with a lemma of ``linguistic`` appeared in the treebank. If all the operations that will be done on the CoNLL file are readonly or are data aggregations, the ``iter_from`` alternatives are more efficient and recommended. These methods will return an iterator over the sentences in the CoNLL resource rather than storing the CoNLL object in memory, which can be convenient when dealing with large files that do not need be completely loaded.

::

Expand Down
13 changes: 13 additions & 0 deletions docs/pyconll/tree/sentencetree.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
sentencetree
===================================

A ``SentenceTree`` is a thin wrapper around a ``Sentence`` that also provides a tree based representation of the sentence. The sentence for a ``SentenceTree`` can be retreived through the ``sentence`` property and the tree through the ``tree`` property.

This wrapper is very bare currently and only seeks to create the tree based representation, and does not provide additional logic. Please create a github issue if you would like to see functionality added in this area!

API
----------------------------------
.. automodule:: pyconll.tree.sentencetree
:members:
:special-members:
:exclude-members: __dict__, __weakref__
11 changes: 11 additions & 0 deletions docs/pyconll/tree/tree.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
tree
===================================

``Tree`` is a very basic immutable tree class. A ``Tree`` can have multiple children and has one parent. The parent of a tree is established when a ``Tree`` is added as the child of another ``Tree``.

API
----------------------------------
.. automodule:: pyconll.tree.tree
:members:
:special-members:
:exclude-members: __dict__, __weakref__
4 changes: 2 additions & 2 deletions docs/pyconll/unit/conll.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
conll
===================================

A collection of CoNLL annotated sentences. This collection should rarely be created by API callers, that is what the ``pyconll.load`` module is for which allows for easy APIs to load CoNLL files from a string or file (no network yet). The Conll object can be thought of as a simple list of sentences. There is very little more of a wrapper around this.
A collection of CoNLL annotated sentences. For creating new instances of this object, API callers should use the ``pyconll.load`` module to abstract over the resource type. The ``Conll`` object can be thought of as a simple wrapper around a list of sentences that can be serialized into a CoNLL format.

``Conll`` is a subclass of ``MutableSequence`` this means that ``append``, ``reverse``, ``extend``, ``pop``, ``remove``, and ``__iadd__`` are available free of charge. There is no implementation of them, but they are provided by ``MutableSequence`` by implementing the base abstract methods. This means that ``Conll`` behaves almost exactly like a ``list`` with the same methods.
``Conll`` is a subclass of ``MutableSequence``, so ``append``, ``reverse``, ``extend``, ``pop``, ``remove``, and ``__iadd__`` are available free of charge, even though they are not defined below.


API
Expand Down
10 changes: 5 additions & 5 deletions docs/pyconll/unit/sentence.rst
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
sentence
===================================

The Sentence module represents an entire CoNLL sentence. A sentence is composed of two main parts, the comments and the tokens.
The ``Sentence`` module represents an entire CoNLL sentence, which is composed of two main parts: the comments and the tokens.

Comments
----------------------------------
Comments are treated as key-value pairs, where the separating character between key and value is ``=``. If there is no ``=`` present then then the comment is treated as a singleton and the corresponding value is ``None``. To access and write to these values look for values related to meta (the meta data of the sentence).
Comments are treated as key-value pairs, where the separating character between key and value is ``=``. If there is no ``=`` present then then the comment is treated as a singleton, where the key is the comment string and the corresponding value is ``None``. Read and write methods on this data can be found on methods prefixed with ``meta_``.

Some things to keep in mind is that the id and text of a sentence can be accessed through member properties directly rather than through method APIs. So ``sentence.id``, rather than ``sentence.meta_value('id')``. Note that since this API does not support changing the forms of tokens, and focuses on the annotation of tokens, the text value cannot be changed of a sentence, but all other meta values can be.
For convenience, the id and text of a sentence can be accessed through member properties directly rather than through metadata methods. So ``sentence.id``, rather than ``sentence.meta_value('id')``. Since this API does not support changing a token's form, the ``text`` comment cannot be changed.

Document and Paragraph ID
----------------------------------
Document and paragraph id of a sentence are automatically inferred from a CoNLL treebank given the comments on each sentence. Note that if you wish to reassign these ids, it will have to be at the sentence level, there is no simplifying API to allow for easier mass assignment of this.
The document and paragraph id of a sentence are automatically inferred from a CoNLL treebank given sentence comments. Reassigning ids must be done through comments on the sentence level, and there is no API for simplifying this reassignment.

Tokens
----------------------------------
These are the meat of the sentence. Some things to note for tokens are that they can be accessed either through id as defined in the CoNLL data as a string or as numeric index. The string id indexing allows for multitoken and null nodes to be included easily. So the same indexing syntax understands both, ``sentence['2-3']`` and ``sentence[2]``.
These are the meat of the sentence. Tokens can be accessed through their id defined in the CoNLL annotation as a string or as a numeric index. So the same indexing syntax understands, ``sentence['5']``, ``sentence['2-3']`` and ``sentence[2]``.


API
Expand Down
43 changes: 13 additions & 30 deletions docs/pyconll/unit/token.rst
Original file line number Diff line number Diff line change
@@ -1,58 +1,41 @@
token
===================================

The Token module represents a single token (multiword or otherwise) in a CoNLL-U file. In text, this corresponds to one non-empty, non-comment line. Token has several members that correspond with the columns of the lines. All values are stored as strings. So ids are strings and not numeric. These fields are listed below and coresspond exactly with those found in the Universal Dependencenies project:

.. highlight
id
form
lemma
upos
xpos
feats
head
deprel
deps
misc

.. highlight:: none

The Token module represents a CoNLL token annotation. In a CoNLL file, this corresponds to a non-empty, non-comment line. ``Token`` members correspond directly with the Universal Dependencies CoNLL definition and all values are stored as strings. This means ids are strings as well. These fields are: ``id``, ``form``, ``lemma``, ``upos``, ``xpos``, ``feats``, ``head``, ``deprel``, ``deps``, ``misc``

Fields
-----------------------------------
Currently, all fields are strings except for ``feats``, ``deps``, and ``misc``, which are ``dicts``. There are specific semantics for each of these according to the UDv2 guidelines. Again, the current approach is for these fields to be ``dicts`` as described below rather than providing an extra interface for these fields.
All fields are strings except for ``feats``, ``deps``, and ``misc``, which are ``dicts``. Each of these fields has specific semantics per the UDv2 guidelines.

Since all of these fields are ``dicts``, modifying non existent keys will result in a ``KeyError``. This means that new values must be added as in a normal ``dict``. For ``set`` based ``dicts``, ``feats`` and specific fields of ``misc``, the new key must be assigned to an empty ``set`` to start. More details on this below.

feats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``feats`` is a dictionary of attribute value pairs, where there can be multiple values. So the values for ``feats`` is a ``set`` when parsed. The keys are ``str`` and the values are ``set``. Do not assign a value to a ``str`` or any other type. Note that any keys with empty ``sets`` will not be output.
``feats`` is a key value mapping from ``str`` to ``set``. Note that any keys with empty ``sets`` will throw an error, as all keys must have at least one feature.

deps
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``deps`` is also a dictionary of attribute value pairs, where the values are tuples of cardinality 4. Most Universal Dependencies, only use a token index and relation in the ``deps``, but according to documentation, there are up to 4 components in this field, not including the token index. Note that this fixed parsing was introduced in version 1.0 and is not backward compatible. When adding new ``deps``, the values should also be of 4 tuples therefore.
``deps`` is a key value mapping from ``str`` to ``tuple`` of cardinality 4. Most Universal Dependencies treebanks, only use 2 of these 4 dimensions: the token index and the relation. See the Universal Dependencies guideline for more information on these 4 components.When adding new ``deps``, the values must also be tuples of cardinality 4. Note that ``deps`` parsing is broken before version 1.0.

misc
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Lastly, for ``misc``, the documentation only specifies that the values are separated by a '|'. So the values can either be an attribute values pair like ``feats`` or it can be a single value. So for this reason, the value for ``misc`` is either ``None`` for entries with no '=', and an attribute values pair, otherwise, with the value being a ``set`` of ``str``. A key with a value of ``None`` is output as a singleton, while a key with an empty ``set`` is not output like with ``feats``.
Lastly, for ``misc``, the documentation only specifies that the values are separated by a '|'. So not all components have to have a value. So, the values on ``misc`` are either ``None`` for entries with no '=', or ``set`` of ``str``. A key with a value of ``None`` is output as a singleton.
When adding a new key, the key must first be initialized manually as so:

.. highlight
Example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Below is an example of adding a new feature to a token, where the key must first be initialized:

token.misc['NewFeature'] = set(('No', ))
.. code-block:: python
.. highlight:: python
token.feats['NewFeature'] = set(('No', ))
or alternatively as:

.. highlight
token.misc['NewFeature'] = set()
token.misc['NewFeature'].add('No')
.. code-block:: python
.. highlight:: python
token.feats['NewFeature'] = set()
token.feats['NewFeature'].add('No')
API
Expand Down
2 changes: 1 addition & 1 deletion docs/pyconll/util.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
util
===================================

This is module that provides some useful functionality on top of pyconll. This adds logic on top of the API layer rather than extending it. Right now this module is pretty sparse, but it can be easiy extended as demand arises.
This module provides additional, common methods that build off of the API layer. This module simply adds logic, rather than extending the API. Right now this module is pretty sparse, but will be extended as needed.


API
Expand Down
5 changes: 4 additions & 1 deletion examples/add_misc_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,10 @@
for sentence in corpus:
for token in sentence:
if token.lemma == 'dog' and token.upos == 'VERB':
token.misc['Polysemous'] = True
# Note: This means that 'Polysemous' will be present as a singleton
# in the token line. To remove 'Polysemous' from the token's
# features, call del token.misc['Polysemous']
token.misc['Polysemous'] = None

# Print to standard out which can then be redirected.
print(corpus.conll())
5 changes: 3 additions & 2 deletions examples/reannotate_ngram.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@
#
# Change the annotation of a certain ngram. In this case, change any instances
# of 'de plus' (FR) to a fixed annotation if they were previously connected.
# The condition of having been previously connected requires conditions that are
# difficult for most other tools I've seen to handle inside their DSL.
# A variation of this, where right or left headedness of a relation is
# important can be quite difficult or impossible to express in some DSL query
# languages for CoNLL-U.
#
# Format:
# reannotate_ngram.py filename > transform.conll
Expand Down

0 comments on commit bba8d14

Please sign in to comment.