CorpConv

Introduction

CorpConv is a tool for converting between some common corpus formats.

Installation

CorpConv can be easily installed using pip:

pip install CorpConv

Alternatively, you can download and decompress the latest release or clone the git repository:

git clone https://github.com/tsproisl/CorpConv.git

In the new directory, run the following command:

python3 setup.py install

Usage

Using the corpconv executable

You can use the converter as a standalone program from the command line. General usage information is available via the -h option:

corpconv -h

To convert a corpus from one format to another, call the converter like this:

corpconv -i <input_format> -o <output_format> <file>

Supported formats are:

conll: Tab-separated, one token per line with token IDs, empty line after sentences, empty fields marked with an underscore (_), sentence IDs as leading comments (# sent_id 5)

# sent_id = 1
1   They     they    PRON    PRP
2   buy      buy     VERB    VBP
3   and      and     CONJ    CC
4   sell     sell    VERB    VBP
5   books    book    NOUN    NNS
6   .        .       PUNCT   .

osl: One sentence per line, custom delimiter for annotation, tokens separated by space

They/they/PRON/PRP buy/buy/VERB/VBP and/and/CONJ/CC sell/sell/VERB/VBP books/book/NOUN/NNS ././PUNCT/.

tsv: Tab-separated, one token per line, empty line after sentences

They     they    PRON    PRP
buy      buy     VERB    VBP
and      and     CONJ    CC
sell     sell    VERB    VBP
books    book    NOUN    NNS
.        .       PUNCT   .

vrt: Tab-separated, one token per line, sentences as s-tags

<s id="s1">
They     they    PRON    PRP
buy      buy     VERB    VBP
and      and     CONJ    CC
sell     sell    VERB    VBP
books    book    NOUN    NNS
.        .       PUNCT   .
</s>

Using the module

You can also use the readers and writers in your own Python projects. Here is a small example for converting from osl to vrt:

from corpconv import corpus_readers
from corpconv import corpus_writers

sentences = corpus_readers.read_osl(file_object)
for line in corpus_writers.write_vrt(sentences):
    print(line)

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
corpconv		corpconv
.gitignore		.gitignore
CHANGES.txt		CHANGES.txt
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
README.rst		README.rst
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CorpConv

Introduction

Installation

Usage

Using the corpconv executable

Using the module

About

Releases 3

Packages

Languages

License

tsproisl/CorpConv

Folders and files

Latest commit

History

Repository files navigation

CorpConv

Introduction

Installation

Usage

Using the corpconv executable

Using the module

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages