Forpus is a Python library for processing plain text corpora to various corpus formats. In most cases, each NLP tool uses its own idiosyncratic input format. This library helps you to convert a corpus very easy to the desired format.
It is called Forpus, because you are formatting a corpus, but this is also a genus of parrot in the family Psittacidae.
This library supports conversions to
- JSON
- Document-term matrix
- Graph
- David Blei's LDA-C
- Thorsten Joachims' SVMlight
Forpus requires Python 3.6 and some additional libraries:
pandas
, at least v0.21.1.networkx
, at least v2.0.metadata-toolbox
, at least v0.1.
See Getting Started for how to install Forpus.