# A Walk Through Forpus

[Forpus](https://severinsimmler.github.io/forpus) is a Python library for processing plain text corpora to various corpus formats. In most cases, each NLP tool uses its own idiosyncratic input format. This library helps you to convert a corpus very easy to the desired format.

To install Forpus, run the following command in your command-line:

```bash
pip install forpus
```

You will need a directory containing `.txt` files.

In [1]:
import json
import re
try:
    from forpus import forpus
except ModuleNotFoundError:
    import sys
    from pathlib import Path
    sys.path.insert(0, str(Path('.').absolute().parent))
    from forpus import forpus

In [2]:
SOURCE = Path('..', 'corpus')
FNAME_PATTERN = '{author}_{title}'

## 1. Converting to JSON

In [3]:
Corpus = forpus.Corpus(source=SOURCE,
                       target='json',
                       fname_pattern=FNAME_PATTERN)

### 1.1. Calling the method

In [4]:
Corpus.to_json()

### 1.2. Checking the output

In [5]:
with Path('json', 'corpus.json').open('r', encoding='utf-8') as file:
    print(json.load(file))

{'mary_doc3': {'author': 'mary', 'title': 'doc3', 'text': 'Mary has written the third and last document, but this is also pretty nice.\n'}, 'peter_doc1': {'author': 'peter', 'title': 'doc1', 'text': "This is the first document. It's written by Peter. And it contains a lot of words.\n"}, 'paul_doc2': {'author': 'paul', 'title': 'doc2', 'text': 'There is also a second document. This one is by Paul. Furthermore, this also contains a lot of tokens.\n'}}


## 2. Converting to LDA-C

In [6]:
Corpus = forpus.Corpus(source=SOURCE,
                       target='ldac',
                       fname_pattern=FNAME_PATTERN)

In [7]:
def tokenizer(document):
    return re.compile('\w+').findall(document)

def drop_stopwords(tokens, stopwords=['the', 'and']):
    return [token for token in tokens if token not in stopwords]

In [8]:
Corpus.to_ldac(tokenizer=tokenizer,
               drop_stopwords=drop_stopwords)

In [9]:
with Path('ldac', 'corpus.ldac').open('r', encoding='utf-8') as file:
    print('corpus.ldac:\n{0}\n'.format(file.read()))

with Path('ldac', 'corpus.vocab').open('r', encoding='utf-8') as file:
    print('corpus.vocab:\n{0}\n'.format(file.read()))

with Path('ldac', 'corpus.metadata').open('r', encoding='utf-8') as file:
    print('corpus.metadata (a simple CSV-file):\n{0}'.format(file.read()))

corpus.ldac:
12 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1
16 12:1 8:1 13:1 5:1 14:1 15:1 2:1 16:1 17:1 18:1 19:1 20:1 21:1 22:1 23:1 24:1
16 25:1 8:2 9:2 21:2 26:1 5:1 12:1 27:1 16:1 28:1 29:1 7:1 20:1 22:1 23:1 30:1

corpus.vocab:
Mary
has
written
third
last
document
but
this
is
also
pretty
nice
This
first
It
s
by
Peter
And
it
contains
a
lot
of
words
There
second
one
Paul
Furthermore
tokens

corpus.metadata (a simple CSV-file):
,author,title
../corpus/mary_doc3.txt,mary,doc3
../corpus/peter_doc1.txt,peter,doc1
../corpus/paul_doc2.txt,paul,doc2

