# A Walk Through Forpus

[Forpus](https://severinsimmler.github.io/forpus) is a Python library for processing plain text corpora to various corpus formats. In most cases, each NLP tool uses its own idiosyncratic input format. This library helps you to convert a corpus very easy to the desired format.

In [1]:
import json
import re
from pathlib import Path
from collections import Counter
import pandas as pd

import forpus

In [2]:
SOURCE = Path('..', 'corpus')
FNAME_PATTERN = '{author}, {title}'

## 1. Converting to JSON

In [3]:
Corpus = forpus.Corpus(source=SOURCE, # the source directory
                       target='json', # the target directory
                       fname_pattern=FNAME_PATTERN)

### 1.1. Calling the method

In [4]:
Corpus.to_json()

### 1.2. Checking the output

In [5]:
with Path('json', 'corpus.json').open('r', encoding='utf-8') as file:
    print(json.load(file))

{'mary_doc3': {'stem': 'mary_doc3', 'text': 'Mary has written the third and last document, but this is also pretty nice.\n'}, 'peter_doc1': {'stem': 'peter_doc1', 'text': "This is the first document. It's written by Peter. And it contains a lot of words.\n"}, 'paul_doc2': {'stem': 'paul_doc2', 'text': 'There is also a second document. This one is by Paul. Furthermore, this also contains a lot of tokens.\n'}}


## 2. Converting to document-term matrix

In [6]:
Corpus = forpus.Corpus(source=SOURCE, # the source directory
                       target='matrix', # the target directory
                       fname_pattern=FNAME_PATTERN)

### 2.1. Creating preprocessing functions
Due to the structure of the some corpus formats, your corpus has to be tokenized. You can define a simple Regex-based tokenizer as below, or use e.g. the library [NLTK](https://www.nltk.org).

In [7]:
def tokenizer(document):
    return re.compile('\w+').findall(document.lower())

You can pass even more preprocessing functions, which will be applied to the return value of your tokenizer. If you want to remove stopwords, you can use a function as below.

In [8]:
def drop_stopwords(tokens, stopwords=['the', 'and']):
    return [token for token in tokens if token not in stopwords]

### 2.1. Calling the method

Pass every function as an argument of the method `to_document_term_matrix()`.

We're using the class `collections.Counter` as counter, but you could write (or use) a function to e.g. normalize the frequencies.

In [9]:
Corpus.to_document_term_matrix(tokenizer=tokenizer,
                               counter=Counter,
                               drop_stopwords=drop_stopwords)

### 2.2. Checking the output

In [10]:
pd.read_csv(Path('matrix', 'corpus.matrix'), index_col=0)

Unnamed: 0,is,this,also,document,a,lot,written,contains,it,by,...,nice,pretty,third,one,furthermore,words,there,s,first,tokens
mary_doc3,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
peter_doc1,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
paul_doc2,2.0,2.0,2.0,1.0,2.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0


## 3. Converting to a graph

In [11]:
Corpus = forpus.Corpus(source=SOURCE, # the source directory
                       target='graph', # the target directory
                       fname_pattern=FNAME_PATTERN)

### 3.1. Calling the method

In [12]:
Corpus.to_graph(tokenizer=tokenizer,
                counter=Counter,
                variant='gexf', # there are more variants
                drop_stopwords=drop_stopwords)

### 3.2. Checking the output

In [13]:
with Path('graph', 'corpus.gexf').open('r', encoding='utf-8') as file:
    print(file.read())

<?xml version='1.0' encoding='utf-8'?>
<gexf version="1.2" xmlns="http://www.gexf.net/1.2draft" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/XMLSchema-instance">
  <graph defaultedgetype="directed" mode="static" name="">
    <attributes class="edge" mode="static">
      <attribute id="1" title="frequency" type="long" />
    </attributes>
    <attributes class="node" mode="static">
      <attribute id="0" title="stem" type="string" />
    </attributes>
    <meta>
      <creator>NetworkX 2.0</creator>
      <lastmodified>16/03/2018</lastmodified>
    </meta>
    <nodes>
      <node id="mary_doc3" label="mary_doc3">
        <attvalues>
          <attvalue for="0" value="mary_doc3" />
        </attvalues>
      </node>
      <node id="mary" label="mary" />
      <node id="has" label="has" />
      <node id="written" label="written" />
      <node id="third" label="third" />
      <node id="last" label="last" />
      <node id="document" l

## 4. Converting to LDA-C

In [14]:
Corpus = forpus.Corpus(source=SOURCE,
                       target='ldac',
                       fname_pattern=FNAME_PATTERN)

### 4.1. Calling the method

In [15]:
Corpus.to_ldac(tokenizer=tokenizer,
               counter=Counter,
               drop_stopwords=drop_stopwords)

### 4.2. Checking the output

Three files have been generated:

1. `corpus.ldac` contains one document per line as described by [David Blei](https://github.com/blei-lab/lda-c/blob/master/readme.txt).
2. `corpus.tokens` contains one type per line of the vocabulary. The line index is the type index.
3. `corpus.metadata` contains the metadata extracted from filenames. This is a simple CSV-file.

In [16]:
with Path('ldac', 'corpus.ldac').open('r', encoding='utf-8') as file:
    print('corpus.ldac:\n{0}\n'.format(file.read()))

with Path('ldac', 'corpus.tokens').open('r', encoding='utf-8') as file:
    print('corpus.tokens:\n{0}\n'.format(file.read()))

print(pd.read_csv(Path('ldac', 'corpus.metadata'), index_col=0))

corpus.ldac:
12 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1
14 7:1 8:1 12:1 5:1 13:2 14:1 2:1 15:1 16:1 17:1 18:1 19:1 20:1 21:1
15 22:1 8:2 9:2 18:2 23:1 5:1 7:2 24:1 15:1 25:1 26:1 17:1 19:1 20:1 27:1


corpus.tokens:
mary
has
written
third
last
document
but
this
is
also
pretty
nice
first
it
s
by
peter
contains
a
lot
of
words
there
second
one
paul
furthermore
tokens

                                stem    basename
../corpus/mary_doc3.txt    mary_doc3   mary_doc3
../corpus/peter_doc1.txt  peter_doc1  peter_doc1
../corpus/paul_doc2.txt    paul_doc2   paul_doc2


## 5. Converting to SVMlight

In [17]:
Corpus = forpus.Corpus(source=SOURCE,
                       target='svmlight',
                       fname_pattern=FNAME_PATTERN)

### 4.1. Calling the method

In [18]:
Corpus.to_svmlight(tokenizer=tokenizer,
                   classes=[0 for n in range(3)], # each document belongs to class '0'
                   counter=Counter,
                   drop_stopwords=drop_stopwords)

### 4.2. Checking the output

Three files have been generated:

1. `corpus.svmlight` contains one document per line as described by [Thorsten Joachims](http://svmlight.joachims.org/).
2. `corpus.tokens` contains one type per line of the vocabulary. The line index is the type index.
3. `corpus.metadata` contains the metadata extracted from filenames. This is a simple CSV-file.

In [19]:
with Path('svmlight', 'corpus.svmlight').open('r', encoding='utf-8') as file:
    print('corpus.svmlight:\n{0}\n'.format(file.read()))

with Path('svmlight', 'corpus.tokens').open('r', encoding='utf-8') as file:
    print('corpus.tokens:\n{0}\n'.format(file.read()))

print(pd.read_csv(Path('svmlight', 'corpus.metadata'), index_col=0))

corpus.svmlight:
0 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1
0 8:1 9:1 13:1 6:1 14:2 15:1 3:1 16:1 17:1 18:1 19:1 20:1 21:1 22:1
0 23:1 9:2 10:2 19:2 24:1 6:1 8:2 25:1 16:1 26:1 27:1 18:1 20:1 21:1 28:1


corpus.tokens:
mary
has
written
third
last
document
but
this
is
also
pretty
nice
first
it
s
by
peter
contains
a
lot
of
words
there
second
one
paul
furthermore
tokens

                                stem    basename
../corpus/mary_doc3.txt    mary_doc3   mary_doc3
../corpus/peter_doc1.txt  peter_doc1  peter_doc1
../corpus/paul_doc2.txt    paul_doc2   paul_doc2
