# Welcome to PyDKPro

In this tutorial, we will demonstrate how to build [DKPro Core](https://dkpro.github.io/dkpro-core/) based pipelines, their processing using input strings or files, text annotation and how to use individual DKPro Core components. We also demonstrate interfacing of [spaCy](https://spacy.io/) and [nltk](http://www.nltk.org/) (python based nlp toolkits) with UIMA CAS objects.  

## 1. Installing PyDKPro

PyDKPro supports Python 3.6 and above and uses [Docker](https://www.docker.com/) container which hosts web services for all Java based DKPro Core operation. To use this demo, perform following steps

1. install latest python libraries: 
   
   `pip install dkpro-cassis==0.2.7 spacy==2.2.1 nltk==3.4.5`
   `python -m spacy download en_core_web_sm`
   

2. clone the repository

3. open this jupyter notebook and simply run the following commands:


In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))

if module_path not in sys.path:
    sys.path.append(module_path)

### Constructing Pipeline

To process a piece of text, you'll need to first construct a `Pipeline` with different DKPro Core components. The pipeline is language-specific, so you'll need to first specify the language (see examples).

- By default, the components of pipeline will include default parameters. However, you can always specify what parameters you want to include or change. The available component parameters will be provided in PyDKPro documentation.

**REC:** Why in the PyDKPro documentation and not in the DKPro Core documentation? Considering that you seem to want PyDKPro to support multiple DKPro Core versions, it would be quite redundant to duplicate the parameter lists of all DKPro Core versions in PyDKPro...



In [2]:
from pydkpro import Pipeline, Component

p = Pipeline(version="2.0.0", language='en')
p.add(Component().clearNlpSegmenter())

// REC: Is printTagSet not false by default?
p.add(Component().stanfordPosTagger(variant='fast-caseless', printTagSet='false'))

// If it has a build method, maybe it should be a PipelineBuilder? Also, the build method should return something,
// i.e. the actual pipeline which maybe can later be used to shut the pipeline down?
p.build() 

Container web service for the provided pipeline is fired up. To stop use finish method


<pydkpro.pipeline.Pipeline at 0x1224f7290>

### Pipeline processing

After pipeline construction, you'll need to process/trigger the pipeline with the piece of text, you want to process (see example below). If language parameter is not provided, then language detector will be used to detect the language of text. The output of processed pipeline will be `cas` object which is container for accessing linguistic annotations having DKPro Core defined typesystem. PyDKPro provides the DKPro Core type systems which are used by CAS object to extract the annotations e.g. `tokens`, `sentence`, `pos tags`, `ner`, etc. based on defined pipeline structure. 

The examples below demostrate how to extract token text and pos tags.

In [3]:
cas = p.process('Backgammon is one of the oldest known board games.', language='en')

In [4]:
from pydkpro import DKProCoreTypeSystem as dts

cas.select(dts().token).as_text() 

// REC: I wonder which version of the DKProCoreTypeSystem is used here...

// REC: Why is dts() a method? It would seem quite redundant to always have to call this method.
// At least to me, it would feel more natural to write something like this:
    
from pydkpro.typesystem import Token

cas.select(Token).as_text()

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games.']

In [5]:
cas.select(dts().token).get_pos()

['NNP', 'VBZ', 'NN', 'IN', 'DT', 'JJS', 'VBN', 'NN', 'NNS', '.']

### Adding Annotations

Similar to [DKPro cassis](https://github.com/dkpro/dkpro-cassis), to add manual annotations to cas object, we need to defined it with `typesystem`. For the given text, annotations of Tokens that has an id and pos feature can be added in the following.

**REC**: The `CAS` type in Java UIMA is written with all upper-case letters. I think it would introduce an unnecessary difference to call it `Cas` here.

**REC**: Why does `CAS(dts().typesystem)()` have two sets of parentheses? That looks very odd.

In [6]:
from pydkpro.cas import CAS

Token = dts().typesystem.get_type('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token')

cas = CAS(dts().typesystem)()
cas.sofa_string = "I like cheese ."

tokens = [
    Token(begin=0, end=1, id='0', pos='NNP'),
    Token(begin=2, end=6, id='1', pos='VBD'),
    Token(begin=7, end=13, id='2', pos='IN'),
    Token(begin=14, end=15, id='3', pos='.')
]


for token in tokens:
    cas.add_annotation(token)
    
// REC: this would probably be a request to DKPro Cassis, but...
// - To be more in-line with Java UIMA, IMHO there should be an `add_to_indexes(token)` method. An annotation
//   be added to a CAS also via a reference on another annotation. It would be serialized, but it could not be
//   retrieved via select methods unless it also has been added to the indexes.
// - It might be convenient if `add_to_indexes()` would accept multiple feature structures at once, e.g. 
//   `add_to_indexes(tokens)` or `add_to_indexes(token1, token2, token3)`.

Token features can printed as following:

In [7]:
print([x.get_covered_text() for x in cas.select_all()])
print([x.pos for x in cas.select_all()])

['I', 'like', 'cheese', '.']
['NNP', 'VBD', 'IN', '.']


### SpaCy interfacing

Generated CAS objects can also be typecast to the spaCy type system and vice-versa. 

**REC:** IMHO this is not the "spaCy type system" but "spaCy annotation object model" - IMHO the type system is a concept of the CAS and the CAS is the annotation object model used by UIMA.

In [8]:
cas = p.process('Backgammon is one of the oldest known board games.', language='en')

Conversion to spaCy

**REC:** It seems strange that the conversions from/to spacy are both classes. I would find it more natural to work with a more stateless API. Mind that the CAS may contain information which cannot be passed on to spacy but the user may not want to lose this data. So IMHO the conversion process should support merging results from spacy back into the original CAS.

```
import pydkpro.spacyconverter;

// Export any annotations from the CAS that are supported into the spacy object model
spacyconverter.from_dkprocore(cas, spacy_document);

// let spacy do some processing

// Import (merge) any annotations from spacy that are supported back into the CAS document
spacyconverter.to_dkprocore(spacy_document, cas);

```

In [9]:
from pydkpro import To_spacy, From_spacy

for token in To_spacy(cas)(): 
    print(token.text, token.tag_) 

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Conversion from spaCy

In [30]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
cas = From_spacy(doc)()

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

In [11]:
cas.select(dts().token()).get_pos()

['NNP', 'VBZ', 'NN', 'IN', 'DT', 'JJS', 'VBN', 'NN', 'NNS', '.']

### NLTK interfacing

Similar to spaCy, `NLTK` objects can also be convert into cas and vice-versa.

In [12]:
from pydkpro.external import To_nltk, From_nltk 

Conversion to NLTK -  Since this toolkit doesn't have common dataset, PyDKPro provide helper functions e.g. `tagger` (see below example).

In [13]:
To_nltk().tagger(cas)

[('Backgammon', 'NNP'),
 ('is', 'VBZ'),
 ('one', 'CD'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('oldest', 'JJS'),
 ('known', 'VBN'),
 ('board', 'NN'),
 ('games', 'NNS'),
 ('.', '.')]

In [14]:
import nltk
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(To_nltk().tagger(cas))

In [15]:
print(chunked)

(S
  (Chunk Backgammon/NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  oldest/JJS
  known/VBN
  board/NN
  games/NNS
  ./.)


Similar helper functions are developed for NLTK to PyDKpro's cas conversion as follows:  

In [16]:
from nltk.tokenize import TweetTokenizer
cas = From_nltk().tokenizer(TweetTokenizer().tokenize('Backgammon is one of the oldest known board games.'))

### Cas processing

PyDKPro pipeline also provide direct cas object processing as demonstrated in below example:

In [17]:
p = Pipeline()
p.add(Component().stanfordPosTagger())
p.build()

Container web service for the provided pipeline is fired up. To stop use finish method


<pydkpro.pipeline.Pipeline at 0x124685890>

In [18]:
cas = p.process(cas)

In [19]:
cas.select(dts().token).as_text() 

['Backgammon',
 'is',
 'one',
 'of',
 'the',
 'oldest',
 'known',
 'board',
 'games',
 '.']

In [20]:
cas.select(dts().token).get_pos()

['NNP', 'VBZ', 'NN', 'IN', 'DT', 'JJS', 'VBN', 'NN', 'NNS', '.']

### Shortcut for running single components

A single component can also be run without the need to build a pipeline first:

**REC:** It seems like `Component()` is a factory, so maybe call it `ComponentFactory()` - is it necessary to be an object? Why not use "static" factory methods? It seems strange that `clearNlpSegmenter()` constructs a component - it sounds more like a method that is "clearing the NLP segmenter". How about something like `ComponentFactory.create('ClearNlpSegmenter')` or `ComponentFactory.create_ClearNlpSegmenter()` instead? Since you aim to support multiple DKPro Core versions which might come with different sets of components, it might be wise not to hard-code the component names.

In [21]:
tokenizer = Component().clearNlpSegmenter() 

In [22]:
cas = tokenizer.process('Backgammon is one of the oldest known board games.')
cas.select(dts().token).as_text()

['Backgammon',
 'is',
 'one',
 'of',
 'the',
 'oldest',
 'known',
 'board',
 'games',
 '.']

### Working with list of strings

Multiple strings in the form of list can also be processed, where each element of list will be considered as document.

In [23]:
str_list = ['Backgammon is one of the oldest known board games.', 'I like playing cricket.']

In [24]:
for str in str_list:
    cas = p.process(str)
    print(cas.select(dts().token).as_text()) 

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games.']
['I', 'like', 'playing', 'cricket.']


### Working with text documents

Pipelines can also be directly run on text documents:

**REC:** Why is File2str a class? It would seem a utility method like `read_as_string('blah.txt')` would make more sense since there shouldn't be any need to maintain an object/object state. It would also be good be able to include a file encoding here. Btw, is there not simple method for reading a file into a string in Python?

In [31]:
from pydkpro.external import File2str

In [26]:
cas = p.process(File2str('test_data/input/test2.txt')())

In [27]:
cas.select(dts().token).as_text()

['Backgammon',
 'is',
 'one',
 'of',
 'the',
 'oldest',
 'known',
 'board',
 'games',
 '.']

### Working with multiple text documents

Multiple documents can also be processed by providing documents path and document name matching patterns

In [28]:
# documents available at different path can be provided in list
docs = ['test_data/input/1.txt', 'test_data/input/2.txt']
for doc in docs:
    p.process(File2str(doc)())

### End collection process

With following command pipeline's collection process will be completed (Alternatively, scope operator `with` can be used)

In [29]:
p.finish()

0