# Welcome to PyDKPro

In this tutorial, we will demonstrate how to build [DKPro Core](https://dkpro.github.io/dkpro-core/) based pipelines, their processing using input strings or files, text annotation and how to use individual DKPro components. We also demonstrate interfacing of [spaCy](https://spacy.io/) and [nltk](http://www.nltk.org/) (python based nlp toolkits) with DKPro cas objects.  

## 1. Installing PyDKPro

PyDKPro supports Python 3.6 and above and uses [Docker](https://www.docker.com/) container which hosts web services for all Java based DKPro core operation. To use this demo, perform following steps

1. install latest python libraries: 
   
   `pip install dkpro-cassis==0.2.7 spacy==2.2.1 nltk==3.4.5`
   

2. clone the repository

3. open this jupyter notebook and simply run the following commands:


In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

### Constructing Pipeline

To process a piece of text, you'll need to first construct a `Pipeline` with different `DKPro core` components. The pipeline is language-specific, so you'll need to first specify the language (see examples).

- By default, the components of pipeline will include default parameters. However, you can always specify what parameters you want to include or change.Component parameter list will be provided in `PyDKPro` documentation.



In [2]:
from pydkpro import Pipeline, Component  
p = Pipeline(version="2.0.0", language='en')
p.add(Component().clearNlpSegmenter())
p.add(Component().stanfordPosTagger(variant='fast-caseless', printTagSet='false'))
p.build() 

Container web service for the provided pipeline is fired up. To stop use finish method


<pydkpro.pipeline.Pipeline at 0x1190322d0>

### Pipeline processing

After pipeline construction, you'll need to process/trigger the pipeline with the piece of text, you want to process (see example below). If language parameter is not provided, then language detector will be used to detect the language of text. The output of processed pipeline will be `cas` object which is container for accessing linguistic annotations having DKPro defined typesystem. PyDKPro provide DKPro Core typesystems which is used by cas object to extract the annotations e.g. `tokens`, `sentence`, `pos tags`, `ner`, etc. based on defined pipeline structure. 

Please find below examples which demostrate, how to extract token text and pos tags.



In [3]:
cas = p.process('Backgammon is one of the oldest known board games.', language='en')

In [4]:
from pydkpro import DKProCoreTypeSystem as dts

cas.select(dts().token).as_text() 

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games.']

In [5]:
cas.select(dts().token).get_pos()

['NNP', 'VBZ', 'NN', 'IN', 'DT', 'JJS', 'VBN', 'NN', 'NNS', '.']

### Adding Annotations

Similar to [DKPro cassis](https://github.com/dkpro/dkpro-cassis), to add manual annotations to cas object, we need to defined it with `typesystem`. For the given text, annotations of Tokens that has an id and pos feature can be added in the following.


In [6]:
from pydkpro.cas import Cas
Token = dts().typesystem.get_type('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token')
cas = Cas(dts().typesystem)()
cas.sofa_string = "I like cheese ."
tokens = [
    Token(begin=0, end=1, id='0', pos='NNP'),
    Token(begin=2, end=6, id='1', pos='VBD'),
    Token(begin=7, end=13, id='2', pos='IN'),
    Token(begin=14, end=15, id='3', pos='.')
]


for token in tokens:
    cas.add_annotation(token)
    


Cas token attributes can printed as following:

In [7]:
print([x.get_covered_text() for x in cas.select_all()])
print([x.pos for x in cas.select_all()])

['I', 'like', 'cheese', '.']
['NNP', 'VBD', 'IN', '.']


### SpaCy interfacing

Generated CAS objects can also be typecast to the spaCy type system and vice-versa. 


In [8]:
from pydkpro import To_spacy, From_spacy
cas = p.process('Backgammon is one of the oldest known board games.', language='en')


Conversion to spaCy

In [9]:
for token in To_spacy(cas)(): 
    print(token.text, token.tag_) 

Backgammon NNP
is VBZ
one CD
of IN
the DT
oldest JJS
known VBN
board NN
games NNS
. .


Conversion from spaCy

In [10]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
cas = From_spacy(doc)()

In [11]:
cas.select(dts().token()).get_pos()

['NNP', 'VBZ', 'NN', 'IN', 'DT', 'JJS', 'VBN', 'NN', 'NNS', '.']

### NLTK interfacing

Similar to spaCy, `NLTK` objects can also be convert into cas and vice-versa.

In [12]:
from pydkpro.external import To_nltk, From_nltk 

Conversion to NLTK -  Since this toolkit doesn't have common dataset, PyDKPro provide helper functions e.g. `tagger` (see below example).

In [13]:
To_nltk().tagger(cas)

[('Backgammon', 'NNP'),
 ('is', 'VBZ'),
 ('one', 'CD'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('oldest', 'JJS'),
 ('known', 'VBN'),
 ('board', 'NN'),
 ('games', 'NNS'),
 ('.', '.')]

In [14]:
import nltk
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(To_nltk().tagger(cas))

In [15]:
print(chunked)

(S
  (Chunk Backgammon/NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  oldest/JJS
  known/VBN
  board/NN
  games/NNS
  ./.)


Similar helper functions are developed for NLTK to PyDKpro's cas conversion as follows:  

In [16]:
from nltk.tokenize import TweetTokenizer
cas = From_nltk().tokenizer(TweetTokenizer().tokenize('Backgammon is one of the oldest known board games.'))

### Cas processing

PyDKPro pipeline also provide direct cas object processing as demonstrated in below example:

In [17]:
p = Pipeline()
p.add(Component().stanfordPosTagger())
p.build()

Container web service for the provided pipeline is fired up. To stop use finish method


<pydkpro.pipeline.Pipeline at 0x129cadb10>

In [18]:
cas = p.process(cas)

In [20]:
cas.select(dts().token).as_text() 

['Backgammon',
 'is',
 'one',
 'of',
 'the',
 'oldest',
 'known',
 'board',
 'games',
 '.']

In [21]:
cas.select(dts().token).get_pos()

['NNP', 'VBZ', 'NN', 'IN', 'DT', 'JJS', 'VBN', 'NN', 'NNS', '.']

### Shortcut for running single components

A single component can also be run without the need to build a pipeline first:

In [22]:
tokenizer = Component().clearNlpSegmenter() 

In [23]:
cas = tokenizer.process('Backgammon is one of the oldest known board games.')
cas.select(dts().token).as_text()

['Backgammon',
 'is',
 'one',
 'of',
 'the',
 'oldest',
 'known',
 'board',
 'games',
 '.']

### Working with list of strings

Multiple strings in the form of list can also be processed, where each element of list will be considered as document.

In [24]:
str_list = ['Backgammon is one of the oldest known board games.', 'I like playing cricket.']

In [25]:
for str in str_list:
    cas = p.process(str)
    print(cas.select(dts().token).as_text()) 

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games.']
['I', 'like', 'playing', 'cricket.']


### Working with text documents

Pipelines can also be directly run on text documents:

In [26]:
from pydkpro.external import File2str

In [27]:
cas = p.process(File2str('test_data/input/test2.txt')())

In [28]:
cas.select(dts().token.as_text()

['Backgammon',
 'is',
 'one',
 'of',
 'the',
 'oldest',
 'known',
 'board',
 'games',
 '.']

### Working with multiple text documents

Multiple documents can also be processed by providing documents path and document name matching patterns

In [29]:
# documents available at different path can be provided in list
docs = ['test_data/input/1.txt', 'test_data/input/2.txt']
for doc in docs:
    p.process(File2str(doc)())

### End collection process

With following command pipeline's collection process will be completed (Alternatively, scope operator `with` can be used)

In [30]:
p.finish()

0