# Welcome to PyDKPro

In this tutorial, we will demonstrate how to build [DKPro Core](https://dkpro.github.io/dkpro-core/) based pipelines, their processing using input strings or files, text annotation and how to use individual DKPro Core components. We also demonstrate interfacing of [spaCy](https://spacy.io/) and [nltk](http://www.nltk.org/) (python based nlp toolkits) with DKPro cas objects.  

## 1. Installing PyDKPro

PyDKPro supports Python 3.6 and above and uses [Docker](https://www.docker.com/) container which hosts web services for all Java based DKPro Core operation. To use this demo, make sure Python>=3.6 and pip package is installed. Please perform following steps (Step 1 is optional) in your terminal.

1. To create vitual enviorment, 

    `python -m pip install virtualenv` (Install virtualenv if not already installed)

    `mkdir [env_name]`
    
    `virtualenv -p python3 [env_name]` or `python3 -m venv [env_name]`
    
    To activate the environment,
    
    On Windows, run:

    `[env_name]\Scripts\activate.bat`

    On Unix or MacOS, run:

    `source [env_name]/bin/activate`

    If you want to create conda (version 4.6 or later) environment,

    `conda create --name [env_name] python=3.6`
    
    To activate the environment (on Windows, MacOS and Unix),
    
    `conda activate [env_name]`
    
    

2. Install latest python libraries: 
   
   `python -m pip install -r requirements.txt`
   
   `python -m spacy download en_core_web_sm`
   
   

3. Clone the repository.

    `git clone https://github.com/zesch/pydkpro.git`
    
    

4. Open this jupyter notebook in your browser.

    `jupyter notebook`



In [1]:
# Run this cell to make sure that all dependencies are installed successfully.

import pkg_resources
from pkg_resources import DistributionNotFound, VersionConflict

dependencies = [
  'dkpro-cassis==0.2.9',
  'spacy==2.2.1',
  'nltk==3.4.5'
]
 
uninstalled_libraries = []
for each_dependency in dependencies:
    try:
        pkg_resources.require([each_dependency])
    except (DistributionNotFound, VersionConflict):
        uninstalled_libraries.append(each_dependency)
if len(uninstalled_libraries) > 0:
    print(" %s dependecies are not installed properly. Please install them and rerun this jupyter notebook" 
          %(','.join(uninstalled_libraries)))
else:
    print("All dependencies are installed successfully")


 dkpro-cassis==0.2.7 dependecies are not installed properly. Please install them and rerun this jupyter notebook


In [2]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
module_path2 = os.path.abspath(os.path.join('../pydkpro'))
if module_path not in sys.path:
    sys.path.append(module_path)
    sys.path.append(module_path2)

### Constructing Pipeline

To process a piece of text, you'll need to first construct a `Pipeline` with different `DKPro Core` components. The pipeline is language-specific, so you'll need to first specify the language (see examples).

- By default, the components of pipeline will include default parameters. However, you can always specify what parameters you want to include or change. Component parameter list is provided in [DKPro Core](https://dkpro.github.io/dkpro-core/) documentation.



In [3]:
from pydkpro import Pipeline, Component  
p = Pipeline(version="2.0.0", language='en')
p.add(Component().opennlp_segmenter())
p.add(Component().opennlp_postagger(param_tagset=True))
p.build() 

(b'com.zz', None)
(b'test', None)
(b'0.0.1-SNAPSHOT', None)
[32m✓ Got all neccessary informations[0m
[32m✓ Server code generation[0m
[K[32m✓[0m Compiling Project
[K[32m✓[0m Building container...ner...[K
[34m▐|\____________▌[0m Container is running on port: 3000[K(b'', None)
[KContainer web service for the provided pipeline is fired up. To stop use finish method


<pydkpro.pipeline.Pipeline at 0x10df5c908>

### Pipeline processing

After pipeline construction, you'll need to process/trigger the pipeline with the piece of text, you want to process (see example below). If language parameter is not provided, then language detector will be used to detect the language of text. The output of processed pipeline will be `CAS` object which is container for accessing linguistic annotations having DKPro Core defined typesystem. PyDKPro provide DKPro Core type systems which are used by `CAS` object to extract the annotations e.g. `tokens`, `sentence`, `pos tags`, `ner`, etc. based on defined pipeline structure. 


The examples below demostrate how to extract token text and pos tags.


In [4]:
cas = p.process('Lets play Cricket.', language='en')

[K[32m✓[0m Pinging...
b'<?xml version="1.0" encoding="UTF-8"?><xmi:XMI xmlns:xmi="http://www.omg.org/XMI" xmlns:pos="http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos.ecore" xmlns:tcas="http:///uima/tcas.ecore" xmlns:cas="http:///uima/cas.ecore" xmlns:tweet="http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos/tweet.ecore" xmlns:morph="http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/morph.ecore" xmlns:type2="http:///de/tudarmstadt/ukp/dkpro/core/api/frequency/tfidf/type.ecore" xmlns:dependency="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/dependency.ecore" xmlns:type="http:///de/tudarmstadt/ukp/dkpro/core/api/anomaly/type.ecore" xmlns:type6="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type.ecore" xmlns:type3="http:///de/tudarmstadt/ukp/dkpro/core/api/metadata/type.ecore" xmlns:type4="http:///de/tudarmstadt/ukp/dkpro/core/api/ner/type.ecore" xmlns:type5="http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore" xmlns:constituent="http:/

In [5]:
from pydkpro import DKProCoreTypeSystem as dts

ts_token = dts().token

cas.select(ts_token).as_text() 

['"', 'Lets', 'play', 'Cricket', '.', '"']

In [6]:
cas.select(ts_token).get_pos()

['PUNCT', 'NOUN', 'VERB', 'PROPN', 'PUNCT', 'PUNCT']

In [7]:
p.finish()

'Container service is successfully destroyed'

### Adding Annotations

Similar to [DKPro cassis](https://github.com/dkpro/dkpro-cassis), to add manual annotations to cas object, we need to defined it with `typesystem`. For the given text, annotations of Tokens that has an id and pos feature can be added in the following.


In [8]:
from pydkpro.cas import Cas
Token = dts().typesystem.get_type('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token')
cas = Cas(dts().typesystem)()
cas.sofa_string = "I like cheese ."
tokens = [
    Token(begin=0, end=1, id='0', pos='NNP'),
    Token(begin=2, end=6, id='1', pos='VBD'),
    Token(begin=7, end=13, id='2', pos='IN'),
    Token(begin=14, end=15, id='3', pos='.')
]


for token in tokens:
    cas.add_annotation(token)
    


Token features can printed as following:

In [9]:
print([x.get_covered_text() for x in cas.select_all()])
print([x.pos for x in cas.select_all()])

['I', 'like', 'cheese', '.']
['NNP', 'VBD', 'IN', '.']


### SpaCy interfacing

Generated CAS objects can also be typecast to the spaCy annotation object model and vice-versa. 


In [23]:
from pydkpro import To_spacy, From_spacy
cas = p.process('Backgammon is one of the oldest known board games.', language='en')


Conversion to spaCy

In [24]:
for token in To_spacy(cas)(): 
    print(token.text, token.tag_) 

Backgammon NNP
is VBZ
one CD
of IN
the DT
oldest JJS
known VBN
board NN
games NNS
. .


Conversion from spaCy

In [25]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
cas = From_spacy(doc)()

In [26]:
cas.select(dts().token()).get_pos()

['NNP', 'VBZ', 'NN', 'IN', 'DT', 'JJS', 'VBN', 'NN', 'NNS', '.']

### NLTK interfacing

Similar to spaCy, `NLTK` objects can also be convert into cas and vice-versa.

In [27]:
from pydkpro.external import To_nltk, From_nltk 

Conversion to NLTK -  Since this toolkit doesn't have common dataset, PyDKPro provide helper functions e.g. `tagger` (see below example).

In [28]:
To_nltk().tagger(cas)

[('Backgammon', 'NNP'),
 ('is', 'VBZ'),
 ('one', 'CD'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('oldest', 'JJS'),
 ('known', 'VBN'),
 ('board', 'NN'),
 ('games', 'NNS'),
 ('.', '.')]

In [29]:
import nltk
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(To_nltk().tagger(cas))

In [30]:
print(chunked)

(S
  (Chunk Backgammon/NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  oldest/JJS
  known/VBN
  board/NN
  games/NNS
  ./.)


Similar helper functions are developed for NLTK to PyDKpro's CAS conversion as follows:  

In [31]:
from nltk.tokenize import TweetTokenizer
cas = From_nltk().tokenizer(TweetTokenizer().tokenize('Backgammon is one of the oldest known board games.'))

### Cas processing

PyDKPro pipeline also provide direct cas object processing as demonstrated in below example:

In [32]:
p = Pipeline()
p.add(Component().stanfordPosTagger())
p.build()

Container web service for the provided pipeline is fired up. To stop use finish method


<pydkpro.pipeline.Pipeline at 0x1309111d0>

In [33]:
cas = p.process(cas)

In [34]:
cas.select(dts().token).as_text() 

['Backgammon',
 'is',
 'one',
 'of',
 'the',
 'oldest',
 'known',
 'board',
 'games',
 '.']

In [35]:
cas.select(dts().token).get_pos()

['NNP', 'VBZ', 'NN', 'IN', 'DT', 'JJS', 'VBN', 'NN', 'NNS', '.']

### Shortcut for running single components

A single component can also be run without the need to build a pipeline first:

In [36]:
tokenizer = Component().clearNlpSegmenter() 

In [37]:
cas = tokenizer.process('Backgammon is one of the oldest known board games.')
cas.select(dts().token).as_text()

['Backgammon',
 'is',
 'one',
 'of',
 'the',
 'oldest',
 'known',
 'board',
 'games',
 '.']

### Working with list of strings

Multiple strings in the form of list can also be processed, where each element of list will be considered as document.

In [38]:
str_list = ['Backgammon is one of the oldest known board games.', 'I like playing cricket.']

In [39]:
for str in str_list:
    cas = p.process(str)
    print(cas.select(dts().token).as_text()) 

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games.']
['I', 'like', 'playing', 'cricket.']


### Working with text documents

Pipelines can also be directly run on text documents:

In [40]:
from pydkpro.external import File2str

In [41]:
cas = p.process(File2str('test_data/input/test2.txt')())

In [42]:
cas.select(dts().token).as_text()

['Backgammon',
 'is',
 'one',
 'of',
 'the',
 'oldest',
 'known',
 'board',
 'games',
 '.']

### Working with multiple text documents

Multiple documents can also be processed by providing documents path and document name matching patterns

In [43]:
# documents available at different path can be provided in list
docs = ['test_data/input/1.txt', 'test_data/input/2.txt']
for doc in docs:
    p.process(File2str(doc)())

### End collection process

With following command pipeline's collection process will be completed (Alternatively, scope operator `with` can be used)

In [44]:
p.finish()

0