# Slovene preprocessing, rule-based systems, regular expressions

<sup>This notebook is a part of Natural Language Processing class at the University of Ljubljana, Faculty for computer and information science. Please contact [slavko.zitnik@fri.uni-lj.si](mailto:slavko.zitnik@fri.uni-lj.si) for any comments.</sub>

## Slovene text preprocessing

Most of the tools and corpora is available at [ssj.slovenscina.eu](http://ssj.slovenscina.eu/) or [clarin.si](http://www.clarin.si/info/o-projektu/). Some direct links to the available tools are:

* [Part-of-speech tagger](http://oznacevalnik.slovenscina.eu/Vsebine/Sl/ProgramskaOprema/Navodila.aspx): The  tool includes part-of-speech pagger, tokenizer and lemmatizer Lemmagen. Part-of-speech annotation examples are described at [http://bos.zrc-sazu.si/bibliografija/o_oznake.html](http://bos.zrc-sazu.si/bibliografija/o_oznake.html), [http://nl.ijs.si/imp/msd/html-sl/](http://nl.ijs.si/imp/msd/html-sl/) or [https://www.jakopin.net/primoz/disertacija/oblikozn.php](https://www.jakopin.net/primoz/disertacija/oblikozn.php).
* [Dependency parser](http://razclenjevalnik.slovenscina.eu/Default.aspx) - Labels explanation ["Površinskoskladenjsko označevanje korpusa Slovene Dependency Treebank," bachelor thesis by Nina Ledinek](https://nl.ijs.si/sdt/bib/SDT-diploma-nina.pdf).
* [Lemmatizer Lemmagen](http://lemmatise.ijs.si/Software/): A port to Java also exists - [Lemmagen4J](https://github.com/szitnik/Lemmagen4J)
* Currently best lemmatizer, tokenizer, part-of-speech tagger (year 2022) - Stanza is based on Stanford's model and available at [clarin repository](https://www.clarin.si/repository/xmlui/handle/11356/1412) and regulary updated in the [Stanza's Github repo](https://github.com/clarinsi/classla).
* [SloBENCH](https://slobench.cjvt.si/): Slovenian NLP tools benchmark.
* Project RSDO (2020-2023) results: [slovenscina.eu](http://slovenscina.eu/).

Project PoVeJMo is currently on-going (2023-2026).

There already exist some pre-trained deep neural network model we will discuss later.

In [1]:
import classla
# download standard models for Slovenian
classla.download('sl')                  

Downloading https://raw.githubusercontent.com/clarinsi/classla-resources/main/resources_2.1.json: 11.4kB [00:00, 23.8MB/s]                   
2024-02-27 03:20:24 INFO: Downloading these customized packages for language: sl (Slovenian)...
| Processor | Package  |
------------------------
| tokenize  | standard |
| pos       | standard |
| lemma     | standard |
| depparse  | standard |
| ner       | standard |
| pretrain  | standard |

2024-02-27 03:20:26 INFO: File exists: /Users/slavkoz/classla_resources/sl/pos/standard.pt.
2024-02-27 03:20:26 INFO: File exists: /Users/slavkoz/classla_resources/sl/lemma/standard.pt.
2024-02-27 03:20:26 INFO: File exists: /Users/slavkoz/classla_resources/sl/depparse/standard.pt.
2024-02-27 03:20:26 INFO: File exists: /Users/slavkoz/classla_resources/sl/ner/standard.pt.
2024-02-27 03:20:27 INFO: File exists: /Users/slavkoz/classla_resources/sl/pretrain/standard.pt.
2024-02-27 03:20:27 INFO: Finished downloading models and saved to /Users/slavkoz/classla

In [2]:
# RTV SLO article, March 7, 2022
text = """
V Belorusiji se je začel tretji krog pogajanj o premirju. 
Na mizi so zahteve o notranjepolitični rešitvi krize, mednarodno 
humanitarnem in vojaškem vidiku. Ruska vojska je že zjutraj odprla 
humanitarne koridorje, a jih večina vodi v Rusijo in Belorusijo.

Tretji krog mirovnih pogajanj poteka v narodnem parku Beloveška 
pušča, ki se razteza med Poljsko in Belorusijo. Ruske pogajalce 
vodi Vladimir Medinski, ki je dejal, da je Rusija 
"nedvomno pripravljena" na pogovore z ukrajinsko stranjo, poroča 
ruska tiskovna agencija TASS.
"""

In [3]:
nlp = classla.Pipeline('sl', processors='tokenize,ner,pos,lemma,depparse')                      
doc = nlp(text)     
doc

2024-02-27 03:20:27 INFO: Loading these models for language: sl (Slovenian):
| Processor | Package  |
------------------------
| tokenize  | standard |
| pos       | standard |
| lemma     | standard |
| depparse  | standard |
| ner       | standard |

2024-02-27 03:20:27 INFO: Use device: cpu
2024-02-27 03:20:27 INFO: Loading: tokenize
2024-02-27 03:20:27 INFO: Loading: pos
2024-02-27 03:20:36 INFO: Loading: lemma
2024-02-27 03:20:48 INFO: Loading: depparse
2024-02-27 03:20:48 INFO: Loading: ner
2024-02-27 03:20:48 INFO: Done loading processors!


[
  [
    [
      {
        "id": 1,
        "text": "V",
        "lemma": "v",
        "upos": "ADP",
        "xpos": "Sl",
        "feats": "Case=Loc",
        "head": 2,
        "deprel": "case",
        "ner": "O"
      },
      {
        "id": 2,
        "text": "Belorusiji",
        "lemma": "Belorusija",
        "upos": "PROPN",
        "xpos": "Npfsl",
        "feats": "Case=Loc|Gender=Fem|Number=Sing",
        "head": 5,
        "deprel": "obl",
        "ner": "B-LOC"
      },
      {
        "id": 3,
        "text": "se",
        "lemma": "se",
        "upos": "PRON",
        "xpos": "Px------y",
        "feats": "PronType=Prs|Reflex=Yes|Variant=Short",
        "head": 5,
        "deprel": "expl",
        "ner": "O"
      },
      {
        "id": 4,
        "text": "je",
        "lemma": "biti",
        "upos": "AUX",
        "xpos": "Va-r3s-n",
        "feats": "Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin",
        "head": 5,
        "deprel": "aux",
 

## Rule-based systems

Rule-based systems use manually defined set of rules to extract structured data. The extraction workflow generally consists of the following steps (*Rule-Based Information Extraction for Structured Data Acquisition using TEXTMARKER* (2008), Martin Atzmueller et al.):

<img src="rule-based-system.png" />

The use of statistical models or deep learning networks for structured data extraction is increasing, although commetcial systems still heavily rely on manual extraction rules. Pleae read the IBM paper *Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!* (2013) by Laura Chiticariu et al..

### GATE

[GATE](https://gate.ac.uk/) is a natural language processing framework that consists of a GUI application, a set of extensible tools and a Java library to develop custom extensions. 

It includes a rule-based information extraction system called [ANNIE](https://gate.ac.uk/sale/tao/splitch6.html#x9-1200006) that heavily relies on [JAPE](https://gate.ac.uk/sale/tao/splitch8.html#x12-2070008) (Java Annotations Patterns Engine) rules.

<img src="gate-developer.png" />

## Other platforms

### UIMA

[Apache UIMA](https://uima.apache.org) is another framework for information extraction. Apart from open-source plugins, some commercialized versions exist. The framework became especially known after the IBM Watson's win at the [Jeopardy challenge](https://www.youtube.com/watch?v=WFR3lOm_xhE) because Watson engine was based on UIMA.

### KNIME Analytics Platform

[KNIME](https://www.knime.com/) is an open-source tool that enables data analysis and visualizations without programming. The low/no-code user interface is intended for beginners and advanced users that can use an enormous amount of predeveloped plugins. The company that develops and supports the tool offers also premium features such as cloud collaboration, private space, custom plugins development, integration with other systems, ...

<img src="knime.png" />

Also check a review of tools for NLP democratization: [Zasnova splošnega ogrodja in podatkovnega modela za obdelavo naravnega jezika – ANGLEr
(2013)](https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/522/852/9449) by Slavko Žitnik.

# Regular expressions

One of the standardized ways to extract structured information from text is the use of regular expressions. To get familiar with the regular expressions, follow a tutorial at [RegexOne](https://regexone.com/) or go through explanations at [Regular-Expressions.info](http://www.regular-expressions.info/). 

While you learn and also when you want to test some regular expression examples against some text, you can help yourself using tools such as [Regex101](https://regex101.com/).

### Regular expressions in Python
Let's try some regular expressions in Python.

In [4]:
text = """Mr. Swensen, 62, runs the school’s $25.4 billion endowment, one of the largest in the country. 
Since November 1 2016 he is joined by his intellectual sparring partner, Mr. Dean Takahashi, his senior director."""

In [5]:
import re

regex = "Mr. (\w+)"

# Find person
personPattern = re.compile(regex)
match = personPattern.search(text)
print("Found person: '{}'.".format( match.group(1) ))

Found person: 'Swensen'.


In [6]:
# Find all persons
matches = re.finditer(regex, text)
for match in matches:
    print("Found person: '{}'.".format( match.group(1) ))

Found person: 'Swensen'.
Found person: 'Dean'.


In [7]:
# Find money amounts
regex = "[$€]\s*[0-9\.,]+"
matches = re.finditer(regex, text)
for match in matches:
    print("Amount: '{}'.".format( match.group(0) ))

Amount: '$25.4'.


In [8]:
# TODO: write a general regex for dates while identifying month, day and year separately

### Exercise

#### GATE and JAPE rules

Install GATE Developer, try to use ANNIE against some english news text, investigate JAPE rules (some predefined ones are listed [here](https://gate.ac.uk/sale/tao/splitap6.html#x38-773000F)). Try to define some new JAPE rules.

#### NLTK regular expressions

To check some simple examples using Python and NLTK, please read *Regular Expressions for Natural Language Processing* (2006) by Steven Bird and Ewan Klein. Solve the exercises at the end of the document.

#### NLP platforms

Check the platforms review (see above), install and try to use a platform of your choice. Solve the problem below using the selected platform.

#### CMU Seminars

1. Download the [CMU Seminar Announcements](https://people.cs.umass.edu/~mccallum/data/sa-tagged.tar.gz) dataset.
2. Remove all XML tags from texts.
3. Use regular expressions to recognize parts of the announcements. Which parts can you recognize (date, time, location, names, emails)?