## Tasks


1. Download [docker image](https://hub.docker.com/r/djstrong/krnnt2) of KRNNT2. It includes the following tools:


    i. Morfeusz2 - morphological dictionary  
    ii. Corpus2 - corpus access library
    iii. Toki - tokenizer for Polish
    iv. Maca - morphosyntactic analyzer   
    v. KRNNT - Polish tagger

2. As an alternative you can use Tagger interfaces in [Clarin-Pl](https://ws.clarin-pl.eu/tager.shtml)
3. Use the tool to tag and lemmatize the law corpus.
4. Using the tagged corpus compute bigram statistic for the tokens containing:


    i. lemmatized, downcased word
    ii. morphosyntactic **category** of the word (`subst`, `fin`, `adj`, etc.)
   

In [1]:
# lematyzacja to sprowadzanie danego słowa do jego formy podstawowej (hasłowej), która reprezentuje dany wyraz, np. wiórkami → wiórek, jeżdżący → jeździć,
# tagging - the process of marking up a word in a text (corpus) as corresponding to a particular part of speech

import requests


r = requests.post('http://localhost:9200',data = "ma. kota, i skacze:")

r.text.split("\n")[0:1]
# print(r.text)

['ma\tnone']

In [2]:
import locale
import os
import pickle

# python -m spacy download en_core_web_sm
# python -m spacy download pl_core_news_sm
import string
import tarfile
from collections import Counter
import matplotlib
import matplotlib.pyplot as plt
import morfeusz2
import numpy as np
import pandas as pd
import regex
import spacy
from spacy.tokenizer import *
import math
import operator
import time
import Levenshtein
import pandas as pd
from PIL import Image


matplotlib.style.use("ggplot")
%matplotlib inline
locale.setlocale(locale.LC_COLLATE, "pl_PL.UTF-8")


'pl_PL.UTF-8'

In [3]:



class Token():
    def __init__(self, lemma,morf,word) -> None:
        self.word: string = word # kota
        self.morf: string = morf
        self.lemma: string = lemma # kot
        self.flexeme : string = morf.split(":")[0] #morphosyntactic **category** of the word (`subst`, `fin`, `adj`, etc
    

# words: dict[string,Token] ={}
def to_tokens(response):
    lines = response.text.split('\n')

    tokens = []
    i=0
    token=None
    morf=None
    for line in lines:
        if not line.startswith('\t'):        
            word = line.split('\t')[0]
        else:
            lemma = line.split('\t')[1]
            if len(lemma.split(" ")) > 1:
                continue
            morf = line.split('\t')[2]
            tokens.append(Token(lemma,morf,word))
    return tokens

    

In [4]:

tokens = {}
tokens_list = []
i = 0
path = "../data/ustawy"

tokens = []

for filename in os.listdir(path):
    with open(os.path.join(path, filename), "r", encoding="utf-8") as file:
        act = file.read()
        act = regex.sub(r"\s+", " ", act)
        act = regex.sub(r"­", "", act)
        act = act.lower()
#         print(act.split(" "))
        response = requests.post('http://localhost:9200',data = act.encode('utf-8'))
        tokens+=to_tokens(response)
        
#         break
#         words = [token.text for token in tokenizer(act)]
#         tokens[file.name] = words
#         tokens_list = tokens_list + words
        i += 1
        if i==4:
            break
    
        if i % 200 == 0:
            print(i)



In [5]:
def bigrams(words):
    words = zip(words, words[1:])
    return [" ".join(pair) for pair in words]


In [6]:
text = [i.lemma   for i in tokens]
gram2 = bigrams(text)

gram2 = [
    token
    for token in gram2
    if all(char not in string.punctuation and not char.isdigit() for char in token)
]

Counter(gram2).most_common(5)


[('bank regionalny', 105),
 ('bank spółdzielczy', 101),
 ('w artykuł', 84),
 ('o który', 63),
 ('który mowa', 63)]

In [14]:
text = [i.flexeme for i in tokens]
gram2 = bigrams(text)

gram2 = [
    i
    for i in gram2
    if not "interp" in i
]

Counter(gram2).most_common(5)

[('subst adj', 1187),
 ('prep subst', 876),
 ('subst subst', 844),
 ('adj subst', 589),
 ('subst prep', 400)]

In [13]:
# print(tokens[0].flexeme)
gram2

['brev interp',
 'interp prep',
 'prep interp',
 'interp prep',
 'prep adj',
 'adj brev',
 'brev interp',
 'interp brev',
 'brev num',
 'num interp',
 'interp brev',
 'brev interp',
 'interp adj',
 'adj subst',
 'subst prep',
 'prep subst',
 'subst adj',
 'adj subst',
 'subst adj',
 'adj brev',
 'brev interp',
 'interp prep',
 'prep subst',
 'subst subst',
 'subst prep',
 'prep subst',
 'subst subst',
 'subst brev',
 'brev interp',
 'interp adj',
 'adj interp',
 'interp prep',
 'prep subst',
 'subst prep',
 'prep subst',
 'subst adj',
 'adj subst',
 'subst adj',
 'adj brev',
 'brev interp',
 'interp prep',
 'prep subst',
 'subst subst',
 'subst interp',
 'interp brev',
 'brev interp',
 'interp prep',
 'prep interp',
 'interp prep',
 'prep adj',
 'adj brev',
 'brev interp',
 'interp brev',
 'brev num',
 'num interp',
 'interp brev',
 'brev interp',
 'interp adj',
 'adj interp',
 'interp brev',
 'brev num',
 'num interp',
 'interp brev',
 'brev interp',
 'interp num',
 'num conj',
 'conj

In [8]:
i=8
print(tokens[i].word,tokens[i].morf)

nr brev:npun


In [9]:
tokens_list = separate_puctuations(tokens_list)
gram2 = bigrams(tokens_list)


NameError: name 'separate_puctuations' is not defined

In [None]:
"\ta".startswith('f')

5. Discard bigrams containing characters other than letters. Make sure that you discard the invalid entries after computing the bigram counts.
6. For example: "Ala ma kota", which is tagged as:

   ```
   Ala	none
           Ala	subst:sg:nom:f	disamb
   ma	space
           mieć	fin:sg:ter:imperf	disamb
   kota	space
           kot	subst:sg:acc:m2	disamb
   .	none
           .	interp	disamb
   ```
   
   the algorithm should return the following bigrams: `ala:subst mieć:fin` and `mieć:fin kot:subst`.

7. Compute LLR statistic for this dataset.
8. Partition the entries based on the syntactic categories of the words, i.e. all bigrams having the form of 
   `w1:adj` `w2:subst` should be placed in one partition (the order of the words may not be changed).

9. Select the 10 largest partitions (partitions with the largest number of entries).
10. Use the computed LLR measure to select 5 bigrams for each of the largest categories.

11. Using the results from the previous step answer the following questions:


    i. What types of bigrams have been found? 
    ii. Which of the category-pairs indicate valuable multiword expressions? Do they have anything in common?
    iii. Which signal: LLR score or syntactic category is more useful for determining genuine multiword expressions?
    iv. Can you describe a different use-case where the morphosyntactic category is useful for resolving a real-world problem?