## Tasks


1. Download [docker image](https://hub.docker.com/r/djstrong/krnnt2) of KRNNT2. It includes the following tools:


    i. Morfeusz2 - morphological dictionary  
    ii. Corpus2 - corpus access library
    iii. Toki - tokenizer for Polish
    iv. Maca - morphosyntactic analyzer   
    v. KRNNT - Polish tagger

2. As an alternative you can use Tagger interfaces in [Clarin-Pl](https://ws.clarin-pl.eu/tager.shtml)
3. Use the tool to tag and lemmatize the law corpus.
4. Using the tagged corpus compute bigram statistic for the tokens containing:


    i. lemmatized, downcased word
    ii. morphosyntactic **category** of the word (`subst`, `fin`, `adj`, etc.)
   

In [1]:
# lematyzacja to sprowadzanie danego słowa do jego formy podstawowej (hasłowej), która reprezentuje dany wyraz, np. wiórkami → wiórek, jeżdżący → jeździć,
# tagging - the process of marking up a word in a text (corpus) as corresponding to a particular part of speech

import requests


r = requests.post('http://localhost:9200',data = "ma. kota, i skacze:")

r.text.split("\n")[0:1]
# print(r.text)

['ma\tnone']

In [2]:
import locale
import os
import pickle

# python -m spacy download en_core_web_sm
# python -m spacy download pl_core_news_sm
import string
import tarfile
from collections import Counter
import matplotlib
import matplotlib.pyplot as plt
import morfeusz2
import numpy as np
import pandas as pd
import regex
import spacy
from spacy.tokenizer import *
import math
import operator
import time
import Levenshtein
import pandas as pd
from PIL import Image


matplotlib.style.use("ggplot")
%matplotlib inline
locale.setlocale(locale.LC_COLLATE, "pl_PL.UTF-8")


'pl_PL.UTF-8'

In [3]:



class Token():
    def __init__(self, lemma,morf,word) -> None:
        self.word: string = word # kota
        self.morf: string = morf
        self.lemma: string = lemma # kot
        self.flexeme : string = morf.split(":")[0] #morphosyntactic **category** of the word (`subst`, `fin`, `adj`, etc
    

# words: dict[string,Token] ={}
def to_tokens(response):
    lines = response.text.split('\n')

    tokens = []
    i=0
    token=None
    morf=None
    for line in lines:
        if not line.startswith('\t'):        
            word = line.split('\t')[0]
        else:
            lemma = line.split('\t')[1]
            if len(lemma.split(" ")) > 1:
                continue
            morf = line.split('\t')[2]
            tokens.append(Token(lemma,morf,word))
    return tokens

    

In [4]:

tokens = {}
tokens_list = []
i = 0
path = "../data/ustawy"

tokens = []

for filename in os.listdir(path):
    with open(os.path.join(path, filename), "r", encoding="utf-8") as file:
        act = file.read()
        act = regex.sub(r"\s+", " ", act)
        act = regex.sub(r"­", "", act)
        act = act.lower()
        response = requests.post('http://localhost:9200',data = act.encode('utf-8'))
        tokens+=to_tokens(response)
        
        i += 1
        
        if i==4:
            break
        if i % 200 == 0:
            print(i)



In [5]:
def bigrams(words):
    words = zip(words, words[1:])
    return [" ".join(pair) for pair in words]


In [6]:
text = [i.lemma   for i in tokens]
gram2 = bigrams(text)

gram2 = [
    token
    for token in gram2
    if all(char not in string.punctuation and not char.isdigit() for char in token)
]

Counter(gram2).most_common(5)


[('bank regionalny', 105),
 ('bank spółdzielczy', 101),
 ('w artykuł', 84),
 ('o który', 63),
 ('który mowa', 63)]

In [7]:
text = [i.flexeme for i in tokens]
gram2 = bigrams(text)

gram2 = [
    i
    for i in gram2
    if not "interp" in i
]

Counter(gram2).most_common(5)

[('subst adj', 1187),
 ('prep subst', 876),
 ('subst subst', 844),
 ('adj subst', 589),
 ('subst prep', 400)]

5. Discard bigrams containing characters other than letters. Make sure that you discard the invalid entries after computing the bigram counts.
6. For example: "Ala ma kota", which is tagged as:

   ```
   Ala	none
           Ala	subst:sg:nom:f	disamb
   ma	space
           mieć	fin:sg:ter:imperf	disamb
   kota	space
           kot	subst:sg:acc:m2	disamb
   .	none
           .	interp	disamb
   ```
   
   the algorithm should return the following bigrams: `ala:subst mieć:fin` and `mieć:fin kot:subst`.

In [8]:
def bigrams(tokens,filtered=True):
    token_pairs = zip(tokens, tokens[1:])
    return [(i,j) for i,j in token_pairs]


def isLetter(token):
    return all(char not in string.punctuation and not char.isdigit() for char in token.word)

    
gram2 = bigrams(tokens)
gram2 = [(i,j) for (i, j) in gram2 if isLetter(i) and isLetter(j) ]
gram2_tokens = gram2
gram2 = [f'{i.lemma}:{i.flexeme} {j.lemma}:{j.flexeme}' for (i, j) in gram2]

In [9]:
gram2_count =Counter(gram2)

In [10]:
gram2_count

Counter({'ustawa:subst z:prep': 20,
         'z:prep dzień:subst': 35,
         'o:prep zmiana:subst': 12,
         'zmiana:subst ustawa:subst': 4,
         'ustawa:subst o:prep': 9,
         'o:prep system:subst': 2,
         'system:subst oświata:subst': 3,
         'oświata:subst artykuł:brev': 1,
         'w:prep ustawa:subst': 15,
         'oraz:conj z:prep': 3,
         'i:conj numer:brev': 5,
         'wprowadzać:fin się:qub': 5,
         'się:qub następujący:adj': 5,
         'następujący:adj zmiana:subst': 5,
         'w:prep artykuł:brev': 84,
         'po:prep punkt:brev': 2,
         'dodawać:fin się:qub': 31,
         'się:qub punkt:brev': 10,
         'w:prep brzmienie:subst': 26,
         'opieka:subst nad:prep': 1,
         'nad:prep uczeń:subst': 1,
         'uczeń:subst z:prep': 1,
         'z:prep znaczny:adj': 1,
         'znaczny:adj lub:conj': 1,
         'lub:conj sprząc:ppas': 1,
         'sprząc:ppas dysfunkcja:subst': 1,
         'dysfunkcja:subst poprzez:prep

7. Compute LLR statistic for this dataset.

In [11]:
from collections import defaultdict

token_count = defaultdict(int)

for bigram, count in gram2_count.items():
    (first_token, second_token) = bigram.split(" ")
    token_count[first_token] += count
    token_count[second_token] += count

total = sum(gram2_count.values())


In [12]:
def H(k):
    N = np.sum(k)
    return np.sum(k / N * np.ma.log(k / N).filled(0))


def llr(a, b):

    k11 = gram2_count[a + " " + b]
    k12 = token_count[b] - k11
    k21 = token_count[a] - k11
    k22 = total - k21 - k12 - k11
    k = np.array([[k11, k12], [k21, k22]])
    rowSums = np.sum(k, axis=1).tolist()
    colSums = np.sum(k, axis=0).tolist()

    return 2 * np.sum(k) * (H(k) - H(rowSums) - H(colSums))



In [13]:
gram2_llr = {}
length = len(gram2)
i = 0
for key in gram2:
    if len(key.split()) > 2:
        print(key)
    gram2_llr[key] = llr(*key.split())
    if i % (int(length / 10)) == 0:
        print(f"{i}/{length}")
    # print(key,gram2_llr[key])
    i += 1

0/7459
745/7459
1490/7459
2235/7459
2980/7459
3725/7459
4470/7459
5215/7459
5960/7459
6705/7459
7450/7459


In [14]:
def sort_dict(dictionary):
    return dict(sorted(dictionary.items(), key=operator.itemgetter(1), reverse=True))


gram2_llr = sort_dict(gram2_llr)
list(gram2_llr.items())[:10]


[('otrzymywać:fin brzmienie:subst', 469.0043113555203),
 ('który:adj mowa:subst', 315.4923648172178),
 ('bank:subst spółdzielczy:adj', 289.0264606855246),
 ('bank:subst regionalny:adj', 284.754995921074),
 ('o:prep który:adj', 267.51968422832385),
 ('samorząd:subst terytorialny:adj', 243.48476244902707),
 ('BGŻ:subst sa:subst', 216.73228519474628),
 ('jednostka:subst samorząd:subst', 204.4617207453875),
 ('w:prep artykuł:brev', 188.06953653583406),
 ('dodawać:fin się:qub', 178.40172327146752)]

8. Partition the entries based on the syntactic categories of the words, i.e. all bigrams having the form of 
   `w1:adj` `w2:subst` should be placed in one partition (the order of the words may not be changed).

In [18]:

gram2 = [f'{i.flexeme} {j.flexeme}' for (i, j) in gram2_tokens]
top10 = Counter(gram2).most_common(10)
top10

[('subst adj', 1087),
 ('prep subst', 876),
 ('subst subst', 836),
 ('adj subst', 511),
 ('subst prep', 380),
 ('conj subst', 295),
 ('subst conj', 210),
 ('prep brev', 190),
 ('adj prep', 181),
 ('fin subst', 179)]

9. Select the 10 largest partitions (partitions with the largest number of entries).
10. Use the computed LLR measure to select 5 bigrams for each of the largest categories.

In [16]:
string = 'otrzymywać:fin brzmienie:subst'
def preprocess(string):
    token1, token2= string.split(" ")
    return (token1.split(":"),token2.split(":"))

def inCategory(string,category):
    ([word1,flexeme1],[word2,flexeme2]) = preprocess(string)
    return category == f"{flexeme1} {flexeme2}"


# preprocess(string)[0][1]
inCategory(string,"fin subst")

True

In [25]:
categories = {i:[] for i, _ in top10}

for bigram,count in gram2_llr.items():
    finished = True
    for category in categories:
        if len(categories[category]) <5:
            finished = False
            if inCategory(bigram,category):
                categories[category].append(bigram+" "+str(count))
    if finished:
        break
            
df=pd.DataFrame.from_dict(categories,orient='index').transpose()

df

Unnamed: 0,subst adj,prep subst,subst subst,adj subst,subst prep,conj subst,subst conj,prep brev,adj prep,fin subst
0,bank:subst spółdzielczy:adj 289.0264606855246,do:prep sprawa:subst 125.74200765926477,BGŻ:subst sa:subst 216.73228519474628,który:adj mowa:subst 315.4923648172178,bank:subst w:prep 149.12163868041247,i:conj placówka:subst 48.61320366984871,bank:subst i:conj 83.29717019147425,w:prep artykuł:brev 188.06953653583406,właściwy:adj do:prep 119.26079007557013,otrzymywać:fin brzmienie:subst 469.0043113555203
1,bank:subst regionalny:adj 284.754995921074,z:prep dzień:subst 80.97208187625974,jednostka:subst samorząd:subst 204.4617207453875,niniejszy:adj ustawa:subst 87.2303001901888,mowa:subst w:prep 111.60544473471974,i:conj wychowanie:subst 41.349171086543585,bank:subst oraz:conj 26.878691763379017,w:prep ustęp:brev 125.17130291145133,dochodowy:adj od:prep 68.42120766524118,mieć:fin zastosowanie:subst 50.12909530059465
2,samorząd:subst terytorialny:adj 243.4847624490...,w:prep bank:subst 70.06412955089863,kurator:subst oświata:subst 161.90221103369143,walny:adj zgromadzenie:subst 84.10593778532356,bank:subst z:prep 55.505229454553216,i:conj gimnazjum:subst 27.611800568028027,oświata:subst i:conj 17.286242248469502,po:prep ustęp:brev 38.21901353718173,regionalny:adj w:prep 39.67609091332702,dokonać:fin rejestracja:subst 31.35040406424977
3,gospodarka:subst żywnościowy:adj 177.834960111...,od:prep dzień:subst 69.18080752739775,minister:subst finanse:subst 154.74949521377766,centralny:adj komisja:subst 70.3230325493746,miesiąc:subst od:prep 54.249265641169295,i:conj bank:subst 23.255364834870964,zawód:subst lub:conj 15.105292069534952,w:prep punkt:brev 14.6725720748113,długi:adj niż:prep 31.707898077811116,odwoływać:fin rada:subst 27.293889788256564
4,minister:subst właściwy:adj 177.1439965092652,na:prep podstawa:subst 60.89198882266925,droga:subst rozporządzenie:subst 133.566681659...,okręgowy:adj komisja:subst 68.72321833030415,porozumienie:subst z:prep 53.16394973362357,i:conj szkoła:subst 22.164985965988524,ustawa:subst i:conj 10.946287197889102,po:prep artykuł:brev 10.250835275554309,który:adj w:prep 29.502663380487746,przeprowadzać:fin rozliczenie:subst 24.7020638...


11. Using the results from the previous step answer the following questions:


    i. What types of bigrams have been found? 
    ii. Which of the category-pairs indicate valuable multiword expressions? Do they have anything in common?
    iii. Which signal: LLR score or syntactic category is more useful for determining genuine multiword expressions?
    iv. Can you describe a different use-case where the morphosyntactic category is useful for resolving a real-world problem?

In [None]:
best_bigrams = [*10]

In [None]:
df = pd.DataFrame({ "a":[],"b" :[]
})

df = df.append({'a': 'Ma kota'}, ignore_index=True)

df = df.append({'a': 'Ma kota'}, ignore_index=True)

df = df.append({'b': 'Ma kota'}, ignore_index=True)

df

In [None]:
print(len(list(df['b'])))