# Multiword expressions identification and extraction

The task shows two simple methods useful for identifying multiword expressions (MWE) in corpora.

## Tasks

1. Use SpaCy [tokenizer API](https://spacy.io/api/tokenizer) to tokenize the text from the law corpus.

In [1]:
import os
import pickle
import locale
# python -m spacy download en_core_web_sm
# python -m spacy download pl_core_news_sm
import re
import string
import tarfile
from collections import Counter

import matplotlib
import matplotlib.pyplot as plt
import morfeusz2
import numpy as np
import pandas as pd
import regex
import spacy
from elasticsearch import *
from elasticsearch.helpers import *
from elasticsearch_dsl import *
from elasticsearch_dsl import query
from spacy.tokenizer import *

matplotlib.style.use("ggplot")
import time
import math
import Levenshtein
import operator


%matplotlib inline
import pandas as pd
locale.setlocale(locale.LC_COLLATE, 'pl_PL.UTF-8')


'pl_PL.UTF-8'

2. Compute **bigram** counts of downcased tokens.  Given the sentence: "The quick brown fox jumps over the
   lazy dog.", the bigram counts are as follows:
   
   * "the quick": 1
   * "quick brown": 1
   * "brown fox": 1
   * . ...
   * "dog .": 1

In [2]:
nlp = spacy.load("pl_core_news_sm")
tokenizer = Tokenizer(nlp.vocab)

tokens = {}
tokens_list = []
i = 0
path = "../data/ustawy"
for filename in os.listdir(path):
    with open(os.path.join(path, filename), "r", encoding="utf-8") as file: 
        act = file.read()
        act = regex.sub(r"\s+", " ", act)
        act = regex.sub(r"­","",act)
        act = act.lower()
        words = [token.text for token in tokenizer(act)]
        tokens[file.name] = words
        tokens_list = tokens_list + words
        i += 1
        if i % 200 == 0:
            print(i)
            
old_tokens_list = tokens_list

200
400
600
800
1000


In [3]:
tokens_list[0:10]

[' ', 'dz.u.', 'z', '1998', 'r.', 'nr', '117,', 'poz.', '759', 'ustawa']

In [4]:
def separate_puctuations(tokens):
    new_tokens = []
    for token in tokens:
        splitted = regex.findall(r"[\w']+|[.,!?;]", token)  #https://stackoverflow.com/questions/367155/splitting-a-string-into-words-and-punctuation
        new_tokens +=splitted
    return new_tokens

tokens = ['new,','fast,','expensive'] 
separate_puctuations(tokens)

['new', ',', 'fast', ',', 'expensive']

In [5]:
def bigrams(words):
    words = list(map(lambda x: x.strip(),words))
    words = zip(words, words[1:])
    return [' '.join(pair) for pair in words]

text = "The quick brown fox jumps over the lazy dog."
words = [token.text for token in tokenizer(text)]
print(bigrams(words))

['The quick', 'quick brown', 'brown fox', 'fox jumps', 'jumps over', 'over the', 'the lazy', 'lazy dog.']


In [6]:
tokens_list = separate_puctuations(tokens_list)
gram2 = bigrams(tokens_list)

In [7]:
Counter(gram2).most_common(5)


[('art .', 83779),
 ('ust .', 53552),
 ('poz .', 45222),
 ('. 1', 43484),
 (', poz', 43192)]

   
3. Discard bigrams containing characters other than letters. Make sure that you discard the invalid entries **after**
   computing the bigram counts.
    

In [8]:
# data = gram2.filter()
gram2 =[token for token in gram2 if all(char not in string.punctuation and not char.isdigit() for char in token)]
gram2[0:5]
Counter(gram2).most_common(5)

[('w art', 32045),
 ('mowa w', 28471),
 ('w ust', 23557),
 ('o których', 13885),
 ('których mowa', 13858)]

4. Use [pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) to compute the measure 
   for all pairs of words. 

In [9]:

def to_probabilities(tokens):
    tokens_freq = Counter(tokens)
    count = sum(tokens_freq.values())
    return {k: v/count for k, v in tokens_freq.items()}

p_bigram = to_probabilities(gram2)


p_token = to_probabilities(tokens_list) 
# map(lambda x: x/count,bigram_freq)
# word_freq = Counter(tokens_list

# reduce(list(bigram_freq.values))

In [10]:
def pmi(x,y): #pointwise_mutual_information
    result = p_bigram[x+" "+y] / (p_token[x] * p_token[y])
    return math.log2(result)
    

bigram_pmis =  {}
for key in gram2:
    if len(key.split())>2:
        print(key)
    bigram_pmis[key] = pmi(*key.split())


    

In [11]:
pmis = dict(sorted(bigram_pmis.items(), key=operator.itemgetter(1),reverse=True))
list(pmis.items())[:5]


{'korzy stający': 23.024484997199306,
 'gałki ocznej': 23.024484997199306,
 'przedemery talne': 23.024484997199306,
 'organa uchwałodawcze': 23.024484997199306,
 'kropki wstawić': 23.024484997199306,
 'antykonkurencyjnym koncentracjom': 23.024484997199306,
 'skupiających kibiców': 23.024484997199306,
 'chuli gańskich': 23.024484997199306,
 'znająca pjm': 23.024484997199306,
 'przyspo sobieniu': 23.024484997199306,
 'skoczów komorowice': 23.024484997199306,
 'mówił szczerą': 23.024484997199306,
 'uczucie wstydu': 23.024484997199306,
 'nuklidy rozszczepialne': 23.024484997199306,
 'nm nanometrów': 23.024484997199306,
 'aparatami rentgenowskimi': 23.024484997199306,
 'stabilnym jodem': 23.024484997199306,
 'stemplach kontrolerskich': 23.024484997199306,
 'uczynieniu nieczytelnym': 23.024484997199306,
 'odvjetnik odvjetnica': 23.024484997199306,
 'środowiskami przestępczymi': 23.024484997199306,
 'hal aukcyjnych': 23.024484997199306,
 'odszko dowania': 23.024484997199306,
 'pompowni bolko'

In [12]:

"""             środków trwałych, jeżeli w umowie leasingu zastrzeżono, że korzy
             stający będzie ponosił ciężar tych podatków i składek niezależnie"""
old_tokens_list.index('korzy')
old_tokens_list[22536:22560]

['w',
 'umowie',
 'leasingu',
 'zastrzeżono,',
 'że',
 'korzy',
 'stający',
 'będzie',
 'ponosił',
 'ciężar',
 'tych',
 'podatków',
 'i',
 'składek',
 'niezależnie',
 'od',
 'opłat',
 'za',
 'używanie,',
 '3)',
 'kaucji',
 'określonej',
 'w',
 'umowie']

In [13]:
tokens_list.index('zrze')

<class 'ValueError'>: 'zrze' is not in list

In [None]:
tokens_list[4421639:4421650]

In [None]:
print(*["Ala","ma","aligatora"])

In [None]:
separate_puctuations(["tery­torialnego"])

5. Sort the word pairs according to that measure in the descending order and determine top 10 entries.

6. Filter bigrams with number of occurrences lower than 5. Determine top 10 entries for the remaining dataset (>=5
   occurrences).

In [25]:
gram2_filtered = {k: v for k, v in Counter(gram2).items() if v>=5}
pmis = dict(sorted(gram2_filtered.items(), key=operator.itemgetter(1),reverse=True))
list(pmis.items())[:5]


[('w art', 32045),
 ('mowa w', 28471),
 ('w ust', 23557),
 ('o których', 13885),
 ('których mowa', 13858)]

7. Use [log likelihood ratio](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html) (LLR) to compute the measure
   for all pairs of words.

8. Sort the word pairs according to that measure in the descending order and display top 10 entries.

9. Compute **trigram** counts for the whole corpus and perform the same filtering.

10. Use PMI (with 5 occurrence threshold) and LLR to compute top 10 results for the trigrams. Devise a method for computing the values, based on the
   results for bigrams.

11. Create a table comparing the methods (separate table for bigrams and trigrams).

12. Answer the following questions:

   a. Why do we have to filter the bigrams, rather than the token sequence?
   
   b. Which measure (PMI, PMI with filtering, LLR) works better for the bigrams and which for the trigrams?
   
   c. What types of expressions are discovered by the methods.
   
   d. Can you devise a different type of filtering that would yield better results?