## Zhlukovanie článkov Wikipédie do kategórií na základe ich vedecko-spoločenskej oblasti

**Vypracoval:** Tomáš Babjak

**Predmet:** Vyhľadávanie informácii

**GitHub:** https://github.com/tomasbabjak/VINF_Wikipedia

Imports

In [None]:
import regex
import re
import datamuse
import nltk
import json
import string
import time
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import *
from sklearn.feature_extraction.text import TfidfVectorizer

### 1. Vytvoriť testovaciu vzorku dát, na ktorej budeme prvotne projekt realizovať

Read XML file with Wiki articles and parse articles to list:

In [134]:
def read_xml(file_name, n_first_articles):
    
    start_tag = f'<page>'
    end_tag = f'</page>'
    
    start_found = False
    articles_found = []
    lines = ''
    
    with open(file_name, encoding="utf8") as f:
        for line in f:
            if start_tag in line:
                start_found = True
            if start_found:
                lines += line
            if end_tag in line:
                start_found = False
                articles_found.append(lines)
                lines = ''
            if len(articles_found) == n_first_articles:
                break
    with open(f'../data/wiki_{n_first_articles}_before.json', 'w') as outfile:
        json.dump(articles_found, outfile, indent=4)
    return articles_found

In [133]:
def read_xml_modified(file_name, n_first_articles):
    
    start_tag = f'<page>'
    end_tag = f'</page>'
    
    start_found = False
    articles_found = []
    lines = ''
    counter = 0
    
    with open(file_name, encoding="utf8") as f:
        for line in f:
            if start_tag in line:
                start_found = True
            if start_found:
                lines += line
            if end_tag in line:
                start_found = False
                articles_found.append(lines)
                lines = ''
            if len(articles_found) == n_first_articles:
                counter += 1
                with open(f'../data/wiki_{counter}_before.json', 'w') as outfile:
                    json.dump(articles_found, outfile, indent=4)
                articles_found = []
    return articles_found

Extract Title and Text attributes from article and create dictionary from them:

In [132]:
def extract_text(text):
    title_regex = r'<title[^>]*>([^<]+)</title>'
    text_regex = r'<text[^>]*>([^<]+)</text>'
    pages = []
    for page in text:
        title = regex.findall(title_regex, page)
        text = regex.findall(text_regex, page)
        pages.append({"title": title[0] if title else '',
                      "text": text[0] if text else ''})
    return pages

## Najst infobox, anchor texty a wiki kategorie

### 4. Z článkov testovacej sady vyhľadať dôležité pojmy - zamerať sa na Infobox, kde sa nachádzajú dôležité informácie o článku

### 5. Vyhľadať odkazy na iné články Wikipédie (anchor text), ktoré môžu smerovať priamo na oblasť alebo aspoň priblížiť kontext článku

Find Infobox and Achor texts from Text attribute of article and add them to dictionary

In [129]:
def find_infobox_anchor(text):
    regex_infobox = r"(?=\{Infobox )(\{([^{}]|(?1))*\})"
    regex_anchor = r"\[\[([^\]\[:]+)\|([^\]\[:]+)\]\]"
    regex_category = r"\[\[Category:([^\]]*\b)"
    for page in text:
        page['infobox'] = regex.findall(regex_infobox, page['text'])
        page['anchors'] = regex.findall(regex_anchor, page['text'])
        page['category_wiki'] = regex.findall(regex_category, page['text'])
        page['text'] = regex.sub(regex_infobox, '', page['text'])
        page['text'] = regex.sub(regex_anchor, '', page['text'])
        page['text'] = regex.sub(regex_category, '', page['text'])
    return text

Separate Redirect articles from others into two separate lists

In [130]:
def find_redirect(text):
    regex_redirect = r"^#redirect[^\[]*\[\[([^\]]+)"
    redirect_pages = []
    article_pages = []
    for page in text:
        if regex.findall(regex_redirect, page['text']):
            redirect_pages.append(page)
        else:
            article_pages.append(page)
    return (redirect_pages, article_pages)

In [None]:
#redirects, articles_test = find_redirect(find_infobox_anchor(extract_text(read_xml_modified('../data/enwiki-latest-pages-articles.xml', 100))))

In [None]:
#read_xml_modified('../data/enwiki-latest-pages-articles.xml', 100000)
for i in range(12,207):
    with open(f'../data/wiki_{i}_before.json', 'r+') as json_file:
        print(i)
        articles_all = json.load(json_file)
        redirects, articles_all = find_redirect(find_infobox_anchor(extract_text(articles_all)))
    with open(f'../data/wiki_{i}_before.json', 'w+') as json_file:
        json.dump(articles_all, json_file, indent=4)

In [None]:
for i in range(12,207):
    with open(f'../data/wiki_{i}_before.json', 'r+') as json_file:
        print(i)
        articles_all = json.load(json_file)
        redirects, articles_all = find_redirect(articles_all)
    with open(f'../data/wiki_{i}_before.json', 'w+') as json_file:
        json.dump(articles_all, json_file, indent=4)

### 2. Vytvoriť zoznam (strom) spoločensko-vedných oblastí, do ktorých budeme jednotlivé stránky zaraďovať, ku každej oblasti nájsť aj slová, ktoré sa s ňou spájajú

Find terms related to our categories with Datamuse library. Split words of each category and find 100 terms related to them

In [None]:
categories = [
    'Culture, literature and the arts',
    'Geography - places and states',
    'Medicine - health and fitness',
    'History and events',
    'Mathematics and logic',
    'Nature and physics',
    'Technology and computing',
    'Philosophy and thinking',
    'Religion and belief',
    'Society, politics and people'
]

new_categories = [
    'Culture',
    'Food',
    'Language',
    'Literature',
    'Art',
    'Dance',
    'Film',
    'Music',
    'Theatre',
    'Architecture',
    'Painting',
    'Sculpture',
    'Games',
    'Sport',
    'Recreation',
    'Media',
    'Internet',
    'Geography',
    'Earth',
    'Health',
    'Fitness',
    'Exercise',
    'Life',
    'Medicine',
    'History',
    'Education',
    'Crime',
    'War',
    'Transport',
    'Mathematics',
    'Logic',
    'Statistics',
    'Biology',
    'Nature',
    'Science',
    'Philosophy',
    'Religion',
    'Belief',
    'Society',
    'Technology',
    'Computing',
    'Electronics',
    'Engineering']

### Vytvorit gazeteer pomocou Wiki clankov mojich kategorii

Ku kazdej z mojich kategorii najst clanok wikipedie s rovnakym nazvom a pomocou neho neskor vytvorit gazeteer.

In [None]:
def find_categories_articles(file_name):
    start_tag = f'<page>'
    end_tag = f'</page>'
    title_regex = r'<title[^>]*>([^<]+)</title>'

    start_found = False
    reading = False
    start_just = False
    articles_found = []
    lines = ''
    
    try:
        with open(file_name, encoding="utf8") as f:
            for line in f:
                if start_tag in line:
                    start_found = True
                    start_just = True
                    continue
                if start_just:
                    category = regex.findall(title_regex, line)
                    if category[0] in new_categories:
                        print(category[0])
                        reading = True
                    start_just = False
                if start_found and reading:
                    lines += line
                if end_tag in line:
                    start_found = False
                    reading = False
                    if category[0] in new_categories:
                        articles_found.append(lines)
                    category = ''
                    lines = ''
                if len(articles_found) == len(new_categories):
                    break
        with open(f'../data/wiki_categories.json', 'w') as outfile:
            json.dump(articles_found, outfile)
        return articles_found
    except:
          print("An exception occurred")
    finally:
        with open(f'../data/wiki_categories.json', 'w') as outfile:
            json.dump(articles_found, outfile)    
        return articles_found

In [127]:
categories_articles = find_categories_articles('../data/enwiki-latest-pages-articles.xml')

Art
Computing
Crime
Dance
Earth
Engineering
Education
Electronics
Food
Games
Internet
Language
Life
Mathematics
Music
Medicine
Nature
Recreation
Religion
Statistics
Science
Sculpture
Technology
War
Society
Media
Health
Belief
An exception occurred


### Vytvorit gazeteer pomocou Datamuse kniznice

Ku kazdej z mojich kategorii najst gazeteer pomocou kniznice Datamuse - related words

In [39]:
api = datamuse.Datamuse()
        
def categories_find_related(categories):
    cats_with_words = []

    for c in categories:
        api_words = api.words(ml=c, max=20)
        result = list(map(lambda x: x.get('word'), api_words))
        result.append(c.lower())
        cats_with_words.append({'category':c,'related_words':result})

    return cats_with_words

## Predspracovanie

### 3. Články vhodne predspracovať - stemming, tokenizácia, odstránenie stop slov

In [56]:
def tokenize_text(text):
    text_tokens = word_tokenize(text)
    text_tokens = [token.lower() for token in text_tokens if token not in ["*+'-./:;,|<=>?@[\]^_`{}~!\"#$%&()\n"]]
    return text_tokens

In [57]:
def remove_stops(text):
    tokens_without_stops = list(filter(lambda x: (x not in string.punctuation) and (x not in stopwords.words('english')),text))
    return tokens_without_stops

In [58]:
stemmer = PorterStemmer()

def stem_list(llist):
    return [stemmer.stem(word) for word in llist]

In [59]:
def preprocess_text(text):
    if not text:
        return []
    else:
        text = tokenize_text(text)
        text = remove_stops(text)
        text = stem_list(text)
        return text

### Predspracovat slova textu clanku, kategorii, infoboxov a anchor textov

In [None]:
for art in articles_test:
    # Predspracovat slova textu clanku:
    art['text_tokens'] = preprocess_text(art.get('text'))
    # Predspracovat slova z kategorii:
    art['category_wiki_tokens'] = preprocess_text(' '.join(art.get('category_wiki')))
    # Predspracovat slova z infoboxov:
    if art.get('infobox'):
        art['infobox_tokens'] = preprocess_text(' '.join(art.get('infobox')[0]))
    else:
        art['infobox_tokens'] = []
    # Predspracovat slova z anchor textov:
    art['anchors_tokens'] = preprocess_text(' '.join([' '.join(tups) for tups in art.get('anchors')]))

#### Testovanie pre jeden subor

In [None]:
import time

for i in range(11,12):
    with open(f'../data/wiki_{i}_before.json', 'r+') as json_file:
        articles_all = json.load(json_file)
        start_time = time.time()
        for index, art in zip(range(100,1100), articles_all):
            print(index)
            art['text_tokens'] = preprocess_text(art.pop('text',''))
            art['category_wiki_tokens'] = preprocess_text(' '.join(art.pop('category_wiki','')))
            if art.get('infobox'):
                art['infobox_tokens'] = preprocess_text(' '.join(art.pop('infobox','')[0]))
            else:
                art['infobox_tokens'] = []
            art['anchors_tokens'] = preprocess_text(' '.join([' '.join(tups) for tups in art.pop('anchors','')]))
        print("--- %s seconds ---" % (time.time() - start_time))
#     with open(f'../data/wiki_{i}_before.json', 'w+') as json_file:
#         json.dump(articles_all, json_file, indent=4)

In [None]:
arts = articles_all[100:1100]

#### Vsetky subory

In [None]:
for i in range(12,207):
    with open(f'../data/wiki_{i}_before.json', 'r+') as json_file:
        articles_all = json.load(json_file)
        for index, art in zip(range(10), items):
            print(index)
            art['text_tokens'] = preprocess_text(art.pop('text',''))
            art['category_wiki_tokens'] = preprocess_text(' '.join(art.pop('category_wiki','')))
            if art.get('infobox'):
                art['infobox_tokens'] = preprocess_text(' '.join(art.pop('infobox','')[0]))
            else:
                art['infobox_tokens'] = []
            art['anchors_tokens'] = preprocess_text(' '.join([' '.join(tups) for tups in art.pop('anchors','')]))
    with open(f'../data/wiki_{i}_before.json', 'w+') as json_file:
        json.dump(articles_all, json_file, indent=4)

### Predspracovat slova pre gazeteer - DATAMUSE

In [None]:
cats_with_words = categories_find_related(new_categories)
for cat in cats_with_words:
    cat['related_tokens'] = preprocess_text(' '.join(cat.get('related_words')))

### Predspracovat slova pre gazeteer - WIKI clanky

In [135]:
# categories_articles = find_categories_articles('../data/enwiki-latest-pages-articles.xml')
with open('../data/wiki_categories.json') as json_file:
    categories_articles = json.load(json_file)

categories_articles = find_infobox_anchor(extract_text(categories_articles))
for art in categories_articles:
    art['text_tokens'] = preprocess_text(art.get('text'))# num = 0
    art['category_wiki_tokens'] = preprocess_text(' '.join(art.get('category_wiki')))
    if art.get('infobox'):
        art['infobox_tokens'] = preprocess_text(' '.join(art.get('infobox')[0]))
    else:
        art['infobox_tokens'] = []
    art['anchors_tokens'] = preprocess_text(' '.join([' '.join(tups) for tups in art.get('anchors')]))

## TF-IDF

In [None]:
import pandas as pd 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import numpy.linalg as LA

In [None]:
# OLD VERSION

# def tfidf_train(train_set):
#     vectorizer = TfidfVectorizer(use_idf=True)
#     trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
#     print(trainVectorizerArray)
#     return vectorizer, trainVectorizerArray

# def tfdif_test_cosine(test_set, vectorizer, trainVectorizerArray):
#     if not test_set[0]:
#         return {}
    
#     testVectorizerArray = vectorizer.transform(test_set).toarray()
#     cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)
    
#     categories_sims = {}
#     for vector, category in zip(trainVectorizerArray, new_categories):
#         for testV in testVectorizerArray:
#             cosine = cx(vector, testV)
#             if cosine != 0:
#                 categories_sims[category] = cosine
#     return categories_sims

In [124]:
def tfidf_train(train_set):
    vectorizer = TfidfVectorizer()
    docs_tfidf = vectorizer.fit_transform(train_set)
    return vectorizer, docs_tfidf

def tfdif_test_cosine(query, vectorizer, docs_tfidf):
    query_tfidf = vectorizer.transform([query])
    cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf).flatten()
    categories_sims = {}
    
    for cosine, category in zip(cosineSimilarities, new_categories):
        if cosine != 0:
            categories_sims[category] = cosine
    return categories_sims

## Kosinusova podobnost TF-IDF

Natrenovana Term frequency - Inverse document frequency na datasete mojho gazeteeru, teda slov jednotlivych kategorii.

Tento model vyhodnotime na slovach textu clankov, infoboxov, kategorii a anchor textoch a vypocitame kosinusovu podobnost s kategoriami gazeteeru.

### Trenovanie TF-IDF na gazeteere z DATAMUSE

In [None]:
train_set = [' '.join(cat.get('related_tokens')) for cat in cats_with_words]

vectorizer_datamuse, trained_model_datamuse = tfidf_train(train_set)

### Trenovanie TF-IDF na gazeteere z WIKI clankov kategorii

In [None]:
# train_set_text = [' '.join(cat.get('text_tokens')) for cat in categories_articles]
train_set_category = [' '.join(cat.get('category_wiki_tokens')) for cat in categories_articles]
train_set_infobox = [' '.join(cat.get('infobox_tokens')) for cat in categories_articles]
train_set_anchors = [' '.join(cat.get('anchors_tokens')) for cat in categories_articles]

vectorizer_wiki0, trained_model_wiki0 = tfidf_train(train_set_anchors)
vectorizer_wiki1, trained_model_wiki1 = tfidf_train(train_set_infobox)
vectorizer_wiki2, trained_model_wiki2 = tfidf_train(train_set_category)

In [136]:
train_set_text = [' '.join(cat.get('text_tokens')) for cat in categories_articles]
vectorizer_wiki3, trained_model_wiki3 = tfidf_train(train_set_text)

In [126]:
categories_articles[0]

'    <title>Art</title>\n    <ns>0</ns>\n    <id>752</id>\n    <revision>\n      <id>979038585</id>\n      <parentid>979038551</parentid>\n      <timestamp>2020-09-18T11:47:28Z</timestamp>\n      <contributor>\n        <username>Materialscientist</username>\n        <id>7852030</id>\n      </contributor>\n      <minor />\n      <comment>Reverted edits by [[Special:Contributions/Russian_r_maybe|Russian_r_maybe]] ([[User talk:Russian_r_maybe|talk]]) ([[WP:HG|HG]]) (3.4.10)</comment>\n      <model>wikitext</model>\n      <format>text/x-wiki</format>\n      <text bytes="98875" xml:space="preserve">{{about|the general concept of art|the group of creative disciplines|The arts|other uses|Art (disambiguation)}}\n{{pp-semi-indef}}\n{{pp-move-indef}}\n{{short description|Creative work to evoke emotional response}}\n{{Use dmy dates|date=July 2020}}\n[[File:Art-portrait-collage 2.jpg|thumb|upright=1.5|Clockwise from upper left: an 1887 [[self-portrait]] by [[Vincent van Gogh]]; a female ancestor f

### Kosinusova podobnost s Kategorickymi clankami Wiki - DATAMUSE TFIDF

### Anchor text - kosinusova podobnost

In [None]:
for art in categories_articles:
    art['anchor_sims'] = tfdif_test_cosine(' '.join(art.get('anchors_tokens')), vectorizer_datamuse, trained_model_datamuse)
    art['anchor_sims'] = {k: v for k, v in sorted(art.get('anchor_sims').items(), key = lambda item: item[1], reverse=True)}
    print(art.get('title'))
    print(art.get('anchor_sims'))

### Kategorie Wiki - kosinusova podobnost

In [None]:
for art in categories_articles:
    art['categories_sims'] = tfdif_test_cosine(' '.join(art.get('category_wiki_tokens')), vectorizer_datamuse, trained_model_datamuse)
    art['categories_sims'] = {k: v for k, v in sorted(art.get('categories_sims').items(), key = lambda item: item[1], reverse=True)}        
    print(art.get('title'))
    print(art.get('categories_sims'))

### Infobox - kosinusova podobnost

In [None]:
for art in categories_articles:
    art['infobox_sims'] = tfdif_test_cosine(' '.join(art.get('infobox_tokens')), vectorizer_datamuse, trained_model_datamuse)
    art['infobox_sims'] = {k: v for k, v in sorted(art.get('infobox_sims').items(), key = lambda item: item[1], reverse=True)}    
    print(art.get('title'))
    print(art.get('infobox_sims'))

###  Text clanku - kosinusova podobnost

In [None]:
for art in categories_articles:
    art['text_sims'] = tfdif_test_cosine(' '.join(art.get('text_tokens')), vectorizer_datamuse, trained_model_datamuse)
    art['text_sims'] = {k: v for k, v in sorted(art.get('text_sims').items(), key = lambda item: item[1], reverse=True)}
    print(art.get('title'))
    print(art.get('text_sims'))

### Kosinusova podobnost s testovacimi clankami - DATAMUSE TFIDF

### Anchor text - kosinusova podobnost

In [None]:
for art in articles_test:
    art['anchor_sims'] = tfdif_test_cosine(' '.join(art.get('anchors_tokens')), vectorizer_datamuse, trained_model_datamuse)
    art['anchor_sims'] = {k: v for k, v in sorted(art.get('anchor_sims').items(), key = lambda item: item[1], reverse=True)}
    print(art.get('title'))
    print(art.get('anchor_sims'))

### Kategorie Wiki - kosinusova podobnost

In [None]:
for art in articles_test:
    art['categories_sims'] = tfdif_test_cosine(' '.join(art.get('category_wiki_tokens')), vectorizer_datamuse, trained_model_datamuse)
    art['categories_sims'] = {k: v for k, v in sorted(art.get('categories_sims').items(), key = lambda item: item[1], reverse=True)}    
    print(art.get('title'))
    print(art.get('categories_sims'))

### Infobox - kosinusova podobnost

In [None]:
for art in articles_test:
    art['infobox_sims'] = tfdif_test_cosine(' '.join(art.get('infobox_tokens')), vectorizer_datamuse, trained_model_datamuse)
    art['infobox_sims'] = {k: v for k, v in sorted(art.get('infobox_sims').items(), key = lambda item: item[1], reverse=True)}
    print(art.get('title'))
    print(art.get('infobox_sims'))

###  Text clanku - kosinusova podobnost

In [None]:
for art in articles_test:
    art['text_sims'] = tfdif_test_cosine(' '.join(art.get('text_tokens')), vectorizer_datamuse, trained_model_datamuse)
    art['text_sims'] = {k: v for k, v in sorted(art.get('text_sims').items(), key = lambda item: item[1], reverse=True)}
    print(art.get('title'))
    print(art.get('text_sims'))

### Kosinusova podobnost s  s testovacimi clankami - WIKI Infobox TFIDF

### Anchor text - kosinusova podobnost

In [None]:
for art in articles_test:
    art['anchor_sims_info'] = tfdif_test_cosine(' '.join(art.get('anchors_tokens')), vectorizer_wiki1, trained_model_wiki1)
    art['anchor_sims_info'] = {k: v for k, v in sorted(art.get('anchor_sims_info').items(), key = lambda item: item[1], reverse=True)}    
    print(art.get('title'))
    print(art.get('anchor_sims_info'))

### Kategorie Wiki - kosinusova podobnost

In [None]:
for art in articles_test:
    art['categories_sims_info'] = tfdif_test_cosine(' '.join(art.get('category_wiki_tokens')), vectorizer_wiki1, trained_model_wiki1)
    art['categories_sims_info'] = {k: v for k, v in sorted(art.get('categories_sims_info').items(), key = lambda item: item[1], reverse=True)}    
    print(art.get('title'))
    print(art.get('categories_sims_info'))

### Infobox - kosinusova podobnost

In [None]:
for art in articles_test:
    art['infobox_sims_info'] = tfdif_test_cosine(' '.join(art.get('infobox_tokens')), vectorizer_wiki1, trained_model_wiki1)
    art['infobox_sims_info'] = {k: v for k, v in sorted(art.get('infobox_sims_info').items(), key = lambda item: item[1], reverse=True)}
    print(art.get('title'))
    print(art.get('infobox_sims_info'))

###  Text clanku - kosinusova podobnost

In [None]:
for art in articles_test:
    art['text_sims_info'] = tfdif_test_cosine(' '.join(art.get('text_tokens')), vectorizer_wiki1, trained_model_wiki1)
    art['text_sims_info'] = {k: v for k, v in sorted(art.get('text_sims_info').items(), key = lambda item: item[1], reverse=True)}
    print(art.get('title'))
    print(art.get('text_sims_info'))

### Kosinusova podobnost  s testovacimi clankami - WIKI kategorie TFIDF

### Anchor text - kosinusova podobnost

In [None]:
for art in articles_test:
    art['anchor_sims_cat'] = tfdif_test_cosine(' '.join(art.get('anchors_tokens')), vectorizer_wiki2, trained_model_wiki2)
    art['anchor_sims_cat'] = {k: v for k, v in sorted(art.get('anchor_sims_cat').items(), key = lambda item: item[1], reverse=True)}    
    print(art.get('title'))
    print(art.get('anchor_sims_cat'))

### Kategorie Wiki - kosinusova podobnost

In [None]:
for art in articles_test:
    art['categories_sims_cat'] = tfdif_test_cosine(' '.join(art.get('category_wiki_tokens')), vectorizer_wiki2, trained_model_wiki2)
    art['categories_sims_cat'] = {k: v for k, v in sorted(art.get('categories_sims_cat').items(), key = lambda item: item[1], reverse=True)}    
    print(art.get('title'))
    print(art.get('categories_sims_cat'))

### Infobox - kosinusova podobnost

In [None]:
for art in articles_test:
    art['infobox_sims_cat'] = tfdif_test_cosine(' '.join(art.get('infobox_tokens')), vectorizer_wiki2, trained_model_wiki2)
    art['infobox_sims_cat'] = {k: v for k, v in sorted(art.get('infobox_sims_cat').items(), key = lambda item: item[1], reverse=True)}
    print(art.get('title'))
    print(art.get('infobox_sims_cat'))

###  Text clanku - kosinusova podobnost

In [None]:
for art in articles_test:
    art['text_sims_cat'] = tfdif_test_cosine(' '.join(art.get('text_tokens')), vectorizer_wiki2, trained_model_wiki2)
    art['text_sims_cat'] = {k: v for k, v in sorted(art.get('text_sims_cat').items(), key = lambda item: item[1], reverse=True)}
    print(art.get('title'))
    print(art.get('text_sims_cat'))

### Kosinusova podobnost  s testovacimi clankami - WIKI text TFIDF

### Anchor text - kosinusova podobnost

In [138]:
for art in articles_test:
    art['anchor_sims_text'] = tfdif_test_cosine(' '.join(art.get('anchors_tokens')), vectorizer_wiki3, trained_model_wiki3)
    art['anchor_sims_text'] = {k: v for k, v in sorted(art.get('anchor_sims_text').items(), key = lambda item: item[1], reverse=True)}    
    print(art.get('title'))
    print(art.get('anchor_sims_text'))

Anarchism
{'Earth': 0.07244911022677851, 'History': 0.06700589993343714, 'Medicine': 0.05714992036223981, 'Language': 0.05355468495119101, 'Film': 0.04890920237097767, 'Fitness': 0.04139444246229392, 'War': 0.03825578079125836, 'Health': 0.03587335537628782, 'Media': 0.03208074024430446, 'Culture': 0.030875891552559164, 'Life': 0.029800582617354285, 'Education': 0.027006592809053043, 'Internet': 0.024166673530668835, 'Sport': 0.022914222787118445, 'Crime': 0.0221760342602772, 'Geography': 0.02193233462149937, 'Dance': 0.02133418194400668, 'Exercise': 0.021286249582868833, 'Sculpture': 0.02074728769975692, 'Painting': 0.020670919688639114, 'Theatre': 0.020064317849952683, 'Literature': 0.018524342217998507, 'Games': 0.017923024019223268, 'Food': 0.01623841074336768, 'Recreation': 0.013805809269872623, 'Music': 0.011387356496594418, 'Art': 0.009886175700932328}
Autism
{'Media': 0.10393937988900835, 'Crime': 0.05878668975014259, 'Sculpture': 0.04480115784310468, 'Health': 0.03962452758501

Andre Agassi
{'History': 0.030308603991457895, 'Sculpture': 0.028145206474875943, 'Culture': 0.023672894379247137, 'Film': 0.02356806135920569, 'Geography': 0.02320421349164917, 'Painting': 0.02305220168066994, 'Medicine': 0.022514716270426133, 'Fitness': 0.02169978178193662, 'Earth': 0.020779975788907043, 'Food': 0.01815939735398965, 'Exercise': 0.0168629852298678, 'Recreation': 0.016685940145354097, 'Crime': 0.016514344911032663, 'Language': 0.01644092324532501, 'Internet': 0.015330794545304862, 'Games': 0.0153035552989865, 'Life': 0.01486302252894969, 'Media': 0.014767367080669908, 'Sport': 0.014415782769178011, 'Dance': 0.01367381990046725, 'War': 0.01322897190513676, 'Art': 0.013075080175810804, 'Health': 0.012122100327839525, 'Music': 0.011675382593045383, 'Literature': 0.011121387186636102, 'Education': 0.010335745352183219, 'Theatre': 0.010038530816272642, 'Architecture': 0.0032845475746463232}


### Kategorie Wiki - kosinusova podobnost

In [139]:
for art in articles_test:
    art['categories_sims_text'] = tfdif_test_cosine(' '.join(art.get('category_wiki_tokens')), vectorizer_wiki3, trained_model_wiki3)
    art['categories_sims_text'] = {k: v for k, v in sorted(art.get('categories_sims_text').items(), key = lambda item: item[1], reverse=True)}    
    print(art.get('title'))
    print(art.get('anchor_sims_text'))

Anarchism
{'Earth': 0.07244911022677851, 'History': 0.06700589993343714, 'Medicine': 0.05714992036223981, 'Language': 0.05355468495119101, 'Film': 0.04890920237097767, 'Fitness': 0.04139444246229392, 'War': 0.03825578079125836, 'Health': 0.03587335537628782, 'Media': 0.03208074024430446, 'Culture': 0.030875891552559164, 'Life': 0.029800582617354285, 'Education': 0.027006592809053043, 'Internet': 0.024166673530668835, 'Sport': 0.022914222787118445, 'Crime': 0.0221760342602772, 'Geography': 0.02193233462149937, 'Dance': 0.02133418194400668, 'Exercise': 0.021286249582868833, 'Sculpture': 0.02074728769975692, 'Painting': 0.020670919688639114, 'Theatre': 0.020064317849952683, 'Literature': 0.018524342217998507, 'Games': 0.017923024019223268, 'Food': 0.01623841074336768, 'Recreation': 0.013805809269872623, 'Music': 0.011387356496594418, 'Art': 0.009886175700932328}
Autism
{'Media': 0.10393937988900835, 'Crime': 0.05878668975014259, 'Sculpture': 0.04480115784310468, 'Health': 0.03962452758501

Apollo
{'Earth': 0.0651557104158204, 'Exercise': 0.04855255781538288, 'Sculpture': 0.045134256032517514, 'Culture': 0.024935123313762304, 'Life': 0.02057515583134431, 'Fitness': 0.019550199658203236, 'Internet': 0.017626926769124952, 'Art': 0.017294367174790902, 'Media': 0.016113517073282006, 'Language': 0.015883013991788743, 'War': 0.015551602681729382, 'Recreation': 0.0144758027898894, 'Sport': 0.012944159888665568, 'Dance': 0.012900172993502363, 'Medicine': 0.012334115410461357, 'History': 0.011891497107943922, 'Education': 0.011674189296116414, 'Games': 0.01054514983311521, 'Film': 0.007993335013138396, 'Geography': 0.007865016341335117, 'Literature': 0.007503965725233466, 'Theatre': 0.006075353682193001, 'Food': 0.005548534494577702, 'Music': 0.005002396619327778, 'Health': 0.0043914995587015105, 'Crime': 0.0042500288488837796, 'Painting': 0.003739606338154417}
Andre Agassi
{'History': 0.030308603991457895, 'Sculpture': 0.028145206474875943, 'Culture': 0.023672894379247137, 'Film'

### Infobox - kosinusova podobnost

In [140]:
for art in articles_test:
    art['infobox_sims_text'] = tfdif_test_cosine(' '.join(art.get('infobox_tokens')), vectorizer_wiki3, trained_model_wiki3)
    art['infobox_sims_text'] = {k: v for k, v in sorted(art.get('infobox_sims_text').items(), key = lambda item: item[1], reverse=True)}    
    print(art.get('title'))
    print(art.get('anchor_sims_text'))

Anarchism
{'Earth': 0.07244911022677851, 'History': 0.06700589993343714, 'Medicine': 0.05714992036223981, 'Language': 0.05355468495119101, 'Film': 0.04890920237097767, 'Fitness': 0.04139444246229392, 'War': 0.03825578079125836, 'Health': 0.03587335537628782, 'Media': 0.03208074024430446, 'Culture': 0.030875891552559164, 'Life': 0.029800582617354285, 'Education': 0.027006592809053043, 'Internet': 0.024166673530668835, 'Sport': 0.022914222787118445, 'Crime': 0.0221760342602772, 'Geography': 0.02193233462149937, 'Dance': 0.02133418194400668, 'Exercise': 0.021286249582868833, 'Sculpture': 0.02074728769975692, 'Painting': 0.020670919688639114, 'Theatre': 0.020064317849952683, 'Literature': 0.018524342217998507, 'Games': 0.017923024019223268, 'Food': 0.01623841074336768, 'Recreation': 0.013805809269872623, 'Music': 0.011387356496594418, 'Art': 0.009886175700932328}
Autism
{'Media': 0.10393937988900835, 'Crime': 0.05878668975014259, 'Sculpture': 0.04480115784310468, 'Health': 0.03962452758501

###  Text clanku - kosinusova podobnost

In [141]:
for art in articles_test:
    art['text_sims_text'] = tfdif_test_cosine(' '.join(art.get('text_tokens')), vectorizer_wiki3, trained_model_wiki3)
    art['text_sims_text'] = {k: v for k, v in sorted(art.get('text_sims_text').items(), key = lambda item: item[1], reverse=True)}    
    print(art.get('title'))
    print(art.get('text_sims_text'))

Anarchism
{'Fitness': 0.39933652965431, 'Life': 0.36300868513842904, 'Games': 0.3356660211844196, 'Film': 0.31506541008390543, 'Culture': 0.3021243497686672, 'Earth': 0.3004439751184369, 'Sport': 0.28926622263670654, 'Medicine': 0.2869505605002411, 'Internet': 0.2773459231364972, 'Sculpture': 0.2696662763516006, 'Art': 0.2625904653395812, 'Dance': 0.25933183615493094, 'Painting': 0.24503529601477533, 'War': 0.24064286486673614, 'Food': 0.2375597205384696, 'History': 0.23754299174257565, 'Music': 0.22962742766588254, 'Media': 0.2170663673247739, 'Health': 0.19237482408122264, 'Crime': 0.18934380658895186, 'Geography': 0.18032836885607498, 'Language': 0.17692380173652184, 'Theatre': 0.16043712686108288, 'Literature': 0.1353309713435972, 'Recreation': 0.13472867537075325, 'Exercise': 0.12570660614677312, 'Education': 0.03060649118953453, 'Architecture': 0.0009211383659994514}
Autism
{'Art': 0.4722744443395023, 'Games': 0.469157118319701, 'Fitness': 0.4534333082501079, 'Life': 0.4163940631

Academy Awards
{'Games': 0.5070169212845437, 'Internet': 0.4720164168192308, 'Painting': 0.46302550099452694, 'Fitness': 0.45333874104801053, 'Life': 0.4521203638395053, 'Art': 0.43570882002864353, 'Culture': 0.39910122325966474, 'Film': 0.3926109669805059, 'Medicine': 0.38013096130422785, 'Dance': 0.37379167562689347, 'Sport': 0.35001299477041176, 'Earth': 0.3497612862161938, 'Food': 0.34259004874901894, 'Crime': 0.30985362094045854, 'Theatre': 0.3002311112736856, 'Geography': 0.2966884522449866, 'Media': 0.2950302928790549, 'War': 0.27945972119664414, 'Music': 0.2728395790582755, 'Health': 0.2529291648245821, 'Sculpture': 0.24906210694585074, 'Language': 0.23534914559814069, 'History': 0.21670056292160148, 'Exercise': 0.21016938307690894, 'Literature': 0.18115075463563365, 'Recreation': 0.1770493380084246, 'Education': 0.03615442964702903, 'Architecture': 0.006726200569632919}
Actrius
{'Games': 0.5127864806573236, 'Art': 0.4855286505530308, 'Fitness': 0.4810416107840157, 'Life': 0.45

ASCII
{'Fitness': 0.494854530156508, 'Art': 0.48244334296120317, 'Games': 0.4458548829367324, 'Culture': 0.4338992436352611, 'Life': 0.43003571792525996, 'Earth': 0.41378608557146734, 'Sport': 0.41197206951734655, 'War': 0.38736406348115804, 'Dance': 0.3864063382358102, 'Medicine': 0.37721239204602613, 'Sculpture': 0.3692991632302381, 'Painting': 0.367260498287583, 'Internet': 0.3629515283333163, 'Food': 0.3431442909680095, 'Film': 0.33870165097423305, 'Language': 0.3229508905924473, 'Music': 0.31124060837541745, 'Media': 0.30609924193833044, 'Crime': 0.3014153912471974, 'Health': 0.2965019995583556, 'Exercise': 0.2869776168191269, 'Geography': 0.28492651483824527, 'Theatre': 0.2709912221547795, 'Recreation': 0.2708235029945509, 'Literature': 0.2667912529998638, 'History': 0.21957194004841446, 'Education': 0.02307065477116307, 'Architecture': 0.0012592414485311188}
Austin (disambiguation)
{'Education': 0.020770524978731584, 'Sculpture': 0.0170901024074799, 'Earth': 0.010860599710672028

In [143]:
with open(f'../data/test_30_tested.json', 'w') as outfile:
    json.dump(articles_test, outfile, indent=4)

## Testovanie a vyhodnotenie

### DATAMUSE gazeteer

#### Anchor text +  Kategorie Wiki + Infobox + Text clanku

In [158]:
from more_itertools import take
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import recall_score, precision_score, f1_score

with open('../data/test_30_tested.json') as json_file:
    articles_test = json.load(json_file)
    
titles = [art.get('title') for art in articles_test]
anchors = [0] * 30
categories = [0] * 30
infoboxes = [0] * 30
texts = [0] * 30
extended = [0] * 30
counter = 0

for art in articles_test:
    len_annotated = len(art.get('annotated_categories'))
    
    n_anchors = take(len_annotated, art.get('anchor_sims').keys())
    n_anchors.extend(take(len_annotated, art.get('categories_sims').keys()))
    n_anchors.extend(take(len_annotated, art.get('infobox_sims').keys()))
    n_anchors.extend(take(len_annotated, art.get('text_sims').keys()))
    art['extended_sims'] = list(set(n_anchors))
    
    for cat in art.get('annotated_categories'):
        if cat in art.get('anchor_sims') and not pd.isna(art.get('anchor_sims').get(cat)) and list(art.get('anchor_sims')).index(cat) + 1 <= len_annotated:
            anchors[counter] += 1
        if cat in art.get('categories_sims') and not pd.isna(art.get('categories_sims').get(cat)) and list(art.get('categories_sims')).index(cat) + 1 <= len_annotated:
            categories[counter] += 1
        if cat in art.get('infobox_sims') and not pd.isna(art.get('infobox_sims').get(cat)) and list(art.get('infobox_sims')).index(cat) + 1 <= len_annotated:
            infoboxes[counter] += 1
        if cat in art.get('text_sims') and not pd.isna(art.get('text_sims').get(cat)) and list(art.get('text_sims')).index(cat) + 1 <= len_annotated:
            texts[counter] += 1
        if cat in art.get('extended_sims'):
            extended[counter] += 1
    anchors[counter] = float("{:.2f}".format(anchors[counter] / len_annotated * 100))
    categories[counter] = float("{:.2f}".format(categories[counter] / len_annotated * 100))
    infoboxes[counter] = float("{:.2f}".format(infoboxes[counter] / len_annotated * 100))
    texts[counter] = float("{:.2f}".format(texts[counter] / len_annotated* 100))
    extended[counter] = float("{:.2f}".format(extended[counter] / len_annotated* 100))

    counter += 1
    
cars = {
    'Anchor': anchors,
    'Categories': categories,
    'Infobox': infoboxes, 
    'Text': texts,
    'All': extended
}

df = pd.DataFrame(cars, columns = ['Anchor','Categories','Infobox','Text','All'], index=titles)

print(df)
print (df.describe())
print('\nAverage of All column: ', float("{:.2f}".format(df['All'].mean())))

A=[art['annotated_categories'] for art in articles_test]
B=[art['extended_sims'] for art in articles_test]
#A=[[ "Culture","Philosophy","Belief","Society"],['Society']]
#B=[['Culture', 'Philosophy', 'Belief','dsdasd'],['dsd']]

multi = MultiLabelBinarizer()

y_true = multi.fit(A).transform(A)
y_pred = multi.transform(B)

print('Precision: ',precision_score(y_true, y_pred,average='weighted',zero_division=1))
print('Recall: ',recall_score(y_true, y_pred, average='weighted',zero_division=1))
print('F1:' ,f1_score(y_true, y_pred, average='weighted'))

                                          Anchor  Categories  Infobox    Text  \
Anarchism                                  25.00       75.00     0.00   50.00   
Autism                                     66.67        0.00    33.33    0.00   
Albedo                                      0.00        0.00     0.00    0.00   
A                                          50.00        0.00    50.00    0.00   
Alabama                                     0.00        0.00     0.00    0.00   
Achilles                                    0.00       66.67     0.00   66.67   
Abraham Lincoln                            40.00       40.00    20.00   20.00   
Aristotle                                  57.14       57.14    57.14   42.86   
An American in Paris                       50.00       25.00     0.00   50.00   
Academy Award for Best Production Design   66.67       66.67    66.67   66.67   
Academy Awards                             66.67       66.67    66.67   66.67   
Actrius                     



### WIKI kategorie gazeteer

#### Anchor text +  Kategorie Wiki + Infobox + Text clanku

In [45]:
with open('../data/test_30_tested.json') as json_file:
    articles_test = json.load(json_file)
    
titles = [art.get('title') for art in articles_test]
anchors = [0] * 30
categories = [0] * 30
infoboxes = [0] * 30
texts = [0] * 30
extended = [0] * 30
counter = 0

for art in articles_test:
    len_annotated = len(art.get('annotated_categories'))

    n_anchors = take(len_annotated, art.get('anchor_sims_cat').keys())
    n_anchors.extend(take(len_annotated, art.get('categories_sims_cat').keys()))
    n_anchors.extend(take(len_annotated, art.get('infobox_sims_cat').keys()))
    n_anchors.extend(take(len_annotated, art.get('text_sims_cat').keys()))
    art['extended_sims_cat'] = list(set(n_anchors))

    for cat in art.get('annotated_categories'):
        if cat in art.get('anchor_sims_cat') and not pd.isna(art.get('anchor_sims_cat').get(cat)) and list(art.get('anchor_sims_cat')).index(cat) + 1 <= len_annotated:
            anchors[counter] += 1
            #print(cat, art.get('anchor_sims_cat').get(cat),list(art.get('anchor_sims_cat')).index(cat) + 1)
        if cat in art.get('categories_sims_cat') and not pd.isna(art.get('categories_sims_cat').get(cat)) and list(art.get('categories_sims_cat')).index(cat) + 1 <= len_annotated:
            categories[counter] += 1
            #print(cat, art.get('categories_sims_cat').get(cat),list(art.get('categories_sims_cat')).index(cat) + 1)
        if cat in art.get('infobox_sims_cat') and not pd.isna(art.get('infobox_sims_cat').get(cat)) and list(art.get('infobox_sims_cat')).index(cat) + 1 <= len_annotated:
            infoboxes[counter] += 1
            #print(cat, art.get('infobox_sims_cat').get(cat),list(art.get('infobox_sims_cat')).index(cat) + 1)
        if cat in art.get('text_sims_cat') and not pd.isna(art.get('text_sims_cat').get(cat)) and list(art.get('text_sims_cat')).index(cat) + 1 <= len_annotated:
            texts[counter] += 1
            #print(cat, art.get('text_sims_cat').get(cat),list(art.get('text_sims_cat')).index(cat) + 1)
        if cat in art.get('extended_sims_cat'):
            extended[counter] += 1
    anchors[counter] = float("{:.2f}".format(anchors[counter] / len_annotated * 100))
    categories[counter] = float("{:.2f}".format(categories[counter] / len_annotated * 100))
    infoboxes[counter] = float("{:.2f}".format(infoboxes[counter] / len_annotated * 100))
    texts[counter] = float("{:.2f}".format(texts[counter] / len_annotated* 100))
    extended[counter] = float("{:.2f}".format(extended[counter] / len_annotated* 100))

    counter += 1
    
cars = {
    'Anchor': anchors,
    'Categories': categories,
    'Infobox': infoboxes, 
    'Text': texts,
    'All': extended
}

df = pd.DataFrame(cars, columns = ['Anchor','Categories','Infobox','Text','All'], index=titles)

print(df)
print (df.describe())
print('\nAverage of All column: ', float("{:.2f}".format(df['All'].mean())))

A=[art['annotated_categories'] for art in articles_test]
B=[art['extended_sims_cat'] for art in articles_test]

multi = MultiLabelBinarizer()

y_true = multi.fit(A).transform(A)
y_pred = multi.transform(B)

print('Precision: ',precision_score(y_true, y_pred,average='weighted',zero_division=1))
print('Recall: ',recall_score(y_true, y_pred, average='weighted',zero_division=1))
print('F1:' ,f1_score(y_true, y_pred, average='weighted'))

                                          Anchor  Categories  Infobox   Text  \
Anarchism                                   0.00       25.00     0.00  25.00   
Autism                                      0.00        0.00     0.00   0.00   
Albedo                                      0.00        0.00     0.00   0.00   
A                                           0.00        0.00     0.00   0.00   
Alabama                                     0.00        0.00     0.00   0.00   
Achilles                                    0.00        0.00     0.00   0.00   
Abraham Lincoln                             0.00        0.00     0.00   0.00   
Aristotle                                  28.57        0.00    28.57  14.29   
An American in Paris                        0.00        0.00     0.00   0.00   
Academy Award for Best Production Design    0.00        0.00     0.00   0.00   
Academy Awards                              0.00        0.00     0.00   0.00   
Actrius                                 



### WIKI Infobox gazeteer

#### Anchor text +  Kategorie Wiki + Infobox + Text clanku

In [46]:
with open('../data/test_30_tested.json') as json_file:
    articles_test = json.load(json_file)
    
titles = [art.get('title') for art in articles_test]
anchors = [0] * 30
categories = [0] * 30
infoboxes = [0] * 30
texts = [0] * 30
extended = [0] * 30
counter = 0

for art in articles_test:
    len_annotated = len(art.get('annotated_categories'))

    n_anchors = take(len_annotated, art.get('anchor_sims_info').keys())
    n_anchors.extend(take(len_annotated, art.get('categories_sims_info').keys()))
    n_anchors.extend(take(len_annotated, art.get('infobox_sims_info').keys()))
    n_anchors.extend(take(len_annotated, art.get('text_sims_info').keys()))
    art['extended_sims_info'] = list(set(n_anchors))

    for cat in art.get('annotated_categories'):
        if cat in art.get('anchor_sims_info') and not pd.isna(art.get('anchor_sims_info').get(cat)) and list(art.get('anchor_sims_info')).index(cat) + 1 <= len_annotated:
            anchors[counter] += 1
        if cat in art.get('categories_sims_info') and not pd.isna(art.get('categories_sims_info').get(cat)) and list(art.get('categories_sims_info')).index(cat) + 1 <= len_annotated:
            categories[counter] += 1
        if cat in art.get('infobox_sims_info') and not pd.isna(art.get('infobox_sims_info').get(cat)) and list(art.get('infobox_sims_info')).index(cat) + 1 <= len_annotated:
            infoboxes[counter] += 1
        if cat in art.get('text_sims_info') and not pd.isna(art.get('text_sims_info').get(cat)) and list(art.get('text_sims_info')).index(cat) + 1 <= len_annotated:
            texts[counter] += 1
        if cat in art.get('extended_sims_info'):
            extended[counter] += 1
    anchors[counter] = float("{:.2f}".format(anchors[counter] / len_annotated * 100))
    categories[counter] = float("{:.2f}".format(categories[counter] / len_annotated * 100))
    infoboxes[counter] = float("{:.2f}".format(infoboxes[counter] / len_annotated * 100))
    texts[counter] = float("{:.2f}".format(texts[counter] / len_annotated* 100))
    extended[counter] = float("{:.2f}".format(extended[counter] / len_annotated* 100))

    counter += 1
    
cars = {
    'Anchor': anchors,
    'Categories': categories,
    'Infobox': infoboxes, 
    'Text': texts,
    'All': extended
}

df = pd.DataFrame(cars, columns = ['Anchor','Categories','Infobox','Text','All'], index=titles)

print(df)
print (df.describe())
print('\nAverage of All column: ', float("{:.2f}".format(df['All'].mean())))

A=[art['annotated_categories'] for art in articles_test]
B=[art['extended_sims_info'] for art in articles_test]

multi = MultiLabelBinarizer()

y_true = multi.fit(A).transform(A)
y_pred = multi.transform(B)

print('Precision: ',precision_score(y_true, y_pred,average='weighted',zero_division=1))
print('Recall: ',recall_score(y_true, y_pred, average='weighted',zero_division=1))
print('F1:' ,f1_score(y_true, y_pred, average='weighted'))

                                          Anchor  Categories  Infobox    Text  \
Anarchism                                   0.00         0.0     0.00   50.00   
Autism                                      0.00         0.0    33.33    0.00   
Albedo                                      0.00         0.0     0.00    0.00   
A                                           0.00         0.0    50.00    0.00   
Alabama                                     0.00         0.0     0.00    0.00   
Achilles                                    0.00         0.0     0.00   66.67   
Abraham Lincoln                             0.00         0.0    20.00   20.00   
Aristotle                                   0.00         0.0    57.14   42.86   
An American in Paris                       25.00         0.0     0.00   50.00   
Academy Award for Best Production Design   33.33         0.0    66.67   66.67   
Academy Awards                             33.33         0.0    66.67   66.67   
Actrius                     



### WIKI Text gazeteer

#### Anchor text +  Kategorie Wiki + Infobox + Text clanku

In [144]:
with open('../data/test_30_tested.json') as json_file:
    articles_test = json.load(json_file)
    
titles = [art.get('title') for art in articles_test]
anchors = [0] * 30
categories = [0] * 30
infoboxes = [0] * 30
texts = [0] * 30
extended = [0] * 30
counter = 0

for art in articles_test:
    len_annotated = len(art.get('annotated_categories'))

    n_anchors = take(len_annotated, art.get('anchor_sims_text').keys())
    n_anchors.extend(take(len_annotated, art.get('categories_sims_text').keys()))
    n_anchors.extend(take(len_annotated, art.get('infobox_sims_text').keys()))
    n_anchors.extend(take(len_annotated, art.get('text_sims_text').keys()))
    art['extended_sims_text'] = list(set(n_anchors))

    for cat in art.get('annotated_categories'):
        if cat in art.get('anchor_sims_text') and not pd.isna(art.get('anchor_sims_text').get(cat)) and list(art.get('anchor_sims_text')).index(cat) + 1 <= len_annotated:
            anchors[counter] += 1
        if cat in art.get('categories_sims_text') and not pd.isna(art.get('categories_sims_text').get(cat)) and list(art.get('categories_sims_text')).index(cat) + 1 <= len_annotated:
            categories[counter] += 1
        if cat in art.get('infobox_sims_text') and not pd.isna(art.get('infobox_sims_text').get(cat)) and list(art.get('infobox_sims_text')).index(cat) + 1 <= len_annotated:
            infoboxes[counter] += 1
        if cat in art.get('text_sims_text') and not pd.isna(art.get('text_sims_text').get(cat)) and list(art.get('text_sims_text')).index(cat) + 1 <= len_annotated:
            texts[counter] += 1
        if cat in art.get('extended_sims_text'):
            extended[counter] += 1
    anchors[counter] = float("{:.2f}".format(anchors[counter] / len_annotated * 100))
    categories[counter] = float("{:.2f}".format(categories[counter] / len_annotated * 100))
    infoboxes[counter] = float("{:.2f}".format(infoboxes[counter] / len_annotated * 100))
    texts[counter] = float("{:.2f}".format(texts[counter] / len_annotated* 100))
    extended[counter] = float("{:.2f}".format(extended[counter] / len_annotated* 100))

    counter += 1
    
cars = {
    'Anchor': anchors,
    'Categories': categories,
    'Infobox': infoboxes, 
    'Text': texts,
    'All': extended
}

df = pd.DataFrame(cars, columns = ['Anchor','Categories','Infobox','Text','All'], index=titles)

print(df)
print (df.describe())
print('\nAverage of All column: ', float("{:.2f}".format(df['All'].mean())))

A=[art['annotated_categories'] for art in articles_test]
B=[art['extended_sims_text'] for art in articles_test]

multi = MultiLabelBinarizer()

y_true = multi.fit(A).transform(A)
y_pred = multi.transform(B)

print('Precision: ',precision_score(y_true, y_pred,average='weighted',zero_division=1))
print('Recall: ',recall_score(y_true, y_pred, average='weighted',zero_division=1))
print('F1:' ,f1_score(y_true, y_pred, average='weighted'))

                                          Anchor  Categories  Infobox   Text  \
Anarchism                                   0.00       25.00     0.00   0.00   
Autism                                      0.00        0.00     0.00   0.00   
Albedo                                      0.00        0.00     0.00   0.00   
A                                           0.00        0.00     0.00   0.00   
Alabama                                     0.00        0.00     0.00   0.00   
Achilles                                    0.00        0.00     0.00   0.00   
Abraham Lincoln                            20.00       40.00     0.00   0.00   
Aristotle                                  14.29       14.29    14.29  14.29   
An American in Paris                        0.00        0.00     0.00  25.00   
Academy Award for Best Production Design    0.00        0.00    33.33  33.33   
Academy Awards                              0.00        0.00     0.00   0.00   
Actrius                                 



In [157]:
print(articles_test[0].get('extended_sims'))

None


## Invertovany index

In [150]:
from collections import defaultdict

class invertedIndex(object):

    def __init__(self,docs,method):
        self.docSets = defaultdict(set)
        for doc in docs:
            index = doc.get('title')
            t = [preprocess_text(a) for a in doc.get(method)]
            for term in [item for sublist in t for item in sublist]:
                self.docSets[term].add(index)
        #print(self.docSets)
        
    def search(self, term, andor):
        pole=set()            
        for a in preprocess_text(term):
            #print(self.docSets[a])
            if andor == 'and':
                if len(pole) == 0:
                    pole = self.docSets[a]
                else:
                    pole = pole.intersection(self.docSets[a])
            elif andor == 'or':
                pole = pole.union(self.docSets[a])
        return pole

### Inverted index - hladanie podla textu

In [161]:
i=invertedIndex(articles_test, 'text_tokens')
#print(i)

print(i.search("art film", "and"))

{'Achilles', 'Alien', 'Academy Awards', 'Animation', 'Alchemy', 'Ayn Rand', 'Alabama', 'Academy Award for Best Production Design', 'Algeria', 'Aristotle', 'An American in Paris', 'Anthropology'}


### Invertovany index - hladanie podla kategorie (infobox)

In [162]:
j=invertedIndex(articles_test, 'extended_sims_info')
#print(j)

print(j.search("art film", "and"))

{'Alien', 'Academy Awards', 'Animation', 'Actrius', 'Academy Award for Best Production Design'}


### Invertovany index - hladanie podla kategorie (text)

In [159]:
# j=invertedIndex(articles_test, 'extended_sims')
#print(j)

print(j.search("art film", "and"))

{'Academy Award for Best Production Design'}


## Presna zhoda

### 6. Z tela článku vyhľadať najčastejšie používané termy a tie, ktoré boli identifikované v kroku 2

Find exact match words or expressions with categorised words

In [None]:
Find exact match words or expressions with categorised words

def find_exact_match(articles, categories):
    for article in articles:
        article['categories_exact_text'] = []
        article['categories_exact_anchors'] = []
        article['categories_exact_infobox'] = []
        for category in categories:
            related_words = category.get('related_words')
            found_text = []
            found_anchors = []
            found_infobox = []
            found_text = list(filter(lambda word: re.findall(rf'\W+({word})\W+', article['text'], re.IGNORECASE), related_words))
            found_anchors = list(filter(lambda word: re.findall(rf'\W+({word})\W+', str(article['anchors']).strip('[]'), re.IGNORECASE), related_words))
            found_infobox = list(filter(lambda word: re.findall(rf'\W+({word})\W+', str(article['infobox']).strip('[]'), re.IGNORECASE), related_words))
            if found_text:
                article['categories_exact_text'].append({'category':category.get('category'),'related_words':found_text})
            if found_anchors:
                article['categories_exact_anchors'].append({'category':category.get('category'),'related_words':found_anchors})
            if found_infobox:
                article['categories_exact_infobox'].append({'category':category.get('category'),'related_words':found_infobox})
    return articles

In [None]:
def save_articles(articles, file_name):
    with open(f'../data/{file_name}.json', 'w') as outfile:
        json.dump(articles, outfile, indent=4)

In [None]:
exact_match = find_exact_match(articles, cats_with_words)
save_articles(exact_match, 'wiki_100_exact_match')

### Vyskusat PySpark

In [None]:
import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()

In [None]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf().setAppName('SparkApp').setMaster("local")
sc = pyspark.SparkContext(conf = conf)
spark = SparkSession(sc)

In [None]:
tic = time.perf_counter()
numeric_val = sc.parallelize(articles)
square_udf_int = udf(lambda z: remove_stop_words(z))
#numeric_val.map(lambda x: remove_stop_words(x)).collect()
toc = time.perf_counter()
print(f"Performed in {toc - tic:0.4f} seconds")

In [None]:
tic = time.perf_counter()
numeric_val.map(lambda x: square_udf_int(x)).collect()
toc = time.perf_counter()
print(f"Performed in {toc - tic:0.4f} seconds")

In [None]:
def square(x):
    return x**2

In [None]:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
square_udf_int = udf(lambda z: square(z), IntegerType())

In [None]:
square_udf_int([1,2,3])