## Zhlukovanie článkov Wikipédie do kategórií na základe ich vedecko-spoločenskej oblasti

**Vypracoval:** Tomáš Babjak

**Predmet:** Vyhľadávanie informácii

**GitHub:** https://github.com/tomasbabjak/VINF_Wikipedia

Imports

In [None]:
import regex
import re
import datamuse
import nltk
import json

### 1. Vytvoriť testovaciu vzorku dát, na ktorej budeme prvotne projekt realizovať

Read XML file with Wiki articles and parse articles to list:

In [None]:
def read_xml(file_name, n_first_articles):
    
    start_tag = f'<page>'
    end_tag = f'</page>'
    
    start_found = False
    articles_found = []
    lines = ''
    
    with open(file_name, encoding="utf8") as f:
        for line in f:
            if start_tag in line:
                start_found = True
            if start_found:
                lines += line
            if end_tag in line:
                start_found = False
                articles_found.append(lines)
                lines = ''
            # treba vyfilterovat <title> a <text>, mozno aj <id> pre indexaciu
            if len(articles_found) == n_first_articles:
                break
    with open(f'../data/wiki_{n_first_articles}_before.json', 'w') as outfile:
        json.dump(articles_found, outfile)
    return articles_found

Extract Title and Text attributes from article and create dictionary from them:

In [None]:
def extract_text(text):
    title_regex = r'<title[^>]*>([^<]+)</title>'
    text_regex = r'<text[^>]*>([^<]+)</text>'
    pages = []
    for page in text:
        title = regex.findall(title_regex, page)
        text = regex.findall(text_regex, page)
        pages.append({"title": title[0] if title else '',
                      "text": text[0] if text else ''})
    return pages

### 4. Z článkov testovacej sady vyhľadať dôležité pojmy - zamerať sa na Infobox, kde sa nachádzajú dôležité informácie o článku

### 5. Vyhľadať odkazy na iné články Wikipédie (anchor text), ktoré môžu smerovať priamo na oblasť alebo aspoň priblížiť kontext článku

Find Infobox and Achor texts from Text attribute of article and add them to dictionary

In [None]:
def find_infobox_anchor(text):
    regex_infobox = r"(?=\{Infobox )(\{([^{}]|(?1))*\})"
    regex_anchor = r"\[\[([^\]\[:]+)\|([^\]\[:]+)\]\]"
    for page in text:
        page['infobox'] = regex.findall(regex_infobox, page['text'])
        page['anchors'] = regex.findall(regex_anchor, page['text'])
    return text

Separate Redirect articles from others into two separate lists

In [None]:
def find_redirect(text):
    regex_redirect = r"^#REDIRECT[^\[]*\[\[([^\]]+)"
    redirect_pages = []
    article_pages = []
    for page in text:
        if regex.findall(regex_redirect, page['text']):
            redirect_pages.append(page)
        else:
            article_pages.append(page)
    return (redirect_pages, article_pages)

Run line:

In [None]:
redirects, articles = find_redirect(find_infobox_anchor(extract_text(read_xml('../data/enwiki-latest-pages-articles.xml', 100))))

### 2. Vytvoriť zoznam (strom) spoločensko-vedných oblastí, do ktorých budeme jednotlivé stránky zaraďovať, ku každej oblasti nájsť aj slová, ktoré sa s ňou spájajú

Find terms related to our categories with Datamuse library. Split words of each category and find 100 terms related to them

In [None]:
api = datamuse.Datamuse()

def categories_find_related():
    
    categories = [
        'Culture, literature and the arts',
        'Geography - places and states',
        'Medicine - health and fitness',
        'History and events',
        'Mathematics and logic',
        'Nature and physics',
        'Technology and computing',
        'Philosophy and thinking',
        'Religion and belief',
        'Society, politics and people'
    ]
    cats_with_words = []
    
    for c in categories:
        keywords = regex.split(' - |, | and ',c)
        num = 11 if len(keywords) == 2 else 7
        api_words = []
        for word in keywords:
            api_words.extend(api.words(ml=word, max=num))
        result = list(map(lambda x: x.get('word'), api_words))
        result.extend(list(map(lambda x: x.lower() ,keywords)))
        cats_with_words.append({'category':c,'related_words':result})
    return cats_with_words

In [None]:
cats_with_words = categories_find_related()

### 3. Články vhodne predspracovať - stemming, tokenizácia, odstránenie stop slov

In [None]:
def preprocess_articles(articles):
    for article in articles:
        nltk_tokens = nltk.word_tokenize(article.get('text'))
        print (nltk_tokens)

In [None]:
text = '{short description|Political philosophy and movement}}\n{{redirect2|Anarchist|Anarchists|other uses|'
nltk_tokens = nltk.word_tokenize(text)
grams_2 = nltk.ngrams('Toto je moj super text.'.split(), 2)

print(nltk_tokens)

In [None]:
# custom word tokenizer
def tokzr_WORD(txt): 
    return ('WORD', re.findall(r'(?ms)\W*(\w+)', txt))

### 6. Z tela článku vyhľadať najčastejšie používané termy a tie, ktoré boli identifikované v kroku 2

Find exact match words or expressions with categorised words

In [None]:
def find_exact_match(articles, categories):
    for article in articles:
        article['categories_exact_text'] = []
        article['categories_exact_anchors'] = []
        article['categories_exact_infobox'] = []
        for category in categories:
            related_words = category.get('related_words')
            found_text = []
            found_anchors = []
            found_infobox = []
            found_text = list(filter(lambda word: re.findall(rf'\W+({word})\W+', article['text'], re.IGNORECASE), related_words))
            found_anchors = list(filter(lambda word: re.findall(rf'\W+({word})\W+', str(article['anchors']).strip('[]'), re.IGNORECASE), related_words))
            found_infobox = list(filter(lambda word: re.findall(rf'\W+({word})\W+', str(article['infobox']).strip('[]'), re.IGNORECASE), related_words))
            if found_text:
                article['categories_exact_text'].append({'category':category.get('category'),'related_words':found_text})
            if found_anchors:
                article['categories_exact_anchors'].append({'category':category.get('category'),'related_words':found_anchors})
            if found_infobox:
                article['categories_exact_infobox'].append({'category':category.get('category'),'related_words':found_infobox})
    return articles

In [None]:
def save_articles(articles, file_name):
    with open(f'../data/{file_name}.json', 'w') as outfile:
        json.dump(articles, outfile, indent=4)

In [None]:
exact_match = find_exact_match(articles, cats_with_words)
save_articles(exact_match, 'wiki_100_exact_match')