# Vytvorenie slovníka dvojíc pre účely Named Entity Recognizing
#### Creating a dictionary of pairs for the purposes of Named Entity Recognizing: Wiki page - type

Projekt je momentalne rozdeleny do 2 časti.

1. časť tvorí stahovanie potrebných súborov(wikipedia dump) na účely spracovania v projekte.
2. časť tvorí parsovanie súborov spolu s priradením kategorie jednotlivym clankom

## 1. Part : Downloading Wikipedia articles


In [1]:
import requests
from bs4 import BeautifulSoup
import os
import re
from functools import reduce

Stiahnutie dát zo stránky wikipédie. Vyfiltrovanie všetkých súborov, ktoré obsahujú v názve "pages-articles".

In [4]:
base_url = 'https://dumps.wikimedia.org/enwiki/20201001/'
base_html = requests.get(base_url).text
base_html[:15]

'<!DOCTYPE html '

In [5]:
soup_dump = BeautifulSoup(base_html, 'html.parser')
soup_dump.find_all('li', {'class': 'file'}, limit = 10)[0]

<li class="file"><a href="/enwiki/20201001/enwiki-20201001-pages-articles-multistream.xml.bz2">enwiki-20201001-pages-articles-multistream.xml.bz2</a> 17.5 GB</li>

In [6]:
files = []
for file in soup_dump.find_all('li', {'class': 'file'}):
    text = file.text
    if 'pages-articles' in text:
        files.append((text.split()[0], text.split()[1:]))
files[:5]

[('enwiki-20201001-pages-articles-multistream.xml.bz2', ['17.5', 'GB']),
 ('enwiki-20201001-pages-articles-multistream-index.txt.bz2', ['215.8', 'MB']),
 ('enwiki-20201001-pages-articles-multistream1.xml-p1p41242.bz2',
  ['231.7', 'MB']),
 ('enwiki-20201001-pages-articles-multistream-index1.txt-p1p41242.bz2',
  ['222', 'KB']),
 ('enwiki-20201001-pages-articles-multistream2.xml-p41243p151573.bz2',
  ['313.2', 'MB'])]

In [7]:
files_to_download = [file[0] for file in files if re.search('pages-articles\d{1,2}.xml-p',file[0])]
files_to_download[:5]

['enwiki-20201001-pages-articles1.xml-p1p41242.bz2',
 'enwiki-20201001-pages-articles2.xml-p41243p151573.bz2',
 'enwiki-20201001-pages-articles3.xml-p151574p311329.bz2',
 'enwiki-20201001-pages-articles4.xml-p311330p558391.bz2',
 'enwiki-20201001-pages-articles5.xml-p558392p958045.bz2']

Použitie knižnice keras na stiahnutie týchto súborov/datasetu. Stiahnú sa len tie súbory, ktoré ešte nie sú stahnuté

In [70]:
import sys
from keras.utils import get_file
directory = '/home/xminarikd/.keras/datasets/'

In [74]:
data_paths = []
file_info = []

for file in files_to_download:
    path = directory + file
    
    if not os.path.exists(directory):
        print('neexistuje')
    # downaload only when file dont exist
    if not os.path.exists(directory + file):
        print('Downloading')
        data_paths.append(get_file(file, base_url + file))
        file_size = os.stat(path).st_size / 1e6
        
        # Find the number of articles
        file_articles = int(file.split('p')[-1].split('.')[-2]) - int(file.split('p')[-2])
        file_info.append((file, file_size, file_articles))
        
    # when file already exist
    else:
        data_paths.append(path)
        file_size = os.stat(path).st_size / 1e6
        
        file_number = int(file.split('p')[-1].split('.')[-2]) - int(file.split('p')[-2])
        file_info.append((file.split('-')[-1], file_size, file_number))

Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles1.xml-p1p41242.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles2.xml-p41243p151573.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles3.xml-p151574p311329.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles4.xml-p311330p558391.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles5.xml-p558392p958045.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles6.xml-p958046p1483661.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles7.xml-p1483662p2134111.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/2020

## 2. Part Parsing data

Parsovanie prebieha postupne na všetkých súboroch v kompresovanom tvare. Na tento účel je použitý podproces "bzcat", ktorý číta a dodáva súbor po jednotlivých riadkoch. Na spracovanie týchto dát je použitý XML SAX parser. Tento parser obsahuje metódu ContentHandler, ktorá zabezpečuje uchovanie riadkov v buffery, pričom sa hľadajú tagy (page, title, text). Po nájdeni ukončovacieho znaku tagu page prebieha spracovanie celého článku.

Z článku sú pomocou regulárnych výrazov extrahované informácie:
* **infobox**
    * atribúty infoboxu
    * typ infoboxu
* **kategórie čklánku**

Následne na základe týchto informácií je určená kategória článku.

In [1]:
import subprocess
import xml.sax
import regex
import pandas as pd
from functools import reduce
import requests
from bs4 import BeautifulSoup
import re
import csv
import json
import os
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk
import gc
from nltk.util import ngrams
import ipywidgets as widgets
from ipywidgets import interact
from itertools import chain, islice 
#nltk.download('punkt')
#nltk.download('stopwords')

Momentálne sú priradzované kategórie: Person, Company, Organisation, Place.
Priradzovanie prebieha podľa vyššieho poradia na základe parametrov v poradí:
* **typ infoboxu** - či sa v zozname danej kategorie nachádza infobox daného článku
* **atribúty infoboxu:**
    * **person** - birth_date
    * **company** - industry, trade_name, products, brands
    * **organisation** - zatiaľ žiadne
    * **place** - coordinates, locations _|neobsahuje|_ date, founded, founder, founders
* **kategorie článku:**
    * **organisation** - obsahuje v kategóriach slovo organisaion/s
* **text článku** - zatiaľ nepoužité, ale plánované pre prípady, kedy článok neobsahuje infobox a kategórie neposkytnú žiadnu informáciu

In [2]:
#Get Infobox and Infobox type from article text
def ArticleHandler(infobox_types=None, evaluation=None):
    #source:(stackof) https://regex101.com/r/kT1jF4/1
    infobox_regex = '(?=\{Infobox )(\{([^{}]|(?1))*\})'
    inf_type_regex = '(?<=Infobox)(.*?)(?=(\||\n|<!-|<--))'
    #https://regex101.com/r/1vJlms/1
    inf_parameters = '(?(?<=\|)|(?<=\|\s))(\w*)\s*=\s*[\w{\[]'
    #https://regex101.com/r/fl5hAw/1 https://regex101.com/r/Xj0fM3/1
    redirect_title = '(?<=\[\[)(.*)(?=\]\])'
    categories = '(?<=\[\[Category:)([^\]]*)(?=\]\])'
    testing = evaluation
    
    Person=['player', 'male', 'actor', 'sportspeopl', 'medalist', 'actress', 'expatri', 'singer', 'musician', 'live', 'writer', 'politician', 'f.c', 'alumni', 'personnel', 'olymp', '20th-centuri', 'faculti', 'coach', 'guitarist']
    Company=['brand', 'merger', 'retail', 'exchang', 'stock', 'label', 'nasdaq', 'multin', 'subsidiari', 'acquisit', 'onlin', 'offer', 'held', 'conglomer', 'store', 'bankruptci']
    Organisation=['scout', 'think', 'non-profit', 'gang', 'event', 'recur', 'religi', 'child-rel', 'non-align', 'non-government', 'critic', 'evangel', 'yakuza', 'advocaci']
    Location=['regist', 'unincorpor', 'station', 'popul', 'complet', 'aerodrom', 'villag', 'town', 'landform', 'parish', 'river', 'seaplan', 'open', 'census-design', 'mountain', 'attract', 'neighbourhood', 'suburb', 'rang', 'airport']
    
    PersonBi=['living peopl', '20th century', 'f c', 'c play', 'century american', '21st century', 'american male', 'association football', 'league play', 'expatriate footballers', 'expatriate sportspeople', 'cup play', 'international footbal', 'rugby league', 'university alumni', 'fc play', 'musical groups', 'ice hockey', 'world cup', 'american people', 'fifa world', 'male actors', 'football league', 'male actor', 'expatriate footbal', 'military personnel', 'people educated', 'hockey players', 'male writ', 'records artist', 'draft pick', 'century indian', 'football manag', 'male television', 'film actor', 'uk mps', 'male film', 'soccer play', 'television actor', 'united f', 'year birth', 'living people', 'winter olymp', 'birth missing', 'football midfield', 'missing living', 'major league', 'school alumni', '19th century', 'new zealand']
    CompanyBi=['companies established', 'companies based', 'companies united', 'services companies', 'financial services', 'mergers acquisit', 'american companies', 'stock exchanges', 'video game', 'game companies', 'manufacturing companies', 'chains united', 'companies listed', 'exchanges africa', 'restaurants established', 'stock exchang', 'companies disestablished', 'mass media', 'media companies', 'restaurant chains', 'retail companies', 'companies filed', 'defunct companies', 'filed chapter', 'internet properties', 'properties established', 'retailers united', '11 bankruptcy', 'chapter 11', 'companies canada', 'companies formerly', 'established 1960', 'established 1995', 'fast food', 'formerly listed', 'listed new', 'manufacturers united', 'york stock', 'established 1989', 'establishments california', 'based austin', 'british companies', 'clothing companies', 'companies england', 'companies isle', 'development compani', 'established 1950', 'established 1974', 'established 2003', 'food chains']
    OrganisationBi=['based united', 'non profit', 'learned societies', 'profit organizations', 'organizations established', 'organisations based', 'organizations based', '3 organ', '501 c', 'associations based', 'c 3', 'charities based', 'consultative status', 'established 1946', 'professional associations', 'psychology organizations', 'relief organ', 'societies canada', 'status united', 'english football', 'establishments united', 'youth organizations', 'united nations', '1845 establishments', '1859 establishments', '1864 establishments', '1907 establishments', '1908 establishments', '1959 establishments', '1982 establishments', '1996 establishments', '19th centuri', 'academy financial', 'advocacy organ', 'aid organ', 'air ambulance', 'ambulance servic', 'ambulance services', 'american council', 'american organized', 'ancient near', 'awards h', 'banks texa', 'bar associ', 'based geneva', 'based hong', 'based montr', 'based surrey', 'based switzerland', 'based tyne']
    LocationBi=['pyrénées atlantiqu', 'communes pyrénées', 'articles needing', 'atlantiques communes', 'communes articles', 'french wikipedia', 'needing translation', 'pyrénées atlantiques', 'translation french', 'lower navarr', 'populated places', 'register historic', 'national register', 'historic places', 'towns luxembourg', 'unincorporated communities', 'civil parishes', 'buildings structures', 'neighborhoods pittsburgh', 'sports venues', 'western australia', 'parishes leicestershir', 'villages leicestershir', 'cities towns', 'protected areas', 'borough charnwood', 'places established', 'suburbs perth', 'west virginia', 'tourist attractions', 'buildings completed', 'city rockingham', 'suburbs city', 'shopping malls', 'new jersey', 'county west', 'historic house', 'house museums', 'venues completed', 'county california', 'county virginia', 'alzette canton', 'communes esch', 'county massachusett', 'esch sur', 'former communes', 'mountains hills', 'rhode island', 'road bridges', 'sur alzette']
    
    
    # infobox_types = getInfoboxTypesList()
    
    def filterArticles(title):
        if regex.search("^Category:|^Template:|^File:", title):
            return True
        return False
    
    def getCategories(text):
        return regex.findall(categories, text)
    
    
    def getArticleAtributes(infobox,text):
        i_par = regex.findall(inf_parameters, infobox)
        i_type = regex.search(inf_type_regex, infobox)
        i_type = i_type.group(0).strip() if i_type is not None else "none"
        return {'type': i_type.lower(), 'parameters': i_par, 'categories': list(getCategories(text))}
    
    
    def remove_stop_words(data):
        stopwords = nltk.corpus.stopwords.words('english')
        return [w for w in data if w not in stopwords]


    def tokenize(data):
        symbols = symbols = "!\"#$%&()*+'-./:;,|<=>?@[\]^_`{}~\n"
        tokens = word_tokenize(data)
        tokens = [token.lower() for token in tokens if token not in list(symbols)]
        return tokens


    def stemming(data):
        stemmer= PorterStemmer()
        return [stemmer.stem(token) for token in data]
        
    def processCategories(data):
        data = tokenize(data)
        data = remove_stop_words(data)
        data = stemming(data)
        return data
    
    
    def get_bigrams(text):
        bigrams = []
        for sen in text:
            token = nltk.word_tokenize(sen)
            bigrams.append(list(map(lambda x: ' '.join(x),list(ngrams(token,2)))))
        return bigrams


    stopwords = nltk.corpus.stopwords.words('english')
    def process_whole_sentence(text):
        sen = ' '.join(w for w in text.split() if w not in stopwords)
        sen = re.sub(r'\W', ' ', str(sen))
        sen = re.sub(r'\s+', ' ', sen, flags=re.I)
        sen = sen.lower()
        return sen
        
        
    def isRedirect(text):
        return regex.search("^#redirect\s*\[\[(?i)", text)
        
        
    def getInfobox(text):
        infobox = regex.search(infobox_regex, text)
        return infobox.group() if infobox is not None else "redirect" if isRedirect(text) is not None else "no infobox/redirect"
    
    
    def categoryBy_infoboxType(info):
        if info['type'] in infobox_types['person']:
            return 'Person'
        elif info['type'] in infobox_types['company']:
            return 'Company'
        elif info['type'] in infobox_types['org']:
            return 'Organization'
        elif info['type'] in infobox_types['location']:
            return 'Location'
        else:
            return None
  

    def anotherCategoryBy_infoboxType(info):
        if info['type'] in infobox_types['other']:
            return 'Another'
        else:
            return None

        
    def categoryBy_atributes(info):
        if 'birth_date' in info['parameters']:
            return "A_Person"
        elif any(i in info['parameters'] for i in ['industry', 'trade_name', 'products', 'brands']):
            return 'A_Company'
        elif any(i in info['parameters'] for i in ['coordinates', 'locations']) and not(any(i in info['parameters'] for i in ['date', 'founded', 'founder', 'founders'])):
            return 'A_Location'
        else:
            return None
        
        
    def categoryBy_categories(info):
        stemmed_categories = reduce(lambda x,y: x+y,map(lambda x: processCategories(x), info['categories']),[])
        bigramCategories = sum(get_bigrams(stemming(map(lambda x: process_whole_sentence(x), info['categories']))),[])
        
#         if any(i in PersonBi for i in bigramCategories) or any(i in Person for i in stemmed_categories):
#             return 'C_Person'
#         elif any(i in CompanyBi for i in bigramCategories) or any(i in Company for i in stemmed_categories):
#             return 'C_Company'
#         elif any(i in OrganisationBi for i in bigramCategories) or any(i in Organisation for i in stemmed_categories):
#             return 'C_Organization'
#         elif any(i in LocationBi for i in bigramCategories) or any(i in Location for i in stemmed_categories):
#             return 'C_Location'
        
#         if any(i in Person for i in stemmed_categories):
#             return 'C_Person'
#         elif any(i in Company for i in stemmed_categories):
#             return 'C_Company'
#         elif any(i in Organisation for i in stemmed_categories):
#             return 'C_Organization'
#         elif any(i in Location for i in stemmed_categories):
#             return 'C_Location'

        if any(i in PersonBi for i in bigramCategories):
            return 'C_Person'
        elif any(i in CompanyBi for i in bigramCategories):
            return 'C_Company'
        elif any(i in OrganisationBi for i in bigramCategories):
            return 'C_Organization'
        elif any(i in LocationBi for i in bigramCategories):
            return 'C_Location'
        
        elif list(filter(lambda x: regex.search('^\d*\sbirths*(?i)', x), info['categories'])):
            return 'C_Person'
        elif list(filter(lambda x: regex.search('\b(compan(y|ies))\b(?i)', x), info['categories'])):
            return 'C_Company'
        elif list(filter(lambda x: regex.search('(organisations*)(?i)', x), info['categories'])):
            return 'C_Organization'
        else:
            return None
    
    
    def first_true(iterable,data=None, default='Other'):
        return next((item(data) for item in iterable if item(data) is not None), default)
    
    
    def predictCategory(infobox, info):
        if infobox not in ['redirect', 'no infobox/redirect']:
            if not(testing):
                return first_true([categoryBy_infoboxType,categoryBy_atributes, anotherCategoryBy_infoboxType, categoryBy_categories], info)
            else:
                if categoryBy_infoboxType(info) is not None or anotherCategoryBy_infoboxType(info) is not None: 
                    return first_true([categoryBy_atributes,categoryBy_categories], info)
                else:
                    return None
            #tieto clanky maju len kategorie
        elif infobox == 'no infobox/redirect':
            return first_true([categoryBy_categories], info,"Other/None")
        else:
            return 'redirect::'+info

    
    def processArticle(title, text):
        infobox = getInfobox(text)
        
        if filterArticles(title):
            return None
        
        if infobox == "redirect":
            info = regex.search(redirect_title, text)
            if info is None:
                return None
            info = info.group(0)
        
        elif infobox == 'no infobox/redirect':
            info = {'categories': list(getCategories(text))}
            if info['categories'] == []:
                return None
        else:
            info = getArticleAtributes(infobox, text)

        return (title, infobox, info, predictCategory(infobox, info))
    return processArticle

In [3]:
#docs: https://docs.python.org/3.8/library/xml.sax.handler.html
class ContentHandler(xml.sax.handler.ContentHandler):
    def __init__(self, testing=None):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buf = None
        self._last_tag = None
        self._parts = {}
        self.output = []
        self.evaluation = testing
        self.article_process = ArticleHandler(infobox_types=getInfoboxTypesList(),evaluation=self.evaluation)

    def characters(self, content):
        if self._last_tag:
            self._buf.append(content)

    def startElement(self, name, attrs):
        if name == 'page':
            self._parts = {}
        if name in ('title', 'text'):
            self._last_tag = name
            self._buf = []

    def endElement(self, name):
        if name == self._last_tag:
            self._parts[name] = ''.join(self._buf)
        
        #whole article
        if name == 'page':
            data = self.article_process(**self._parts)
            if data is not None:
                self.output.append(data)

In [4]:
def parseWiki(data=None, limit = 200, save = True, test_sample=False, evaluation=False):
    
    if test_sample:
        data = os.getcwd().rsplit('/', 1)[0]
        data = f'{data}/data/sample_wiki_articles2.xml.bz2'
        print(data)
    elif data is None:
        data = '/home/xminarikd/.keras/datasets/enwiki-20201001-pages-articles9.xml-p2936261p4045402.bz2'
    
    handler = ContentHandler(testing=evaluation)

    parser = xml.sax.make_parser()
    parser.setContentHandler(handler)

    for i, line in enumerate(subprocess.Popen(['bzcat'], 
                             stdin = open(data), 
                             stdout = subprocess.PIPE).stdout):

#         if (i + 1) % 10000 == 0:
#             print(f'Spracovanych {i + 1} riadkov.', end = '\r')
#             print('')
        try:
            parser.feed(line)
        except StopIteration:
            break
        
        # get only some results
        if limit and len(handler.output) >= limit:
            break
        
    if save:
        output_dir = os.getcwd().rsplit('/', 1)[0]
        partition_name = data.split('/')[-1].split('-')[-1].split('.')[0]
        if not(evaluation):
            output_file = f'{output_dir}/output/{partition_name}.tsv'
        else:
            output_file = f'{output_dir}/output/eval/{partition_name}.tsv'

        
        f1 = open(output_file, 'w+', newline='\n')
        f2 = open(f'{output_file}-redirects', 'w+', newline='\n')
        
        writer1 = csv.writer(f1, delimiter='\t')
        writer2 = csv.writer(f2, delimiter='\t')
        writer1.writerow(["Title","Category"])
        writer2.writerow(["Title","Source"])
        
        for x in handler.output:
            if x[1] == 'redirect':
                writer2.writerow([x[0],x[2] or 'None'])
            else:
                writer1.writerow([x[0],x[3] or 'None'])
        
        print(f'{output_file} done', end='\r')
        del handler
        del parser
        gc.collect()
        return None
    else:
        return handler.output

Stiahnutie a parsovanie stránky wikipédie, ktorá obsahuje zoznam typov infoboxov. Tento zoznam obsahuje aj členeie týchto typov do rôznych kategórií. Vďaka tomuto je možné jednoducho získať všetky infoboxy, ktoré sú spojené napríklad s osobami.

In [6]:
def getInfoboxTypesList():
    infobox_list_url = 'https://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes'
    infobox_list_html = requests.get(infobox_list_url).text
    soup_dump = BeautifulSoup(infobox_list_html, 'html.parser')
    #sib = soup_dump.find_all("div" ,{'id': 'toc'}).next_sibling
    other = {}

    template_list = dict();
    prev = None
    prev_tag = None
    prev_parent = None
    prev_parent_tag = 2

    for i, sibling in enumerate(soup_dump.find(id="toc").next_siblings):

        if prev_parent == 'Other':
            break

        if sibling.name == 'h2':
            template_list[sibling.findChild().text] = {}
            prev_parent = sibling.findChild().text
            prev_tag = 2

        if sibling.name == 'h3':
            if prev_tag < 3:
                template_list[prev_parent][sibling.findChild().text] = list()
                prev_tag = 3
                prev = sibling.findChild().text

            if prev_tag == 3:
                template_list[prev_parent][sibling.findChild().text] = list()
                prev = sibling.findChild().text

        if sibling.name == 'ul':
            a = sibling.find_all('a', title=re.compile('^Template:Infobox'))
            b = map(lambda x: regex.findall('(?<=Template:Infobox )(.*)(?i)', x.text.lower()), a)
            c = reduce(lambda x,y: x+y, b, list())

            if prev_tag >=3:
                template_list[prev_parent][prev] = [y for x in [template_list[prev_parent][prev], list(c)] for y in x] 
            else:
                template_list[prev_parent] = list(c)

    
    for k ,v in template_list.items():
        if(k not in ['Person', "Place", 'Society and social science', "Other"]):
            other.update({k:v})
        elif k == 'Society and social science':
            tmp = {}
            for k2,v2 in v.items():
                if k2 not in ['Business and economics', "Organization"]:
                    tmp.update({k2:v2})
            other.update({k:tmp})
            
    other = sum(sum((map(lambda x: list(x.values()) if isinstance(x, dict) else [x] ,list(other.values()))),[]),[])
    persons = list(reduce(lambda x,y: x+y, template_list["Person"].values()))
    locations = list(reduce(lambda x,y: x+y, template_list["Place"].values()))
    companies = template_list['Society and social science']['Business and economics']
    organizations = template_list['Society and social science']['Organization']
    
    return {'person': persons, 'location': locations, 'company': companies, 'org': organizations, 'other': other}

## Utils

Jednoduché funckie používané vo viacerých častaich projektu

In [2]:
def getPathFiles(path, endwith):
    out_path = os.getcwd().rsplit('/', 1)[0]
    files = f'{out_path}{path}/'
    files = [files+file for file in os.listdir(files) if file.endswith(endwith)]
    return files

In [3]:
def mapperCategories(arg):
    switcher = {
        'Person': 0,
        'A_Person':0,
        'C_Person':0,
        'Company': 1,
        'A_Company':1,
        'C_Company':1,
        'Organization':2,
        'A_Organization':2,
        'C_Organization':2,
        'Location':3,
        'A_Location':3,
        'C_Location':3,
        'Another':4
    }
    return switcher.get(arg,4)

In [4]:
def mapperCategoriesNames(arg):
    switcher = {
        'Person': 'Person',
        'A_Person':'Person',
        'C_Person':'Person',
        'Company': 'Company',
        'A_Company':'Company',
        'C_Company':'Company',
        'Organization':'Organization',
        'A_Organization':'Organization',
        'C_Organization':'Organization',
        'Location':'Location',
        'A_Location':'Location',
        'C_Location':'Location',
        'Another':'Another'
    }
    return switcher.get(arg,'Another')

In [5]:
def readTsv(file):
    output = []
    with open(file) as f:
        for line in csv.DictReader(f, delimiter='\t'): 
            output.append(line)
    return output

In [18]:
def readTsvGenerator(file):
#     output = []
    with open(file) as f:
        for line in csv.DictReader(f, delimiter='\t'): 
            yield line
#     return output

### Main

Spustenie programu na spravocanie súboru, avšak nie je to hlavné spustenie. Tento prístup je na ukázanie funckionality pri spracovaní časti jedného súboru. Na spracovanie celého datasetu je vytvorená multiprocessorova alternatíva. 

In [65]:
data = parseWiki(test_sample=True, limit=0, save=False, evaluation=False)

for i, x in enumerate(data):
    if i > 150:
        break
    if x[1] == 'redirect':
        print(x[0], '<-->', x[3])
    else:
        print(x[0], '<-->', x[3])

/home/xminarikd/Documents/VINF/data/sample_wiki_articles2.xml.bz2
David Stagg <--> A_Person
Amaranthus mantegazzianus <--> redirect::Amaranthus caudatus
Amaranthus quitensis <--> redirect::Amaranthus hybridus
Maud Queen of Norway <--> redirect::Maud of Wales
Milligram per litre <--> redirect::Gram per litre
Utica Psychiatric Center <--> Location
Olean Wholesale Grocery <--> C_Company
Queen Tiye <--> redirect::Tiye
Queen Hatshepsut <--> redirect::Hatshepsut
Clibanarii <--> Other/None
Political documentary <--> redirect::Documentary film
Final fantasy legends <--> redirect::Final Fantasy Dimensions
Queen Marie Amelie Therese <--> redirect::Maria Amalia of Naples and Sicily
Political documentaries <--> redirect::Documentary film
E-767 <--> redirect::Boeing E-767
Prince Edward-Lennox <--> redirect::Prince Edward\xe2\x80\x94Lennox
Arthur Hill (actor) <--> A_Person
Periodic paralysis <--> Other
Greenstripe <--> redirect::Amaranthus acanthochiton
Amaranthus cruentus <--> C_Location
Careless w

## Multiprocessing

Pomocou tohto prístupu (multiprocessing) bolo umožnené rýchlejšie spracovanie celého datasetu, ktorý je rozdelený na 58 súborov. 
Tento prístup bol testovaný na 4 jadrách, prčim ich unitilácia bola na úrovni 100%.
Spracovanie celého datasetu v tomto prípade trvalo 1h 44min 51s

In [14]:
from multiprocessing import Pool 
from tqdm.notebook import tqdm
from functools import partial
import uuid

In [15]:
dataset_dir = '/home/xminarikd/.keras/datasets/'
dataset = [dataset_dir+file for file in os.listdir(dataset_dir)]
len(dataset)

58

In [16]:
%%time
pool = Pool(processes=4)
results = []

map_parser = partial(parseWiki, limit = 0, save = True,evaluation=False)

for x in tqdm(pool.imap_unordered(map_parser, dataset), total = len(dataset)):
    results.append(x)

pool.close()
pool.join()

HBox(children=(FloatProgress(value=0.0, max=58.0), HTML(value='')))

/home/xminarikd/Documents/VINF/output/p23716198p25216197.tsv done
CPU times: user 397 ms, sys: 81.3 ms, total: 478 ms
Wall time: 1h 44min 51s


# Index

Vytvorenie indexu a následne vyhľadávanie v ňom. Na tento účel je použitý framework Elasticsearch. Komunikácia s týmto frameworkom prebiha pomocou python knižnice s rovnakých názvom. 

Vytvorený idex obsahuje 2 polia, title a category. Tieto polia sú typu text. Vytváranie indexu prebiaha pomocou bulk API, ktorá umožnuje zadanie viacero indexov v jednom kroku, bez použitia commit.

In [6]:
from elasticsearch import Elasticsearch
from tqdm.auto import tqdm as tq
def connect_elasticsearch():
    _es = None
    _es = Elasticsearch([{'host': 'localhost', 'port': 9200}], timeout=120, max_retries=10, retry_on_timeout=True)
    if _es.ping():
        print(':) Connect')
    else:
        print(':( could not connect!')
    return _es

es = connect_elasticsearch()

:) Connect


Vytvorenie indexu, zadefinovanie počtu polí a ich typov

In [8]:
def create_index(es_object, index_name='wiki'):
    created = False
    # index settings
    settings = {
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 1,
        }
        "mappings": {
          "properties": {
            "category": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "title": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        }
    }
 
    try:
        if not es_object.indices.exists(index_name):
            # Ignore 400 means to ignore "Index Already Exist" error.
            es_object.indices.create(index=index_name, body=settings)
            print('Created Index')
        created = True
    except Exception as ex:
        print(str(ex))
    finally:
        return created

In [37]:
create_index(es,'wiki')

Created Index


True

Funkcia na pridanie nového indexu. Použitie tejto metódy na pridanie viacerých záznamov súčastne nie je odporučané z dôvodu vysokej časovej náročnosti.

In [43]:
def toElastic(files, elastic):
    for x in tq(files):
        data = readTsv(x)
        for item in tq(data):
            res = elastic.index(index='wiki', id=uuid.uuid4(), body={'title': item['Title'], 'category': mapperCategories(item['Category'])})
            if res['result'] != 'created':
                print('Warning, Error', res)

## Searching

Jednoduché vyhľadávanie, v ktorom sa vyhľadáva súčastne v oboch poliach. 

In [28]:
@interact(query="")
def searchByTitle(query):
    res = es.search(index='wiki', body={
        "query":{
            "multi_match":{
                "query": query,
                "type": "cross_fields",
                "analyzer" : "standard",
                "fields": ["title","category^3"]
            }
        }
    })
    return list(map(lambda x: x['_source'],res['hits']['hits']))

interactive(children=(Text(value='', description='query'), Output()), _dom_classes=('widget-interact',))

Možnosť vytvorenia vyhľadávania s presným definovaním požadovanej kategórie a následne vyhľdávanie len na základe parametra title. Taktiež ponúka možnosť definovania počtu požadovaných výsledkov.

In [27]:
@interact(title="", category=["Person", "Location", "Company", "Organization"], i=widgets.IntSlider(value=10, description='Limit', max=100, min=1))
def searchByTitle2(title, category,i):
    res = es.search(index='wiki', body={
        "from" : 0,
        "size" : i,
#         "min_score": 0.8,
        "query":{
            "bool": {
                "must": {
                    "bool": {
                        "should": {
                            "match":{
                                "title": title
                            }
                        },
                        "must": {
                            "match": {"category": category}
                        }
                    }
                }
            }
        }
    })
    df = pd.DataFrame(list(map(lambda x: x['_source'],res['hits']['hits'])), columns=["title", "category"])
    df.set_index('title', inplace=True)
    with pd.option_context('display.max_rows', None, 'display.max_columns', None):
        print(df)
#     return list(map(lambda x: x['_source'],res['hits']['hits']))

interactive(children=(Text(value='', description='title'), Dropdown(description='category', options=('Person',…

In [33]:
interact(searchExactMatchByTitle,title="")

interactive(children=(Text(value='', description='title'), Output()), _dom_classes=('widget-interact',))

<function __main__.searchExactMatchByTitle(title)>

{'title': 'X Æ A-XII', 'category': 'Person'}

In [32]:
def searchExactMatchByTitle(title):
    res = es.search(index="wiki",body=
    {
       "size": 1,
       "query" : {
          "term" : {
             "title.keyword" : title
          }
       }
    })
    if res['hits']['hits']:
        return res['hits']['hits'][0]['_source']
    else:
        return None

In [34]:
res= es.search(index='wiki',body={
   "query" : {
      "term" : {
         "Title.keyword" : "Andy"
      }
   }
})
print(res)

{'took': 4, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 0, 'relation': 'eq'}, 'max_score': None, 'hits': []}}


## Spracovanie redirectov

In [7]:
redirect_files = getPathFiles('/output','-redirects')
redirect_files = redirect_files
len(redirect_files)

58

In [8]:
output_dir = os.getcwd().rsplit('/', 1)[0]
output_file = f'{output_dir}/output/redirects/processedRedirects.tsv'
f1 = open(output_file, 'w+', newline='\n')
f1.truncate(0)
writer1 = csv.writer(f1, delimiter='\t')
writer1.writerow(["Title","Category"])

16

In [None]:
%%time
for file in redirect_files:
    data = readTsv(file)
    resSearch = es.msearch(body=multiSearch(file))
    for item, res in zip(data, resSearch["responses"]):
        if res["hits"]["hits"]:
            writer1.writerow([item["Title"],res["hits"]["hits"][0]["_source"]["category"]])
#             founded.append({"Title": item["Title"], "Category": res["hits"]["hits"][0]["_source"]["category"] })

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 6.2 µs


In [130]:
len(founded)

3060367

In [8]:
output_dir = os.getcwd().rsplit('/', 1)[0]
output_file = f'{output_dir}/output/redirects/processedRedirects.tsv'

In [8]:
len(readTsv(output_file))

7968615

In [19]:
def bulkRedirectData(file):
    metadata = '{ "index": { "_index": "wiki" }}'
    for item in readTsvGenerator(file):
        if mapperCategoriesNames(item["Category"]) in ["Person",'Company','Organization','Location']:
            curr = {"title": item["Title"], "category": mapperCategoriesNames(item["Category"])}
            yield f'{metadata}{os.linesep}{json.dumps(curr)}'

In [20]:
def chunks(iterable, size=10):
    iterator = iter(iterable)
    for first in iterator:
        yield chain([first], islice(iterator, size - 1))

Elasticsearch ma obmedznie na maximálnu veľkosť bulk requestu na vytvorenie indexu. Preto sa dáta zadávajú po skupiných (chunk) veľkostí 500 000 záznamov.

In [23]:
%%time
for chunk in chunks(bulkRedirectData(output_file), size=500000):
    res = es.bulk(index='wiki', body= chunk)
    if res['errors']:
        print(res)

CPU times: user 1min 1s, sys: 3.09 s, total: 1min 4s
Wall time: 4min 28s


In [30]:
def multiSearch(file):
    header = '{"index": "wiki"}'
    for item in readTsvGenerator(file):
        curr = {"size": 1, "query" : { "term" : {"title.keyword" : item['Source'] }} }
        yield f'{header}{os.linesep}{json.dumps(curr)}'

## Vytvorenie indexu zo súboru

Na vyriešenie problému vložania celého spravocaného datasetu do indexu bolo nájdene riešenia využitia bulk api, ktoré vykoná viacero operácií na elesticom bez použitia priebžných commitov. Tento prístup mnohonásobne zrýchlil vytvorenie indexu opriti postupnému vkladaniu jednotlivých záznamov.

Bulk api potrebuje presne definovanú štruktúru na vykonanie query. Táto funkcia, teda generátor, vytvára túto štruktúru.

In [26]:
def bulkMyData(file):
    metadata = '{ "index": { "_index": "wiki" }}'
#     data = readTsv(file)
    for item in readTsvGenerator(file):
        curr = {"title": item["Title"], "category": mapperCategoriesNames(item["Category"])}
        yield f'{metadata}{os.linesep}{json.dumps(curr)}'

In [10]:
test_data = getPathFiles('/output','.tsv')
test_data = test_data
len(test_data)

58

In [16]:
for file in test_data:
    res = es.bulk(index='wiki', body=bulkMyData(file))
    if res['errors']:
        print(res)

### Delete all records

vymazanie všetkých indexov v elasticsearch

In [11]:
def deleteIndex(elastic, index):
    if elastic.indices.exists(index=index):
        elastic.indices.delete(index=index)
        print(f'Deleted index {index}')
    else:
        print(f'Index {index} not exist')

deleteIndex(es, 'wiki')

# Finding common categories

V tejto časti je vykonávaná analáza kategórií článkov s cieľom zistenia kľučových slov v kategóriach pre naše kategórie. Na vykonanie tejto analýzy je použitá technika tf-idf.

In [84]:
from functools import reduce
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk
from collections import Counter
import threading

In [11]:
stopwords = nltk.corpus.stopwords.words('english')
def process_whole_sentence(text):
    sen = ' '.join(w for w in text.split() if w not in stopwords)
    sen = re.sub(r'\W', ' ', str(sen))
    sen = re.sub(r'\s+', ' ', sen, flags=re.I)
    sen = sen.lower()
    return sen

In [12]:
def get_bigrams(text):
    bigrams = []
    for sen in text:
        token = nltk.word_tokenize(sen)
        bigrams.append(list(map(lambda x: ' '.join(x),list(ngrams(token,2)))))
    return bigrams

In [13]:
def process_bigrams(data_cat):
    categories_processed = []
    
    cat_per = list(reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] == 'Person',data_cat),[]))
    cat_com = list(reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Company'],data_cat),[]))
    cat_org = list(reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Organization'],data_cat),[]))
    cat_loc = list(reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Location'],data_cat),[]))
    
    categories_processed.append(sum(get_bigrams(stemming(map(lambda x: process_whole_sentence(x),cat_per))),[]))
    categories_processed.append(sum(get_bigrams(stemming(map(lambda x: process_whole_sentence(x),cat_com))),[]))
    categories_processed.append(sum(get_bigrams(stemming(map(lambda x: process_whole_sentence(x),cat_org))),[]))
    categories_processed.append(sum(get_bigrams(stemming(map(lambda x: process_whole_sentence(x),cat_loc))),[]))
    
    return categories_processed

In [107]:
def remove_stop_words(data):
    stopwords = nltk.corpus.stopwords.words('english')
    return [w for w in data if w not in stopwords]


def tokenize(data):
    symbols = symbols = "!\"#$%&()*+'-./:;,<=>?@[\]^_`{|}~\n"
    tokens = word_tokenize(data)
    tokens = [token.lower() for token in tokens if token not in list(symbols)]
    return tokens


def stemming(data):
    stemmer= PorterStemmer()
    return [stemmer.stem(token) for token in data]


def preprocess(data):
    data = tokenize(data)
    data = remove_stop_words(data)
    data = stemming(data)
    return data

In [15]:
def dfcount(data):
    df = {}
    for i in range(len(data)):
        for token in data[i]:
            try:
                df[token].add(i)
            except:
                df[token] = {i}
    for i in df:
        df[i] = len(df[i])
    return df

In [16]:
def tf_idf(data, doc_freq):
    tfidf = {}
    for i in range(len(data)):
        counter = Counter(data[i])
        count_w = len(data[i])
        for token in np.unique(data[i]):
            tf = counter[token]/count_w
            df = doc_freq[token]
            idf = np.log((len(data)+1)/(df+1))
            tfidf[i, token] = tf*idf
    return tfidf

In [105]:
def get_bigrams(text):
    bigrams = []
    for sen in text:
        token = nltk.word_tokenize(sen)
        bigrams.append(list(map(lambda x: ' '.join(x),list(ngrams(token,2)))))
    return bigrams


stopwords = nltk.corpus.stopwords.words('english')
def process_whole_sentence(text):
    sen = ' '.join(w for w in text.split() if w not in stopwords)
    sen = re.sub(r'\W', ' ', str(sen))
    sen = re.sub(r'\s+', ' ', sen, flags=re.I)
    sen = sen.lower()
    return sen

Spojenie predchádzajúcich funkcií spracovania textu, viet, slov a vypočítania tf-idf skóre, do jednej spúštatelnej funkcie.

In [18]:
def getSignificanteCategories(limit=2000, write=True):
    categories = []

    data_cat = parseWiki(limit=limit ,test_sample=False, save=False)
    categories.append(reduce(lambda x,y: x+preprocess(y),reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] == 'Person',data_cat),[]),[]))
    categories.append(reduce(lambda x,y: x+preprocess(y),reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Company'],data_cat),[]),[]))
    categories.append(reduce(lambda x,y: x+preprocess(y),reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Organization'],data_cat),[]),[]))
    categories.append(reduce(lambda x,y: x+preprocess(y),reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Location'],data_cat),[]),[]))
    
    del data_cat
    
    DF = dfcount(categories)
    tfidf = tf_idf(categories, DF)
    
    c_person = None
    c_company = None
    c_org = None
    c_location = None
    
    def task1():
        c_person = {term:x for (doc, term), x in tfidf.items() if doc == 0}
        c_person = sorted(c_person, key=c_person.__getitem__,reverse=True)
        print('Person: ', c_person[:20])

    def task2():
        c_company = {term:x for (doc, term), x in tfidf.items() if doc == 1}
        c_company = sorted(c_company, key=c_company.__getitem__,reverse=True)
        print('Company: ', c_company[:20])

    def task3():
        c_org = {term:x for (doc, term), x in tfidf.items() if doc == 2}
        c_org = sorted(c_org, key=c_org.__getitem__,reverse=True)
        print('Organisation: ', c_org[:20])

    def task4():
        c_location = {term:x for (doc, term), x in tfidf.items() if doc == 3}
        c_location = sorted(c_location, key=c_location.__getitem__, reverse=True)
        print('Location: ', c_location[:20])
    
    t1 = threading.Thread(target=task1, name='t1') 
    t2 = threading.Thread(target=task2, name='t2') 
    t3 = threading.Thread(target=task3, name='t3') 
    t4 = threading.Thread(target=task4, name='t4')
    
    t1.start()
    t2.start()
    t3.start()
    t4.start()
    
    
    t1.join()
    t2.join()
    t3.join()
    t4.join()
    
    if write:
        print('Person: ', c_person[:20])
        print('')
        print('Company: ', c_company[:20])
        print('')
        print('Organisation: ', c_org[:20])
        print('')
        print('Location: ', c_location[:20])
    
    return {'person': c_person, 'company': c_company, 'org': c_org, 'location': c_location}


def getSignificanteCategoriesBigrams(limit=2000, write=True):
    data_cat = parseWiki(limit=limit ,test_sample=False, save=False)
    categories_processed = []
    
    cat_per = list(reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] == 'Person',data_cat),[]))
    cat_com = list(reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Company'],data_cat),[]))
    cat_org = list(reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Organization'],data_cat),[]))
    cat_loc = list(reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Location'],data_cat),[]))
    
    categories_processed.append(sum(get_bigrams(stemming(map(lambda x: process_whole_sentence(x),cat_per))),[]))
    categories_processed.append(sum(get_bigrams(stemming(map(lambda x: process_whole_sentence(x),cat_com))),[]))
    categories_processed.append(sum(get_bigrams(stemming(map(lambda x: process_whole_sentence(x),cat_org))),[]))
    categories_processed.append(sum(get_bigrams(stemming(map(lambda x: process_whole_sentence(x),cat_loc))),[]))
    
    del data_cat
    
    DF = dfcount(categories_processed)
    tfidf = tf_idf(categories_processed, DF)
    
    #filtrovanie na základe požadovanej kategórie a usporiadanie na základe tf-idf skóre
    c_person = {term:x for (doc, term), x in tfidf.items() if doc == 0}
    c_person = sorted(c_person, key=c_person.__getitem__,reverse=True)

    c_company = {term:x for (doc, term), x in tfidf.items() if doc == 1}
    c_company = sorted(c_company, key=c_company.__getitem__,reverse=True)

    c_org = {term:x for (doc, term), x in tfidf.items() if doc == 2}
    c_org = sorted(c_org, key=c_org.__getitem__,reverse=True)

    c_location = {term:x for (doc, term), x in tfidf.items() if doc == 3}
    c_location = sorted(c_location, key=c_location.__getitem__, reverse=True)
    
    if write:
        print('Person: ', c_person[:20])
        print('')
        print('Company: ', c_company[:20])
        print('')
        print('Organisation: ', c_org[:20])
        print('')
        print('Location: ', c_location[:20])
    
    return {'person': c_person, 'company': c_company, 'org': c_org, 'location': c_location}

In [141]:
bires = getSignificanteCategoriesBigrams(limit=10000,write=True)

Person:  ['living peopl', '20th century', 'f c', 'c play', 'century american', '21st century', 'american male', 'association football', 'league play', 'expatriate footballers', 'expatriate sportspeople', 'cup play', 'international footbal', 'rugby league', 'university alumni', 'fc play', 'musical groups', 'ice hockey', 'world cup', 'american people']

Company:  ['companies established', 'companies based', 'companies united', 'services companies', 'financial services', 'mergers acquisit', 'american companies', 'stock exchanges', 'video game', 'game companies', 'manufacturing companies', 'chains united', 'companies listed', 'exchanges africa', 'restaurants established', 'stock exchang', 'companies disestablished', 'mass media', 'media companies', 'restaurant chains']

Organisation:  ['based united', 'non profit', 'learned societies', 'profit organizations', 'organizations established', 'organisations based', 'organizations based', '3 organ', '501 c', 'associations based', 'c 3', 'chariti

In [77]:
res = getSignificanteCategories(limit=30000, write=False)

Person:  ['player', 'birth', 'male', 'death', 'expatri', 'peopl', 'alumni', 'live', 'sportspeopl', 'actor', 'writer', 'descent', 'footbal', '21st-centuri', 'cricket', '20th-centuri', 'singer', 'politician', 'musician', 'actress']
Company:  ['exchang', 'brand', 'acquisit', 'merger', 'defunct', 'manufactur', 'softwar', 'label', 'cloth', 'vehicl', 'retail', 'disestablish', 'restaur', 'video', 'fast-food', 'nasdaq', 'onlin', 'publish', 'chain', 'stock']
Organisation: Location:  ['station', 'build', 'pyrénées-atlantiqu', 'regist', 'popul', 'complet', 'place', 'airport', 'venu', 'town', 'school', 'aerodrom', 'need', 'counti', 'railway', 'villag', 'unincorpor', 'great', 'mountain', 'open']
 ['sahara', 'scout', 'youth', '501', 'gang', 'non-profit', 'polisario', 'chariti', 'c', 'learn', 'polit', 'societi', 'advocaci', 'ambul', 'anti-christian', 'anti-vaccin', 'child-rel', 'kazakhstan', 'multi-sport', 'non-government']


In [148]:
def printResults(result, limit):
    print('Person: ',result['person'][:limit])
    print('')
    print('Company: ', result['company'][:limit])
    print('')
    print('Organisation: ',result['org'][:limit])
    print('')
    print('Location: ', result['location'][:limit])

In [149]:
printResults(bires,50)

Person:  ['living peopl', '20th century', 'f c', 'c play', 'century american', '21st century', 'american male', 'association football', 'league play', 'expatriate footballers', 'expatriate sportspeople', 'cup play', 'international footbal', 'rugby league', 'university alumni', 'fc play', 'musical groups', 'ice hockey', 'world cup', 'american people', 'fifa world', 'male actors', 'football league', 'male actor', 'expatriate footbal', 'military personnel', 'people educated', 'hockey players', 'male writ', 'records artist', 'draft pick', 'century indian', 'football manag', 'male television', 'film actor', 'uk mps', 'male film', 'soccer play', 'television actor', 'united f', 'year birth', 'living people', 'winter olymp', 'birth missing', 'football midfield', 'missing living', 'major league', 'school alumni', '19th century', 'new zealand']

Company:  ['companies established', 'companies based', 'companies united', 'services companies', 'financial services', 'mergers acquisit', 'american com

300 000 articles cca 45 minutes need refactoring

Person=['player', 'male', 'actor', 'sportspeopl', 'medalist', 'actress', 'expatri', 'singer', 'musician', 'live', 'writer', 'politician', 'f.c', 'alumni', 'personnel', 'olymp', '20th-centuri', 'faculti', 'coach', 'guitarist']

Company=['brand', 'merger', 'retail', 'exchang', 'stock', 'label', 'nasdaq', 'multin', 'subsidiari', 'acquisit', 'onlin', 'offer', 'held', 'conglomer', 'drink', 'vehicl', 'softwar', 'equip', 'store', 'bankruptci']

Organisation=['scout', 'think', 'non-profit', 'girl', 'gang', 'multi-sport', 'event', 'recur', 'religi', 'tank', 'child-rel', 'non-align', 'non-government', 'critic', 'right', 'chess', 'evangel', 'movement|', 'yakuza', 'advocaci']

Location=['regist', 'unincorpor', 'station', 'popul', 'complet', 'aerodrom', 'villag', 'town', 'landform', 'parish', 'river', 'seaplan', 'open', 'census-design', 'mountain', 'attract', 'neighbourhood', 'suburb', 'rang', 'airport']


300 000 first 100
Person:  ['player', 'male', 'actor', 'sportspeopl', 'medalist', 'actress', 'expatri', 'singer', 'musician', 'live', 'writer', 'politician', 'f.c', 'alumni', 'personnel', 'olymp', '20th-centuri', 'faculti', 'coach', 'guitarist', 'basebal', 'novelist', 'emigr', 'descent', 'cup', '21st-centuri', 'mp', 'painter', 'femal', 'journalist', 'poet', 'compos', 'draft', 'pick', 'repres', '19th-centuri', 'summer', 'champion', 'screenwrit', 'lawyer', 'director', 'swimmer', 'soccer', 'forward', 'skater', 'burial', 'midfield', 'field', 'ice', 'gold', 'non-fict', 'basketbal', 'winter', 'recipi', 'comedian', 'fifa', 'filipino', 'businesspeopl', 'defend', 'senat', 'silver', 'major', 'songwrit', 'scientist', 'minist', 'medal', 'fc', 'medallist', 'staff', 'singer-songwrit', 'voic', 'scholar', 'fellow', 'boxer', 'wrestler', 'historian', 'pan', 'drummer', 'universiad', 'rock', 'figur', 'bundesliga', 'cemeteri', 'rugbi', 'bronz', 'pianist', 'dramatist', 'merit', 'playwright', 'cyclist', 'stage', 'inducte', 'mayor', 'under-21', 'activist', 'xi', 'republican', 'first', 'governor', 'presid']

Company:  ['brand', 'merger', 'retail', 'exchang', 'stock', 'label', 'nasdaq', 'multin', 'subsidiari', 'acquisit', 'onlin', 'offer', 'held', 'conglomer', 'drink', 'vehicl', 'softwar', 'equip', 'store', 'bankruptci', 'file', 'cloth', 'non-renew', 'chapter', 'shoe', 'supermarket', 'initi', 'formerli', 'properti', 'publish', 'portfolio', 'chain', 'supplier', 'chocol', 'luxuri', 'tokyo', 'equiti', 'phone', 'applianc', 'part', 'ga', 'motor', 'truck', 'bakeri', 'group|', 'midwestern', 'toy', 'housebuild', 'web', 'hold', 'fashion', 'headquart', 'studio', 'breweri', '11', 'government-own', 'snack', 'spin-off', 'energi', 'fast-food', 'oil', 'pharmaceut', 'amplifi', 'eyewear', 'nationalis', 'encyclopedia', '2010', 'resourc', 'discontinu', 'euronext', 'outsourc', 'r.a', 're-establish', 'guitar', 'colorado', '2017', 'magazin', 'mobil', 'firearm', 'googl', 'warrant', '2008', 'indiana', 'pipelin', 'provid', 'chaebol', 'condiment', 'dairi', 'discount', 'index', 'mortgag', 'poultri', 'coffe', 'cosmet', 'distribut', 'fuel', '2020', 'consult', 'rock', 'station']

Organisation:  ['scout', 'think', 'non-profit', 'girl', 'gang', 'multi-sport', 'event', 'recur', 'religi', 'tank', 'child-rel', 'non-align', 'non-government', 'critic', 'right', 'chess', 'evangel', 'movement|', 'yakuza', 'advocaci', 'patronag', 'usa', 'games|', 'sahara', 'accreditor', 'america|', 'associations|', 'association|', 'hispanic-american', 'ioc-recognis', 'lobbi', 'metalwork', 'polisario', 'supraorgan', '501', 'bolivia', 'femin', 'intergovernment', 'secret', 'traffick', 'learn', 'asian', 'publish', 'ambul', 'anti-abort', 'anti-vaccin', 'consortia', 'feminist', 'parachurch', 'shelter', 'veteran', 'diego', 'adi', 'advaita', 'anti-vivisect', 'awards|', 'caloust', 'churches|thailand', 'education|', 'federation|', 'foundation|', 'genet', 'gmb', 'gulbenkian', 'irredent', 'metric', 'pageants|california', 'philanthrop', 'positiv', 'puri', 'shankara', 'shankaracharya', 'states–european', 'sub-confeder', 'taxat', 'treati', 'trust|', 'vedanta', 'vexillolog', 'center', 'confeder', 'local', 'nebraska', 'olymp', 'anglican', 'denomin', 'labor', 'missionari', 'scientolog', 'welfar', '1778', 'activist', 'anti-christian', 'biblic', 'carpent', 'certif', 'combat', 'emerg', 'homeless', 'israeli–palestinian']

Location:  ['regist', 'unincorpor', 'station', 'popul', 'complet', 'aerodrom', 'villag', 'town', 'landform', 'parish', 'river', 'seaplan', 'open', 'census-design', 'mountain', 'attract', 'neighbourhood', 'suburb', 'rang', 'airport', 'certifi', 'secondari', 'district|', 'site', 'skyscrap', 'pyrénées-atlantiqu', 'basketbal', 'stadium', 'demolish', 'need', 'vaud', 'coast', 'tributari', 'arena', 'neighborhood', 'dam', 'tunnel', 'saskatchewan', 'monument', 'serv', 'multi-purpos', 'mall', 'lighthous', 'pradesh', 'locat', 'volcano', 'norfolk', 'coastal', 'mojav', 'territori', 'canton', 'township', 'subprefectur', 'desert', 'volleybal', 'derbyshir', 'grassland', 'hill', 'censu', 'castl', 'casino', 'landmark', 'governor', 'voivodeship', 'glacier', 'line', 'valley', 'residenti', 'subway', 'nova', 'colorado', 'close', 'scotia', 'princ', 'reservoir', 'grade', 'offic', 'properti', 'abellio', 'scotrail', 'local', 'indoor', 'lrt', 'uninhabit', 'metropolitan', 'oklahoma', 'suffolk', 'wikipedia', 'montana', 'translat', 'cumbria', 'indiana', 'dioces', 'sculptur', 'divis', 'punggol', 'navarr', 'instal', 'reserv', 'verd']


Person:  ['living peopl', '20th century', 'f c', 'c play', 'century american', '21st century', 'american male', 'association football', 'league play', 'expatriate footballers', 'expatriate sportspeople', 'cup play', 'international footbal', 'rugby league', 'university alumni', 'fc play', 'musical groups', 'ice hockey', 'world cup', 'american people', 'fifa world', 'male actors', 'football league', 'male actor', 'expatriate footbal', 'military personnel', 'people educated', 'hockey players', 'male writ', 'records artist', 'draft pick', 'century indian', 'football manag', 'male television', 'film actor', 'uk mps', 'male film', 'soccer play', 'television actor', 'united f', 'year birth', 'living people', 'winter olymp', 'birth missing', 'football midfield', 'missing living', 'major league', 'school alumni', '19th century', 'new zealand']

Company:  ['companies established', 'companies based', 'companies united', 'services companies', 'financial services', 'mergers acquisit', 'american companies', 'stock exchanges', 'video game', 'game companies', 'manufacturing companies', 'chains united', 'companies listed', 'exchanges africa', 'restaurants established', 'stock exchang', 'companies disestablished', 'mass media', 'media companies', 'restaurant chains', 'retail companies', 'companies filed', 'defunct companies', 'filed chapter', 'internet properties', 'properties established', 'retailers united', '11 bankruptcy', 'chapter 11', 'companies canada', 'companies formerly', 'established 1960', 'established 1995', 'fast food', 'formerly listed', 'listed new', 'manufacturers united', 'york stock', 'established 1989', 'establishments california', 'based austin', 'british companies', 'clothing companies', 'companies england', 'companies isle', 'development compani', 'established 1950', 'established 1974', 'established 2003', 'food chains']

Organisation:  ['based united', 'non profit', 'learned societies', 'profit organizations', 'organizations established', 'organisations based', 'organizations based', '3 organ', '501 c', 'associations based', 'c 3', 'charities based', 'consultative status', 'established 1946', 'professional associations', 'psychology organizations', 'relief organ', 'societies canada', 'status united', 'english football', 'establishments united', 'youth organizations', 'united nations', '1845 establishments', '1859 establishments', '1864 establishments', '1907 establishments', '1908 establishments', '1959 establishments', '1982 establishments', '1996 establishments', '19th centuri', 'academy financial', 'advocacy organ', 'aid organ', 'air ambulance', 'ambulance servic', 'ambulance services', 'american council', 'american organized', 'ancient near', 'awards h', 'banks texa', 'bar associ', 'based geneva', 'based hong', 'based montr', 'based surrey', 'based switzerland', 'based tyne']

Location:  ['pyrénées atlantiqu', 'communes pyrénées', 'articles needing', 'atlantiques communes', 'communes articles', 'french wikipedia', 'needing translation', 'pyrénées atlantiques', 'translation french', 'lower navarr', 'populated places', 'register historic', 'national register', 'historic places', 'towns luxembourg', 'unincorporated communities', 'civil parishes', 'buildings structures', 'neighborhoods pittsburgh', 'sports venues', 'western australia', 'parishes leicestershir', 'villages leicestershir', 'cities towns', 'protected areas', 'borough charnwood', 'places established', 'suburbs perth', 'west virginia', 'tourist attractions', 'buildings completed', 'city rockingham', 'suburbs city', 'shopping malls', 'new jersey', 'county west', 'historic house', 'house museums', 'venues completed', 'county california', 'county virginia', 'alzette canton', 'communes esch', 'county massachusett', 'esch sur', 'former communes', 'mountains hills', 'rhode island', 'road bridges', 'sur alzette']


# Overenie pridelovania kategorii

In [106]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [145]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

[[79537     5     6     9  1207]
 [   13  4924    69    20   682]
 [   37   164  1143    40   625]
 [  721   414   230 45134 10983]
 [ 6153  2428  1923  8684 73095]]
              precision    recall  f1-score   support

           0       0.92      0.98      0.95     80764
           1       0.62      0.86      0.72      5708
           2       0.34      0.57      0.42      2009
           3       0.84      0.79      0.81     57482
           4       0.84      0.79      0.82     92283

    accuracy                           0.86    238246
   macro avg       0.71      0.80      0.75    238246
weighted avg       0.86      0.86      0.86    238246

0.8555568613953645


Získanie súborov, ktoré boli spracované, nachádzajúce sa v adresári ./output

In [27]:
out_path = os.getcwd().rsplit('/', 1)[0]
data_files_original = f'{out_path}/output/'
data_files_original = [file for file in os.listdir(data_files_original) if file.endswith('.tsv')]
len(data_files_original)

58

Vytváranie testovacieho datasetu spôsobom filtrovania záznamov, ktoré sú považované za ground truth.

In [144]:
y_test = []
y_pred = []

for x in data_files_original:
    original = readTsv(f'{out_path}/output/{x}')
    evaluated = readTsv(f'{out_path}/output/eval/{x}')
    for o, e in zip(original, evaluated):
        if o['Category'] in ["Person",'Company','Organization','Location','Another']:
            y_test.append(mapperCategories(o['Category']))
            y_pred.append(mapperCategories(e['Category']))

Celková veľkosť testovacieho setu je 238 246 záznamov

In [125]:
len(y_test)

238246

## Results

V tejto časti sú zobrazené výsledky z viacerých spôsobov/variacií prideľovania kategórií

Pri týchto testovaniach sa netestovala kategória "other", preto tieto výsledky nemožno považovať za 100% správne

# Stats

Celkový počet počet spracovaných článkov je: 15 669 453, z toho je 6 124 163 článkov a 9 545 290 redirectov

In [76]:
size = 0
redirectSize = 0
files = getPathFiles('/output','.tsv')
redirects = getPathFiles('/output','-redirects')
for file in files:
    size += len(readTsv(file))
for file in redirects:
    redirectSize += len(readTsv(file))
print(size)
print(redirectSize)

6124163
9545290


Kategória Person je počet: 1 834 543

Kategória Location je počet: 1 038 893

Kategória Organization je počet: 69 638

Kategória Company je počet: 116 931

Ostatne: 3 064 158

In [80]:
results = { "Person": 0, "Organization": 0, "Location": 0, "Company": 0, "Another": 0 }

files = getPathFiles('/output','.tsv')
for file in files:
    for item in readTsv(file):
        results[mapperCategoriesNames(item['Category'])] += 1
print(results)

{'Person': 1834543, 'Organization': 69638, 'Location': 1038893, 'Company': 116931, 'Another': 3064158}


In [None]:
Najviac redirectov má:

In [81]:
redirecttargets = []
redirects = getPathFiles('/output','-redirects')
for file in redirects:
    for item in readTsv(file):
        redirecttargets.append(item["Source"])

In [85]:
counter = Counter(redirecttargets)

In [91]:
for title, count in counter.most_common(20):
    print(f'{title} \t {count}')

Hangul 	 7072
Private Use Areas 	 6410
Category:World Current Research Publishing academic journals 	 3948
OMICS Publishing Group 	 2848
Category:British Open Research Publications academic journals 	 2172
Category:European Union Research Publishing academic journals 	 1850
Category:Eurasian Research Publishing academic journals 	 1816
Category:North American Research Publishing academic journals 	 1810
Category:Academic Knowledge and Research Publishing academic journals 	 1785
Category:American Research Publications academic journals 	 1695
Category:Academic and Scientific Publishing academic journals 	 1616
Category:Canadian Research Publication academic journals 	 1594
Category:Asian and American Research Publishing Group academic journals 	 1513
Category:Science and Technology Publishing academic journals 	 1445
Category:Research and Knowledge Publication academic journals 	 1380
Science Publishing Group 	 1164
Habeas corpus petitions of Guantanamo Bay detainees 	 1087
Chlaenius 	

Najbežnejšie písmená su a,e,i

In [103]:
counter = Counter()
files = getPathFiles('/output','.tsv')
for file in files:
    for item in readTsv(file):
        counter.update(list(item["Title"].lower()))
for char, count in counter.most_common():
    print(f'{char} \t {count}')

In [119]:
counter = Counter()
files = getPathFiles('/output','.tsv')
for file in files:
    for item in readTsv(file):
        counter.update(tokenize(item["Title"]))
for word, count in counter.most_common(20):
    print(f'{word} \t {count}')

In [125]:
counter3 = Counter()
files = getPathFiles('/output','.tsv')
for file in files:
    for item in readTsv(file):
        if(mapperCategoriesNames(item["Category"]) == "Person"):
            counter3.update(tokenize(item["Title"]))
for word, count in counter3.most_common(20):
    print(f'{word} \t {count}')