# Vytvorenie slovníka dvojíc pre účely Named Entity Recognizing
#### Creating a dictionary of pairs for the purposes of Named Entity Recognizing: Wiki page - type

Projekt je momentalne rozdeleny do 2 časti.

1. časť tvorí stahovanie potrebných súborov(wikipedia dump) na účely spracovania v projekte.
2. časť tvorí parsovanie súborov spolu s priradením kategorie jednotlivym clankom

## 1. Part : Downloading Wikipedia articles


In [18]:
import requests
from bs4 import BeautifulSoup
import os
import re
from functools import reduce

Stiahnutie dát zo stránky wikipédie. Vyfiltrovanie všetkých súborov, ktoré obsahujú v názve "pages-articles".

In [4]:
base_url = 'https://dumps.wikimedia.org/enwiki/20201001/'
base_html = requests.get(base_url).text
base_html[:15]

'<!DOCTYPE html '

In [5]:
soup_dump = BeautifulSoup(base_html, 'html.parser')
soup_dump.find_all('li', {'class': 'file'}, limit = 10)[0]

<li class="file"><a href="/enwiki/20201001/enwiki-20201001-pages-articles-multistream.xml.bz2">enwiki-20201001-pages-articles-multistream.xml.bz2</a> 17.5 GB</li>

In [6]:
files = []
for file in soup_dump.find_all('li', {'class': 'file'}):
    text = file.text
    if 'pages-articles' in text:
        files.append((text.split()[0], text.split()[1:]))
files[:5]

[('enwiki-20201001-pages-articles-multistream.xml.bz2', ['17.5', 'GB']),
 ('enwiki-20201001-pages-articles-multistream-index.txt.bz2', ['215.8', 'MB']),
 ('enwiki-20201001-pages-articles-multistream1.xml-p1p41242.bz2',
  ['231.7', 'MB']),
 ('enwiki-20201001-pages-articles-multistream-index1.txt-p1p41242.bz2',
  ['222', 'KB']),
 ('enwiki-20201001-pages-articles-multistream2.xml-p41243p151573.bz2',
  ['313.2', 'MB'])]

In [7]:
files_to_download = [file[0] for file in files if re.search('pages-articles\d{1,2}.xml-p',file[0])]
files_to_download[:5]

['enwiki-20201001-pages-articles1.xml-p1p41242.bz2',
 'enwiki-20201001-pages-articles2.xml-p41243p151573.bz2',
 'enwiki-20201001-pages-articles3.xml-p151574p311329.bz2',
 'enwiki-20201001-pages-articles4.xml-p311330p558391.bz2',
 'enwiki-20201001-pages-articles5.xml-p558392p958045.bz2']

Použitie knižnice keras na stiahnutie týchto súborov/datasetu. Stiahnú sa len tie súbory, ktoré ešte nie sú stahnuté

In [70]:
import sys
from keras.utils import get_file
directory = '/home/xminarikd/.keras/datasets/'

In [74]:
data_paths = []
file_info = []

for file in files_to_download:
    path = directory + file
    
    if not os.path.exists(directory):
        print('neexistuje')
    # downaload only when file dont exist
    if not os.path.exists(directory + file):
        print('Downloading')
        data_paths.append(get_file(file, base_url + file))
        file_size = os.stat(path).st_size / 1e6
        
        # Find the number of articles
        file_articles = int(file.split('p')[-1].split('.')[-2]) - int(file.split('p')[-2])
        file_info.append((file, file_size, file_articles))
        
    # when file already exist
    else:
        data_paths.append(path)
        file_size = os.stat(path).st_size / 1e6
        
        file_number = int(file.split('p')[-1].split('.')[-2]) - int(file.split('p')[-2])
        file_info.append((file.split('-')[-1], file_size, file_number))

Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles1.xml-p1p41242.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles2.xml-p41243p151573.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles3.xml-p151574p311329.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles4.xml-p311330p558391.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles5.xml-p558392p958045.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles6.xml-p958046p1483661.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles7.xml-p1483662p2134111.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/2020

In [76]:
sorted(file_info, key = lambda x: x[1], reverse = True)[:5]

[('enwiki-20201001-pages-articles9.xml-p2936261p4045402.bz2',
  512.145263,
  1109141),
 ('enwiki-20201001-pages-articles10.xml-p4045403p5399366.bz2',
  502.44982,
  1353963),
 ('enwiki-20201001-pages-articles11.xml-p5399367p6899366.bz2',
  486.770345,
  1499999),
 ('enwiki-20201001-pages-articles8.xml-p2134112p2936260.bz2',
  471.652241,
  802148),
 ('enwiki-20201001-pages-articles7.xml-p1483662p2134111.bz2',
  462.504094,
  650449)]

## 2. Part Parsing data

Parsovanie prebieha postupne na všetkých súboroch v kompresovanom tvare. Na tento účel je použitý podproces "bzcat", ktorý číta a dodáva súbor po jednotlivých riadkoch. Na spracovanie týchto dát je použitý XML SAX parser. Tento parser obsahuje metódu ContentHandler, ktorá zabezpečuje uchovanie riadkov v buffery, pričom sa hľadajú tagy (page, title, text). Po nájdeni ukončovacieho znaku tagu page prebieha spracovanie celého článku.

Z článku sú pomocou regulárnych výrazov extrahované informácie:
* **infobox**
    * atribúty infoboxu
    * typ infoboxu
* **kategórie čklánku**

Následne na základe týchto informácií je určená kategória článku.

In [1]:
import subprocess
import xml.sax
import regex
import pandas as pd
from functools import reduce
import requests
from bs4 import BeautifulSoup
import re
import csv
import json
import os
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk
import gc
#nltk.download('punkt')
#nltk.download('stopwords')

Momentálne sú priradzované kategórie: Person, Company, Organisation, Place.
Priradzovanie prebieha podľa vyššieho poradia na základe parametrov v poradí:
* **typ infoboxu** - či sa v zozname danej kategorie nachádza infobox daného článku
* **atribúty infoboxu:**
    * **person** - birth_date
    * **company** - industry, trade_name, products, brands
    * **organisation** - zatiaľ žiadne
    * **place** - coordinates, locations _|neobsahuje|_ date, founded, founder, founders
* **kategorie článku:**
    * **organisation** - obsahuje v kategóriach slovo organisaion/s
* **text článku** - zatiaľ nepoužité, ale plánované pre prípady, kedy článok neobsahuje infobox a kategórie neposkytnú žiadnu informáciu

In [41]:
#Get Infobox and Infobox type from article text
def ArticleHandler(infobox_types=None):
    #source:(stackof) https://regex101.com/r/kT1jF4/1
    infobox_regex = '(?=\{Infobox )(\{([^{}]|(?1))*\})'
    inf_type_regex = '(?<=Infobox)(.*?)(?=(\||\n|<!-|<--))'
    #https://regex101.com/r/1vJlms/1
    inf_parameters = '(?(?<=\|)|(?<=\|\s))(\w*)\s*=\s*[\w{\[]'
    #https://regex101.com/r/fl5hAw/1 https://regex101.com/r/Xj0fM3/1
    redirect_title = '(?<=\[\[)(.*)(?=\]\])'
    categories = '(?<=\[\[Category:)([^\]]*)(?=\]\])'
    
    Person=['player', 'male', 'actor', 'sportspeopl', 'medalist', 'actress', 'expatri', 'singer', 'musician', 'live', 'writer', 'politician', 'f.c', 'alumni', 'personnel', 'olymp', '20th-centuri', 'faculti', 'coach', 'guitarist']
    Company=['brand', 'merger', 'retail', 'exchang', 'stock', 'label', 'nasdaq', 'multin', 'subsidiari', 'acquisit', 'onlin', 'offer', 'held', 'conglomer', 'store', 'bankruptci']
    Organisation=['scout', 'think', 'non-profit', 'gang', 'multi-sport', 'event', 'recur', 'religi', 'child-rel', 'non-align', 'non-government', 'critic', 'right', 'chess', 'evangel', 'yakuza', 'advocaci']
    Location=['regist', 'unincorpor', 'station', 'popul', 'complet', 'aerodrom', 'villag', 'town', 'landform', 'parish', 'river', 'seaplan', 'open', 'census-design', 'mountain', 'attract', 'neighbourhood', 'suburb', 'rang', 'airport']
    
    
    # infobox_types = getInfoboxTypesList()
    
    def getCategories(text):
        return regex.findall(categories, text)
    
    
    def getArticleAtributes(infobox,text):
        i_par = regex.findall(inf_parameters, infobox)
        i_type = regex.search(inf_type_regex, infobox)
        i_type = i_type.group(0).strip() if i_type is not None else "none"
        return {'type': i_type.lower(), 'parameters': i_par, 'categories': list(getCategories(text))}
    
    
    def remove_stop_words(data):
        stopwords = nltk.corpus.stopwords.words('english')
        return [w for w in data if w not in stopwords]


    def tokenize(data):
        symbols = symbols = "!\"#$%&()*+'-./:;,|<=>?@[\]^_`{}~\n"
        tokens = word_tokenize(data)
        tokens = [token.lower() for token in tokens if token not in list(symbols)]
        return tokens


    def stemming(data):
        stemmer= PorterStemmer()
        return [stemmer.stem(token) for token in data]
        
    def processCategories(data):
        data = tokenize(data)
        data = remove_stop_words(data)
        data = stemming(data)
        return data
        
    def isRedirect(text):
        return regex.search("^#redirect\s*\[\[(?i)", text)
        
        
    def getInfobox(text):
        infobox = regex.search(infobox_regex, text)
        return infobox.group() if infobox is not None else "redirect" if isRedirect(text) is not None else "no infobox/redirect"
    
    def categoryBy_infoboxType(info):
        if info['type'] in infobox_types['person']:
            return "Person"
        elif info['type'] in infobox_types['company']:
            return 'Company'
        elif info['type'] in infobox_types['org']:
            return "Organization"
        elif info['type'] in infobox_types['location']:
            return "Location"
        else:
            return None
        
    def categoryBy_atributes(info):
        if 'birth_date' in info['parameters']:
            return "A_Person"
        elif any(i in info['parameters'] for i in ['industry', 'trade_name', 'products', 'brands']):
            return 'A_Company'
        elif any(i in info['parameters'] for i in ['coordinates', 'locations']) and not(any(i in info['parameters'] for i in ['date', 'founded', 'founder', 'founders'])):
            return 'A_Location'
        else:
            return None
        
    def categoryBy_categories(info):
        stemmed_categories = reduce(lambda x,y: x+y,map(lambda x: processCategories(x), info['categories']),[])
        
        if any(i in Person for i in stemmed_categories):
            return 'C_Person'
        elif any(i in Company for i in stemmed_categories):
            return 'C_Company'
        elif any(i in Organisation for i in stemmed_categories):
            return 'C_Organization'
        elif any(i in Location for i in stemmed_categories):
            return 'C_Location'
        
        elif list(filter(lambda x: regex.search('\b(compan(y|ies))\b(?i)', x), info['categories'])):
            return 'C_Company'
        elif list(filter(lambda x: regex.search('(organisations*)(?i)', x), info['categories'])):
            return 'C_Organization'
        else:
            return None
    
    
    def first_true(iterable,data=None, default='Other'):
        return next((item(data) for item in iterable if item(data) is not None), default)
    
    
    def predictCategory(infobox, info):
        if infobox not in ['redirect', 'no infobox/redirect']:
            return first_true([categoryBy_infoboxType,categoryBy_atributes,categoryBy_categories], info)
            
            #tieto clanky maju len kategorie
        elif infobox == 'no infobox/redirect':
            return first_true([categoryBy_categories], info,"Other/None")
        else:
            return 'redirect::'+info

    
    def processArticle(title, text):
        infobox = getInfobox(text)
        
        if infobox == "redirect":
            info = regex.search(redirect_title, text).group(0)
        
        elif infobox == 'no infobox/redirect':
            info = {'categories': list(getCategories(text))}
            if info['categories'] == []:
                return None
        else:
            info = getArticleAtributes(infobox, text)

        return (title, infobox, info, predictCategory(infobox, info))
    return processArticle

In [27]:
sss = {'categories':[]}
if sss['categories'] == []:
    print('ano')

ano


In [37]:
#docs: https://docs.python.org/3.8/library/xml.sax.handler.html
class ContentHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buf = None
        self._last_tag = None
        self._parts = {}
        self.output = []
        self.article_process = ArticleHandler(infobox_types=getInfoboxTypesList())

    def characters(self, content):
        if self._last_tag:
            self._buf.append(content)

    def startElement(self, name, attrs):
        if name == 'page':
            self._parts = {}
        if name in ('title', 'text'):
            self._last_tag = name
            self._buf = []

    def endElement(self, name):
        if name == self._last_tag:
            self._parts[name] = ''.join(self._buf)
        
        #whole article
        if name == 'page':
            data = self.article_process(**self._parts)
            if data is not None:
                self.output.append(data)

In [20]:
def parseWiki(data=None, limit = 200, save = True, test_sample=False):
    
    if test_sample:
        data = os.getcwd().rsplit('/', 1)[0]
        data = f'{data}/data/sample_wiki_articles2.xml.bz2'
        print(data)
    elif data is None:
        data = '/home/xminarikd/.keras/datasets/enwiki-20201001-pages-articles9.xml-p2936261p4045402.bz2'
    
    handler = ContentHandler()

    parser = xml.sax.make_parser()
    parser.setContentHandler(handler)

    for i, line in enumerate(subprocess.Popen(['bzcat'], 
                             stdin = open(data), 
                             stdout = subprocess.PIPE).stdout):

#         if (i + 1) % 10000 == 0:
#             print(f'Spracovanych {i + 1} riadkov.', end = '\r')
#             print('')
        try:
            parser.feed(line)
        except StopIteration:
            break
        
        # get only some results
        if len(handler.output) >= limit:
            break
        
    if save:
        output_dir = os.getcwd().rsplit('/', 1)[0]
        partition_name = data.split('/')[-1].split('-')[-1].split('.')[0]
        output_file = f'{output_dir}/output/{partition_name}.tsv'

        with open(output_file, 'w+', newline='\n') as file:
            writer = csv.writer(file, delimiter='\t')
            writer.writerow(["Title", "Category"])
            for x in handler.output:
                writer.writerow([x[0],x[3] or 'None'])

#         with open(output_file, 'w+', newline='\n') as file:
#             for x in handler.output:
#                 file.write(json.dumps({"title": x[0], "category":x[3]}))
        
        print(f'{output_file} done', end='\r')
        del handler
        del parser
        gc.collect()
        return None
    else:
        return handler.output

Stiahnutie a parsovanie stránky wikipédie, ktorá obsahuje zoznam typov infoboxov. Tento zoznam obsahuje aj členeie týchto typov do rôznych kategórií. Vďaka tomuto je možné jednoducho získať všetky infoboxy, ktoré sú spojené napríklad s osobami.

In [21]:
def getInfoboxTypesList():
    infobox_list_url = 'https://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes'
    infobox_list_html = requests.get(infobox_list_url).text
    soup_dump = BeautifulSoup(infobox_list_html, 'html.parser')
    #sib = soup_dump.find_all("div" ,{'id': 'toc'}).next_sibling

    template_list = dict();
    prev = None
    prev_tag = None
    prev_parent = None
    prev_parent_tag = 2

    for i, sibling in enumerate(soup_dump.find(id="toc").next_siblings):

        if prev_parent == 'Other':
            break

        if sibling.name == 'h2':
            template_list[sibling.findChild().text] = {}
            prev_parent = sibling.findChild().text
            prev_tag = 2

        if sibling.name == 'h3':
            if prev_tag < 3:
                template_list[prev_parent][sibling.findChild().text] = list()
                prev_tag = 3
                prev = sibling.findChild().text

            if prev_tag == 3:
                template_list[prev_parent][sibling.findChild().text] = list()
                prev = sibling.findChild().text

        if sibling.name == 'ul':
            a = sibling.find_all('a', title=re.compile('^Template:Infobox'))
            b = map(lambda x: regex.findall('(?<=Template:Infobox )(.*)(?i)', x.text.lower()), a)
            c = reduce(lambda x,y: x+y, b, list())

            if prev_tag >=3:
                template_list[prev_parent][prev] = [y for x in [template_list[prev_parent][prev], list(c)] for y in x] 
            else:
                template_list[prev_parent] = list(c)

    persons = list(reduce(lambda x,y: x+y, template_list["Person"].values()))
    locations = list(reduce(lambda x,y: x+y, template_list["Place"].values()))
    companies = template_list['Society and social science']['Business and economics']
    organizations = template_list['Society and social science']['Organization']
    
    return {'person': persons, 'location': locations, 'company': companies, 'org': organizations}

### Main

Spustenie funkcie na spracovanie súborov.

In [42]:
data = parseWiki(test_sample=False, limit=300, save=False)

for i, x in enumerate(data):
    if i > 150:
        break
    if x[1] == 'redirect':
        print(x[0], '<-->', x[1])
    else:
        print(x[0], '<-->', x[3])

David Stagg <--> Person
Amaranthus mantegazzianus <--> redirect
Amaranthus quitensis <--> redirect
Maud Queen of Norway <--> redirect
Milligram per litre <--> redirect
Utica Psychiatric Center <--> Location
Olean Wholesale Grocery <--> C_Company
Queen Tiye <--> redirect
Queen Hatshepsut <--> redirect
Clibanarii <--> Other/None
Political documentary <--> redirect
Final fantasy legends <--> redirect
Queen Marie Amelie Therese <--> redirect
Political documentaries <--> redirect
E-767 <--> redirect
Prince Edward-Lennox <--> redirect
Arthur Hill (actor) <--> Person
Periodic paralysis <--> Other
Greenstripe <--> redirect
Amaranthus cruentus <--> Other/None
Careless weed <--> redirect
Zamil idris <--> redirect
Khada sag <--> redirect
Million instructions per second <--> redirect
Ashtadiggajas <--> Other/None
John C.Harsanyi <--> redirect
Société entomologique de France <--> C_Organization
Sangorache <--> redirect
Joseph's coat <--> redirect
Recipients of the Distinguished Service Award of the

## Multiprocessing

In [44]:
from multiprocessing import Pool 
import tqdm
from functools import partial
import uuid

In [45]:
dataset_dir = '/home/xminarikd/.keras/datasets/'
dataset = [dataset_dir+file for file in os.listdir(dataset_dir)]
len(dataset)

58

In [46]:
%%time
pool = Pool(processes=4)
results = []

map_parser = partial(parseWiki, limit = 100, save = True)

for x in tqdm.tqdm_notebook(pool.imap_unordered(map_parser, dataset), total = len(dataset)):
    results.append(x)

pool.close()
pool.join()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=58.0), HTML(value='')))

/home/xminarikd/Documents/VINF/output/p15824603p17324602.tsv done/home/xminarikd/Documents/VINF/output/p311330p558391.tsv done/home/xminarikd/Documents/VINF/output/p32808443p34308442.tsv done/home/xminarikd/Documents/VINF/output/p5399367p6899366.tsv done/home/xminarikd/Documents/VINF/output/p52064554p53564553.tsv done/home/xminarikd/Documents/VINF/output/p23716198p25216197.tsv done
CPU times: user 264 ms, sys: 150 ms, total: 414 ms
Wall time: 16.4 s


# Indexes

In [15]:
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
es

<Elasticsearch([{'host': 'localhost', 'port': 9200}])>

In [94]:
def readTsv(file):
    output = []
    with open(file) as f:
        for line in csv.DictReader(f, delimiter='\t'): 
            output.append(line)
    return output

In [95]:
out_path = os.getcwd().rsplit('/', 1)[0]
data_files = f'{out_path}/output/'
data_files = [data_files+file for file in os.listdir(data_files) if file.endswith('.tsv')]
len(data_files)

58

In [96]:
def toElastic(files, elastic):
    for x in files:
        data = readTsv(x)
        for item in data:
            res = elastic.index(index='test', doc_type="wikipedia", id=uuid.uuid4(), body=item)
            if res['result'] != 'created':
                print('Warning, Error', res)

In [97]:
toElastic(data_files,es)



#### Sample searching

In [99]:
res= es.search(index='test',body={'query':{'match':{'Title':'Hotline'}}})
print(res['hits']['hits'])

[{'_index': 'test', '_type': 'wikipedia', '_id': '7178c4f0-47c1-4cc2-825e-86e47b48880b', '_score': 11.579972, '_source': {'Title': 'Hotline', 'Category': 'Other/None'}}]


#### Delete all records

In [93]:
es.indices.delete(index='test')

{'acknowledged': True}

## Finding common categories

In [20]:
from functools import reduce
import numpy as np
categories = []
data_cat = parseWiki(limit=200 ,test_sample=False, save=False)

cat_per = list(reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] == 'Person',data_cat),[]))
cat_com = list(reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Company', 'W_Company', 'Q_Company'],data_cat),[]))
cat_org = list(reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Organization', 'W_Organization', 'Q_Organization'],data_cat),[]))
cat_loc = list(reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Location', 'W_Location'],data_cat),[]))

Spracovanych 10000 riadkov.

In [54]:
abc = reduce(lambda x,y: x+y,map(lambda x: preprocess(x),data_cat[0][2]['categories']))
abc

['1983',
 'birth',
 'australian',
 'rugbi',
 'leagu',
 'player',
 'rugbi',
 'leagu',
 'player',
 'queensland',
 'brisban',
 'bronco',
 'player',
 'canterbury-bankstown',
 'bulldog',
 'player',
 'queensland',
 'rugbi',
 'leagu',
 'state',
 'origin',
 'player',
 'rugbi',
 'leagu',
 'five-eighth',
 'rugbi',
 'leagu',
 'centr',
 'rugbi',
 'leagu',
 'lock',
 'peopl',
 'educ',
 'padua',
 'colleg',
 'brisban',
 'sportspeopl',
 'townsvil',
 'rugbi',
 'leagu',
 'second-row',
 'wynnum',
 'manli',
 'seagul',
 'player',
 'live',
 'peopl']

In [12]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk
from collections import Counter
#nltk.download('punkt')
#nltk.download('stopwords')

In [10]:
def remove_stop_words(data):
    stopwords = nltk.corpus.stopwords.words('english')
    return [w for w in data if w not in stopwords]


def tokenize(data):
    symbols = symbols = "!\"#$%&()*+'-./:;,<=>?@[\]^_`{|}~\n"
    tokens = word_tokenize(data)
    tokens = [token.lower() for token in tokens if token not in list(symbols)]
    return tokens


def stemming(data):
    stemmer= PorterStemmer()
    return [stemmer.stem(token) for token in data]

In [11]:
def preprocess(data):
    data = tokenize(data)
    data = remove_stop_words(data)
    data = stemming(data)
    return data

In [117]:
categories_processed = []
categories_processed.append(reduce(lambda x,y: x+preprocess(y),cat_per,[]))
categories_processed.append(reduce(lambda x,y: x+preprocess(y),cat_com,[]))
categories_processed.append(reduce(lambda x,y: x+preprocess(y),cat_org,[]))
categories_processed.append(reduce(lambda x,y: x+preprocess(y),cat_loc,[]))

In [12]:
def dfcount(data):
    df = {}
    for i in range(len(data)):
        for token in data[i]:
            try:
                df[token].add(i)
            except:
                df[token] = {i}
    for i in df:
        df[i] = len(df[i])
    return df

In [119]:
DF = dfcount(categories_processed)
DF

{'1983': 3,
 'birth': 3,
 'australian': 3,
 'rugbi': 2,
 'leagu': 4,
 'player': 2,
 'queensland': 4,
 'brisban': 2,
 'bronco': 1,
 'canterbury-bankstown': 1,
 'bulldog': 1,
 'state': 4,
 'origin': 2,
 'five-eighth': 1,
 'centr': 2,
 'lock': 1,
 'peopl': 4,
 'educ': 4,
 'padua': 1,
 'colleg': 4,
 'sportspeopl': 1,
 'townsvil': 1,
 'second-row': 1,
 'wynnum': 1,
 'manli': 1,
 'seagul': 1,
 'live': 4,
 '1922': 2,
 '2006': 3,
 'death': 2,
 'canadian': 3,
 'male': 2,
 'film': 3,
 'actor': 1,
 'stage': 1,
 'televis': 4,
 'alzheim': 1,
 "'s": 4,
 'diseas': 1,
 'disease-rel': 1,
 'california': 4,
 'toni': 2,
 'award': 3,
 'winner': 3,
 'melfort': 1,
 'saskatchewan': 2,
 'univers': 4,
 'british': 3,
 'columbia': 3,
 'alumni': 2,
 'expatri': 1,
 'unit': 4,
 '1963': 3,
 'american': 4,
 'lawyer': 1,
 'lo': 4,
 'angel': 4,
 'hugo': 1,
 'chávez': 1,
 'earli': 1,
 'individual|chavez': 1,
 '1980': 4,
 'panola': 1,
 'counti': 3,
 'texa': 4,
 'footbal': 3,
 'lineback': 1,
 'kilgor': 1,
 'ranger': 1,
 'n

In [120]:
len(DF)

8641

In [13]:
def tf_idf(data, doc_freq):
    tfidf = {}
    for i in range(len(data)):
        counter = Counter(data[i])
        count_w = len(data[i])
        for token in np.unique(data[i]):
            tf = counter[token]/count_w
            df = doc_freq[token]
            idf = np.log((len(data)+1)/(df+1))
            tfidf[i, token] = tf*idf
    return tfidf

In [131]:
tmp = tf_idf(categories_processed, DF)

In [123]:
c_person = {term:x for (doc, term), x in tmp.items() if doc == 0}
c_person = sorted(c_person, key=c_person.__getitem__,reverse=True)

c_company = {term:x for (doc, term), x in tmp.items() if doc == 1}
c_company = sorted(c_company, key=c_company.__getitem__,reverse=True)

c_org = {term:x for (doc, term), x in tmp.items() if doc == 2}
c_org = sorted(c_org, key=c_org.__getitem__,reverse=True)

c_location = {term:x for (doc, term), x in tmp.items() if doc == 3}
c_location = sorted(c_location, key=c_location.__getitem__, reverse=True)

In [125]:
print(c_person[:20])
print(c_company[:20])
print(c_org[:20])
print(c_location[:20])

['player', 'expatri', 'death', 'sportspeopl', 'male', 'actor', 'birth', 'alumni', 'writer', 'cricket', 'footbal', 'actress', 'singer', 'fc', 'f.c', '21st-centuri', 'coach', 'medalist', 'soccer', 'hockey']
['exchang', 'manufactur', 'brand', 'defunct', 'acquisit', 'merger', 'label', 'stock', 'chain', 'retail', 'softwar', 'publish', 'pipelin', 'cloth', 'vehicl', 'fast-food', 'motor', 'disestablish', 'record', 'oil']
['organis', 'chariti', 'non-profit', 'parti', 'manama', 'tobago', 'ambul', 'gang', 'learn', 'youth', 'são', '501', 'country|bel', 'event', 'slough', 'treati', 'guid', 'nation', 'advocaci', 'claro']
['pyrénées-atlantiqu', 'regist', 'station', 'airport', 'complet', 'aerodrom', 'commun', 'need', 'popul', 'counti', 'river', 'railway', 'unincorpor', 'villag', 'venu', 'build', 'french', 'place', 'translat', 'leicestershir']


In [75]:
from functools import reduce
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk
from collections import Counter
import threading

def getSignificanteCategories(limit=2000, write=True):
    categories = []

    data_cat = parseWiki(limit=limit ,test_sample=False, save=False)
    categories.append(reduce(lambda x,y: x+preprocess(y),reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] == 'Person',data_cat),[]),[]))
    categories.append(reduce(lambda x,y: x+preprocess(y),reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Company', 'A_Company'],data_cat),[]),[]))
    categories.append(reduce(lambda x,y: x+preprocess(y),reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Organization'],data_cat),[]),[]))
    categories.append(reduce(lambda x,y: x+preprocess(y),reduce(lambda x,y: x+y[2]['categories'],filter(lambda x: x[3] in ['Location', 'A_Location'],data_cat),[]),[]))
    
    del data_cat
    
    DF = dfcount(categories)
    tfidf = tf_idf(categories, DF)
    
    c_person = None
    c_company = None
    c_org = None
    c_location = None
    
    def task1():
        c_person = {term:x for (doc, term), x in tfidf.items() if doc == 0}
        c_person = sorted(c_person, key=c_person.__getitem__,reverse=True)
        print('Person: ', c_person[:20])

    def task2():
        c_company = {term:x for (doc, term), x in tfidf.items() if doc == 1}
        c_company = sorted(c_company, key=c_company.__getitem__,reverse=True)
        print('Company: ', c_company[:20])

    def task3():
        c_org = {term:x for (doc, term), x in tfidf.items() if doc == 2}
        c_org = sorted(c_org, key=c_org.__getitem__,reverse=True)
        print('Organisation: ', c_org[:20])

    def task4():
        c_location = {term:x for (doc, term), x in tfidf.items() if doc == 3}
        c_location = sorted(c_location, key=c_location.__getitem__, reverse=True)
        print('Location: ', c_location[:20])
    
    t1 = threading.Thread(target=task1, name='t1') 
    t2 = threading.Thread(target=task2, name='t2') 
    t3 = threading.Thread(target=task3, name='t3') 
    t4 = threading.Thread(target=task4, name='t4')
    
    t1.start()
    t2.start()
    t3.start()
    t4.start()
    
    
    t1.join()
    t2.join()
    t3.join()
    t4.join()
    
    if write:
        print('Person: ', c_person[:20])
        print('')
        print('Company: ', c_company[:20])
        print('')
        print('Organisation: ', c_org[:20])
        print('')
        print('Location: ', c_location[:20])
    
    return {'person': c_person, 'company': c_company, 'org': c_org, 'location': c_location}

In [77]:
res = getSignificanteCategories(limit=30000, write=False)

Person:  ['player', 'birth', 'male', 'death', 'expatri', 'peopl', 'alumni', 'live', 'sportspeopl', 'actor', 'writer', 'descent', 'footbal', '21st-centuri', 'cricket', '20th-centuri', 'singer', 'politician', 'musician', 'actress']
Company:  ['exchang', 'brand', 'acquisit', 'merger', 'defunct', 'manufactur', 'softwar', 'label', 'cloth', 'vehicl', 'retail', 'disestablish', 'restaur', 'video', 'fast-food', 'nasdaq', 'onlin', 'publish', 'chain', 'stock']
Organisation: Location:  ['station', 'build', 'pyrénées-atlantiqu', 'regist', 'popul', 'complet', 'place', 'airport', 'venu', 'town', 'school', 'aerodrom', 'need', 'counti', 'railway', 'villag', 'unincorpor', 'great', 'mountain', 'open']
 ['sahara', 'scout', 'youth', '501', 'gang', 'non-profit', 'polisario', 'chariti', 'c', 'learn', 'polit', 'societi', 'advocaci', 'ambul', 'anti-christian', 'anti-vaccin', 'child-rel', 'kazakhstan', 'multi-sport', 'non-government']


In [74]:
print('Person: ',res['person'][:100])
print('')
print('Company: ', res['company'][:100])
print('')
print('Organisation: ',res['org'][:100])
print('')
print('Location: ', res['location'][:100])

TypeError: 'NoneType' object is not subscriptable

300 000 articles cca 45 minutes need refactoring

Person=['player', 'male', 'actor', 'sportspeopl', 'medalist', 'actress', 'expatri', 'singer', 'musician', 'live', 'writer', 'politician', 'f.c', 'alumni', 'personnel', 'olymp', '20th-centuri', 'faculti', 'coach', 'guitarist']

Company=['brand', 'merger', 'retail', 'exchang', 'stock', 'label', 'nasdaq', 'multin', 'subsidiari', 'acquisit', 'onlin', 'offer', 'held', 'conglomer', 'drink', 'vehicl', 'softwar', 'equip', 'store', 'bankruptci']

Organisation=['scout', 'think', 'non-profit', 'girl', 'gang', 'multi-sport', 'event', 'recur', 'religi', 'tank', 'child-rel', 'non-align', 'non-government', 'critic', 'right', 'chess', 'evangel', 'movement|', 'yakuza', 'advocaci']

Location=['regist', 'unincorpor', 'station', 'popul', 'complet', 'aerodrom', 'villag', 'town', 'landform', 'parish', 'river', 'seaplan', 'open', 'census-design', 'mountain', 'attract', 'neighbourhood', 'suburb', 'rang', 'airport']


300 000 first 100
Person:  ['player', 'male', 'actor', 'sportspeopl', 'medalist', 'actress', 'expatri', 'singer', 'musician', 'live', 'writer', 'politician', 'f.c', 'alumni', 'personnel', 'olymp', '20th-centuri', 'faculti', 'coach', 'guitarist', 'basebal', 'novelist', 'emigr', 'descent', 'cup', '21st-centuri', 'mp', 'painter', 'femal', 'journalist', 'poet', 'compos', 'draft', 'pick', 'repres', '19th-centuri', 'summer', 'champion', 'screenwrit', 'lawyer', 'director', 'swimmer', 'soccer', 'forward', 'skater', 'burial', 'midfield', 'field', 'ice', 'gold', 'non-fict', 'basketbal', 'winter', 'recipi', 'comedian', 'fifa', 'filipino', 'businesspeopl', 'defend', 'senat', 'silver', 'major', 'songwrit', 'scientist', 'minist', 'medal', 'fc', 'medallist', 'staff', 'singer-songwrit', 'voic', 'scholar', 'fellow', 'boxer', 'wrestler', 'historian', 'pan', 'drummer', 'universiad', 'rock', 'figur', 'bundesliga', 'cemeteri', 'rugbi', 'bronz', 'pianist', 'dramatist', 'merit', 'playwright', 'cyclist', 'stage', 'inducte', 'mayor', 'under-21', 'activist', 'xi', 'republican', 'first', 'governor', 'presid']

Company:  ['brand', 'merger', 'retail', 'exchang', 'stock', 'label', 'nasdaq', 'multin', 'subsidiari', 'acquisit', 'onlin', 'offer', 'held', 'conglomer', 'drink', 'vehicl', 'softwar', 'equip', 'store', 'bankruptci', 'file', 'cloth', 'non-renew', 'chapter', 'shoe', 'supermarket', 'initi', 'formerli', 'properti', 'publish', 'portfolio', 'chain', 'supplier', 'chocol', 'luxuri', 'tokyo', 'equiti', 'phone', 'applianc', 'part', 'ga', 'motor', 'truck', 'bakeri', 'group|', 'midwestern', 'toy', 'housebuild', 'web', 'hold', 'fashion', 'headquart', 'studio', 'breweri', '11', 'government-own', 'snack', 'spin-off', 'energi', 'fast-food', 'oil', 'pharmaceut', 'amplifi', 'eyewear', 'nationalis', 'encyclopedia', '2010', 'resourc', 'discontinu', 'euronext', 'outsourc', 'r.a', 're-establish', 'guitar', 'colorado', '2017', 'magazin', 'mobil', 'firearm', 'googl', 'warrant', '2008', 'indiana', 'pipelin', 'provid', 'chaebol', 'condiment', 'dairi', 'discount', 'index', 'mortgag', 'poultri', 'coffe', 'cosmet', 'distribut', 'fuel', '2020', 'consult', 'rock', 'station']

Organisation:  ['scout', 'think', 'non-profit', 'girl', 'gang', 'multi-sport', 'event', 'recur', 'religi', 'tank', 'child-rel', 'non-align', 'non-government', 'critic', 'right', 'chess', 'evangel', 'movement|', 'yakuza', 'advocaci', 'patronag', 'usa', 'games|', 'sahara', 'accreditor', 'america|', 'associations|', 'association|', 'hispanic-american', 'ioc-recognis', 'lobbi', 'metalwork', 'polisario', 'supraorgan', '501', 'bolivia', 'femin', 'intergovernment', 'secret', 'traffick', 'learn', 'asian', 'publish', 'ambul', 'anti-abort', 'anti-vaccin', 'consortia', 'feminist', 'parachurch', 'shelter', 'veteran', 'diego', 'adi', 'advaita', 'anti-vivisect', 'awards|', 'caloust', 'churches|thailand', 'education|', 'federation|', 'foundation|', 'genet', 'gmb', 'gulbenkian', 'irredent', 'metric', 'pageants|california', 'philanthrop', 'positiv', 'puri', 'shankara', 'shankaracharya', 'states–european', 'sub-confeder', 'taxat', 'treati', 'trust|', 'vedanta', 'vexillolog', 'center', 'confeder', 'local', 'nebraska', 'olymp', 'anglican', 'denomin', 'labor', 'missionari', 'scientolog', 'welfar', '1778', 'activist', 'anti-christian', 'biblic', 'carpent', 'certif', 'combat', 'emerg', 'homeless', 'israeli–palestinian']

Location:  ['regist', 'unincorpor', 'station', 'popul', 'complet', 'aerodrom', 'villag', 'town', 'landform', 'parish', 'river', 'seaplan', 'open', 'census-design', 'mountain', 'attract', 'neighbourhood', 'suburb', 'rang', 'airport', 'certifi', 'secondari', 'district|', 'site', 'skyscrap', 'pyrénées-atlantiqu', 'basketbal', 'stadium', 'demolish', 'need', 'vaud', 'coast', 'tributari', 'arena', 'neighborhood', 'dam', 'tunnel', 'saskatchewan', 'monument', 'serv', 'multi-purpos', 'mall', 'lighthous', 'pradesh', 'locat', 'volcano', 'norfolk', 'coastal', 'mojav', 'territori', 'canton', 'township', 'subprefectur', 'desert', 'volleybal', 'derbyshir', 'grassland', 'hill', 'censu', 'castl', 'casino', 'landmark', 'governor', 'voivodeship', 'glacier', 'line', 'valley', 'residenti', 'subway', 'nova', 'colorado', 'close', 'scotia', 'princ', 'reservoir', 'grade', 'offic', 'properti', 'abellio', 'scotrail', 'local', 'indoor', 'lrt', 'uninhabit', 'metropolitan', 'oklahoma', 'suffolk', 'wikipedia', 'montana', 'translat', 'cumbria', 'indiana', 'dioces', 'sculptur', 'divis', 'punggol', 'navarr', 'instal', 'reserv', 'verd']


## Test and data searching area

In [17]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn

In [43]:
df = pd.DataFrame(data)
df.head(10)

Unnamed: 0,0,1,2,3
0,David Stagg,{Infobox rugby league biography\n|name ...,"{'type': 'rugby league biography', 'parameters...",Person
1,Amaranthus mantegazzianus,redirect,Amaranthus caudatus,redirect::Amaranthus caudatus
2,Amaranthus quitensis,redirect,Amaranthus hybridus,redirect::Amaranthus hybridus
3,Maud Queen of Norway,redirect,Maud of Wales,redirect::Maud of Wales
4,Milligram per litre,redirect,Gram per litre,redirect::Gram per litre
5,Utica Psychiatric Center,"{Infobox NRHP | name =Utica State Hospital, Ma...","{'type': 'nrhp', 'parameters': ['name', 'nrhp_...",Location
6,Olean Wholesale Grocery,no infobox/redirect,{'categories': ['Companies based in Cattaraugu...,C_Company
7,Queen Tiye,redirect,Tiye,redirect::Tiye
8,Queen Hatshepsut,redirect,Hatshepsut,redirect::Hatshepsut
9,Clibanarii,no infobox/redirect,"{'categories': ['Cavalry', 'Asian armour', 'Ty...",Other/None


In [87]:
df[3].value_counts()

Other/None        79378
Other             21720
Person            20884
C_Person          12142
Location          11074
C_Location         6246
A_Location         4270
C_Organization     3251
C_Company          2491
Company            1830
A_Person           1591
Organization        681
A_Company             9
Name: 3, dtype: int64

In [314]:
data_path = '/home/xminarikd/.keras/datasets/enwiki-20201001-pages-articles9.xml-p2936261p4045402.bz2'
sample_data_path = '/home/xminarikd/Documents/VINF/data/sample_wiki_articles2.xml.bz2'
# Object for handling xml
handler = ContentHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

for i, line in enumerate(subprocess.Popen(['bzcat'], 
                         stdin = open(data_path), 
                         stdout = subprocess.PIPE).stdout):
    parser.feed(line)
    
    if len(handler.output) > 20000:
        break

print(handler.output[2][1])
#print(regex.search(exp_inf_type, infobox).group().strip())

redirect


In [32]:
df2 = pd.DataFrame(data)
rr = df2.loc[df2[3] == 'Other/None']
rr = rr.loc[rr[2] == {'categories':[]}]

Speciesbox
Citation
Image
div
Licensing
summary
May refers to
Use dmy dates

In [33]:
rr

Unnamed: 0,0,1,2,3,4


In [61]:
temp5 = ['History of Atlanta',
  'North Carolina in the American Civil War',
  'Shipping companies of the United States',
  'Companies based in Virginia']
list(filter(lambda x: regex.search('(compan[y|ies])(?i)', x), temp5))

['Shipping companies of the United States', 'Companies based in Virginia']

In [299]:
temp2 = ['ano','nie jasd sad', 'asdasdasd asd']
temp3 = []

ano


In [322]:
list(filter(lambda x: regex.search('(organisations*|associations*)(?i)', x),temp))

['Organisations based in Manama']

In [40]:
tt = '/home/xminarikd/.keras/datasets/enwiki-20201001-pages-articles9.xml-p2936261p4045402.bz2'
tt.split('/')[-1].split('-')[-1].split('.')[0]

'p2936261p4045402'

In [7]:
import os
dirname = os.getcwd().rsplit('/', 1)[0]
dirname = f'{dirname}/data/sample_wiki_articles2.xml.bz2'
dirname

'/home/xminarikd/Documents/VINF/data/sample_wiki_articles2.xml.bz2'

In [1]:
tt = '/home/xminarikd/.keras/datasets/enwiki-20201001-pages-articles9.xml-p2936261p4045402.bz2'
tt

'/home/xminarikd/.keras/datasets/enwiki-20201001-pages-articles9.xml-p2936261p4045402.bz2'