# Vytvorenie slovníka dvojíc pre účely Named Entity Recognizing
#### Creating a dictionary of pairs for the purposes of Named Entity Recognizing: Wiki page - type

Projekt je momentalne rozdeleny do 2 časti.

1. časť tvorí stahovanie potrebných súborov(wikipedia dump) na účely spracovania v projekte.
2. časť tvorí parsovanie súborov spolu s priradením kategorie jednotlivym clankom

## 1. Part : Downloading Wikipedia articles


In [104]:
import requests
from bs4 import BeautifulSoup
import os
import re
from functools import reduce

Stiahnutie dát zo stránky wikipédie. Vyfiltrovanie všetkých súborov, ktoré obsahujú v názve "pages-articles".

In [4]:
base_url = 'https://dumps.wikimedia.org/enwiki/20201001/'
base_html = requests.get(base_url).text
base_html[:15]

'<!DOCTYPE html '

In [5]:
soup_dump = BeautifulSoup(base_html, 'html.parser')
soup_dump.find_all('li', {'class': 'file'}, limit = 10)[0]

<li class="file"><a href="/enwiki/20201001/enwiki-20201001-pages-articles-multistream.xml.bz2">enwiki-20201001-pages-articles-multistream.xml.bz2</a> 17.5 GB</li>

In [6]:
files = []
for file in soup_dump.find_all('li', {'class': 'file'}):
    text = file.text
    if 'pages-articles' in text:
        files.append((text.split()[0], text.split()[1:]))
files[:5]

[('enwiki-20201001-pages-articles-multistream.xml.bz2', ['17.5', 'GB']),
 ('enwiki-20201001-pages-articles-multistream-index.txt.bz2', ['215.8', 'MB']),
 ('enwiki-20201001-pages-articles-multistream1.xml-p1p41242.bz2',
  ['231.7', 'MB']),
 ('enwiki-20201001-pages-articles-multistream-index1.txt-p1p41242.bz2',
  ['222', 'KB']),
 ('enwiki-20201001-pages-articles-multistream2.xml-p41243p151573.bz2',
  ['313.2', 'MB'])]

In [7]:
files_to_download = [file[0] for file in files if re.search('pages-articles\d{1,2}.xml-p',file[0])]
files_to_download[:5]

['enwiki-20201001-pages-articles1.xml-p1p41242.bz2',
 'enwiki-20201001-pages-articles2.xml-p41243p151573.bz2',
 'enwiki-20201001-pages-articles3.xml-p151574p311329.bz2',
 'enwiki-20201001-pages-articles4.xml-p311330p558391.bz2',
 'enwiki-20201001-pages-articles5.xml-p558392p958045.bz2']

Použitie knižnice keras na stiahnutie týchto súborov/datasetu. Stiahnú sa len tie súbory, ktoré ešte nie sú stahnuté

In [70]:
import sys
from keras.utils import get_file
directory = '/home/xminarikd/.keras/datasets/'

In [74]:
data_paths = []
file_info = []

for file in files_to_download:
    path = directory + file
    
    if not os.path.exists(directory):
        print('neexistuje')
    # downaload only when file dont exist
    if not os.path.exists(directory + file):
        print('Downloading')
        data_paths.append(get_file(file, base_url + file))
        file_size = os.stat(path).st_size / 1e6
        
        # Find the number of articles
        file_articles = int(file.split('p')[-1].split('.')[-2]) - int(file.split('p')[-2])
        file_info.append((file, file_size, file_articles))
        
    # when file already exist
    else:
        data_paths.append(path)
        file_size = os.stat(path).st_size / 1e6
        
        file_number = int(file.split('p')[-1].split('.')[-2]) - int(file.split('p')[-2])
        file_info.append((file.split('-')[-1], file_size, file_number))

Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles1.xml-p1p41242.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles2.xml-p41243p151573.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles3.xml-p151574p311329.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles4.xml-p311330p558391.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles5.xml-p558392p958045.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles6.xml-p958046p1483661.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles7.xml-p1483662p2134111.bz2
Downloading
Downloading data from https://dumps.wikimedia.org/enwiki/2020

In [76]:
sorted(file_info, key = lambda x: x[1], reverse = True)[:5]

[('enwiki-20201001-pages-articles9.xml-p2936261p4045402.bz2',
  512.145263,
  1109141),
 ('enwiki-20201001-pages-articles10.xml-p4045403p5399366.bz2',
  502.44982,
  1353963),
 ('enwiki-20201001-pages-articles11.xml-p5399367p6899366.bz2',
  486.770345,
  1499999),
 ('enwiki-20201001-pages-articles8.xml-p2134112p2936260.bz2',
  471.652241,
  802148),
 ('enwiki-20201001-pages-articles7.xml-p1483662p2134111.bz2',
  462.504094,
  650449)]

## 2. Part Parsing data

Parsovanie prebieha postupne na všetkých súboroch v kompresovanom tvare. Na tento účel je použitý podproces "bzcat", ktorý číta a dodáva súbor po jednotlivých riadkoch. Na spracovanie týchto dát je použitý XML SAX parser. Tento parser obsahuje metódu ContentHandler, ktorá zabezpečuje uchovanie riadkov v buffery, pričom sa hľadajú tagy (page, title, text). Po nájdeni ukončovacieho znaku tagu page prebieha spracovanie celého článku.

Z článku sú pomocou regulárnych výrazov extrahované informácie:
* **infobox**
    * atribúty infoboxu
    * typ infoboxu
* **kategórie čklánku**

Následne na základe týchto informácií je určená kategória článku.

In [10]:
import subprocess
import xml.sax
import regex
import pandas as pd
from functools import reduce
import requests
from bs4 import BeautifulSoup
import re
import json

Momentálne sú priradzované kategórie: Person, Company, Organisation, Place.
Priradzovanie prebieha podľa vyššieho poradia na základe parametrov v poradí:
* **typ infoboxu** - či sa v zozname danej kategorie nachádza infobox daného článku
* **atribúty infoboxu:**
    * **person** - birth_date
    * **company** - industry, trade_name, products, brands
    * **organisation** - zatiaľ žiadne
    * **place** - coordinates, locations _|neobsahuje|_ date, founded, founder, founders
* **kategorie článku:**
    * **organisation** - obsahuje v kategóriach slovo organisaion/s
* **text článku** - zatiaľ nepoužité, ale plánované pre prípady, kedy článok neobsahuje infobox a kategórie neposkytnú žiadnu informáciu

In [65]:
#Get Infobox and Infobox type from article text
def ArticleHandler():
    infobox_regex = '(?=\{Infobox )(\{([^{}]|(?1))*\})'
    inf_type_regex = '(?<=Infobox)(.*?)(?=(\||\n|<!-|<--))'
    #https://regex101.com/r/1vJlms/1
    inf_parameters = '(?(?<=\|)|(?<=\|\s))(\w*)\s*=\s*[\w{\[]'
    #https://regex101.com/r/fl5hAw/1 https://regex101.com/r/Xj0fM3/1
    redirect_title = '(?<=\[\[)(.*)(?=\]\])'
    categories = '(?<=\[\[Category:)([^\]]*)(?=\]\])'
    
    
    def getCategories(text):
        return regex.findall(categories, text)
    
    
    def getArticleAtributes(infobox,text):
        i_par = regex.findall(inf_parameters, infobox)
        i_type = regex.search(inf_type_regex, infobox)
        i_type = i_type.group(0).strip() if i_type is not None else "none"
        return {'type': i_type.lower(), 'parameters': i_par, 'categories': list(getCategories(text))}
        
        
    def isRedirect(text):
        return regex.search("^#redirect\s*\[\[(?i)", text)
        
        
    def getInfobox(text):
        infobox = regex.search(infobox_regex, text)
        return infobox.group() if infobox is not None else "redirect" if isRedirect(text) is not None else "no infobox/redirect"
    
    
    def predictCategory(infobox, info, text):
        if infobox not in ['redirect', 'no infobox/redirect']:
            if info['type'] in persons or 'birth_date' in info['parameters']:
                return "Person"
            elif info['type'] in companies:
                return 'Company'
            elif any(i in info['parameters'] for i in ['industry', 'trade_name', 'products', 'brands']):
                return 'W_Company'
            elif list(filter(lambda x: regex.search('\b(compan(y|ies))\b(?i)', x), info['categories'])):
                return 'Q_Company'
            
            elif info['type'] in organizations:
                return "Organization"
            elif list(filter(lambda x: regex.search('(organisations*)(?i)', x), info['categories'])):
                return 'W_Organization'
            
            elif info['type'] in locations:
                return "Location"
            elif any(i in info['parameters'] for i in ['coordinates', 'locations']) and not(any(i in info['parameters'] for i in ['date', 'founded', 'founder', 'founders'])):
                return 'W_Location'
            
            else:
                return "Other"
            #tieto clanky maju len kategorie
        elif infobox == 'no infobox/redirect':
            if list(filter(lambda x: regex.search('(compan[y|ies])(?i)', x), info['categories'])):
                return 'Q_Company'
            elif list(filter(lambda x: regex.search('(organisations*)(?i)', x), info['categories'])):
                return 'Q_Organization'
            else:
                return None
            
            
            
    
    
    def processArticle(title, text):
        infobox = getInfobox(text)
        
        if infobox == "redirect":
            info = regex.search(redirect_title, text).group(0)
        
        elif infobox == 'no infobox/redirect':
            info = {'categories': list(getCategories(text))}
        
        else:
            info = getArticleAtributes(infobox, text)

        return (title, infobox, info, text, predictCategory(infobox, info, text))
    return processArticle

In [12]:
#docs: https://docs.python.org/3.8/library/xml.sax.handler.html
class ContentHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buf = None
        self._last_tag = None
        self._parts = {}
        self.output = []
        self.article_process = ArticleHandler()

    def characters(self, content):
        if self._last_tag:
            self._buf.append(content)

    def startElement(self, name, attrs):
        if name == 'page':
            self._parts = {}
        if name in ('title', 'text'):
            self._last_tag = name
            self._buf = []

    def endElement(self, name):
        if name == self._last_tag:
            self._parts[name] = ''.join(self._buf)
        
        #whole article
        if name == 'page':
            #function for process whole article in future
            self.output.append(self.article_process(**self._parts))

In [19]:
def parseWiki(limit = 2000, save = False, test_sample=False, data = '/home/xminarikd/.keras/datasets/enwiki-20201001-pages-articles9.xml-p2936261p4045402.bz2'):
    
    if test_sample:
        data = os.getcwd().rsplit('/', 1)[0]
        data = f'{data}/data/sample_wiki_articles2.xml.bz2'
        print(data)
    
    handler = ContentHandler()

    parser = xml.sax.make_parser()
    parser.setContentHandler(handler)

    for i, line in enumerate(subprocess.Popen(['bzcat'], 
                             stdin = open(data), 
                             stdout = subprocess.PIPE).stdout):

        if (i + 1) % 10000 == 0:
            print(f'Spracovanych {i + 1} riadkov.', end = '\r')
        try:
            parser.feed(line)
        except StopIteration:
            break
        
        # get only some results
        if len(handler.output) >= limit:
            break
        
    if save:
        output_dir = os.getcwd().rsplit('/', 1)[0]
        partition_name = data.split('/')[-1].split('-')[-1].split('.')[0]
        output_file = f'{output_dir}/output/{partition_name}.json'

        with open(output_file, 'w+') as file:
            for x in handler.output:
                file.write(json.dumps({x[0]:x[4]}) + '\n')

    
    return handler.output

Stiahnutie a parsovanie stránky wikipédie, ktorá obsahuje zoznam typov infoboxov. Tento zoznam obsahuje aj členeie týchto typov do rôznych kategórií. Vďaka tomuto je možné jednoducho získať všetky infoboxy, ktoré sú spojené napríklad s osobami.

In [14]:
infobox_list_url = 'https://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes'
infobox_list_html = requests.get(infobox_list_url).text
soup_dump = BeautifulSoup(infobox_list_html, 'html.parser')
#sib = soup_dump.find_all("div" ,{'id': 'toc'}).next_sibling

template_list = dict();
prev = None
prev_tag = None
prev_parent = None
prev_parent_tag = 2

for i, sibling in enumerate(soup_dump.find(id="toc").next_siblings):
    
    if prev_parent == 'Other':
        break
    
    if sibling.name == 'h2':
        template_list[sibling.findChild().text] = {}
        prev_parent = sibling.findChild().text
        prev_tag = 2
    
    if sibling.name == 'h3':
        if prev_tag < 3:
            template_list[prev_parent][sibling.findChild().text] = list()
            prev_tag = 3
            prev = sibling.findChild().text
            
        if prev_tag == 3:
            template_list[prev_parent][sibling.findChild().text] = list()
            prev = sibling.findChild().text
            
    if sibling.name == 'ul':
        a = sibling.find_all('a', title=re.compile('^Template:Infobox'))
        b = map(lambda x: regex.findall('(?<=Template:Infobox )(.*)(?i)', x.text.lower()), a)
        c = reduce(lambda x,y: x+y, b, list())
        
        if prev_tag >=3:
            template_list[prev_parent][prev] = [y for x in [template_list[prev_parent][prev], list(c)] for y in x] 
        else:
            template_list[prev_parent] = list(c)
#template_list

In [15]:
persons = list(reduce(lambda x,y: x+y, template_list["Person"].values()))
locations = list(reduce(lambda x,y: x+y, template_list["Place"].values()))
companies = template_list['Society and social science']['Business and economics']
organizations = template_list['Society and social science']['Organization']

### Main

Spustenie funkcie na spracovanie súborov.

In [67]:
data = parseWiki(test_sample=True, save=True)

for i, x in enumerate(data):
    if i > 150:
        break
    if x[1] == 'redirect':
        print(x[0], '<-->', x[1])
    else:
        print(x[0], '<-->', x[4])

/home/xminarikd/Documents/VINF/data/sample_wiki_articles2.xml.bz2
David Stagg <--> Person
Amaranthus mantegazzianus <--> redirect
Amaranthus quitensis <--> redirect
KTXT <--> None
Maud Queen of Norway <--> redirect
Milligram per litre <--> redirect
Utica Psychiatric Center <--> Location
Wikipedia:Articles for deletion/Studiomuscle <--> None
Olean Wholesale Grocery <--> Q_Company
Queen Tiye <--> redirect
Queen Hatshepsut <--> redirect
Clibanarii <--> None
File:Hanns Martin Schleyer in captivity.jpg <--> None
Political documentary <--> redirect
Wikipedia:Articles for deletion/Slarp <--> None
Final fantasy legends <--> redirect
Queen Marie Amelie Therese <--> redirect
Political documentaries <--> redirect
E-767 <--> redirect
Wikipedia:Articles for deletion/"dirty thirty" <--> None
Prince Edward-Lennox <--> redirect
Arthur Hill (actor) <--> Person
Periodic paralysis <--> Other
Greenstripe <--> redirect
Amaranthus cruentus <--> None
Careless weed <--> redirect
Zamil idris <--> redirect
Khad

## Test and data searching area

In [5]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn

In [15]:
for i, line in enumerate(subprocess.Popen(['bzcat'], 
                              stdin = open(test_data_path), 
                              stdout = subprocess.PIPE).stdout):
    print(i,'<==>', line)
    if i > 50:
        break

0 <==> b'<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">\n'
1 <==> b'  <siteinfo>\n'
2 <==> b'    <sitename>Wikipedia</sitename>\n'
3 <==> b'    <dbname>enwiki</dbname>\n'
4 <==> b'    <base>https://en.wikipedia.org/wiki/Main_Page</base>\n'
5 <==> b'    <generator>MediaWiki 1.36.0-wmf.10</generator>\n'
6 <==> b'    <case>first-letter</case>\n'
7 <==> b'    <namespaces>\n'
8 <==> b'      <namespace key="-2" case="first-letter">Media</namespace>\n'
9 <==> b'      <namespace key="-1" case="first-letter">Special</namespace>\n'
10 <==> b'      <namespace key="0" case="first-letter" />\n'
11 <==> b'      <namespace key="1" case="first-letter">Talk</namespace>\n'
12 <==> b'      <namespace key="2" case="first-letter">User</namespace>\n'
13 <==> b'      <namespace key="3" case="first-letter">

In [68]:
df = pd.DataFrame(data)
df.head(80)

Unnamed: 0,0,1,2,3,4
0,David Stagg,{Infobox rugby league biography\n'\nb'|name ...,"{'type': 'rugby league biography\n'', 'paramet...",{{short description|Australian rugby league fo...,Person
1,Amaranthus mantegazzianus,redirect,Amaranthus caudatus,#REDIRECT [[Amaranthus caudatus]]\n'\nb'\n'\nb...,
2,Amaranthus quitensis,redirect,Amaranthus hybridus,#redirect [[Amaranthus hybridus]] {{R from tax...,
3,KTXT,no infobox/redirect,{'categories': []},\'\'\'KTXT\'\'\' may refer to:\n'\nb'\n'\nb'* ...,
4,Maud Queen of Norway,redirect,Maud of Wales,#REDIRECT [[Maud of Wales]],
5,Milligram per litre,redirect,Gram per litre,#REDIRECT [[Gram per litre]],
6,Utica Psychiatric Center,"{Infobox NRHP | name =Utica State Hospital, Ma...","{'type': 'nrhp', 'parameters': ['name', 'nrhp_...",{{Use mdy dates|date=December 2017}}\n'\nb'{{I...,Location
7,Wikipedia:Articles for deletion/Studiomuscle,no infobox/redirect,{'categories': []},"<div class=""boilerplate metadata vfd"" style=""b...",
8,Olean Wholesale Grocery,no infobox/redirect,{'categories': ['Companies based in Cattaraugu...,{{Citation style|date=October 2019}}[[File:Ole...,Q_Company
9,Queen Tiye,redirect,Tiye,#REDIRECT[[Tiye]],


In [12]:
df[2].unique()

array(['rugby league biography', 'No infobox', 'NRHP', 'person',
       'medical condition (new)', 'Italian comune', 'sports season',
       'NFL player', 'school', 'character', 'officeholder', 'comic',
       'UK place', 'song', 'military unit', 'medical person', 'bridge',
       'Basketball club', 'civilian attack'], dtype=object)

In [22]:
df[1].value_counts()

person                  284
settlement              255
album                   192
airport                 153
football biography      148
                       ... 
Public transit            1
Congressperson            1
cattle breed              1
Monstertruck              1
laboratory equipment      1
Name: 1, Length: 444, dtype: int64

In [314]:
data_path = '/home/xminarikd/.keras/datasets/enwiki-20201001-pages-articles9.xml-p2936261p4045402.bz2'
sample_data_path = '/home/xminarikd/Documents/VINF/data/sample_wiki_articles2.xml.bz2'
# Object for handling xml
handler = ContentHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

for i, line in enumerate(subprocess.Popen(['bzcat'], 
                         stdin = open(data_path), 
                         stdout = subprocess.PIPE).stdout):
    parser.feed(line)
    
    if len(handler.output) > 20000:
        break

print(handler.output[2][1])
#print(regex.search(exp_inf_type, infobox).group().strip())

redirect


In [28]:
df2 = pd.DataFrame(data)
rr = df2.loc[df2[4] == 'W_Location']
rr

Unnamed: 0,0,1,2,3,4
59,Bartley Secondary School,{Infobox school\n | name = Bartle...,"{'type': 'school', 'parameters': ['name', 'mot...",{{Infobox school\n | name = Bartl...,W_Location
377,WBXX-TV,{Infobox television station\n| callsign ...,"{'type': 'television station', 'parameters': [...",{{short description|CW affiliate in Crossville...,W_Location
532,"Linn, Aargau",{Infobox Swiss town\n | subject_name = Linn\n...,"{'type': 'swiss town', 'parameters': ['subject...",{{Use dmy dates|date=June 2013}}\n{{Infobox Sw...,W_Location
535,Leichlingen,{Infobox German location\n|type ...,"{'type': 'german location', 'parameters': ['ty...",{{Infobox German location\n|type ...,W_Location
553,Hammelburg,{Infobox German location\n|type ...,"{'type': 'german location', 'parameters': ['ty...",{{Infobox German location\n|type ...,W_Location
568,Engstingen,{Infobox German location\n|image_photo =...,"{'type': 'german location', 'parameters': ['im...",{{Infobox German location\n|image_photo ...,W_Location
597,Place Saint-Sulpice,{Infobox street\n| name = Place Saint-Sulpice\...,"{'type': 'street', 'parameters': ['name', 'ima...",{{Use dmy dates|date=June 2012}}\n{{Unreferenc...,W_Location
606,Mount Florida railway station,{Infobox UK station\n|symbol = rail\n|na...,"{'type': 'uk station', 'parameters': ['symbol'...",{{Use dmy dates|date=March 2020}}\n{{Infobox U...,W_Location
657,Budenheim,{Infobox German location\n| name ...,"{'type': 'german location', 'parameters': ['im...",{{Infobox German location\n| name ...,W_Location
927,Shavers Fork Mountain Complex,{Infobox mountain range\n<!-- *** Name section...,"{'type': 'mountain range', 'parameters': ['nam...",{{Infobox mountain range\n<!-- *** Name sectio...,W_Location


Speciesbox
Citation
Image
div
Licensing
summary
May refers to
Use dmy dates

In [46]:
temp = data[52][2]
temp

{'categories': ['History of Atlanta',
  'North Carolina in the American Civil War',
  'Shipping companies of the United States',
  'Companies based in Virginia']}

In [61]:
temp5 = ['History of Atlanta',
  'North Carolina in the American Civil War',
  'Shipping companies of the United States',
  'Companies based in Virginia']
list(filter(lambda x: regex.search('(compan[y|ies])(?i)', x), temp5))

['Shipping companies of the United States', 'Companies based in Virginia']

In [299]:
temp2 = ['ano','nie jasd sad', 'asdasdasd asd']
temp3 = []

ano


In [322]:
list(filter(lambda x: regex.search('(organisations*|associations*)(?i)', x),temp))

['Organisations based in Manama']

In [40]:
tt = '/home/xminarikd/.keras/datasets/enwiki-20201001-pages-articles9.xml-p2936261p4045402.bz2'
tt.split('/')[-1].split('-')[-1].split('.')[0]

'p2936261p4045402'

In [7]:
import os
dirname = os.getcwd().rsplit('/', 1)[0]
dirname = f'{dirname}/data/sample_wiki_articles2.xml.bz2'
dirname

'/home/xminarikd/Documents/VINF/data/sample_wiki_articles2.xml.bz2'

In [1]:
tt = '/home/xminarikd/.keras/datasets/enwiki-20201001-pages-articles9.xml-p2936261p4045402.bz2'
tt

'/home/xminarikd/.keras/datasets/enwiki-20201001-pages-articles9.xml-p2936261p4045402.bz2'