## 1. Домен

Вибираємо наступний домен: письменники, книжки, які вони написали, та роки виходу книжок;

In [234]:
import pandas as pd
import numpy as np
import re
from SPARQLWrapper import SPARQLWrapper, JSON

The SPARQL-query to acquire the data for this domain is defined bellow. In this case, we consider only book written in English.

In [264]:
sparql_query = """
    PREFIX dbo: <http://dbpedia.org/ontology/>
    PREFIX dbp: <http://dbpedia.org/property/>
    PREFIX res:  <http://dbpedia.org/resource/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

    SELECT DISTINCT ?writer ?book ?date
    WHERE {
    ?writer rdf:type <http://dbpedia.org/ontology/Person> .
    ?writer dbo:notableWork ?book .
    ?book rdf:type <http://dbpedia.org/ontology/Book> .
    #?book dbo:language ?lang .
    #FILTER( ?lang IN (dbr:English_language, dbr:English)) .
    OPTIONAL {?book dbo:language dbr:English_language} .

    OPTIONAL {?book dbp:releaseDate ?date} .
    OPTIONAL {?book dbp:englishReleaseDate ?date} .
    OPTIONAL {?book dbp:pubDate ?date} .
    OPTIONAL {?book dbp:published ?date} .
    OPTIONAL {?book dbp:publicationDate ?date} .

}
"""

In [265]:
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery(sparql_query)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

In [266]:
def convert_from_dbpedia_book_str(dbpedia_book_str: str):
    book = re.sub(r'\(.*\)', '', dbpedia_book_str).replace('_', ' ').strip()
    return book

In [287]:
def acquire_years_from_string(s: str):
    if s is None:
        return []
    years = re.findall("(\d\d\d\d)", s)
    return set([int(year) for year in years])

In [326]:
def acquire_year_from_string(s: str):
    if s is None:
        return None
    year = re.search("(\d\d\d\d)", s)
    if year is None:
        return None
    return int(year[0])

In [289]:
def convert_sparql_result_to_data_frame(results):
    df_dict = {
        'writer' : [],
        'book': [],
        'year': []
    }
    
    for result in results["results"]["bindings"]:
        try:
            writer = result["writer"]["value"].split('/')[-1]
            book = result["book"]["value"].split('/')[-1]
            book = convert_from_dbpedia_book_str(book)
            year = acquire_years_from_string(result["date"]["value"])
                    
            df_dict['writer'].append(writer)
            df_dict['book'].append(book)
            df_dict['year'].append(year)
        except:
            continue
        
    return pd.DataFrame(df_dict)

In [302]:
def convert_sparql_result_to_writer_books_dict(results):
    writer_books_dict: Dict[str, Dict[str, List[int]]] = dict()
    for result in results["results"]["bindings"]:
        try:
            writer = result["writer"]["value"].split('/')[-1]
            book = result["book"]["value"].split('/')[-1]
            book = convert_from_dbpedia_book_str(book)
            years = acquire_years_from_string(result["date"]["value"])

            if writer not in writer_books_dict:
                writer_books_dict[writer] = dict()
            
            if book in writer_books_dict[writer]:
                writer_books_dict[writer][book].update(years)
            else:
                writer_books_dict[writer][book] = years
            
        except:
            continue
        
    return writer_books_dict

In [303]:
df = convert_sparql_result_to_data_frame(results)

In [304]:
df.head()

Unnamed: 0,writer,book,year
0,William_Makepeace_Thackeray,Vanity Fair,"{1848, 1847}"
1,Nalo_Hopkinson,Brown Girl in the Ring,{1998}
2,Theodore_Judson,Fitzpatrick's War,{2004}
3,Robert_M._Pirsig,Lila: An Inquiry into Morals,{1991}
4,Samuel_Beckett,Watt,{1953}


In [305]:
dbpedia_writer_books_dict = convert_sparql_result_to_writer_books_dict(results)

In [308]:
dbpedia_writer_books_dict

{'William_Makepeace_Thackeray': {'Vanity Fair': {1847, 1848}},
 'Nalo_Hopkinson': {'Brown Girl in the Ring': {1998},
  'Skin Folk': {2001},
  'The Salt Roads': {2003}},
 'Theodore_Judson': {"Fitzpatrick's War": {2004},
  "The Martian General's Daughter": {2008}},
 'Robert_M._Pirsig': {'Lila: An Inquiry into Morals': {1991},
  'Zen and the Art of Motorcycle Maintenance': {1974}},
 'Samuel_Beckett': {'Watt': {1953},
  'Murphy': {1938},
  'Malone Dies': {1951},
  'Molloy': {1951, 1955}},
 'Alan_Lawrence_Sitomer': {'The Hoopster': {2005},
  'Hip Hop High School': {2006},
  'Homeboyz': {2007}},
 'Ben_Cormack': {'The Story of Egmo': {2006}},
 'Melina_Marchetta': {'Looking for Alibrandi': {1992},
  'On the Jellicoe Road': {2006}},
 'Steele_Rudd': {'On Our Selection': {1899}},
 'Chinua_Achebe': {'Arrow of God': {1964},
  'A Man of the People': {1966},
  'Anthills of the Savannah': {1987},
  'No Longer at Ease': {1960},
  'Things Fall Apart': {1958}},
 'Eric_L._Harry': {'Arc Light': {1994}},
 '

In [306]:
len(dbpedia_writer_books_dict)

1336

## 2. Видобування фактів

In [34]:
import wikipediaapi
from typing import Dict, List, Tuple

In [21]:
wiki_en = wikipediaapi.Wikipedia('en')

2.1. Напишіть програму, яка шукає статті у Вікіпедії про сутності, що належать до вашого домена, та витягає тексти цих статей.

In [28]:
def acquire_wiki_text(wiki, page_name: str):
    page = wiki.page(page_name)
    if page.exists() is False:
        return None
    
    return page.text

In [23]:
writer_wikitext_dict: Dict[str, str] = dict()

In [343]:
for writer in dbpedia_writer_books_dict.keys():    
    wiki_text = acquire_wiki_text(wiki_en, writer)
    if wiki_text is None:
        continue
        
    writer_wikitext_dict[writer] = wiki_text

In [344]:
writer_wikitext_dict['Nalo_Hopkinson']

'Nalo Hopkinson (born 20 December 1960) is a Jamaican-born Canadian speculative fiction writer and editor. She currently lives and teaches in Riverside, California. Her novels (Brown Girl in the Ring, Midnight Robber, The Salt Roads, The New Moon\'s Arms) and short stories such as those in her collection Skin Folk often draw on Caribbean history and language, and its traditions of oral and written storytelling.\nHopkinson has edited two fiction anthologies (Whispers From the Cotton Tree Root: Caribbean Fabulist Fiction and Mojo: Conjure Stories). She was the co-editor with Uppinder Mehan for the anthology So Long Been Dreaming: Postcolonial Visions of the Future, and with Geoff Ryman for Tesseracts 9.\nHopkinson defended George Elliott Clarke\'s novel Whylah Falls on the CBC\'s Canada Reads 2002. She was the curator of Six Impossible Things, an audio series of Canadian fantastical fiction on CBC Radio One.\n\nEarly life and education\nNalo Hopkinson was born 20 December 1960 in Kingsto

2.2. Напишіть програму, яка опрацьовує текст статті (саме сирий текст, а не таблички, якщо такі є) та витягає з нього інформацію про ваш домен. Цю інформацію ви будете порівнювати зі сформованою базою даних.

In [345]:
import spacy

In [346]:
class BookFinder:
    
    DELTA_POS_BOOK_YEAR = 3
    
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm")
        self.sentencizer = nlp.create_pipe("sentencizer")
        self.nlp.add_pipe(sentencizer)

    
    def find_books_written_by_author(self, author_description: str) -> Dict[str, str]:
        result: Dict[str, str] = dict()
        
        sentences: List[str] = self.sent_tokenize(author_description)
        for doc in self.nlp.pipe(sentences, disable=["tagger", "parser"]):
            book_with_pos_list = self._find_book_with_posisitons(doc.ents)
            for book, pos in book_with_pos_list:
                year = self._find_year_for_book(doc.ents, pos)
                if book not in result or result[book] is None:
                    result[book] = year
        return result
                
        
        books = self.ner_identifier.acquire_all_ner_type(author_description, 'WORK_OF_ART')
        return list(books)
    
    def sent_tokenize(self, text: str) -> List[str]:
        sentences = []
        for doc in self.nlp.pipe([text], disable=["tagger", "parser", "ner"]):
            for sent in doc.sents:
                sentences.append(sent.text)
        return sentences
    
    
    def _find_book_with_posisitons(self, entities):
        result = list()
        for i in range(0, len(entities)):
            if entities[i].label_ == 'WORK_OF_ART':
                result.append((entities[i].text, i))
        return result
    
    def _find_year_for_book(self, entities, book_pos):
        i = book_pos
        while i < len(entities):
            if entities[i].label_ == 'DATE':
                year = acquire_year_from_string(entities[i].text)
                if year is not None:
                    return year
            i += 1
        
        i = book_pos
        while i >= 0:
            if entities[i].label_ == 'DATE':
                year = acquire_year_from_string(entities[i].text)
                if year is not None:
                    return year
            i -= 1
            
        return None

In [348]:
book_filder = BookFinder()

In [349]:
book_filder.find_books_written_by_author(writer_wikitext_dict['Nalo_Hopkinson'])

{'Brown Girl in the Ring': None,
 'Skin Folk': None,
 'the World Fantasy Award': 2003,
 'the Prix Aurora Award': 2008,
 'Love With Hominids': 1998,
 'A Habit of Waste': 1999,
 'The Glass Bottle Trick': 2000,
 'Greedy Choke Puppy': 2001,
 'Ganger (Ball Lightning': 2001,
 'Midnight Robber': 2001,
 'Young Bloods: Stories': 2001,
 'Queer Fear II': 2002,
 'Shift': 2002,
 'Conjunctions: the New Wave Fabulists': 2002,
 'Herbal': 2004,
 'Whose Upward Flight I Love': 2004,
 'The Smile on the Face': 2004,
 'Girls Who Bite Back: Witches, Mutants, Slayers and Freaks': 2004,
 '"Making the Impossible Possible: An Interview with Nalo Hopkinson"': 2004,
 'Waving at Trains': 2017}

Let's calculate the reselt for all writers 

In [350]:
wiki_writer_books_dict: Dict[str, Dict[str, str]] = dict()
for writer in dbpedia_writer_books_dict.keys():
    if writer not in wiki_writer_books_dict:
        wiki_writer_books_dict[writer] = dict()
    
    if writer not in writer_wikitext_dict:
        continue
        
    writer_wiki_description = writer_wikitext_dict[writer]
    wiki_writer_books_dict[writer] = book_filder.find_books_written_by_author(writer_wiki_description)

In [353]:
wiki_writer_books_dict.items()



## 3. Оцінювання результатів

Розробіть метрику, яка покаже, наскільки інформація, яку ви дістали зі статей, збігається з інформацією в вашій базі даних. Скільки пропущеної інформації? Чи є часткові збіги? (Наприклад, ім'я СЕО певної компанії збігається лише частково або ім'я СЕО збігається, а роки діяльності різні.)



In [190]:
def check_match_book(book: str, books_dict):
    for book_key in books_dict:
        if book in book_key:
            return book_key
        if book_key in book:
            return book_key
    return None

In [354]:
def are_years_matched(wiki_year, dbpedia_years):
    if len(dbpedia_years) == 0:
        if wiki_year is None:
            return True
        return False
    
    return wiki_year in  dbpedia_years

In [366]:
matched_dict = dict()
matched_result_dict = {
    'writer' : [],
    'num_dbpedia_books' : [],
    'num_wiki_books' : [],
    'matched_wiki_with_dbpedia_books' : [],
    'diff_num_dbpedia_and_wiki_books' : [],
    'matched_years_for_matched_books' : []
}
for writer in dbpedia_writer_books_dict:
    dbedia_books = dbpedia_writer_books_dict[writer]
    wiki_books = wiki_writer_books_dict[writer]
    
    
    matched_result_dict['writer'].append(writer)
    matched_result_dict['num_dbpedia_books'].append(len(dbedia_books))
    matched_result_dict['num_wiki_books'].append(len(wiki_books))
    matched_result_dict['diff_num_dbpedia_and_wiki_books'].append(len(dbedia_books) - len(wiki_books))
    
    matched_books, matched_year = 0, 0
    for dbedia_book in dbedia_books:
        wiki_book_matched = check_match_book(dbedia_book, wiki_books)
        
        if wiki_book_matched is None:
            continue
            
        matched_books += 1
        
        if are_years_matched(wiki_books[wiki_book_matched], dbedia_books[dbedia_book]):
            matched_year += 1
    
    if len(dbedia_books) > 0:
        matched_result_dict['matched_wiki_with_dbpedia_books'].append(matched_books / len(dbedia_books))
    else:
        if len(wiki_books) == 0:
            matched_result_dict['matched_wiki_with_dbpedia_books'].append(1)
        else:
            matched_result_dict['matched_wiki_with_dbpedia_books'].append(None)
            
    
    matched_years_for_matched_books = matched_year/ matched_books if matched_books > 0 else None                                                        
    matched_result_dict['matched_years_for_matched_books'].append(matched_years_for_matched_books )
                                                                                 
matched_result_df = pd.DataFrame(matched_result_dict)

Нижче представлена таблиця, де вказані коефіцієнти співпадіння книг що були витягнуті з вікіпедії та кних що отримані з dbedia, а також коефіцієнти спіпадіння років (у випадку якщо книги співпали) в яких були написані ці книги. Крім того також представлені кількість книг до кожного письменника, отримані з  dbpedia та вікіпедії, звідси видно що з вікіпедії отримується більше кількість книг для певного письменника ніж було отримано з dbpedia. 

In [367]:
matched_result_df

Unnamed: 0,writer,num_dbpedia_books,num_wiki_books,matched_wiki_with_dbpedia_books,diff_num_dbpedia_and_wiki_books,matched_years_for_matched_books
0,William_Makepeace_Thackeray,1,26,0.000000,-25,
1,Nalo_Hopkinson,3,20,0.666667,-17,0.0
2,Theodore_Judson,2,2,0.000000,0,
3,Robert_M._Pirsig,2,3,0.500000,-1,0.0
4,Samuel_Beckett,4,25,0.250000,-21,1.0
...,...,...,...,...,...,...
1331,Chinu_Modi,1,3,0.000000,-2,
1332,Charlie_Jane_Anders,1,8,1.000000,-7,0.0
1333,Kavi_Kant,1,2,1.000000,-1,1.0
1334,Seth_Dickinson,1,4,1.000000,-3,0.0


In [379]:
matched_books = matched_result_df['matched_wiki_with_dbpedia_books'].values
matched_books = matched_books[matched_books != np.array(None)]
print("The average matched book coefficient: ", 100 * np.average(matched_books), "%")

The average matched book coefficient:  39.228091436175276 %


Можливо даний коефіцієнт буде здаватися за високим, враховуючи що визначення книжок було реалізовна лише на іменних сутностях. Це пояснюватиметься тим, що кількість книг отриманих зі сторінки wikipedia значно переважає кількість кних отриманих з dbpedia, що збільшує ймовірність того що певна книга з dbpedia знайдеться в книгах отриманиз з wiki.

In [407]:
matched_years = matched_result_df['matched_years_for_matched_books'].values
matched_years = matched_years[~np.isnan(matched_years)]
print("The average matched years coefficient, where books are matched: ", 100 * np.average(matched_years), "%")

The average matched years coefficient, where books are matched:  37.10706751054852 %


In [410]:
diff = matched_result_df['diff_num_dbpedia_and_wiki_books'].values
print("The average difference between number of dbpedia books and num of wiki books", np.average(diff))

The average difference between number of dbpedia books and num of wiki books -10.326347305389222


Як бачимо середня різниця є від'ємною та досить суттєвою, це пов'язано з тим що переважно в більшості випадків у wikipedia вказано значно більше книг написаних тим чи іншим письменником. Для прикладу розглянемо дані з  dbpedia про Nalo Hopkinson (http://dbpedia.org/page/Nalo_Hopkinson) і бачимо що тут вказано лише три книги (dbo:notableWork) в той час чк на сторінці wikipedia згадано значно більше (https://en.wikipedia.org/wiki/Nalo_Hopkinson). 