## 1. Домен

Вибираємо наступний домен: письменники, книжки, які вони написали, та роки виходу книжок;

In [202]:
import pandas as pd
import numpy as np
import re
from SPARQLWrapper import SPARQLWrapper, JSON

The SPARQL-query to acquire the data for this domain is defined bellow. In this case, we consider only book written in English.

In [2]:
sparql_query = """
    PREFIX dbo: <http://dbpedia.org/ontology/>
    PREFIX dbp: <http://dbpedia.org/property/>
    PREFIX res:  <http://dbpedia.org/resource/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

    SELECT DISTINCT ?writer ?book ?date
    WHERE {
    ?writer rdf:type <http://dbpedia.org/ontology/Person> .
    ?writer dbo:notableWork ?book .
    ?book rdf:type <http://dbpedia.org/ontology/Book> .
    ?book dbo:language dbr:English_language .
    ?book dbp:releaseDate ?date
}
"""

In [3]:
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery(sparql_query)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

In [213]:
def convert_from_dbpedia_book_str(dbpedia_book_str: str):
    book = re.sub(r'\(.*\)', '', dbpedia_book_str).replace('_', ' ').strip()
    return book

In [214]:
def convert_sparql_result_to_data_frame(results):
    df_dict = {
        'writer' : [],
        'book': [],
        'year': []
    }
    
    for result in results["results"]["bindings"]:
        try:
            writer = result["writer"]["value"].split('/')[-1]
            book = result["book"]["value"].split('/')[-1]
            book = convert_from_dbpedia_book_str(book)
            year = int(result["date"]["value"])
            
            if year > 2020 or year < 0:
                continue
        
            df_dict['writer'].append(writer)
            df_dict['book'].append(book)
            df_dict['year'].append(year)
        except:
            continue
        
    return pd.DataFrame(df_dict)

In [215]:
def convert_sparql_result_to_writer_books_dict(results):
    writer_books_dict: Dict[str, Dict[str]] = dict()
    for result in results["results"]["bindings"]:
        try:
            writer = result["writer"]["value"].split('/')[-1]
            book = result["book"]["value"].split('/')[-1]
            book = convert_from_dbpedia_book_str(book)
            year = int(result["date"]["value"])

            if year > 2020 or year < 0:
                continue
            if writer not in writer_books_dict:
                writer_books_dict[writer] = dict()
            
            writer_books_dict[writer][book] = year
        except:
            continue
        
    return writer_books_dict

In [216]:
df = convert_sparql_result_to_data_frame(results)

In [217]:
df.head()

Unnamed: 0,writer,book,year
0,Nalo_Hopkinson,Brown Girl in the Ring,1998
1,Theodore_Judson,Fitzpatrick's War,2004
2,Robert_M._Pirsig,Lila: An Inquiry into Morals,1991
3,Samuel_Beckett,Watt,1953
4,Alan_Lawrence_Sitomer,The Hoopster,2005


In [218]:
dbpedia_writer_books_dict = convert_sparql_result_to_writer_books_dict(results)

In [219]:
len(dbpedia_writer_books_dict)

116

In [220]:
dbpedia_writer_books_dict.keys

<function dict.keys>

## 2. Видобування фактів

In [34]:
import wikipediaapi
from typing import Dict, List, Tuple

In [21]:
wiki_en = wikipediaapi.Wikipedia('en')

2.1. Напишіть програму, яка шукає статті у Вікіпедії про сутності, що належать до вашого домена, та витягає тексти цих статей.

In [28]:
def acquire_wiki_text(wiki, page_name: str):
    page = wiki.page(page_name)
    if page.exists() is False:
        return None
    
    return page.text

In [23]:
writer_wikitext_dict: Dict[str, str] = dict()

In [169]:
for writer in dbpedia_writer_books_dict.keys():    
    wiki_text = acquire_wiki_text(wiki_en, writer)
    if wiki_text is None:
        continue
        
    writer_wikitext_dict[writer] = wiki_text

In [170]:
writer_wikitext_dict['Nalo_Hopkinson']

'Nalo Hopkinson (born 20 December 1960) is a Jamaican-born Canadian speculative fiction writer and editor. She currently lives and teaches in Riverside, California. Her novels (Brown Girl in the Ring, Midnight Robber, The Salt Roads, The New Moon\'s Arms) and short stories such as those in her collection Skin Folk often draw on Caribbean history and language, and its traditions of oral and written storytelling.\nHopkinson has edited two fiction anthologies (Whispers From the Cotton Tree Root: Caribbean Fabulist Fiction and Mojo: Conjure Stories). She was the co-editor with Uppinder Mehan for the anthology So Long Been Dreaming: Postcolonial Visions of the Future, and with Geoff Ryman for Tesseracts 9.\nHopkinson defended George Elliott Clarke\'s novel Whylah Falls on the CBC\'s Canada Reads 2002. She was the curator of Six Impossible Things, an audio series of Canadian fantastical fiction on CBC Radio One.\n\nEarly life and education\nNalo Hopkinson was born 20 December 1960 in Kingsto

2.2. Напишіть програму, яка опрацьовує текст статті (саме сирий текст, а не таблички, якщо такі є) та витягає з нього інформацію про ваш домен. Цю інформацію ви будете порівнювати зі сформованою базою даних.

In [171]:
import spacy

In [172]:
class BookFinder:
    
    DELTA_POS_BOOK_YEAR = 3
    
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm")
        self.sentencizer = nlp.create_pipe("sentencizer")
        self.nlp.add_pipe(sentencizer)

    
    def find_books_written_by_author(self, author_description: str) -> Dict[str, str]:
        result: Dict[str, str] = dict()
        
        sentences: List[str] = self.sent_tokenize(author_description)
        for doc in self.nlp.pipe(sentences, disable=["tagger", "parser"]):
            book_with_pos_list = self._find_book_with_posisitons(doc.ents)
            for book, pos in book_with_pos_list:
                date = self._find_date_for_book(doc.ents, pos)
                if book not in result or result[book] is None:
                    result[book] = date
        return result
                
        
        books = self.ner_identifier.acquire_all_ner_type(author_description, 'WORK_OF_ART')
        return list(books)
    
    def sent_tokenize(self, text: str) -> List[str]:
        sentences = []
        for doc in self.nlp.pipe([text], disable=["tagger", "parser", "ner"]):
            for sent in doc.sents:
                sentences.append(sent.text)
        return sentences
    
    
    def _find_book_with_posisitons(self, entities):
        result = list()
        for i in range(0, len(entities)):
            if entities[i].label_ == 'WORK_OF_ART':
                result.append((entities[i].text, i))
        return result
    
    def _find_date_for_book(self, entities, book_pos):
        i = book_pos
        while i < len(entities):
            if entities[i].label_ == 'DATE':
                return entities[i].text
            i += 1
        
        i = book_pos
        while i >= 0:
            if entities[i].label_ == 'DATE':
                return entities[i].text
            i -= 1
            
        return None

In [173]:
book_filder = BookFinder()

In [174]:
book_filder.find_books_written_by_author(writer_wikitext_dict['Nalo_Hopkinson'])

{'Brown Girl in the Ring': None,
 'Skin Folk': None,
 'the World Fantasy Award': '2003',
 'the Prix Aurora Award': '2008',
 'Love With Hominids': '1998',
 'A Habit of Waste': '1999',
 'The Glass Bottle Trick': '2000',
 'Greedy Choke Puppy': '2001',
 'Ganger (Ball Lightning': '2001',
 'Midnight Robber': '2001',
 'Young Bloods: Stories': '2001',
 'Queer Fear II': '2002',
 'Shift': '2002',
 'Conjunctions: the New Wave Fabulists': '2002',
 'Herbal': '2004',
 'Whose Upward Flight I Love': '2004',
 'The Smile on the Face': '2004',
 'Girls Who Bite Back: Witches, Mutants, Slayers and Freaks': '2004',
 '"Making the Impossible Possible: An Interview with Nalo Hopkinson"': '2004',
 'Waving at Trains': 'October 18, 2017'}

Let's calculate the reselt for all writers 

In [187]:
wiki_writer_books_dict: Dict[str, Dict[str, str]] = dict()
for writer in dbpedia_writer_books_dict.keys():
    if writer not in wiki_writer_books_dict:
        wiki_writer_books_dict[writer] = dict()
    
    if writer not in writer_wikitext_dict:
        continue
        
    writer_wiki_description = writer_wikitext_dict[writer]
    wiki_writer_books_dict[writer] = book_filder.find_books_written_by_author(writer_wiki_description)

In [188]:
wiki_writer_books_dict

{'Nalo_Hopkinson': {'Brown Girl in the Ring': None,
  'Skin Folk': None,
  'the World Fantasy Award': '2003',
  'the Prix Aurora Award': '2008',
  'Love With Hominids': '1998',
  'A Habit of Waste': '1999',
  'The Glass Bottle Trick': '2000',
  'Greedy Choke Puppy': '2001',
  'Ganger (Ball Lightning': '2001',
  'Midnight Robber': '2001',
  'Young Bloods: Stories': '2001',
  'Queer Fear II': '2002',
  'Shift': '2002',
  'Conjunctions: the New Wave Fabulists': '2002',
  'Herbal': '2004',
  'Whose Upward Flight I Love': '2004',
  'The Smile on the Face': '2004',
  'Girls Who Bite Back: Witches, Mutants, Slayers and Freaks': '2004',
  '"Making the Impossible Possible: An Interview with Nalo Hopkinson"': '2004',
  'Waving at Trains': 'October 18, 2017'},
 'Theodore_Judson': {"Blog\nJudson's": None, 'Daughter at Pyr': None},
 'Robert_M._Pirsig': {'a Guggenheim Fellowship': '1991',
  'Motorcycle': '1968',
  "Yourself: Revisiting 'Zen and the Art of Motorcycle Maintenance'": '1968'},
 'Samuel_

## 3. Оцінювання результатів

Розробіть метрику, яка покаже, наскільки інформація, яку ви дістали зі статей, збігається з інформацією в вашій базі даних. Скільки пропущеної інформації? Чи є часткові збіги? (Наприклад, ім'я СЕО певної компанії збігається лише частково або ім'я СЕО збігається, а роки діяльності різні.)



In [190]:
def check_match_book(book: str, books_dict):
    for book_key in books_dict:
        if book in book_key:
            return book_key
        if book_key in book:
            return book_key
    return None

In [193]:
def are_years_matched(year1, year2):
    if year1 is None:
        return False
    if year2 is None:
        return False
    
    if year1 in year2:
        return True
    if year2 in year1:
        return True
    
    return False

In [221]:
matched_dict = dict()
for writer in dbpedia_writer_books_dict:
    dbedia_books = dbpedia_writer_books_dict[writer]
    wiki_books = wiki_writer_books_dict[writer]
    
    matched_books, matched_year = 0, 0
    for dbedia_book in dbedia_books:
        print(dbedia_book)
        wiki_book_matched = check_match_book(dbedia_book, wiki_books)
        
        if wiki_book_matched is None:
            continue
            
        matched_books += 1
        
        if are_years_matched(str(dbedia_books[dbedia_book]), wiki_books[wiki_book_matched]):
            matched_year += 1
            
    matched_dict[writer] = (matched_books / len(dbedia_books), matched_year / len(dbedia_books))

Brown Girl in the Ring
Fitzpatrick's War
Lila: An Inquiry into Morals
Watt
The Hoopster
On Our Selection
Arrow of God
Dragonkeeper
Once Were Warriors
What Becomes of the Broken Hearted%3F
Jake's Long Shadow
To the Islands
The Roving Party
House
Vengeance
Roma Sub Rosa
The Autobiography of Miss Jane Pittman
The Broken Sword
The Grifters
True Grit
My Brother Jack
The Fabulous Clipjoint
The Wing of Night
Dark Angel
Out of Africa
Slugs
Psycho
Psycho II
Billy Bathgate
Homer & Langley
Blue Highways
Careless
Death Wish
Gilgamesh
The A-List
The Dark
The Barracks
Clockers
Taxi
Life or Death
Are You There God%3F It's Me, Margaret.
The Two Georges
A Night in the Lonesome October
Doorways in the Sand
The Summer Tree
Mrs. Eckdorf in O'Neill's Hotel
The Children of Dynmouth
Love and Summer
Felicia's Journey
Less Than One: Selected Essays
Zone One
Fletch
The Krishna Key
Chanakya's Chant
The Face of Fear
The Beast House
Antonina: A Byzantine Slut
Count No Man Happy: A Byzantine Fantasy
Exquisite Corps

In [225]:
matched_dict

{'Nalo_Hopkinson': (1.0, 0.0),
 'Theodore_Judson': (0.0, 0.0),
 'Robert_M._Pirsig': (0.0, 0.0),
 'Samuel_Beckett': (0.0, 0.0),
 'Alan_Lawrence_Sitomer': (1.0, 0.0),
 'Steele_Rudd': (1.0, 0.0),
 'Chinua_Achebe': (0.0, 0.0),
 'Carole_Wilkinson': (0.0, 0.0),
 'Alan_Duff': (0.3333333333333333, 0.3333333333333333),
 'Randolph_Stow': (0.0, 0.0),
 'Rohan_Wilson': (0.0, 0.0),
 'Ted_Dekker': (0.0, 0.0),
 'George_Jonas': (0.0, 0.0),
 'Steven_Saylor': (0.0, 0.0),
 'Ernest_J._Gaines': (1.0, 1.0),
 'Poul_Anderson': (0.0, 0.0),
 'Jim_Thompson_(writer)': (0.0, 0.0),
 'Charles_Portis': (1.0, 1.0),
 'George_Johnston_(novelist)': (0.0, 0.0),
 'Fredric_Brown': (1.0, 0.0),
 'Brenda_Walker': (1.0, 0.0),
 'John_Dale_(writer)': (0.0, 0.0),
 'Karen_Blixen': (1.0, 0.0),
 'Shaun_Hutson': (0.0, 0.0),
 'Robert_Bloch': (1.0, 0.0),
 'E._L._Doctorow': (0.5, 0.0),
 'William_Least_Heat-Moon': (0.0, 0.0),
 'Deborah_Robertson': (0.0, 0.0),
 'Brian_Garfield': (1.0, 0.0),
 'Joan_London_(Australian_author)': (0.0, 0.0),
 '

In [230]:
avg_matched_book = sum([item[0] for item in matched_dict.values()]) / len(matched_dict)

In [231]:
print("The average mathced books for all writers are: ", 100 * avg_matched_book, '%')

The average mathced books for all writers are:  37.42816091954023 %


In [232]:
avg_matched_year = sum([item[1] for item in matched_dict.values()]) / len(matched_dict)

In [233]:
print("The average mathced year for all writers, where books are matched are: ", 100 * avg_matched_year, '%')

The average mathced year for all writers, where books are matched are:  12.140804597701148 %
