# Relation extraction using distant supervision

Семинар подготовлен на основе Стенфордского курса CS224U, ссылка на их [материалы](http://web.stanford.edu/class/cs224u/2019/).

Статьи про Distant supervision:
Mintz et al. 2009  [Distant supervision for relation extraction without labeled data](https://www.aclweb.org/anthology/P09-1113/)



Ye, Z.-X., and Ling, Z.-H. 2019 [Distant Supervision Relation Extraction with Intra-Bag
and Inter-Bag Attentions
](https://arxiv.org/pdf/1904.00143.pdf)




# Relation Extraction задача в общем виде

```
(founders, SpaceX, Elon_Musk)
(has_spouse, Elon_Musk, Talulah_Riley)
(worked_at, Elon_Musk, Tesla_Motors)
```
Приложение - создание и пополнение баз знаний (knowledge base NB). Такая база полезна для множества задач, например для вопросно-ответных систем.


### Distant supervision
Мы хотим использовать обучение с учителем, но не хотим тратить ресурсы на ручную разметку. distant supervision способ использовать уже существующие в структурированном виде знания из базы знаний для получения новых. Зная отношение `(founders, SpaceX, Elon_Musk)` , мы делаем предположение, что предложения, где есть те же участники, выражают то же отношение. Разметим такие предложения автоматически и добавим их в наше обучающее множество. 

Проблемы подхода:
- мы получаем зашумленные данные
- нам нужно с чего-то начинать, мы не можем создавать KB с нуля и не можем находить новые отношения при таком подходе.


In [16]:
import gzip
import numpy as np
import random
import sys
import os

from collections import Counter, defaultdict, namedtuple
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import train_test_split


## The corpus

Нам нужен корпус с отмеченными в нем сущностями.
Должны соблюдаться два условия:
1. Снята омонимия (sense disambiguation)
1. Если одна и та же сущность может быть выражена по-разному, все упоминания сущности в корпусе должны указывать на один идентификатор (entity resolution)

Мы будем использовать корпус на основе [Wikilinks](https://code.google.com/archive/p/wiki-links/). Google анонсировал Wikilinks в [2013](https://research.googleblog.com/2013/03/learning-from-big-data-40-million.html). В корпусе 40 миллионов упоминаний, 3 миллиона NE, каждое упоминание размечено ссылкой на  Википедию.

Сегодняшний корпус - преобразованная версия Wikilinks. Каждый пример в нем имеет два упоминания сущностей и окружающий их контекст.
Для удобства обращения с примерами, будем использовать класс `Corpus`


In [13]:
Example = namedtuple('Example',
    'entity_1, entity_2, left, mention_1, middle, mention_2, right, '
    'left_POS, mention_1_POS, middle_POS, mention_2_POS, right_POS')

class Corpus(object):
    """
    Class for representing and working with the raw text we use
    as evidence for making relation predictions.
    Parameters
    ----------
    src_filename_or_examples : str or list
        If str, this is assumed to be the full path to the gzip file
        that contains the examples to use. The method `read_examples`
        is used to open it in that case. If this is a list, then it
        should be a list of `Example` instances.
    Attributes
    ----------
    examples_by_entities : dict
        A 2d dictionary mapping `ex.entity_1` to a dict mapping entity
        `ex.entity_2` to the full `Example` instance `ex`. This is
        created by the method `_index_examples_by_entities`.
    """
    def __init__(self, src_filename_or_examples):
        if isinstance(src_filename_or_examples, str):
            self.examples = self.read_examples(src_filename_or_examples)
        else:
            self.examples = src_filename_or_examples
        self.examples_by_entities = {}
        self._index_examples_by_entities()

    @staticmethod
    def read_examples(src_filename):
        """
        Read `src_filename`, assumed to be a `gzip` file with
        tab-separated lines that can be turned into `Example`
        instances.
         Parameters
        ----------
        src_filename :  str
            Assumed to be the full path to the gzip file that contains
            the examples.
        Returns
        -------
        list of Example
        """
        examples = []
        with gzip.open(src_filename, mode='rt', encoding='utf8') as f:
            for line in f:
                fields = line[:-1].split('\t')
                examples.append(Example(*fields))
        return examples

    def _index_examples_by_entities(self):
        """
        Fill `examples_by_entities` as a 2d dictionary mapping
        `ex.entity_1` to a dict mapping entity `ex.entity_2` to the
        full `Example` instance `ex`.
        """
        for ex in self.examples:
            if ex.entity_1 not in self.examples_by_entities:
                self.examples_by_entities[ex.entity_1] = {}
            if ex.entity_2 not in self.examples_by_entities[ex.entity_1]:
                self.examples_by_entities[ex.entity_1][ex.entity_2] = []
            self.examples_by_entities[ex.entity_1][ex.entity_2].append(ex)

    def get_examples_for_entities(self, e1, e2):
        """
        Given two entities `e1` and `e2` as strings, return
        examples from `self.examples_by_entities`, as a list of
        `Example` instances."""
        try:
            return self.examples_by_entities[e1][e2]
        except KeyError:
            return []


    def __str__(self):
        return 'Corpus with {0:,} examples'.format(len(self.examples))

    def __repr__(self):
        return str(self)

    def __len__(self):
        return len(self.examples)

In [17]:
corpus = Corpus('corpus.tsv.gz')

len(corpus.examples)

331696

In [18]:
corpus.examples[10]

Example(entity_1='The_Official_Story', entity_2='The_Secret_in_Their_Eyes', left='to well connected political families . These and other topics related to the Dirty War have been explored in books and films including Apartment Zero ,', mention_1='The Official Story', middle='and', mention_2='The Secret In Their Eyes .', right='Holtzman , who lives in Philadelphia , was born in Buenos Aires in 1947 and grew up in Latin America and Southeast Asia . He moved to the', left_POS='to/TO well/RB connected/JJ political/JJ families/NNS ./. These/DT and/CC other/JJ topics/NNS related/VBN to/TO the/DT Dirty/NNP War/NNP have/VBP been/VBN explored/VBN in/IN books/NNS and/CC films/NNS including/VBG Apartment/NNP Zero/NNP ,/,', mention_1_POS='The/DT Official/NNP Story/NNP', middle_POS='and/CC', mention_2_POS='The/DT Secret/JJ In/IN Their/PRP$ Eyes/NNS ./.', right_POS='Holtzman/NNP ,/, who/WP lives/VBZ in/IN Philadelphia/NNP ,/, was/VBD born/VBN in/IN Buenos/NNP Aires/NNP in/IN 1947/CD and/CC grew/VBD 

В полях entity_1 entity_2  лежат названия википедийных статей, соотвествующих сущности. https://en.wikipedia.org/wiki/The_Secret_in_Their_Eyes


In [None]:
ex = corpus.examples[1]

In [None]:
ex.left + ex.mention_1 + ex.middle + ex.mention_2 + ex.right

'to all Spanish-occupied lands . The horno has a beehive shape and uses wood as the only heat source . The procedure still used in parts ofNew MexicoandArizonais to build a fire inside the Horno and , when the proper amount of time has passed , remove the embers and ashes and insert the'

Посмотрим на самые распространенные NE в корпусе

In [7]:
counter = Counter()
for example in corpus.examples:
    counter[example.entity_1] += 1
    counter[example.entity_2] += 1
print('The corpus contains {} entities'.format(len(counter)))
counts = sorted([(count, key) for key, count in counter.items()], reverse=True)
print('The most common entities are:')
for count, key in counts[:20]:
    print('{:10d} {}'.format(count, key))

The corpus contains 95909 entities
The most common entities are:
      8137 India
      5240 England
      4121 France
      4040 Germany
      3937 Australia
      3779 Canada
      3633 Italy
      3138 California
      2894 New_York_City
      2745 Pakistan
      2213 New_Zealand
      2183 New_York
      2148 United_Kingdom
      2030 Spain
      2005 Japan
      1891 Russia
      1806 Philippines
      1748 Malaysia
      1721 Indonesia
      1670 China


посмотрим на примеры для NE `Elon_Musk` и `Tesla_Motors`.

In [19]:
len(corpus.get_examples_for_entities('Elon_Musk', 'Tesla_Motors'))

5

Порядок важен

In [20]:
len(corpus.get_examples_for_entities('Tesla_Motors', 'Elon_Musk'))


2

## База знаний (Knowledge base, KB)

Мы будем использовать подмножество базы знаний [Freebase](https://en.wikipedia.org/wiki/Freebase_(database)). Проект перестал существовать в 2016 году, но дампы можно найти в открытом доступе(freebase-easy.cs.uni-freiburg.de/dump/).
Можно посмотреть, что есть во FreeBase и как ей можно пользоваться: http://freebase-easy.cs.uni-freiburg.de/browse/ 


База знаний может быть представлена как набор троек: _relation_,  _subject_, _object_. Пример таких троек:

```
(place_of_birth, Barack_Obama, Honolulu)
(has_spouse, Barack_Obama, Michelle_Obama)
(author, The_Audacity_of_Hope, Barack_Obama)
```

Давайте посмотрим на нашу базу, используем для этого класс `KB`, который проиндексирует ее как по NE, так и по отношениям.


In [10]:
KBTriple = namedtuple('KBTriple', 'rel, sbj, obj')

class KB(object):
    """
    Class for representing and working with the knowledge base.
    Parameters
    ----------
    src_filename_or_triples : str or list
        If str, this is assumed to be the full path to the gzip file
        that contains the KB. The method `read_kb_triples` is used to
        open it in that case. If this is a list, then it should be a
        list of `KBTriple` instances.
    Attributes
    ----------
    all_relations : list
        Built by `_index_kb_triples_by_relation` as a list of str.
    all_entity_pairs : list
        Built by `_collect_all_entity_pairs`, as a sorted list of
        (subject, object) tuples.
    kb_triples_by_relation : dict
        Built by `_index_kb_triples_by_relation`, as a dict mapping
        relations (str) to `KBTriple` lists.
    kb_triples_by_entities : dict
        Built by `_index_kb_triples_by_entities`, as a dict mapping
        relations subject (str) to dict mapping object (str) to
        `KBTriple` lists.
    """
    def __init__(self, src_filename_or_triples):
        if isinstance(src_filename_or_triples, str):
            self.kb_triples = self.read_kb_triples(src_filename_or_triples)
        else:
            self.kb_triples = src_filename_or_triples
        self.all_relations = []
        self.all_entity_pairs = []
        self.kb_triples_by_relation = {}
        self.kb_triples_by_entities = {}
        self._collect_all_entity_pairs()
        self._index_kb_triples_by_relation()
        self._index_kb_triples_by_entities()

    @staticmethod
    def read_kb_triples(src_filename):
        """
        Read `src_filename`, assumed to be a `gzip` file with
        tab-separated lines that can be turned into `KBTriple`
        instances.
        Parameters
        ----------
        src_filename :  str
            Assumed to be the full path to the gzip file that contains
            the triples
        Returns
        -------
        list of KBTriple
        """
        kb_triples = []
        with gzip.open(src_filename, mode='rt', encoding='utf8') as f:
            for line in f:
                rel, sbj, obj = line[:-1].split('\t')
                kb_triples.append(KBTriple(rel, sbj, obj))
        return kb_triples

    def _collect_all_entity_pairs(self):
        pairs = set()
        for kbt in self.kb_triples:
            pairs.add((kbt.sbj, kbt.obj))
        self.all_entity_pairs = sorted(list(pairs))

    def _index_kb_triples_by_relation(self):
        for kbt in self.kb_triples:
            if kbt.rel not in self.kb_triples_by_relation:
                self.kb_triples_by_relation[kbt.rel] = []
            self.kb_triples_by_relation[kbt.rel].append(kbt)
        self.all_relations = sorted(list(self.kb_triples_by_relation))

    def _index_kb_triples_by_entities(self):
        for kbt in self.kb_triples:
            if kbt.sbj not in self.kb_triples_by_entities:
                self.kb_triples_by_entities[kbt.sbj] = {}
            if kbt.obj not in self.kb_triples_by_entities[kbt.sbj]:
                self.kb_triples_by_entities[kbt.sbj][kbt.obj] = []
            self.kb_triples_by_entities[kbt.sbj][kbt.obj].append(kbt)

    def get_triples_for_relation(self, rel):
        """"
        Given a relation name (str), return all of the `KBTriple`
        instances that involve it.
        """
        try:
            return self.kb_triples_by_relation[rel]
        except KeyError:
            return []

    def get_triples_for_entities(self, e1, e2):
        """
        Given a pair of entities `e1` and `e2` (both str), return
        all of the `KBTriple` instances that involve them.
        """
        try:
            return self.kb_triples_by_entities[e1][e2]
        except KeyError:
            return []

    def __str__(self):
        return 'KB with {0:,} triples'.format(len(self.kb_triples))

    def __repr__(self):
        return str(self)

    def __len__(self):
        return len(self.kb_triples)


In [11]:
kb = KB('kb.tsv.gz')

len(kb.kb_triples)

45884


Посмотрим 
- Сколько всего типов отношений
- Количество примеров на отношение
- Примеры троек
- Сколько уникальных NE упоминается в базе

In [12]:
len(kb.all_relations)

16

In [14]:
for rel in kb.all_relations:
    print('{} {}'.format(len(kb.get_triples_for_relation(rel)), rel))

1702 adjoins
2671 author
522 capital
18681 contains
3947 film_performance
1960 founders
824 genre
2563 has_sibling
2994 has_spouse
2542 is_a
1598 nationality
1586 parents
1097 place_of_birth
831 place_of_death
1216 profession
1150 worked_at


In [15]:
for rel in kb.all_relations:
    print(tuple(kb.get_triples_for_relation(rel)[0]))

('adjoins', 'France', 'Spain')
('author', 'Uncle_Silas', 'Sheridan_Le_Fanu')
('capital', 'Panama', 'Panama_City')
('contains', 'Brickfields', 'Kuala_Lumpur_Sentral_railway_station')
('film_performance', 'Colin_Hanks', 'The_Great_Buck_Howard')
('founders', 'Lashkar-e-Taiba', 'Hafiz_Muhammad_Saeed')
('genre', '8_Simple_Rules', 'Sitcom')
('has_sibling', 'Ari_Emanuel', 'Rahm_Emanuel')
('has_spouse', 'Percy_Bysshe_Shelley', 'Mary_Shelley')
('is_a', 'Bhanu_Athaiya', 'Costume_designer')
('nationality', 'Ruben_Rausing', 'Sweden')
('parents', 'Rosanna_Davison', 'Chris_de_Burgh')
('place_of_birth', 'William_Penny_Brookes', 'Much_Wenlock')
('place_of_death', 'Jean_Drapeau', 'Montreal')
('profession', 'Rufus_Wainwright', 'Actor')
('worked_at', 'Brian_Greene', 'Columbia_University')


С помощью метода `kb.get_triples_for_entities()` мы можем найти тройки по субъекту и объекту.

In [16]:
kb.get_triples_for_entities('France', 'Germany')

[KBTriple(rel='adjoins', sbj='France', obj='Germany')]

Некоторые отношения симметричны.

In [17]:
kb.get_triples_for_entities('Germany', 'France')

[KBTriple(rel='adjoins', sbj='Germany', obj='France')]

In [None]:
kb.get_triples_for_entities('Tesla_Motors','Elon_Musk')

[KBTriple(rel='founders', sbj='Tesla_Motors', obj='Elon_Musk')]

In [None]:
kb.get_triples_for_entities('Elon_Musk', 'Tesla_Motors')

[KBTriple(rel='worked_at', sbj='Elon_Musk', obj='Tesla_Motors')]

In [18]:
kb.get_triples_for_entities('Cleopatra', 'Ptolemy_XIII_Theos_Philopator')

[KBTriple(rel='has_sibling', sbj='Cleopatra', obj='Ptolemy_XIII_Theos_Philopator'),
 KBTriple(rel='has_spouse', sbj='Cleopatra', obj='Ptolemy_XIII_Theos_Philopator')]

Сколько всего в базе уникальных сущностей и в скольких тройках они встречаются?

In [19]:
counter = Counter()
for rel in kb.kb_triples:
    counter[rel.sbj] += 1
    counter[rel.obj] += 1
print('The corpus contains {} entities'.format(len(counter)))
counts = sorted([(count, key) for key, count in counter.items()], reverse=True)
print('The most common entities are:')
for count, key in counts[:20]:
    print('{} {}'.format(count, key))

The corpus contains 40141 entities
The most common entities are:
945 England
786 India
438 Italy
414 France
412 California
400 Germany
372 United_Kingdom
366 Canada
302 New_York_City
247 New_York
236 Australia
219 Philippines
215 Japan
212 Scotland
208 Russia
198 Actor
172 Pakistan
170 Ontario
169 Ireland
168 New_Zealand


В базе гораздо меньше сущностей, чем в корпусе. Смысл задачи в том, чтобы найти новые NE, связанные отношениями из заранее известного списка, и пополнить базу.

## Постановка задачи

когда мы говорим о RE, возможны следующие разновидности задачи

- Что именно мы классифицируем
    - Пару упоминаний сущностей (NE mentions) в контексте
    - Пару сущностей
- Что предсказываем
    - 0 или одно конкретное отношение (многоклассовая классификация)
    - n отношений для пары сущностей (multi-label классификация)


В зависимости от подхода, мы по-разному будем связывать KB и корпус. 
- При __классификации упоминаний сущностей__ мы используем базу для разметки отдельных предложений.

При __классификации самих сущностей__ мы используем все предложения из корпуса с этой парой для извлечения признаков, которые бы описывали саму пару. (Наш сегодняшний выбор)



Итак, мы будем решать такую задачу: на вход система берет пару сущностей и на выход отдает отношение/я между ними. Сформированные тройки пополняют KB.

In [20]:
class Dataset(object):
    """
    Class for unifying a `Corpus` and a `KB`.
    Parameters
    ----------
    corpus : `Corpus`
    kb : `KB`
    """
    def __init__(self, corpus, kb):
        self.corpus = corpus
        self.kb = kb

    def find_unrelated_pairs(self):
        unrelated_pairs = set()
        for ex in self.corpus.examples:
            if self.kb.get_triples_for_entities(ex.entity_1, ex.entity_2):
                continue
            if self.kb.get_triples_for_entities(ex.entity_2, ex.entity_1):
                continue
            unrelated_pairs.add((ex.entity_1, ex.entity_2))
            unrelated_pairs.add((ex.entity_2, ex.entity_1))
        return unrelated_pairs

    def featurize(self, kbts_by_rel, featurizers, vectorizer=None, vectorize=True):
        """
        Featurize by relation.
        Parameters
        ----------
        kbts_by_rel : dict
            A map from relation (str) to lists of `KBTriples`.
        featurizers : list of func
            Each function has to have the signature
            `kbt, corpus, feature_counter`, where `kbt` is a `KBTriple`,
            `corpus` is a `Corpus`, and `feature_counter` is a count
            dictionary.
        vectorizer : DictVectorizer or None:
            If None, a new `DictVectorizer` is created and used via
            `fit`. This is primarily for training. If not None, then
            `transform` is used. This is primarily for testing.
        vectorize: bool
            If True, the feature functions in `featurizers` are presumed
            to create feature dicts, and a `DictVectorizer` is used. If
            False, then `featurizers` is required to have exactly one
            function in it, and that function must return exactly the
            sort of objects that the models in the model factory take
            as inputs.
        Returns
        -------
        feat_matrices_by_rel, vectorizer
            where `feat_matrices_by_rel` is a dict mapping relation names
            to (i) lists of representation if `vectorize=False`, else
            to `np.array`s, and (ii) and `vectorizer` is a
            `DictVectorizer` if `vectorize=True`, else None
        """
        if not vectorize:

            feat_matrices_by_rel = defaultdict(list)
            if len(featurizers) != 1:
                raise ValueError(
                    "If `vectorize=False`, the `featurizers` argument "
                    "must contain exactly one function.")
            featurizer = featurizers[0]
            for rel, kbts in kbts_by_rel.items():
                for kbt in kbts:
                    rep = featurizer(kbt, self.corpus)
                    feat_matrices_by_rel[rel].append(rep)
            return feat_matrices_by_rel, None

        # Create feature counters for all instances (kbts).
        feat_counters_by_rel = defaultdict(list)
        for rel, kbts in kbts_by_rel.items():
            for kbt in kbts:
                feature_counter = Counter()
                for featurizer in featurizers:
                    feature_counter = featurizer(kbt, self.corpus, feature_counter)
                feat_counters_by_rel[rel].append(feature_counter)
        feat_matrices_by_rel = defaultdict(list)
        # If we haven't been given a Vectorizer, create one and fit
        # it to all the feature counters.
        if vectorizer is None:
            vectorizer = DictVectorizer(sparse=True)
            def traverse_dicts():
                for dict_list in feat_counters_by_rel.values():
                    for d in dict_list:
                        yield d
            vectorizer.fit(traverse_dicts())
        # Now use the Vectorizer to transform feature dictionaries
        # into feature matrices.
        for rel, feat_counters in feat_counters_by_rel.items():
            feat_matrices_by_rel[rel] = vectorizer.transform(feat_counters)
        return feat_matrices_by_rel, vectorizer

    def build_dataset(self, include_positive=True, sampling_rate=0.1, seed=1):
        unrelated_pairs = self.find_unrelated_pairs()
        random.seed(seed)
        unrelated_pairs = random.sample(
            unrelated_pairs, int(sampling_rate * len(unrelated_pairs)))
        kbts_by_rel = defaultdict(list)
        labels_by_rel = defaultdict(list)
        for index, rel in enumerate(self.kb.all_relations):
            if include_positive:
                for kbt in self.kb.get_triples_for_relation(rel):
                    kbts_by_rel[rel].append(kbt)
                    labels_by_rel[rel].append(True)
            for sbj, obj in unrelated_pairs:
                kbts_by_rel[rel].append(KBTriple(rel, sbj, obj))
                labels_by_rel[rel].append(False)
        return kbts_by_rel, labels_by_rel

    def build_splits(self,
            split_names=['tiny', 'train', 'dev'],
            split_fracs=[0.01, 0.74, 0.25],
            seed=1):
        if len(split_names) != len(split_fracs):
            raise ValueError('split_names and split_fracs must be of equal length')
        if sum(split_fracs) != 1.0:
            raise ValueError('split_fracs must sum to 1')
        n = len(split_fracs) # for convenience only

        def split_list(xs):
            xs = sorted(xs) # sorted for reproducibility
            if seed:
                random.seed(seed)
            random.shuffle(xs)
            split_points = [0] + [int(round(frac * len(xs)))
                                  for frac in np.cumsum(split_fracs)]
            return [xs[split_points[i]:split_points[i + 1]] for i in range(n)]

        # first, split the entities that appear as subjects in the KB
        sbjs = list(set([kbt.sbj for kbt in self.kb.kb_triples]))
        sbj_splits = split_list(sbjs)
        sbj_split_dict = {sbj: i for i, split in enumerate(sbj_splits)
                                 for sbj in split}
        # next, split the KB triples based on their subjects
        kbt_splits = [[kbt for kbt in self.kb.kb_triples if sbj_split_dict[kbt.sbj] == i]
                      for i in range(n)]
        # now split examples based on the entities they contain
        ex_splits = [[] for i in range(n + 1)] # include an extra split
        for ex in self.corpus.examples:
            if ex.entity_1 in sbj_split_dict:
                # if entity_1 is a sbj in the KB, assign example to split of that sbj
                ex_splits[sbj_split_dict[ex.entity_1]].append(ex)
            elif ex.entity_2 in sbj_split_dict:
                # if entity_2 is a sbj in the KB, assign example to split of that sbj
                ex_splits[sbj_split_dict[ex.entity_2]].append(ex)
            else:
                # otherwise, put in extra split to be redistributed
                ex_splits[-1].append(ex)
        # reallocate the examples that weren't assigned to a split on first pass
        extra_ex_splits = split_list(ex_splits[-1])
        ex_splits = [ex_splits[i] + extra_ex_splits[i] for i in range(n)]

        # create a Corpus and a KB for each split
        data = {}
        for i in range(n):
            data[split_names[i]] = Dataset(Corpus(ex_splits[i]), KB(kbt_splits[i]))
        data['all'] = self
        return data

    def count_examples(self):
        counter = Counter()
        for rel in self.kb.all_relations:
            for kbt in self.kb.get_triples_for_relation(rel):
                # count examples in both forward and reverse directions
                counter[rel] += len(self.corpus.get_examples_for_entities(kbt.sbj, kbt.obj))
                counter[rel] += len(self.corpus.get_examples_for_entities(kbt.obj, kbt.sbj))
        # report results
        print('{:20s} {:>10s} {:>10s} {:>10s}'.format(
            '', '', '', 'examples'))
        print('{:20s} {:>10s} {:>10s} {:>10s}'.format(
            'relation', 'examples', 'triples', '/triple'))
        print('{:20s} {:>10s} {:>10s} {:>10s}'.format(
            '--------', '--------', '-------', '-------'))
        for rel in self.kb.all_relations:
            nx = counter[rel]
            nt = len(self.kb.get_triples_for_relation(rel))
            print('{:20s} {:10d} {:10d} {:10.2f}'.format(
                rel, nx, nt, 1.0 * nx / nt))

    def __str__(self):
        return "{}; {}".format(self.corpus, self.kb)

    def __repr__(self):
        return str(self)



In [21]:
dataset = Dataset(corpus, kb)

Для каждого из 16 отношений в базе, для каждой пары сущностей в отношении мы нашли в корпусе содержащие эту пару примеры. 

In [22]:
dataset.count_examples()

                                             examples
relation               examples    triples    /triple
--------               --------    -------    -------
adjoins                   58854       1702      34.58
author                    11768       2671       4.41
capital                    7443        522      14.26
contains                  75952      18681       4.07
film_performance           8994       3947       2.28
founders                   5846       1960       2.98
genre                      1576        824       1.91
has_sibling                8525       2563       3.33
has_spouse                12013       2994       4.01
is_a                       5112       2542       2.01
nationality                3403       1598       2.13
parents                    3802       1586       2.40
place_of_birth             1657       1097       1.51
place_of_death             1523        831       1.83
profession                 1851       1216       1.52
worked_at                  3

### Отрицательные примеры

Мы получили большое количество примеров, иллюстрирующих определенные отношения. Чтобы тренировать классификатор, нам нужны и отрицательные примеры. В качестве таких примеров возьмем совместные упоминания таких NE, которые не связаны отношением в нашей базе знаний.


In [23]:
unrelated_pairs = dataset.find_unrelated_pairs()
print('Found {} unrelated pairs, including:'.format(len(unrelated_pairs)))
for pair in list(unrelated_pairs)[:10]:
    print('   ', pair)

Found 247405 unrelated pairs, including:
    ('Le_Corbusier', 'Alvar_Aalto')
    ('Philip_Glass', 'Robert_Frank')
    ('Puerto_Rico', 'Boston')
    ('Show_Boat', 'The_King_and_I')
    ('Herat', 'Kandahar_Province')
    ('Genre', 'Dance_music')
    ('Iowa_Central_Community_College', 'Morrisville_State_College')
    ('The_Fresh_Prince_of_Bel-Air', 'Baywatch')
    ('Chris_Pratt', 'Casey_Bond')
    ('Amit_Chaudhuri', 'Royal_Society_of_Literature')


### Multi-label classification

Почему мы выбираем multi-label классификацию? 
Реализуйте функцию, которая посчитает, сколько в базе есть случаев, когда между двумя NE существует несколько отношений.
методы: kb.all_entity_pairs, kb.get_triples_for_entities


In [24]:
counts = Counter()
for a, b in kb.all_entity_pairs:
   rels = kb.get_triples_for_entities(a, b)
   if len(rels) > 1:
    rels = tuple([rel.rel for rel in rels])
    counts[rels] += 1

In [25]:
counts

Counter({('adjoins', 'contains'): 6,
         ('author', 'founders'): 1,
         ('capital', 'contains'): 207,
         ('contains', 'adjoins'): 5,
         ('contains', 'capital'): 196,
         ('has_sibling', 'has_spouse'): 3,
         ('has_spouse', 'has_sibling'): 4,
         ('is_a', 'is_a'): 3,
         ('is_a', 'is_a', 'is_a'): 1,
         ('is_a', 'profession'): 615,
         ('nationality', 'place_of_birth'): 32,
         ('nationality', 'place_of_birth', 'place_of_death'): 1,
         ('nationality', 'place_of_death'): 3,
         ('nationality', 'worked_at'): 1,
         ('parents', 'has_spouse'): 1,
         ('place_of_birth', 'nationality'): 29,
         ('place_of_birth', 'place_of_death'): 79,
         ('place_of_death', 'nationality'): 6,
         ('place_of_death', 'place_of_birth'): 64,
         ('place_of_death', 'place_of_birth', 'nationality'): 2,
         ('profession', 'is_a'): 601,
         ('worked_at', 'parents'): 2})

Большинство сочетаний отношений выглядит очень логичным. Поэтому мы будем решать именно multi-label, так как непонятно, как выбрать  одно отношение да пары сущностей из трех 'nationality', 'place_of_birth', 'place_of_death'.
Самый простой способ решать multi-label классификацию - это  отдельно предсказывать бинарный выход для каждого отношения. Правда в таком  случае мы теряем возможность обнаружить связь между отношениями, которые, очевидно, связаны.
То есть по сути это эквивалентно тому, чтобы бинарно классифицировать тройки relation-subject-object (существует или нет такое отношение между такими участниками). 

### Создание датасета для обучения

  Для простоты можно тренировать отдельные классификаторы для каждой роли. Для обучение нам нужен датасет следующего вида: список троек `KBTriples` и список булевых значений по тройкам.
  Положительные примеры мы берем из базы, отрицательные сеплируем из нерелевантных пар.

In [26]:
kbts_by_rel, labels_by_rel = dataset.build_dataset(
    include_positive=True, sampling_rate=0.1, seed=1)

In [27]:
print(kbts_by_rel['adjoins'][1], labels_by_rel['adjoins'][1])

KBTriple(rel='adjoins', sbj='Thailand', obj='Laos') True


In [28]:
print(kbts_by_rel['capital'][637], labels_by_rel['capital'][637])

KBTriple(rel='capital', sbj='The_Bicentennial_Man', obj='Film') False


### Разбиение данных

Разобьем данные на train/ dev/ tiny (последнее - 1% от данных, с ним удобно проверять работоспособность моделей)

разбиение нетривиальное, так как нам нужно разбить и корпус, и KB.
При разбиении на сплиты стремимся к идеальной ситуации, когда сущности, встречающиеся в трейне, не встречаются в тесте, а обучающее подмножество KB содержало только сущности из трейна. Чтобы приблизиться к этому, разбиваем следующим образом:

- разбиваем по сплитам NE из роли subject, так чтоб они не пересекались
- разбиваем по сплитам тройки по их субъектам
- разбиваем примеры из корпуса:

  - если первая NE  из примера отнесена к сплиту, относим туда весь пример
  - если вторая NE  из примера отнесена к сплиту, относим туда весь пример
  - если ни одна сущность из примера не  относится к сплиту, приписываем пример сплиту рандомно

In [29]:
splits = dataset.build_splits(
    split_names=['tiny', 'train', 'dev'],
    split_fracs=[0.01, 0.74, 0.25],
    seed=1)

splits

{'all': Corpus with 331,696 examples; KB with 45,884 triples,
 'dev': Corpus with 79,219 examples; KB with 11,210 triples,
 'tiny': Corpus with 3,474 examples; KB with 445 triples,
 'train': Corpus with 249,003 examples; KB with 34,229 triples}

### Метрики

Будем использовать макро-усреднение по f-мере. Так как в задаче наполнения KB нам не так страшно пропустить отношение, как добавить много ложных, будем использовать f0,5-меру (за низкую точность штрафует сильнее, чем за низкую полноту)

Проводить эвалюацию будем ф-цией  `rel_ext.evaluate()`
Передаем ей 
- `splits`: словарь инстансов класса `Dataset`
- `classifier`: ф-ция, берущая на вход список троек `KBTriples` и возвращающая булев список предсказаний 
- `test_split`: на каком сплите тестировать
- `verbose`

In [30]:
def print_statistics_header():
    print('{:20s} {:>10s} {:>10s} {:>10s} {:>10s} {:>10s}'.format(
        'relation', 'precision', 'recall', 'f-score', 'support', 'size'))
    print('{:20s} {:>10s} {:>10s} {:>10s} {:>10s} {:>10s}'.format(
        '-' * 18, '-' * 9, '-' * 9, '-' * 9, '-' * 9, '-' * 9))


def print_statistics_row(rel, result):
    print('{:20s} {:10.3f} {:10.3f} {:10.3f} {:10d} {:10d}'.format(rel, *result))


def print_statistics_footer(avg_result):
    print('{:20s} {:>10s} {:>10s} {:>10s} {:>10s} {:>10s}'.format(
        '-' * 18, '-' * 9, '-' * 9, '-' * 9, '-' * 9, '-' * 9))
    print('{:20s} {:10.3f} {:10.3f} {:10.3f} {:10d} {:10d}'.format('macro-average', *avg_result))


def macro_average_results(results):
    avg_result = [np.average([r[i] for r in results.values()]) for i in range(3)]
    avg_result.append(np.sum([r[3] for r in results.values()]))
    avg_result.append(np.sum([r[4] for r in results.values()]))
    return avg_result


def evaluate(splits, classifier, test_split='dev', sampling_rate=0.1, verbose=True):
    test_kbts_by_rel, true_labels_by_rel = splits[test_split].build_dataset(sampling_rate=sampling_rate)
    results = {}
    if verbose:
        print_statistics_header()
    for rel in splits['all'].kb.all_relations:
        pred_labels = classifier(test_kbts_by_rel[rel])
        stats = precision_recall_fscore_support(true_labels_by_rel[rel], pred_labels, beta=0.5)
        stats = [stat[1] for stat in stats]  # stats[1] is the stat for label True
        stats.append(len(pred_labels)) # number of examples
        results[rel] = stats
        if verbose:
            print_statistics_row(rel, results[rel])
    avg_result = macro_average_results(results)
    if verbose:
        print_statistics_footer(avg_result)
    return avg_result[2]  # return f_0.5 score as summary statistic


### Случайное угадывание

Попробуем в качестве бейслайна, чтобы понять насколько все плохо

In [31]:
def lift(f):
    return lambda xs: [f(x) for x in xs]

def make_random_classifier(p=0.50):
    def random_classify(kb_triple):
        return random.random() < p
    return lift(random_classify)

In [32]:
evaluate(splits, make_random_classifier())

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.062      0.543      0.075        407       7057
author                    0.095      0.519      0.113        657       7307
capital                   0.019      0.508      0.023        126       6776
contains                  0.402      0.501      0.419       4487      11137
film_performance          0.127      0.494      0.149        984       7634
founders                  0.064      0.484      0.078        469       7119
genre                     0.031      0.507      0.038        205       6855
has_sibling               0.085      0.494      0.102        625       7275
has_spouse                0.098      0.481      0.116        754       7404
is_a                      0.085      0.503      0.102        618       7268
nationality               0.062      0.567      0.076        386       7036
parents     

0.09720548338767715

## Baseline  по частотным фразам между сущностями

Идея для каждой роли посчитать частотные фразы между участниками. А классифицировать будем так: для новой пары сущностей если в каком-от из контекстов для этой пары встретились частотные фразы для определенного класса - относим ее в этот класс.

In [35]:
def find_common_middles(split, top_k=3, show_output=False):
    corpus = split.corpus
    kb = split.kb
    mids_by_rel = {
        'fwd': defaultdict(lambda: defaultdict(int)),
        'rev': defaultdict(lambda: defaultdict(int))}
    for rel in kb.all_relations:
        for kbt in kb.get_triples_for_relation(rel):
            for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
                mids_by_rel['fwd'][rel][ex.middle] += 1
            for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
                mids_by_rel['rev'][rel][ex.middle] += 1
    def most_frequent(mid_counter):
        return sorted([(cnt, mid) for mid, cnt in mid_counter.items()], reverse=True)[:top_k]
    for rel in kb.all_relations:
        for dir in ['fwd', 'rev']:
            top = most_frequent(mids_by_rel[dir][rel])
            if show_output:
                for cnt, mid in top:
                    print('{:20} {:7} {:10} {:}'.format(rel, dir, cnt, mid))
            mids_by_rel[dir][rel] = set([mid for cnt, mid in top])
    return mids_by_rel

_ = find_common_middles(splits['train'], show_output=True)

adjoins              fwd           7667 ,
adjoins              fwd           5134 and
adjoins              fwd            903 , and
adjoins              rev           4582 ,
adjoins              rev           3000 and
adjoins              rev            507 , and
author               fwd           1007 by
author               fwd            124 ,
author               fwd            105 , by
author               rev            816 's
author               rev            210 ‘ s
author               rev            142 ’ s
capital              fwd             33 ,
capital              fwd             17 , after
capital              fwd             14 in
capital              rev           2506 ,
capital              rev            121 in
capital              rev             73 , the capital of
contains             fwd            319 's
contains             fwd            296 ,
contains             fwd            211 (
contains             rev          18511 ,
contains             rev       

In [33]:
def train_top_k_middles_classifier(top_k=3):
    split = splits['train']
    corpus = split.corpus
    top_k_mids_by_rel = find_common_middles(split=split, top_k=top_k)
    def classify(kb_triple):
        fwd_mids = top_k_mids_by_rel['fwd'][kb_triple.rel]
        rev_mids = top_k_mids_by_rel['rev'][kb_triple.rel]
        for ex in corpus.get_examples_for_entities(kb_triple.sbj, kb_triple.obj):
            if ex.middle in fwd_mids:
                return True
        for ex in corpus.get_examples_for_entities(kb_triple.obj, kb_triple.sbj):
            if ex.middle in rev_mids:
                return True
        return False
    return lift(classify)

In [36]:
evaluate(splits, train_top_k_middles_classifier())

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.273      0.285      0.275        407       7057
author                    0.323      0.078      0.198        657       7307
capital                   0.089      0.159      0.098        126       6776
contains                  0.582      0.064      0.222       4487      11137
film_performance          0.312      0.005      0.024        984       7634
founders                  0.150      0.038      0.095        469       7119
genre                     0.000      0.000      0.000        205       6855
has_sibling               0.263      0.176      0.239        625       7275
has_spouse                0.349      0.211      0.309        754       7404
is_a                      0.070      0.024      0.051        618       7268
nationality               0.103      0.036      0.075        386       7036
parents     

0.11079552290042358

### Baseline  на мешке слов

In [None]:
def simple_bag_of_words_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    return feature_counter

In [37]:
def train_models(
        splits,
        featurizers,
        split_name='train',
        model_factory=(lambda: LogisticRegression(
            fit_intercept=True, solver='liblinear', random_state=42)),
        sampling_rate=0.1,
        vectorize=True,
        verbose=True):
    train_dataset = splits[split_name]
    train_o, train_y = train_dataset.build_dataset(sampling_rate=sampling_rate)
    train_X, vectorizer = train_dataset.featurize(
        train_o, featurizers, vectorize=vectorize)
    models = {}
    for rel in splits['all'].kb.all_relations:
        models[rel] = model_factory()
        models[rel].fit(train_X[rel], train_y[rel])
    return {
        'featurizers': featurizers,
        'vectorizer': vectorizer,
        'models': models,
        'all_relations': splits['all'].kb.all_relations,
        'vectorize': vectorize}


def predict(splits, train_result, split_name='dev', sampling_rate=0.1, vectorize=True):
    assess_dataset = splits[split_name]
    assess_o, assess_y = assess_dataset.build_dataset(sampling_rate=sampling_rate)
    test_X, _ = assess_dataset.featurize(
        assess_o,
        featurizers=train_result['featurizers'],
        vectorizer=train_result['vectorizer'],
        vectorize=vectorize)
    predictions = {}
    for rel in train_result['all_relations']:
        predictions[rel] = train_result['models'][rel].predict(test_X[rel])
    return predictions, assess_y


def evaluate_predictions(predictions, test_y, verbose=True):
    results = {}  # one result row for each relation
    if verbose:
        print_statistics_header()
    for rel, preds in predictions.items():
        stats = precision_recall_fscore_support(test_y[rel], preds, beta=0.5)
        stats = [stat[1] for stat in stats]  # stats[1] is the stat for label True
        stats.append(len(test_y[rel]))
        results[rel] = stats
        if verbose:
            print_statistics_row(rel, results[rel])
    avg_result = macro_average_results(results)
    if verbose:
        print_statistics_footer(avg_result)
    return avg_result[2]  # return f_0.5 score as summary statistic



In [38]:
kbt = kb.kb_triples[2]
kbt

KBTriple(rel='has_sibling', sbj='Ari_Emanuel', obj='Rahm_Emanuel')

In [None]:
splits['train'].corpus.get_examples_for_entities(kbt.sbj, kbt.obj)

[Example(entity_1='Ari_Emanuel', entity_2='Rahm_Emanuel', left='stop copyright infringers , asking them to punish them as they would shoplifters . I guess Murdoch doesn ’ t understand the different between theft and copyright infringement . And', mention_1='Ari Emmanuel', middle=',', mention_2='Rahm Emmanuel', right='‘ s brother ( and the inspiration for the Entourage character Ari ) , has been lobbying President Obama to implement some sort of three-strikes policy , like they', left_POS="stop/VB copyright/NN infringers/NNS ,/, asking/VBG them/PRP to/TO punish/VB them/PRP as/IN they/PRP would/MD shoplifters/NNS ./. I/PRP guess/VBP Murdoch/NNP doesn/NN '/POS t/NN understand/VBP the/DT different/JJ between/IN theft/NN and/CC copyright/NN infringement/NN ./. And/CC", mention_1_POS='Ari/NNP Emmanuel/NNP', middle_POS=',/,', mention_2_POS='Rahm/NNP Emmanuel/NNP', right_POS='`/`` s/NNS brother/NN -LRB-/-LRB- and/CC the/DT inspiration/NN for/IN the/DT Entourage/NN character/NN Ari/NNP -RRB-/-R

In [None]:
simple_bag_of_words_featurizer(kbt, splits['train'].corpus, Counter())

Counter({"'s": 1,
         ',': 4,
         '-': 1,
         '--': 1,
         'Chief': 3,
         'House': 2,
         'Obama': 1,
         'President': 1,
         'Staff': 3,
         'White': 2,
         '[': 2,
         ']': 2,
         'brother': 2,
         'congressman': 1,
         'is': 1,
         'of': 5})

In [None]:
featurized = dataset.featurize(kbts_by_rel, featurizers=[simple_bag_of_words_featurizer])

In [None]:
len(kbts_by_rel['worked_at'])

25890

In [None]:
featurized[0]['worked_at']

<25890x34383 sparse matrix of type '<class 'numpy.float64'>'
	with 116812 stored elements in Compressed Sparse Row format>

In [None]:
train_result = train_models(splits, featurizers=[simple_bag_of_words_featurizer])

In [None]:
predictions, true_labels = predict(splits, train_result, split_name='dev')

In [None]:
evaluate_predictions(predictions, true_labels)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.860      0.393      0.695        407       7057
author                    0.810      0.505      0.723        657       7307
capital                   0.681      0.254      0.510        126       6776
contains                  0.772      0.599      0.730       4487      11137
film_performance          0.787      0.578      0.734        984       7634
founders                  0.826      0.414      0.688        469       7119
genre                     0.464      0.156      0.333        205       6855
has_sibling               0.878      0.254      0.589        625       7275
has_spouse                0.901      0.337      0.675        754       7404
is_a                      0.690      0.223      0.487        618       7268
nationality               0.607      0.168      0.399        386       7036
parents     

0.5644409661831543

In [None]:
def find_new_relation_instances(
        dataset,
        featurizers,
        train_split='train',
        test_split='dev',
        model_factory=(lambda: LogisticRegression(
            fit_intercept=True, solver='liblinear', random_state=42)),
        k=10,
        vectorize=True,
        verbose=True):
    splits = dataset.build_splits()
    # train models
    train_result = train_models(
        splits,
        split_name=train_split,
        featurizers=featurizers,
        model_factory=model_factory,
        vectorize=vectorize,
        verbose=True)
    test_split = splits[test_split]
    neg_o, neg_y = test_split.build_dataset(
        include_positive=False,
        sampling_rate=1.0)
    neg_X, _ = test_split.featurize(
        neg_o,
        featurizers=featurizers,
        vectorizer=train_result['vectorizer'],
        vectorize=vectorize)
    # Report highest confidence predictions:
    for rel, model in train_result['models'].items():
        print('Highest probability examples for relation {}:\n'.format(rel))
        probs = model.predict_proba(neg_X[rel])
        probs = [prob[1] for prob in probs] # probability for class True
        sorted_probs = sorted([(p, idx) for idx, p in enumerate(probs)], reverse=True)
        for p, idx in sorted_probs[:k]:
            print('{:10.3f} {}'.format(p, neg_o[rel][idx]))
        print()

In [None]:
find_new_relation_instances(dataset,featurizers=[simple_bag_of_words_featurizer])

Highest probability examples for relation adjoins:

     1.000 KBTriple(rel='adjoins', sbj='Canada', obj='Vancouver')
     1.000 KBTriple(rel='adjoins', sbj='Vancouver', obj='Canada')
     1.000 KBTriple(rel='adjoins', sbj='Sicily', obj='Italy')
     1.000 KBTriple(rel='adjoins', sbj='Italy', obj='Sicily')
     1.000 KBTriple(rel='adjoins', sbj='Atlantic_Ocean', obj='Mexico')
     1.000 KBTriple(rel='adjoins', sbj='Mexico', obj='Atlantic_Ocean')
     1.000 KBTriple(rel='adjoins', sbj='Lahore', obj='Pakistan')
     1.000 KBTriple(rel='adjoins', sbj='Pakistan', obj='Lahore')
     1.000 KBTriple(rel='adjoins', sbj='Europe', obj='Great_Britain')
     1.000 KBTriple(rel='adjoins', sbj='Great_Britain', obj='Europe')

Highest probability examples for relation author:

     1.000 KBTriple(rel='author', sbj='Brave_New_World', obj='Aldous_Huxley')
     1.000 KBTriple(rel='author', sbj='A_Christmas_Carol', obj='Charles_Dickens')
     1.000 KBTriple(rel='author', sbj='Aldous_Huxley', obj='The_Door

### Задание (подумать и попробовать реализовать) 
Итак, мешок слов, несмотря на свою простоту, показывает себя довольно хорошо. Как более хитро можно извлечь признаки из предложений, иллюстрирующих пары сущностей?  В качестве задания вам предлагается улучшить бейслайн, например придумав, как использовать векторизацию контекстов для предложений.

In [None]:
#your code here