# Complete System
In this notebook we will be joining the pieces developed in the previous notebooks to create the final pipeline used to obtain the topics of a given publication.

Each component has been shown in the following notebooks:
* Named Entity Recognition: Notebook 4.
* Entity Linking, Topic Extraction: Notebook 6.
* Text Preprocessor, Vectorizer, Topic Model: Notebook 3.
* Topic Model (automatic labelling of topics): Notebook 5.

In this notebook we will be working on the Topic Combination module that will combine the lists of potential topics to output the final list of topics returned by the system with their confidence scores. Once this module is finished, the complete system will be used to obtain the topics of each article in the dataset, and we will save it for later use and inferring topics from new data.

# Setup

In [1]:
%run __init__.py

In [2]:
import pandas as pd

GIT_FILE_PATH = os.path.join(NOTEBOOK_1_RESULTS_DIR, 'git_dataframe.pkl')

git_df = pd.read_pickle(GIT_FILE_PATH)
git_repositories = git_df['full_text_cleaned'].values

## Loading the model
The main two pipelines that retrieve the list of topics from the text will be loaded here:

In [3]:
from herc_common.utils import load_object

lda_pipe = load_object(os.path.join(NOTEBOOK_3_RESULTS_DIR, 'lda_pipe_with_labels.pkl'))
ner_pipe = load_object(os.path.join(NOTEBOOK_5_RESULTS_DIR, 'topic_extraction_from_ner_pipe.pkl'))

## Combining topics
To join the results of both pipelines we will be making use of the [FeatureUnion](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) class from scikit-learn. This class will serve as a single transformer which concatenates the results of the previous pipelines:

In [4]:
from sklearn.pipeline import FeatureUnion

union = FeatureUnion([("ner", ner_pipe),
                      ("lda", lda_pipe)])

## Building the final pipeline
To build our final system, we will make use of a custom class that will combine the topics from the _ner_ and _lda_ pipelines. Once the topics are concatenated by the feature union they will go to the combiner, where a final list of topics will be returned:

In [5]:
from sklearn.pipeline import Pipeline

from herc_common.topic import TopicCombiner


combiner = TopicCombiner()
final_pipe = Pipeline([('union', union),
                      ('combiner', combiner)])

In [6]:
import string

import en_core_sci_lg
import en_core_web_md

from collections import Counter

from tqdm import tqdm

en_core_web_md.load()
en_core_sci_lg.load()

<spacy.lang.en.English at 0x10c25f98c48>

In [7]:
final_pipe.transform([git_repositories[-1]])

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




[[(biology (Q420), 0.1483065953654189),
  (mathematics (Q395), 0.14788482047635976),
  (geology (Q1069), 0.147849780831655),
  (mathematical analysis (Q7754), 0.14594784235761898),
  (statistics (Q12483), 0.1456752655538695),
  (science (Q336), 0.14472921257103097),
  (education (Q8434), 0.14338235294117646)]]

## Predicting the final topics for the dataset
Now that we have our final system ready, we will obtain the list of topics for the Agriculture dataset:

In [8]:
topics = final_pipe.transform(git_repositories)

HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [9]:
from rdflib import URIRef, BNode, Literal
from rdflib import Namespace
from rdflib import Graph
from rdflib.namespace import RDF, RDFS

EDMA = Namespace("http://edma.org/challenge/")
ITSRDF = Namespace("http://www.w3.org/2005/11/its/rdf#")
NIF = Namespace("https://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#")

g = Graph()
g.bind('edma', EDMA)
g.bind('itsrdf', ITSRDF)
g.bind('nif', NIF)
text = git_repositories[-1]
text_topics = topics[-1]

def add_text_topics_to_graph(uri, c_id, text, topics, g):
    context_element = URIRef(f"{EDMA}{c_id}")
    text_element = Literal(text)
    g.add((context_element, NIF.isString, text_element))
    g.add((context_element, NIF.sourceURL, URIRef(uri)))
    g.add((context_element, NIF.predominantLanguage, Literal('en')))
    for topic, score in topics:
        topic_label = '_'.join(str(topic).split(' '))
        topic_element = URIRef(f"{EDMA}{topic_label}")
        g.add((topic_element, RDF.type, NIF.annotation))
        g.add((topic_element, NIF.confidence, Literal(topic.score)))
        for lang, val in topic.labels.items():
            g.add((topic_element, RDFS.label, Literal(val, lang=lang)))
        for uri in topic.uris:
            g.add((topic_element, ITSRDF.taIdentRef, URIRef(uri)))
        g.add((context_element, NIF.topic, topic_element))
    return context_element


add_text_topics_to_graph('https://github.com/pauldevos/BasketballAnalytics', '223627473',
                         text, text_topics, g)
print(g.serialize(format="turtle").decode("utf-8"))

[
  {
    "@id": "http://edma.org/challenge/223627473",
    "https://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#isString": [
      {
        "@value": "Repository which contains various scripts and work with various basketball statistics. Basketball Analytics. This repository and scripts in it will be focusing on the statistics revolving around NBA and basketball in general. All code is written in Python using the Jupyter Notebooks which allow live preview of the images and thus making it nice and easy to analyze and visualize data. Current mini \"projects\": 2019-20 Season. Shots and assist by Doncic vs Pelicans . Mid-Range heavy Shot chart by DeMar Derozan . 2018-19 Season. Evolution of Brook Lopez, a look at change of Lopez' shot charts through the seasons . Predicting MVP for 2018-19 nba season. . Here is notebook which shows players that excel in the clutch. . Big Luka Dončić analysis, notebook . James Harden analysis, notebook . 2017-18 Season. Assists distribution b

## Saving results
Finally, we are going to save the complete pipeline for further use with new data:

In [None]:
from herc_common.utils import save_object

save_object(final_pipe, os.path.join(NOTEBOOK_6_RESULTS_DIR, 'final_pipe.pkl'))