# Complete System
In this notebook we will be joining the pieces developed in the previous notebooks to create the final pipeline used to obtain the topics of a given publication. This pipeline is illustrated in the following image:
![Dataflow Publications](img/dataflow_publications.png)

Each component has been shown in the following notebooks:
* Named Entity Recognition: Notebook 4.
* Entity Linking, Topic Extraction: Notebook 6.
* Text Preprocessor, Vectorizer, Topic Model: Notebook 3.
* Topic Model (automatic labelling of topics): Notebook 5.

In this notebook we will be working on the Topic Combination module that will combine the lists of potential topics to output the final list of topics returned by the system with their confidence scores. Once this module is finished, the complete system will be used to obtain the topics of each article in the dataset, and we will save it for later use and inferring topics from new data.

# Setup

In [1]:
%run __init__.py

In [2]:
import pandas as pd

PMC_FILE_PATH = os.path.join(NOTEBOOK_2_RESULTS_DIR, 'pmc_dataframe.pkl')

pmc_df = pd.read_pickle(PMC_FILE_PATH)
publications = pmc_df['text_cleaned'].values

## Loading the model
The main two pipelines that retrieve the list of topics from the text will be loaded here:

In [3]:
from herc_common.utils import load_object

lda_pipe = load_object(os.path.join(NOTEBOOK_5_RESULTS_DIR, 'lda_pipe_with_labels.pkl'))
ner_pipe = load_object(os.path.join(NOTEBOOK_6_RESULTS_DIR, 'topic_extraction_from_ner_pipe.pkl'))

## Combining topics
To join the results of both pipelines we will be making use of the [FeatureUnion](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) class from scikit-learn. This class will serve as a single transformer which concatenates the results of the previous pipelines:

In [7]:
from sklearn.pipeline import FeatureUnion

union = FeatureUnion([("ner", ner_pipe),
                      ("lda", lda_pipe)])

## Building the final pipeline
To build our final system, we will make use of a custom class that will combine the topics from the _ner_ and _lda_ pipelines. Once the topics are concatenated by the feature union they will go to the combiner, where a final list of topics will be returned:

In [8]:
from sklearn.pipeline import Pipeline

from herc_common.topic import TopicCombiner


combiner = TopicCombiner()
final_pipe = Pipeline([('union', union),
                      ('combiner', combiner)])

In [9]:
import string

import en_core_sci_lg
import en_core_web_md

from collections import Counter

from tqdm import tqdm

en_core_web_md.load()
en_core_sci_lg.load()

<spacy.lang.en.English at 0x24f0b02ad08>

In [10]:
final_pipe.transform([publications[-1]])

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




[[('interaction science', 0.38543432290211627),
  ('statistics', 0.19584709768758848),
  ('management', 0.1923968474733426),
  ('taxon', 0.18988789750629145),
  ('forestry science', 0.18988789750629145),
  ('botany', 0.18846503178928248),
  ('control', 0.1866846603688709)]]

## Predicting the final topics for the dataset
Now that we have our final system ready, we will obtain the list of topics for the Agriculture dataset:

In [11]:
topics = final_pipe.transform(publications)

HBox(children=(FloatProgress(value=0.0, max=126.0), HTML(value='')))




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [12]:
topics[0][:5]

[('organism', 0.4999017373890787),
 ('chemistry', 0.20094382706652458),
 ('breastfeeding', 0.19604930937175108),
 ('pharmacology', 0.19434628975265017),
 ('sociology', 0.1929542464551966)]

## Saving results
Finally, we are going to save the complete pipeline for further use with new data:

In [14]:
from herc_common.utils import save_object

save_object(final_pipe, os.path.join(NOTEBOOK_7_RESULTS_DIR, 'final_pipe.pkl'))