# Complete System
In this notebook we will be joining the pieces developed in the previous notebooks to create the final pipeline used to obtain the topics of a given publication. This pipeline is illustrated in the following image:
![Dataflow Publications](img/dataflow_publications.png)

Each component has been shown in the following notebooks:
* Named Entity Recognition: Notebook 4.
* Entity Linking, Topic Extraction: Notebook 6.
* Text Preprocessor, Vectorizer, Topic Model: Notebook 3.
* Topic Model (automatic labelling of topics): Notebook 5.

In this notebook we will be working on the Topic Combination module that will combine the lists of potential topics to output the final list of topics returned by the system with their confidence scores. Once this module is finished, the complete system will be used to obtain the topics of each article in the dataset, and we will save it for later use and inferring topics from new data.

# Setup

In [1]:
%run __init__.py

In [2]:
import pandas as pd

PMC_FILE_PATH = os.path.join(NOTEBOOK_2_RESULTS_DIR, 'pmc_dataframe.pkl')

pmc_df = pd.read_pickle(PMC_FILE_PATH)
publications = pmc_df['text_cleaned'].values



## Loading the model

In [3]:
from src.utils import load_object

lda_pipe = load_object(os.path.join(NOTEBOOK_5_RESULTS_DIR, 'lda_pipe_with_labels.pkl'))
ner_pipe = load_object(os.path.join(NOTEBOOK_6_RESULTS_DIR, 'topic_extraction_from_ner_pipe.pkl'))

## Combining topics

In [4]:
from sklearn.pipeline import FeatureUnion

union = FeatureUnion([("ner", ner_pipe),
                      ("lda", lda_pipe)])

In [5]:
from src.topic import TopicCombiner

combiner = TopicCombiner()

## Building the final pipeline

In [6]:
from sklearn.pipeline import Pipeline

final_pipe = Pipeline([('union', union),
                      ('combiner', combiner)])

In [7]:
import string

import en_core_sci_lg
import en_core_web_md

from collections import Counter

from tqdm import tqdm

en_core_web_md.load()
en_core_sci_lg.load()

<spacy.lang.en.English at 0x7f17dbd9bba8>

In [8]:
final_pipe.transform([publications[-1]])

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




1it [00:01,  1.22s/it]


[[('interaction science', 0.3854343229021163),
  ('statistics', 0.19608778625954199),
  ('taxon', 0.19010175763182238),
  ('forestry science', 0.18988218988218988),
  ('botany', 0.18844566712517194),
  ('process', 0.18827301878149336),
  ('control', 0.18732907930720147)]]

## Predicting the final topics for the dataset

In [9]:
topics = final_pipe.transform(publications)

HBox(children=(FloatProgress(value=0.0, max=126.0), HTML(value='')))




126it [02:06,  1.00s/it]


## Saving results

In [14]:
from src.utils import save_object

save_object(final_pipe, os.path.join(NOTEBOOK_7_RESULTS_DIR, 'final_pipe.pkl'))