# 6. Complete System
In this notebook we will be joining the pieces developed in the previous notebooks to create the final pipeline used to obtain the topics of a given publication.

In this notebook we will be working on the Topic Combination module that will combine the lists of potential topics to output the final list of topics returned by the system with their confidence scores. Once this module is finished, the complete system will be used to obtain the topics of each article in the dataset, and we will save it for later use and inferring topics from new data.

# Setup

In [1]:
%run __init__.py

INFO:root:Starting logger


In [2]:
import pandas as pd

DF_FILE_PATH = os.path.join(NOTEBOOK_2_RESULTS_DIR, 'protocols_dataframe.pkl')

df = pd.read_pickle(DF_FILE_PATH)
protocols = df['full_text_cleaned'].values

In [3]:
df.head()

Unnamed: 0,pr_id,title,abstract,materials,procedure,equipment,background,categories,authors,full_text,full_text_no_abstract,full_text_cleaned,full_text_no_abstract_cleaned,num_chars_text
0,e100,Scratch Wound Healing Assay,The scratch wound healing assay has been widel...,"Human MDA-MB-231 cell line (ATCC, catalog numb...",Grow cells in DMEM supplemented with 10% FBS.|...,BD Falcon 24-well tissue culture plate (Fisher...,,Cancer Biology|Invasion & metastasis|Cell biol...,Yanling Chen,Scratch Wound Healing Assay. The scratch wound...,Scratch Wound Healing Assay. Grow cells in DME...,Scratch Wound Healing Assay. The scratch wound...,Scratch Wound Healing Assay. Grow cells in DME...,2583
1,e1029,ADCC Assay Protocol,Antibody-dependent cell-mediated cytotoxicity ...,Raji cells (ATCC)|A/California/04/2009 (H1N1) ...,Preperation of Target Cells\n\n\t\t\n\n\n\t\t\...,Round bottomed 96-well plate|Temperature contr...,,Immunology|Immune cell function|Cytotoxicity|C...,Vikram Srivastava|Zheng Yang|Ivan Fan Ngai...,ADCC Assay Protocol. Antibody-dependent cell-m...,ADCC Assay Protocol. Preperation of Target Cel...,ADCC Assay Protocol. Antibody-dependent cell-m...,ADCC Assay Protocol. Preperation of Target Cel...,3824
2,e1072,Catalase Activity Assay in Candida glabrata,Commensal and pathogenic fungi are exposed to ...,Yeast strains \nNote: BG14 was used as the C. ...,Preparation of total soluble extracts\n\t\t\n\...,Orbital incubator shaker|Microfuge tubes|50 ml...,,Microbiology|Microbial biochemistry|Protein|Ac...,Emmanuel Orta-Zavalza|Marcela Briones-Martin...,Catalase Activity Assay in Candida glabrata. C...,Catalase Activity Assay in Candida glabrata. P...,Catalase Activity Assay in Candida glabrata. C...,Catalase Activity Assay in Candida glabrata. P...,4207
3,e1077,RNA Isolation and Northern Blot Analysis,The northern blot is a technique used in molec...,Vero cells (kidney epithelial cells extracted ...,RNA extraction\n\t\t\n\n\t\t\t\tCells were see...,"100 mm cell culture dishes (Corning, catalog n...",,Microbiology|Microbial genetics|RNA|RNA extrac...,Ying Liao|To Sing Fung|Mei Huang|Shouguo Fang|...,RNA Isolation and Northern Blot Analysis. The ...,RNA Isolation and Northern Blot Analysis. RNA ...,RNA Isolation and Northern Blot Analysis. The ...,RNA Isolation and Northern Blot Analysis. RNA ...,6890
4,e1090,Flow Cytometric Analysis of Autophagic Activit...,Flow cytometry allows very sensitive and relia...,"Cells lines of interest (HepG2, HUH7, CMK, K56...",Maintain cells under standard tissue culture c...,"37 °C, 5% CO2 humidified incubator|Centrifuge|...",,Microbiology|Antimicrobial assay|Autophagy ass...,Metodi Stankov|Diana Panayotova-Dimitrova|Ma...,Flow Cytometric Analysis of Autophagic Activit...,Flow Cytometric Analysis of Autophagic Activit...,Flow Cytometric Analysis of Autophagic Activit...,Flow Cytometric Analysis of Autophagic Activit...,5890


## Loading the model
The main two pipelines that retrieve the list of topics from the text will be loaded here:

In [4]:
from herc_common.utils import load_object

lda_pipe = load_object(os.path.join(NOTEBOOK_4_RESULTS_DIR, 'lda_pipe_with_labels.pkl'))
ner_pipe = load_object(os.path.join(NOTEBOOK_5_RESULTS_DIR, 'topic_extraction_from_ner_pipe.pkl'))

## Combining topics
To join the results of both pipelines we will be making use of the [FeatureUnion](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) class from scikit-learn. This class will serve as a single transformer which concatenates the results of the previous pipelines:

In [5]:
from sklearn.pipeline import FeatureUnion

union = FeatureUnion([("ner", ner_pipe),
                      ("lda", lda_pipe)])

## Building the final pipeline
To build our final system, we will make use of a custom class that will combine the topics from the _ner_ and _lda_ pipelines. Once the topics are concatenated by the feature union they will go to the combiner, where a final list of topics will be returned:

In [6]:
from sklearn.pipeline import Pipeline

from herc_common.topic import TopicCombiner


combiner = TopicCombiner()
final_pipe = Pipeline([('union', union),
                      ('combiner', combiner)])

In [7]:
import string

import en_core_sci_lg
import en_core_web_md

from collections import Counter

from tqdm import tqdm

en_core_web_md.load()
en_core_sci_lg.load()

<spacy.lang.en.English at 0x2cb8a3dc0c8>

In [8]:
final_pipe.transform([protocols[-1]])

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.





HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




[[(biopolymer (Q422649), 0.21697490092470278),
  (protein (Q8054), 0.21379759192971037),
  (cell (Q7868), 0.20830691185795816),
  (polynucleotide (Q80756), 0.20797720797720798),
  (biochemistry (Q7094), 0.20738636363636365),
  (brain (Q1073), 0.20315891866847588),
  (biomolecule (Q206229), 0.20221606648199447)]]

## Predicting the final topics for the dataset
Now that we have our final system ready, we will obtain the list of topics for the Agriculture dataset:

In [9]:
topics = final_pipe.transform(protocols)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))

INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.





INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
I

INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
IN

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [10]:
topics[0][:5]

[(software (Q7397), 0.19620991253644315),
 (chemistry (Q2329), 0.19507246376811593),
 (science (Q336), 0.19495944380069524),
 (research (Q42240), 0.19462116830537884),
 (chemical substance (Q79529), 0.18963088193857425)]

## Saving results
Finally, we are going to save the complete pipeline for further use with new data:

In [11]:
from herc_common.utils import save_object

save_object(final_pipe, os.path.join(NOTEBOOK_6_RESULTS_DIR, 'final_pipe.pkl'))

In [12]:
results_df = df[['pr_id', 'title', 'categories']]
results_df = results_df.assign(y_pred=topics)
results_df.head()

Unnamed: 0,pr_id,title,categories,y_pred
0,e100,Scratch Wound Healing Assay,Cancer Biology|Invasion & metastasis|Cell biol...,"[(software, 0.19620991253644315), (chemistry, ..."
1,e1029,ADCC Assay Protocol,Immunology|Immune cell function|Cytotoxicity|C...,"[(protein, 0.37364396498206315), (biological p..."
2,e1072,Catalase Activity Assay in Candida glabrata,Microbiology|Microbial biochemistry|Protein|Ac...,"[(botany, 0.3104649910786212), (protein, 0.214..."
3,e1077,RNA Isolation and Northern Blot Analysis,Microbiology|Microbial genetics|RNA|RNA extrac...,"[(protein, 0.21430251857314592), (biological p..."
4,e1090,Flow Cytometric Analysis of Autophagic Activit...,Microbiology|Antimicrobial assay|Autophagy ass...,"[(process, 0.19932614555256065), (chemical com..."


In [13]:
results_df.to_csv(os.path.join(NOTEBOOK_6_RESULTS_DIR, 'protocol_results.csv'), index=False)