# 5. Topic extraction from NER
In this notebook we are going to perform a selection of topics from the entities recognized in each article. The process will be as follows:
* The Named Entity Recognizer object created in notebook _4_Named_Entity_Recognition_ will be loaded.
* The list of entities for a given text will be retrieved.
* After we have the list of entities, they will be linked to Wikidata.
* From the list of linked entities, we will create a graph with the Wikidata entities obtained by expanding some of their properties.
* Once the graph has been obtained, we will apply centrality algotihms to select the most representative entities from  it, which will serve as topics for the text.

## Setup

In [1]:
%run __init__.py

logger.setLevel(logging.INFO)

In [2]:
from bokeh.io import output_notebook

output_notebook()



## Loading the data
We will start by loading the agriculture dataframe created in notebook 2. After that, we will select the last article to demonstrate what the main data workflow will be:

In [3]:
import pandas as pd

GIT_FILE_PATH = os.path.join(NOTEBOOK_1_RESULTS_DIR, 'git_dataframe.pkl')

git_df = pd.read_pickle(GIT_FILE_PATH)
git_repositories = git_df['full_text_cleaned'].values

In [4]:
git_df.head(n=25)

Unnamed: 0,gh_id,name,description,owner_name,languages,readme_text,issues_text,commits_text,filenames,comments_text,full_text,full_text_cleaned,num_chars_text
0,216602979,LIRICAL,LIkelihood Ratio Interpretation of Clinical Ab...,cmungall,"{'Java': 492423, 'FreeMarker': 13149, 'Python'...",LIRICAL. LIkelihood Ratio Interpretation of C...,,Merge pull request #442 from TheJacksonLaborat...,\nCHANGELOG\nREADME\nhoxc13 output\nlirical to...,note that the Jannovar dependency does not nee...,LIkelihood Ratio Interpretation of Clinical Ab...,LIkelihood Ratio Interpretation of Clinical Ab...,3770
1,199330464,wikidata_ontomatcher,Matches ontology classes against wikidata,cmungall,"{'Prolog': 14691, 'Makefile': 1472, 'Dockerfil...",Match an ontology to Wikidata. This applicatio...,Will help with #1 and with https://github.com/...,Adding skos:altLabel\n\nhttps://github.com/cmu...,\nREADME\ninstall\npack\nwikidata ontomatcher\...,,Matches ontology classes against wikidata. Mat...,Matches ontology classes against wikidata. Mat...,519
2,253207181,ro-crate-ruby,"A Ruby gem for creating, manipulating and read...",markwilkinson,"{'Ruby': 52724, 'HTML': 1319}","ro-crate-ruby. This is a WIP gem for creating,...",,Update LICENSE\nBump version\nTidy up and chec...,\n travis\nGemfile\nREADME\nROCrate\nContact P...,*\n * Expands the tree to the target element a...,"A Ruby gem for creating, manipulating and read...","A Ruby gem for creating, manipulating and read...",2559
3,212556220,Misc_Training_scripts,A place for me to keep various miscellanelous ...,markwilkinson,"{'Shell': 15815, 'Ruby': 9445}",Misc_Training_scripts. A place for me to keep ...,,added new cool 3-federated query\nfinished edi...,README\nSpecies Abundance Pub2015\nSpecies Inf...,,A place for me to keep various miscellanelous ...,A place for me to keep various miscellanelous ...,545
4,155879756,FAIRifier,A tool to make data FAIR,mikel-egana-aranguren,"{'Java': 3514431, 'JavaScript': 967765, 'HTML'...",Dependencies: Java 8. Apache Ant. Building. in...,,Merge pull request #16 from Shamanou/developme...,\norg eclipse core resources\norg eclipse jdt ...,*\n * Main class for Refine server application...,A tool to make data FAIR. Dependencies: Java 8...,A tool to make data FAIR. Dependencies: Java 8...,57859
5,90349931,elda,Epimorphics implementation of the Linked Data API,mikel-egana-aranguren,"{'Java': 1892893, 'JavaScript': 1757647, 'XSLT...","Elda, an implementation of the Linked Data API...",,Proper reference Config\nConfiguracion ELDA de...,\nCONTRIBUTING\nLICENCE\nREADME demo\nREADME\n...,Everything that's part of the resource set is ...,Epimorphics implementation of the Linked Data ...,Epimorphics implementation of the Linked Data ...,15907
6,126633812,music-genre-classification,Recognizing the genre of music files using mac...,HareeshBahuleyan,"{'Jupyter Notebook': 7532041, 'Python': 8296}",Music Genre Classification. \n Overview. Reco...,,Update LICENSE\nUpdate README.md\nUpdate READM...,1 audio retrieval\n2 plot spectrogram\n3 1 vgg...,,Recognizing the genre of music files using mac...,Recognizing the genre of music files using mac...,3078
7,173520377,probabilistic_nlg,Tensorflow Implementation of Stochastic Wasser...,HareeshBahuleyan,{'Python': 303839},Stochastic Wasserstein Autoencoder for Probabi...,Bumps [tensorflow-gpu](https://github.com/tens...,Update LICENSE\nUpdate requirements.txt\nUpdat...,README\n init \ndf movie test\ndf movie trai...,,Tensorflow Implementation of Stochastic Wasser...,Tensorflow Implementation of Stochastic Wasser...,3725
8,103798851,DataStructures-Algorithms-InC,Programs of Data Structures and Algorithms in ...,gauravtheP,"{'C': 117644, 'Makefile': 54504, 'C++': 9409, ...",,,Minor modification is done in chainingInHashin...,dep\n01Knapsack Problem\nFloyd Warshall Algor...,Time Complexity: O(nlogn)\nTime Complexity\n W...,Programs of Data Structures and Algorithms in ...,Programs of Data Structures and Algorithms in ...,1330
9,153249816,Music-Generation-Using-Deep-Learning,A Deep Learning Case Study to Generate Music S...,gauravtheP,{'Jupyter Notebook': 52835},Music Generation Using Deep-Learning. Check ou...,Does this model also include chord generation?...,Blog Link Updated\nUpdate README.md\nUpdate RE...,Generate Music\nMusic Generation Train1\nMusic...,,A Deep Learning Case Study to Generate Music S...,A Deep Learning Case Study to Generate Music S...,3922


In [5]:
text = git_repositories[-1]

## Loading the NER model
The named entity recognition model created in notebook 4 will now be loaded and used to obtain the entities of the article:

In [6]:
import en_core_sci_lg

from herc_common.utils import load_object
from collections import Counter

ner = load_object(os.path.join(NOTEBOOK_4_RESULTS_DIR, 'ner_system.pkl'))

In [7]:
nlp = en_core_sci_lg.load()
entities = ner.transform([text])
entities[0][:10]

['Repository',
 'scripts',
 'basketball',
 'Basketball Analytics',
 'repository',
 'scripts',
 'statistics',
 'NBA',
 'basketball',
 'code']

## Entity linking
Now, we will be making use of the WikidataEntityLinker class to obtain the Wikidata URI of each entity recognized before:

In [6]:
import json
import requests


DBPEDIA_BASE = 'http://dbpedia.org'
DBPEDIA_SPOTLIGHT_BASE = 'http://api.dbpedia-spotlight.org/en'
OWL_SAME_AS = 'http://www.w3.org/2002/07/owl#sameAs'

class DBPediaEntityLinker():
    def __init__(self, confidence_threshold=0.4):
        self.confidence = confidence_threshold
    
    def link_entities(self, text):
        payload = {'confidence': self.confidence, 'text': text}
        reqheaders = {'accept': 'application/json'}
        res = requests.post(f"{DBPEDIA_SPOTLIGHT_BASE}/annotate",
                            data=payload,
                            headers={"accept": "application/json"})
        if res.status_code != 200:
            print(res.content)
            print("Error annotating text with DBPedia Spotlight: ", res)
            return []
        
        res_dict = json.loads(res.content)
        if 'Resources' not in res_dict:
            return []
        
        return [(resource['@surfaceForm'], resource['@URI'])
                for resource in res_dict['Resources']]


def _convert_to_wd(dbpedia_linked_entities):
    wd_entities = []
    for name, url in dbpedia_linked_entities:
        resource_url = url.replace(f"{DBPEDIA_BASE}/resource", f"{DBPEDIA_BASE}/data")
        resource_url += ".json"
        res = requests.get(resource_url)
        if res.status_code != 200:
            print(f"Error loading resource '{resource_url}': ", res)
            continue

        res_dict = json.loads(res.content)
        try:
            mappings = res_dict[url][OWL_SAME_AS]
        except KeyError:
            wd_entities.append((name, None))

        for mapping in mappings:
            mapping_url = mapping['value']
            if 'http://www.wikidata.org/' in mapping_url:
                wd_entities.append((name, mapping_url))
                break
    return wd_entities


In [12]:
dbpedia_linker = DBPediaEntityLinker()
dbpedia_linked_entities = [dbpedia_linker.link_entities(text)
                           for text in git_repositories]
dbpedia_linked_entities[0][:5]

b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>500 Internal Error</title>\n</head><body>\n<h1>Internal Error</h1>\n<p>The server encountered an internal error or\nmisconfiguration and was unable to complete\nyour request.</p>\n<p>Please contact the server administrator at \n [no address given] to inform them of the time this error occurred,\n and the actions you performed just before this error.</p>\n<p>More information about this error may be available\nin the server error log.</p>\n<hr>\n<address>Apache/2.4.25 (Debian) Server at api.dbpedia-spotlight.org Port 80</address>\n</body></html>\n'
Error annotating text with DBPedia Spotlight:  <Response [500]>


[('phenotypic', 'http://dbpedia.org/resource/Phenotype'),
 ('Human Phenotype Ontology',
  'http://dbpedia.org/resource/Human_Phenotype_Ontology'),
 ('genotypes', 'http://dbpedia.org/resource/Genotype'),
 ('VCF', 'http://dbpedia.org/resource/Variant_Call_Format'),
 ('gene', 'http://dbpedia.org/resource/Gene')]

In [None]:
linked_entities = [_convert_to_wd(entities)
                   for entities in dbpedia_linked_entities]
linked_entities[0][:5]

## CONTINUE HERE

In [7]:
from herc_common.entity_linking import WikidataEntityLinker

linker = WikidataEntityLinker()
linked_entities = linker.fit_transform(entities)
linked_entities[0][:5]

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




[('Repository', 'http://www.wikidata.org/entity/Q3133368'),
 ('scripts', 'http://www.wikidata.org/entity/Q187432'),
 ('basketball', 'http://www.wikidata.org/entity/Q5372'),
 ('Basketball Analytics', None),
 ('repository', 'http://www.wikidata.org/entity/Q3133368')]

## Building the graph
After each entity has been linked to Wikidata, w:e will begin exploring their neighbourhood in the knowledge graph to obtain a list of candidates for our final topics

In [None]:
from herc_common.graph import WikidataGraphBuilder

graph_builder = WikidataGraphBuilder(max_hops=2)
entity_graph = graph_builder.build_graph(linked_entities[0])

In [None]:
from bokeh.io import show
from bokeh.layouts import gridplot

from herc_common.bokeh_utils import build_graph_plot

plot = build_graph_plot(entity_graph, f"Linked entities graph")
show(plot)

Since the graph from above is not completely connected, we will be obtaining the largest connected subgraph:

In [None]:
from herc_common.graph import get_largest_connected_subgraph

connected_entity_subgraph = get_largest_connected_subgraph(entity_graph)

plot = build_graph_plot(connected_entity_subgraph, f"Linked entities graph")
show(plot)

Now that we have the Wikidata graph obtained from our initial list of entities from the text, we will be trying out a list of centrality algorithms to obtain the top 9 entities that represent the text. These entities can be seen as potential topics for the publication:

In [None]:
import networkx.algorithms as nxa

from herc_common.graph import get_centrality_algorithm_results

def try_centrality_algorithms(g, algorithms, stop_uris, top_n=9):
    for (algorithm, name) in algorithms:
        print(f'Algorithm: {name}')
        result = get_centrality_algorithm_results(g, algorithm, stop_uris, top_n)
        print(f"Topics:", [(t[0]['label'], t[1]) for t in result])
        print()
        
algorithms = [
    (nxa.centrality.information_centrality, "Information centrality"),
    (nxa.centrality.eigenvector_centrality_numpy, "Eigenvector centrality"),
    (nxa.centrality.closeness_centrality, "Closeness centrality"),
    (nxa.centrality.betweenness_centrality, "Betweenness centrality"),
    (nxa.centrality.load_centrality, "Load centrality")
]

stop_uris = ['Q4167836', 'Q11862829', 'Q13442814',
             'Q17339814', 'Q24017414', 'Q4671286',
             'Q47154513']
try_centrality_algorithms(connected_entity_subgraph,
                          algorithms,
                          stop_uris)

## Setting up the pipeline
Now that we have seen the main data flow, we will build the final pipeline. This pipeline will receive a list of texts, and return 7 potential topics for each text by executing the steps described above:

In [31]:
from sklearn.pipeline import Pipeline

from herc_common.topic import TopicLabeller


topic_extractor = TopicLabeller(graph_builder, nxa.centrality.closeness_centrality,
                                num_labels_per_topic=7, stop_uris=stop_uris)
topic_pipe = Pipeline([('ner', ner),
                       ('entity_linker', linker),
                       ('topic_extractor', topic_extractor)])

The pipeline will be now saved for later use in the final system:

In [None]:
from herc_common.utils import save_object

PIPE_OUTPUT_FILE_NAME = "topic_extraction_from_ner_pipe.pkl"

save_object(topic_pipe, os.path.join(NOTEBOOK_5_RESULTS_DIR, PIPE_OUTPUT_FILE_NAME))

### Obtaining the topics
Before finishing with this notebook, we will be obtaining the list of inferred topics for each one of the articles from the agriculture dataset. To do so, we just have to call the _fit_transform_ method of our pipeline:

In [32]:
results = topic_pipe.fit_transform(git_repositories)
results[:5]

HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))

https://github.com/indigo-dc/udocker/blob/master/doc/installation_manual.md#22-install-from-indigo-datacloud-repositories
Spreadsheet('#x-spreadsheet-demo'
Spreadsheet("#x-spreadsheet-demo
https://github.com/SheetJS/sheetjs/tree/master/demos/xspreadsheet#saving-data


INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_com




INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.








INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
IN

[[Topic(label='software', qid='Q7397', desc='non-tangible executable component of a computer', score=0.21499176276771004, t_type='ner'),
  Topic(label='computer science', qid='Q21198', desc='study of the theoretical foundations of computation', score=0.21150729335494328, t_type='ner'),
  Topic(label='artificial intelligence', qid='Q11660', desc='intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals', score=0.2084664536741214, t_type='ner'),
  Topic(label='interaction science', qid='Q97008347', desc='scientific discipline', score=0.20841096619643332, t_type='ner'),
  Topic(label='engineering', qid='Q11023', desc='applied science', score=0.20621543323676586, t_type='ner'),
  Topic(label='automation', qid='Q184199', desc='use of various control systems for operating equipment', score=0.20518867924528303, t_type='ner'),
  Topic(label='mathematical analysis', qid='Q7754', desc='', score=0.20438527799530148, t_type='ner')],
 [Topic(lab

### Saving the results
Now, we will be merging the results into our agriculture dataframe, and save the results to a CSV file. This file will contain the id and title of each article, with their respective topics inferred by the system:

In [33]:
NEW_COL_NAME = 'topics_from_ner'

git_df[NEW_COL_NAME] = ['\n'.join([f"{topic.label}, {topic.score:.4f}" for topic in result])
                        for result in results]
git_df.head()

Unnamed: 0,gh_id,name,description,owner_name,languages,readme_text,issues_text,commits_text,filenames,comments_text,full_text,full_text_cleaned,num_chars_text,topics_from_ner
0,216602979,LIRICAL,LIkelihood Ratio Interpretation of Clinical Ab...,cmungall,"{'Java': 492423, 'FreeMarker': 13149, 'Python'...",LIRICAL. LIkelihood Ratio Interpretation of C...,,Merge pull request #442 from TheJacksonLaborat...,\nCHANGELOG\nREADME\nhoxc13 output\nlirical to...,note that the Jannovar dependency does not nee...,LIkelihood Ratio Interpretation of Clinical Ab...,LIkelihood Ratio Interpretation of Clinical Ab...,3770,"software, 0.2150\ncomputer science, 0.2115\nar..."
1,199330464,wikidata_ontomatcher,Matches ontology classes against wikidata,cmungall,"{'Prolog': 14691, 'Makefile': 1472, 'Dockerfil...",Match an ontology to Wikidata. This applicatio...,Will help with #1 and with https://github.com/...,Adding skos:altLabel\n\nhttps://github.com/cmu...,\nREADME\ninstall\npack\nwikidata ontomatcher\...,,Matches ontology classes against wikidata. Mat...,Matches ontology classes against wikidata. Mat...,519,"Wikidata, 0.4091\nonline database, 0.3375\nkno..."
2,253207181,ro-crate-ruby,"A Ruby gem for creating, manipulating and read...",markwilkinson,"{'Ruby': 52724, 'HTML': 1319}","ro-crate-ruby. This is a WIP gem for creating,...",,Update LICENSE\nBump version\nTidy up and chec...,\n travis\nGemfile\nREADME\nROCrate\nContact P...,*\n * Expands the tree to the target element a...,"A Ruby gem for creating, manipulating and read...","A Ruby gem for creating, manipulating and read...",2559,"information, 0.2005\nabstract object, 0.1957\n..."
3,212556220,Misc_Training_scripts,A place for me to keep various miscellanelous ...,markwilkinson,"{'Shell': 15815, 'Ruby': 9445}",Misc_Training_scripts. A place for me to keep ...,,added new cool 3-federated query\nfinished edi...,README\nSpecies Abundance Pub2015\nSpecies Inf...,,A place for me to keep various miscellanelous ...,A place for me to keep various miscellanelous ...,545,"species, 0.6111\ntaxonomic rank, 0.5116\nsubge..."
4,155879756,FAIRifier,A tool to make data FAIR,mikel-egana-aranguren,"{'Java': 3514431, 'JavaScript': 967765, 'HTML'...",Dependencies: Java 8. Apache Ant. Building. in...,,Merge pull request #16 from Shamanou/developme...,\norg eclipse core resources\norg eclipse jdt ...,*\n * Main class for Refine server application...,A tool to make data FAIR. Dependencies: Java 8...,A tool to make data FAIR. Dependencies: Java 8...,57859,"software, 0.2099\ninteraction science, 0.2051\..."


In [34]:
results_df = git_df[['gh_id', 'name', NEW_COL_NAME]]
results_df.head()

Unnamed: 0,gh_id,name,topics_from_ner
0,216602979,LIRICAL,"software, 0.2150\ncomputer science, 0.2115\nar..."
1,199330464,wikidata_ontomatcher,"Wikidata, 0.4091\nonline database, 0.3375\nkno..."
2,253207181,ro-crate-ruby,"information, 0.2005\nabstract object, 0.1957\n..."
3,212556220,Misc_Training_scripts,"species, 0.6111\ntaxonomic rank, 0.5116\nsubge..."
4,155879756,FAIRifier,"software, 0.2099\ninteraction science, 0.2051\..."


In [35]:
OUTPUT_FILE_NAME = "git_df_with_ner_topics.csv"

results_df.to_csv(os.path.join(NOTEBOOK_5_RESULTS_DIR, OUTPUT_FILE_NAME), index=False)