# 3. Topic labelling

## Setup
As always, we will begin by loading a set of constants and initializing the logging system. Since we will be using Bokeh in this notebook, we will configure it to output the results in the Jupyter notebook:

In [1]:
%run __init__.py

In [2]:
from bokeh.io import output_notebook

output_notebook()

In [3]:
import pandas as pd

GIT_FILE_PATH = os.path.join(NOTEBOOK_1_RESULTS_DIR, 'git_dataframe.pkl')

git_df = pd.read_pickle(GIT_FILE_PATH)
git_repositories = git_df['full_text_cleaned'].values

## Entity linking

### Using the entity linking class
An entity linking class has been defined in the _entity_linking.py_ module of the _src_ directory. This class will link the given words to their Wikidata entity by using the [wbsearchentities](https://www.wikidata.org/w/api.php?action=help&modules=wbsearchentities) modules from the MediaWiki API:

In [4]:
from herc_common.entity_linking import WikidataEntityLinker

entity_linker = WikidataEntityLinker()
res = entity_linker.link_entity('python')
res

('python', 'http://www.wikidata.org/entity/Q28865')

### Linking each topic's term to Wikidata
In the following cells we are going to load the lda model trained on the Agriculture dataset, obtain the term distribution of each topic, and link each term to Wikidata. We will start by loading both the LDA pipeline and the document term matrix with the term frequency: 

In [5]:
from herc_common.utils import load_object

lda_agriculture_pipe_filename = "git_nmf_model.pkl"
dtm_tf_filename = "git_dtm_tfidf.pkl"

lda_pipe = load_object(os.path.join(NOTEBOOK_2_RESULTS_DIR, lda_agriculture_pipe_filename))
dtm_tf = load_object(os.path.join(NOTEBOOK_2_RESULTS_DIR, dtm_tf_filename))

In order to obtain the list of terms for each topic, we are going to make use of the _get\_topic\_terms\_by\_relevance_ function to obtain a list of more relevant terms for each topic (see [Sievert & Shirley](https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf) for more information).

In [6]:
from herc_common.utils import get_topic_terms_by_relevance

def link_topic_terms(entity_linker, model, vectorizer,
                     dtm_tf, n_top_words, lambda_=0.6):
    res = []
    if lambda_ < 1.0:
        topic_terms = get_topic_terms_by_relevance(model, vectorizer, dtm_tf,
                                                   n_top_words, lambda_)
    else:
        feature_names = vectorizer.get_feature_names()
        topic_terms = [[feature_names[i] for i in topic.argsort()[:-n_top_words - 1: -1]] 
                       for topic in model.components_]
    return [[entity_linker.link_entity(entity) for entity in topic]
            for topic in topic_terms]


Finally, we can make used of the function defined above to link each term to Wikidata. The output of the following cell will be a 2D array, with the first dimension corresponding to each topic, and the second one consisting on tuples containing the pair ('term', 'wikidata_uri') for every term of the topic:

In [7]:
linked_terms = link_topic_terms(entity_linker, lda_pipe.named_steps['model'],
                                lda_pipe.named_steps['vectorizer'], dtm_tf, 
                                n_top_words=10, lambda_=1)
linked_terms

[[('base', 'http://www.wikidata.org/entity/Q191360'),
  ('array', 'http://www.wikidata.org/entity/Q186152'),
  ('create', 'http://www.wikidata.org/entity/Q4953270'),
  ('core', 'http://www.wikidata.org/entity/Q23595'),
  ('Index', 'http://www.wikidata.org/entity/Q873506'),
  ('Operation', 'http://www.wikidata.org/entity/Q1079196'),
  ('dialog', 'http://www.wikidata.org/entity/Q131395'),
  ('property', 'http://www.wikidata.org/entity/Q1400881'),
  ('map', 'http://www.wikidata.org/entity/Q4006'),
  ('Exporter', 'http://www.wikidata.org/entity/Q73090528')],
 [('Property', 'http://www.wikidata.org/entity/Q1400881'),
  ('spec', 'http://www.wikidata.org/entity/Q2101564'),
  ('Elda', 'http://www.wikidata.org/entity/Q608985'),
  ('Utils', 'http://www.wikidata.org/entity/Q95958203'),
  ('Renderer', None),
  ('Query', 'http://www.wikidata.org/entity/Q11169'),
  ('apispec', None),
  ('Player', 'http://www.wikidata.org/entity/Q937857'),
  ('min', 'http://www.wikidata.org/entity/Q7727'),
  ('Contex

## Obtaining each topic's graphs
In this phase we are going to explore the neighbourhood of each term linked before, to obtain a graph with their related terms from Wikidata. Each set of terms obtained before will be the seed concepts used to obtain the final graph, and a set of properties from Wikidata will be explored recursively to expand the final graph. 

For more information about the implementation of the graph building process, the class used can be accessed at the _graph.py_ module in the source directory.

In the following cell we will be configuring the graph builder to build a graph with a maxium depth of two from every seed node. Higher depth values might cause the resulting topic labels to be very general, while with a smaller value we have the risk of not obtaining a connection between the seed nodes:

In [8]:
from herc_common.graph import WikidataGraphBuilder

graph_builder = WikidataGraphBuilder(max_hops=2)
topic_graphs = [graph_builder.build_graph(topic) for topic in linked_terms]

Now that we have obtained the neighbourhood graph of each topic, we are going to plot the results using bokeh. Each node will have a different color depending on their depth with respect to the seed nodes, which will be painted in blue. This will allow us to perform an initial exploration of these graphs:

In [9]:
from bokeh.io import show
from bokeh.layouts import gridplot

from herc_common.bokeh_utils import build_graph_plot


plots = [build_graph_plot(g, f"Topic {idx}") 
         for idx, g in enumerate(topic_graphs)]
grid = gridplot(plots, ncols=2)
show(grid)

An optimum result would be to have every seed term connected in the final graph. However, theere will be some subgraphs which are isolated from the main ones. In the following section we will be solving this issue.

## Getting the main connected subgraph
As we have described before, some of the topic graphs that we have obtained are not fully connected. Small subgraphs which are isolated from the main subgraph will be considered as noise, and removed before the following computations.

In the following cells, we are going to retrieve the largest connected subgraph from each topic's graph, and plot the results to anaylise them:

In [10]:
from herc_common.graph import get_largest_connected_subgraph

connected_topic_subgraphs = [get_largest_connected_subgraph(g) 
                             for g in topic_graphs]

In [11]:
plots = [build_graph_plot(g, f"Largest Connected subgraph for topic {idx}") 
         for idx, g in enumerate(connected_topic_subgraphs)]
grid = gridplot(plots, ncols=2)
show(grid)

In this section we are aiming to see big graphs with the most amount of seed nodes possible. Graphs with few seed nodes from the original term distribution will tend to be less representative of the original topic.

## Obtaining the main component of each topic
Now that we have the final subgraph for each topic, we will be applying several centrality measures to obtain the node that best represents the topic. In the following cell we have defined an auxiliary function that receives a list of algorithm and returns the results of applying them to obtain the best _n_ entities that represent each topic:

In [12]:
import networkx.algorithms as nxa

from herc_common.graph import get_centrality_algorithm_results

def try_centrality_algorithms(topic_subgraphs, algorithms, stop_uris, top_n=4):
    markdown = ""
    for (algorithm, name) in algorithms:
        print(f'Algorithm: {name}')
        results = [get_centrality_algorithm_results(g, algorithm, stop_uris, top_n)
                   for g in topic_subgraphs]
        results_labels = [[(node[0]['label'], node[1]) for node in topic] 
                          for topic in results]
        for idx, result in enumerate(results_labels):
            print(f"Topic {idx}:", result)
            print()
        print()

        
algorithms = [
    (nxa.centrality.information_centrality, "Information centrality"),
    (nxa.centrality.eigenvector_centrality_numpy, "Eigenvector centrality"),
    (nxa.centrality.closeness_centrality, "Closeness centrality"),
    (nxa.centrality.betweenness_centrality, "Betweenness centrality"),
    (nxa.centrality.communicability_betweenness_centrality, "Communicability betweenness centrality")
]

try_centrality_algorithms(connected_topic_subgraphs,
               algorithms,
               ['Q4167836', 'Q11862829'])

Algorithm: Information centrality
Topic 0: [('map', 0.0030716084118354667), ('type of map', 0.00286892732204285), ('cartographic material', 0.0028388206059795904), ('cartographic work', 0.0028279062448461397)]

Topic 1: [('minute', 0.039280896785896996), ('hour', 0.03792607895091389), ('second', 0.037758204417419255), ('unit of time', 0.03627888993899819)]

Topic 2: [('concept', 0.015957446808510654), ('mathematical concept', 0.015228426395939089), ('axiom', 0.01339285714285716), ('abstract object', 0.013392857142857153)]

Topic 3: [('abstract object', 0.0035540722636019386), ('object', 0.0035165777492774204), ('product', 0.003455814546576825), ('textile', 0.003374190510044103)]

Topic 4: [('protein', 0.023529411764705875), ('first-order metaclass', 0.015686274509803918), ('Probable ABC transporter ATP-binding protein MlaF/Mkl', 0.015686274509803918), ('protein family', 0.014814814814814815)]

Topic 5: [('DNA', 0.009435336266224661), ('nucleic acid', 0.008457930852327596), ('polynucleo

Topic 0: [('database', 0.6523977462419074), ('index', 0.27525590489962254), ('work', 0.2171137265964696), ('collection', 0.2145687820373118)]

Topic 1: [('minute', 0.461085266827684), ('hour', 0.40950288026938225), ('second', 0.36168728418658136), ('unit of time', 0.33663880909950683)]

Topic 2: [('variable-order metaclass', 0.5275526639437641), ('concept', 0.5094105705080683), ('abstract object', 0.42064072890534776), ('mathematical concept', 0.2199519803849073)]

Topic 3: [('textile', 0.5361289151611236), ('clothing', 0.37663471476262206), ('artificial physical object', 0.34511935504224495), ('artificial entity', 0.29262925206990453)]

Topic 4: [('protein', 0.7048830810703123), ('first-order metaclass', 0.13417954279797137), ('Probable ABC transporter ATP-binding protein MlaF/Mkl', 0.13417954279797137), ('amino acid', 0.12429850169690303)]

Topic 5: [('DNA', 0.523799424145078), ('nucleotides', 0.331306021362576), ('nucleic acid', 0.26388135589943545), ('polynucleotide', 0.26226507131

Topic 0: [('map', 0.23711340206185566), ('cartographic work', 0.22384428223844283), ('type of map', 0.22384428223844283), ('work', 0.21800947867298578)]

Topic 1: [('minute', 0.5897435897435898), ('second', 0.5476190476190477), ('hour', 0.5111111111111111), ('time interval', 0.5111111111111111)]

Topic 2: [('concept', 0.4307692307692308), ('mathematical concept', 0.4117647058823529), ('axiom', 0.36363636363636365), ('abstract object', 0.3373493975903614)]

Topic 3: [('abstract object', 0.23870967741935484), ('object', 0.2364217252396166), ('product', 0.23270440251572327), ('textile', 0.2276923076923077)]

Topic 4: [('protein', 0.7755102040816326), ('first-order metaclass', 0.5135135135135135), ('Probable ABC transporter ATP-binding protein MlaF/Mkl', 0.5135135135135135), ('protein catabolic process, modulating synaptic transmission', 0.4418604651162791)]

Topic 5: [('DNA', 0.5662650602409639), ('nucleotides', 0.44549763033175355), ('polynucleotide', 0.4392523364485981), ('nucleic acid'

Topic 0: [('map', 0.608337314859054), ('type of map', 0.5011944577161969), ('work', 0.464405160057334), ('first-order metaclass', 0.43645484949832775)]

Topic 1: [('second', 0.4752493882928664), ('minute', 0.33088650479954823), ('time interval', 0.27667984189723316), ('hour', 0.24788255223037825)]

Topic 2: [('concept', 0.7142857142857142), ('mathematical concept', 0.5634920634920635), ('axiom', 0.5132275132275133), ('axiomatic system', 0.2698412698412698)]

Topic 3: [('textile', 0.644144144144144), ('abstract object', 0.5879303961495742), ('object', 0.5053683820807109), ('product', 0.503517215845983)]

Topic 4: [('protein', 0.9608819345661451), ('protein family', 0.2496443812233286), ('first-order metaclass', 0.13229018492176386), ('Probable ABC transporter ATP-binding protein MlaF/Mkl', 0.13229018492176386)]

Topic 5: [('DNA', 0.7461469534050177), ('nucleotides', 0.3129718599862732), ('mitochondrion', 0.16591931670860996), ('genome', 0.15438496148859904)]

Topic 6: [('Generalizabilit

Topic 0: [('map', 0.615062985423239), ('type of map', 0.5051210480224867), ('work', 0.4661800986402719), ('first-order metaclass', 0.4447677436596028)]

Topic 1: [('minute', 0.5639578854801077), ('second', 0.559828499884562), ('hour', 0.45536879380298534), ('time interval', 0.41107964941810554)]

Topic 2: [('concept', 0.7367868927067245), ('mathematical concept', 0.5799596414035303), ('axiom', 0.523303917668663), ('axiomatic system', 0.279059101848867)]

Topic 3: [('textile', 0.6572223823110261), ('abstract object', 0.6004877340034696), ('product', 0.5099750501409359), ('object', 0.5085349190701691)]

Topic 4: [('protein', 0.966660573460857), ('protein family', 0.2577425062029279), ('first-order metaclass', 0.1873658616716221), ('Probable ABC transporter ATP-binding protein MlaF/Mkl', 0.1873658616716221)]

Topic 5: [('DNA', 0.8524274823423242), ('nucleotides', 0.4176593117931271), ('polynucleotide', 0.2817988474447526), ('nucleic acid', 0.25583902755838106)]

Topic 6: [('Generalizabili

## Add labels to LDA model
Finally, we will be saving the best results to our LDA model that has been trained previously. Now, when we load the model again, after a topic has been inferred for a given text we will also be able to return a representative label for the topic, which will be also linked to Wikidata:

In [13]:
from herc_common.topic import Topic

final_results = [get_centrality_algorithm_results(g,
                                                 nxa.centrality.information_centrality,
                                                ['Q4167836', 'Q11862829'], top_n=1)
                 for g in connected_topic_subgraphs]

final_results_topics = [Topic.from_node(topic[0], topic[1], "lda") 
                        for result in final_results for topic in result]
lda_model = lda_pipe.named_steps['model']

In [14]:
from tqdm import tqdm

import en_core_web_md
import string
import numpy as np

en_core_web_md.load()

<spacy.lang.en.English at 0x1a50c2c62c8>

In [15]:
from herc_common.topic import LabelledTopicModel

labelled_topic_model = LabelledTopicModel(lda_model, final_results_topics)

lda_pipe.steps.pop()
lda_pipe.steps.append(('model', labelled_topic_model))

In [16]:
from herc_common.utils import save_object

save_object(lda_pipe, os.path.join(NOTEBOOK_3_RESULTS_DIR, 'lda_pipe_with_labels.pkl'))

## Obtaining the results for every article in the dataset

In [17]:
import en_core_sci_lg

en_core_sci_lg.load()

<spacy.lang.en.English at 0x1a55feeb188>

In [18]:
results = lda_pipe.transform(git_repositories)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




## Saving the results

In [19]:
NEW_COL_NAME = 'topics_from_lda'

git_df[NEW_COL_NAME] = ['\n'.join([f"{str(topic)}, {topic.score:.5f}" for topic in result])
                        for result in results]

results_df = git_df[['gh_id', 'name', NEW_COL_NAME]]
results_df.head()

Unnamed: 0,gh_id,name,topics_from_lda
0,216602979,LIRICAL,"Protein kinase domain, 0.03182\ntextile, 0.000..."
1,199330464,wikidata_ontomatcher,"Wikidata, 0.06549\ngoods, 0.00000\neducational..."
2,253207181,ro-crate-ruby,"information, 0.07154\ntextile, 0.00000\ndocume..."
3,212556220,Misc_Training_scripts,"virtuoso, 0.17386\nlibrary science, 0.00000\nc..."
4,155879756,FAIRifier,"archaeological find, 0.13087\nmap, 0.03841\nen..."


In [20]:
OUTPUT_FILE_NAME = "git_df_with_lda_topics.csv"

results_df.to_csv(os.path.join(NOTEBOOK_3_RESULTS_DIR, OUTPUT_FILE_NAME), index=False)

In [21]:
results_df

Unnamed: 0,gh_id,name,topics_from_lda
0,216602979,LIRICAL,"Protein kinase domain, 0.03182\ntextile, 0.000..."
1,199330464,wikidata_ontomatcher,"Wikidata, 0.06549\ngoods, 0.00000\neducational..."
2,253207181,ro-crate-ruby,"information, 0.07154\ntextile, 0.00000\ndocume..."
3,212556220,Misc_Training_scripts,"virtuoso, 0.17386\nlibrary science, 0.00000\nc..."
4,155879756,FAIRifier,"archaeological find, 0.13087\nmap, 0.03841\nen..."
5,90349931,elda,"minute, 0.42500\ngoods, 0.00000\nwork, 0.00000"
6,126633812,music-genre-classification,"music, 0.10166\nfirst-order metaclass, 0.03619..."
7,173520377,probabilistic_nlg,"cosmology, 0.04792\ninformation, 0.00000\nmath..."
8,103798851,DataStructures-Algorithms-InC,"computer science, 0.12205\nwork, 0.00000\nenti..."
9,153249816,Music-Generation-Using-Deep-Learning,"music, 0.02836\ncomputer science, 0.00000\nent..."
