# Objectives

- test different packages and models
    - spacy
    - tensorflow
        - LDA
        - BERT
        - fasttext

# data info

https://www.kaggle.com/anmolkumar/topic-modeling-for-research-articles-20/tasks?taskId=2470

Task Details
Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more and more difficult. Tagging or topic modelling provides a way to give clear token of identification to research articles which facilitates recommendation and search process.

Earlier on the Independence Day we conducted a Hackathon to predict the topics for each article included in the test set. Continuing with the same problem, In this Live Hackathon we will take one more step ahead and predict the tags associated with the articles.

Given the abstracts for a set of research articles, predict the tags for each article included in the test set.
Note that a research article can possibly have multiple tags. The research article abstracts are sourced from the following 4 topics:

Computer Science
Mathematics
Physics
Statistics
List of possible tags are as follows:

[Analysis of PDEs, Applications, Artificial Intelligence,Astrophysics of Galaxies, Computation and Language, Computer Vision and Pattern Recognition, Cosmology and Nongalactic Astrophysics, Data Structures and Algorithms, Differential Geometry, Earth and Planetary Astrophysics, Fluid Dynamics,Information Theory, Instrumentation and Methods for Astrophysics, Machine Learning, Materials Science, Methodology, Number Theory, Optimization and Control, Representation Theory, Robotics, Social and Information Networks, Statistics Theory, Strongly Correlated Electrons, Superconductivity, Systems and Control]


In [72]:
import pandas as pd
import numpy as np
import pickle
import random

import spacy
from spacy.pipeline.textcat_multilabel import DEFAULT_MULTI_TEXTCAT_MODEL
from spacy.util import minibatch
from spacy.training import Example

In [6]:
train = pd.read_csv('data/train.csv')
tags = pd.read_csv('data/tags.csv')

In [12]:
train.head(2)

Unnamed: 0,id,ABSTRACT,Computer Science,Mathematics,Physics,Statistics,Analysis of PDEs,Applications,Artificial Intelligence,Astrophysics of Galaxies,...,Methodology,Number Theory,Optimization and Control,Representation Theory,Robotics,Social and Information Networks,Statistics Theory,Strongly Correlated Electrons,Superconductivity,Systems and Control
0,1824,a ever-growing datasets inside observational a...,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3094,we propose the framework considering optimal $...,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


- input abstract
- output a representation of the tags, could be multiple

In [75]:
# spacy
# https://www.kaggle.com/matleonard/text-classification # all changed in spacy 3.0
# https://spacy.io/api/textcategorizer


nlp = spacy.blank("en")


config = {
   "threshold": 0.5,
   "model": DEFAULT_MULTI_TEXTCAT_MODEL,
}

textcat = nlp.add_pipe(
              "textcat_multilabel",
              config=config)

In [76]:
# add all tags
for tag in np.ravel(tags.values):
    textcat.add_label(tag)

In [35]:
textcat.labels

('Analysis of PDEs',
 'Applications',
 'Artificial Intelligence',
 'Astrophysics of Galaxies',
 'Computation and Language',
 'Computer Vision and Pattern Recognition',
 'Cosmology and Nongalactic Astrophysics',
 'Data Structures and Algorithms',
 'Differential Geometry',
 'Earth and Planetary Astrophysics',
 'Fluid Dynamics',
 'Information Theory',
 'Instrumentation and Methods for Astrophysics',
 'Machine Learning',
 'Materials Science',
 'Methodology',
 'Number Theory',
 'Optimization and Control',
 'Representation Theory',
 'Robotics',
 'Social and Information Networks',
 'Statistics Theory',
 'Strongly Correlated Electrons',
 'Superconductivity',
 'Systems and Control')

In [62]:
# text preprocessing

def process_text(df, tags):
    texts = df.ABSTRACT

    labels = []
    for row in range(len(df)):
        label_dict = dict()
        for tag in tags:
            label_dict[tag] = df.loc[row,tag] == 1
        labels += [{'cats': label_dict}]
    
    return texts, labels


In [53]:
process_text(train, textcat.labels)[0]

0        a ever-growing datasets inside observational a...
1        we propose the framework considering optimal $...
2        nanostructures with open shell transition meta...
3        stars are self-gravitating fluids inside which...
4        deep neural perception and control networks ar...
                               ...                        
13999    a methodology of automatic detection of a even...
14000    we consider a case inside which the robot has ...
14001    despite being usually considered two competing...
14002    we present the framework and its implementatio...
14003    here we report small-angle neutron scattering ...
Name: ABSTRACT, Length: 14004, dtype: object

In [55]:
train_data = list(zip(*process_text(train, textcat.labels)))

In [56]:
train_data[1]

("we propose the framework considering optimal $t$-matchings excluding a prescribed $t$-factors inside bipartite graphs. a proposed framework was the generalization of a nonbipartite matching problem and includes several problems, such as a triangle-free $2$-matching, square-free $2$-matching, even factor, and arborescence problems. inside this paper, we demonstrate the unified understanding of these problems by commonly extending previous important results. we solve our problem under the reasonable assumption, which was sufficiently broad to include a specific problems listed above. we first present the min-max theorem and the combinatorial algorithm considering a unweighted version. we then provide the linear programming formulation with dual integrality and the primal-dual algorithm considering a weighted version. the key ingredient of a proposed algorithm was the technique to shrink forbidden structures, which corresponds to a techniques of shrinking odd cycles, triangles, squares,

In [58]:
pickle.dump(train_data, open('data/spacy_text_label.p','wb'))

In [78]:
# training a text categorizer model
# changed based on https://spacy.io/usage/v3#migrating

random.seed(1)
spacy.util.fix_random_seed(1)

examples = []

for text, labels in train_data:
    examples.append(Example.from_dict(nlp.make_doc(text), labels))
nlp.initialize(lambda: examples)

for epoch in range(10):
    random.shuffle(examples)
    # Create the batch generator with batch size = 8
    batches = minibatch(
        examples, size=8)
    # Iterate through minibatches
    for batch in batches:
        nlp.update(batch)
    print(losses)


{'textcat_multilabel': 22.6571195081342}
{'textcat_multilabel': 36.55466588283889}
{'textcat_multilabel': 46.095829123980366}
{'textcat_multilabel': 52.780450004240265}
{'textcat_multilabel': 57.51071749042603}
{'textcat_multilabel': 60.97146983833227}
{'textcat_multilabel': 63.68135900271591}
{'textcat_multilabel': 65.80210949639513}
{'textcat_multilabel': 67.48517272951358}
{'textcat_multilabel': 69.00130326519229}


In [79]:
nlp.to_disk('model/spacy_simple')

In [105]:
# evaluation

docs = [nlp.tokenizer(text) for text in train.ABSTRACT.values]

textcat = nlp.get_pipe('textcat_multilabel')
scores = textcat.predict(docs)

predicted = (scores > .5).astype(int)
true = train[[*textcat.labels]].values

In [111]:
true.shape

(14004, 25)

In [113]:
(predicted == true).sum() / (true.shape[0] * true.shape[1])

0.9977606398171951

In [81]:
# prediction

test = pd.read_csv('data/test.csv')
docs = [nlp.tokenizer(text) for text in test.ABSTRACT.values]

In [87]:
textcat = nlp.get_pipe('textcat_multilabel')
scores = textcat.predict(docs)

In [97]:
submission = pd.read_csv('data/submission.csv')
submission.iloc[:,1:] = (scores > .5).astype(int)

In [101]:
submission.iloc[:,1:].sum()

Analysis of PDEs                                 245
Applications                                     222
Artificial Intelligence                          464
Astrophysics of Galaxies                         190
Computation and Language                         264
Computer Vision and Pattern Recognition          294
Cosmology and Nongalactic Astrophysics           257
Data Structures and Algorithms                   144
Differential Geometry                            237
Earth and Planetary Astrophysics                 194
Fluid Dynamics                                   145
Information Theory                                72
Instrumentation and Methods for Astrophysics     219
Machine Learning                                1376
Materials Science                                283
Methodology                                      232
Number Theory                                    153
Optimization and Control                         112
Representation Theory                         

In [103]:
submission.to_csv('data/submission_spacy_simple.csv', index=False)