# Topic Modeling for Research Articles 

Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more difficult. Tagging or topic modelling provides a way to give token of identification to research articles which facilitates recommendation and search process.

Given the abstract and title for a set of research articles, predict the topics for each article included in the test set. 

Note that a research article can possibly have more than 1 topic. The research article abstracts and titles are sourced from the following 6 topics: 

1. Computer Science

2. Physics

3. Mathematics

4. Statistics

5. Quantitative Biology

6. Quantitative Finance

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/janatahack-independence-day-2020-ml-hackathon/sample_submission_UVKGLZE.csv
/kaggle/input/janatahack-independence-day-2020-ml-hackathon/test.csv
/kaggle/input/janatahack-independence-day-2020-ml-hackathon/train.csv
/kaggle/input/glove-global-vectors-for-word-representation/glove.6B.100d.txt
/kaggle/input/glove-global-vectors-for-word-representation/glove.6B.200d.txt
/kaggle/input/glove-global-vectors-for-word-representation/glove.6B.50d.txt


In [2]:
train = pd.read_csv("/kaggle/input/janatahack-independence-day-2020-ml-hackathon/train.csv")
test = pd.read_csv("/kaggle/input/janatahack-independence-day-2020-ml-hackathon/test.csv")
train.head()

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0


In [3]:
train.nunique()

ID                      20972
TITLE                   20972
ABSTRACT                20972
Computer Science            2
Physics                     2
Mathematics                 2
Statistics                  2
Quantitative Biology        2
Quantitative Finance        2
dtype: int64

In [4]:
train["No_of_topics"] = train["Computer Science"]+train["Physics"]+train["Mathematics"]+train["Statistics"]+train["Quantitative Biology"]+train["Quantitative Finance"]
train[train["No_of_topics"] > 1]   

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance,No_of_topics
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0,2
21,22,Many-Body Localization: Stability and Instability,Rare regions with weak disorder (Griffiths r...,0,1,1,0,0,0,2
28,29,Minimax Estimation of the $L_1$ Distance,We consider the problem of estimating the $L...,0,0,1,1,0,0,2
29,30,Density large deviations for multidimensional ...,We investigate the density large deviation f...,0,1,1,0,0,0,2
30,31,mixup: Beyond Empirical Risk Minimization,"Large deep neural networks are powerful, but...",1,0,0,1,0,0,2
...,...,...,...,...,...,...,...,...,...,...
20963,20964,Faithful Inversion of Generative Models for Ef...,Inference amortization methods share informa...,1,0,0,1,0,0,2
20964,20965,A social Network Analysis of the Operations Re...,We study the U.S. Operations Research/Indust...,1,0,0,1,0,0,2
20967,20968,Contemporary machine learning: a guide for pra...,Machine learning is finding increasingly bro...,1,1,0,0,0,0,2
20970,20971,On the Efficient Simulation of the Left-Tail o...,The sum of Log-normal variates is encountere...,0,0,1,1,0,0,2


In [5]:
train.No_of_topics.value_counts()

1    15928
2     4793
3      251
Name: No_of_topics, dtype: int64

In [6]:
train["content"] = train["TITLE"]+train["ABSTRACT"]
train.drop(labels = ["ID","TITLE","ABSTRACT","No_of_topics"],axis=1,inplace = True)
train.head()

Unnamed: 0,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance,content
0,1,0,0,0,0,0,Reconstructing Subject-Specific Effect Maps P...
1,1,0,0,0,0,0,Rotation Invariance Neural Network Rotation i...
2,0,0,1,0,0,0,Spherical polyharmonics and Poisson kernels fo...
3,0,0,1,0,0,0,A finite element approximation for the stochas...
4,1,0,0,1,0,0,Comparative study of Discrete Wavelet Transfor...


In [7]:
from collections import Counter
def vocab(texts):
    cnt = Counter()
    for row in texts.values:
        for i in row.split():
            cnt[i] += 1
    return len(cnt)
vocab_size = vocab(train.content)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn import metrics


labels = ['Computer Science', 'Physics', 'Mathematics','Statistics', 
          'Quantitative Biology', 'Quantitative Finance']

for label in labels:
    print(label)
    print('')
    print('Value counts:')
    print(train[label].value_counts())

    X = train['content']
    y = train[label]

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.33)
    
    text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                         ('clf', LinearSVC()),
    ])

    text_clf.fit(X_train, y_train)  

    predictions = text_clf.predict(X_test)

    print(metrics.confusion_matrix(y_test,predictions))
    print('')
    print(metrics.classification_report(y_test,predictions))
    print('')
    print('')
    print('')
    print('')

Computer Science

Value counts:
0    12378
1     8594
Name: Computer Science, dtype: int64
[[3575  497]
 [ 500 2349]]

              precision    recall  f1-score   support

           0       0.88      0.88      0.88      4072
           1       0.83      0.82      0.82      2849

    accuracy                           0.86      6921
   macro avg       0.85      0.85      0.85      6921
weighted avg       0.86      0.86      0.86      6921





Physics

Value counts:
0    14959
1     6013
Name: Physics, dtype: int64
[[4754  154]
 [ 313 1700]]

              precision    recall  f1-score   support

           0       0.94      0.97      0.95      4908
           1       0.92      0.84      0.88      2013

    accuracy                           0.93      6921
   macro avg       0.93      0.91      0.92      6921
weighted avg       0.93      0.93      0.93      6921





Mathematics

Value counts:
0    15354
1     5618
Name: Mathematics, dtype: int64


In [None]:
import matplotlib.pyplot as plt

fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 10
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size
train_labels = train[['Computer Science', 'Physics', 'Mathematics','Statistics','Quantitative Biology', 'Quantitative Finance']]
train_labels.sum(axis=0).plot.bar()

In [None]:
import re
# from nltk.corpus import stopwords
# stop_words = set(stopwords.words('english'))
def preprocess_text(sen):
    sentence = re.sub("http[s]*://[^\s]+"," ",sen)
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
    
    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)
    
    return sentence

In [None]:
X = []
sentences = list(train["content"])
for sen in sentences:
    X.append(preprocess_text(sen))
y = train_labels.values
y


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences

In [None]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

vocab_size = len(tokenizer.word_index) + 1

maxlen = 200

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

In [None]:
from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()

glove_file = open('../input/glove-global-vectors-for-word-representation/glove.6B.200d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

embedding_matrix = zeros((vocab_size, 200))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

In [None]:
from keras.layers import Embedding,Dense,GlobalMaxPool1D,Dropout,Flatten,Bidirectional,LSTM
from keras.models import Sequential
# Model 1
# deep_inputs = Input(shape=(maxlen,))
# embedding_layer = Embedding(vocab_size, 200, weights=[embedding_matrix], trainable=False)(deep_inputs)
# LSTM_Layer_1 = LSTM(128)(embedding_layer)
# maxpool = GlobalMaxPooling1D()
# dense_layer2 =  Dense(128, activation='relu')(maxpool)
# dense_layer_1 = Dense(6, activation='sigmoid')(LSTM_Layer_1)
# model = Model(inputs=deep_inputs, outputs=dense_layer_1)
# Model 2
model=Sequential([Embedding(vocab_size,200,input_length=maxlen,weights=[embedding_matrix], trainable=False),
                 Bidirectional(LSTM(100,return_sequences=True)),
                 GlobalMaxPool1D(),
                  Dense(128,activation = 'relu'),
#                  Dense(64,activation='relu'),
#                   Dense(16,activation='relu'),
                  Dense(6,activation='sigmoid')
                 ])


model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])

In [None]:
model.summary()

In [None]:
from keras.utils import plot_model
plot_model(model, to_file='model_plot4a.png', show_shapes=True, show_layer_names=True)

In [None]:
history = model.fit(X_train, y_train, batch_size=256, epochs=5, verbose=1, validation_split=0.2)


In [None]:
score = model.evaluate(X_test, y_test, verbose=1)

print("Test Score:", score[0])
print("Test Accuracy:", score[1])

In [None]:
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])

plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

In [None]:
test.head()

In [None]:
test['content'] = test["TITLE"]+test["ABSTRACT"]
test.drop(labels = ["ID","TITLE","ABSTRACT"],axis=1,inplace = True)
test.head()


In [None]:
test_df = []
rows = list(test.content)
for sent in rows:
    test_df.append(preprocess_text(sent))
    
#     return rows
# rows

In [None]:
# from keras.preprocessing.text import pad_sequences,texts_to_sequences
tokenizer.fit_on_sequences(test_df)
X_test = tokenizer.texts_to_sequences(test_df)
X_test = pad_sequences(X_test,maxlen = 200,padding = 'post')


In [None]:
preds = model.predict(X_test)
for arr in preds:
    for i in range(len(arr)):
        if arr[i]>0.5:
            arr[i] = 1
        else:
            arr[i] = 0


In [None]:
preds = preds.astype("int32")
preds

In [None]:
df = pd.DataFrame(data = preds,columns = ['Computer Science', 'Physics', 'Mathematics','Statistics','Quantitative Biology', 'Quantitative Finance'])
df.head()

In [None]:
sample = pd.read_csv("../input/janatahack-independence-day-2020-ml-hackathon/sample_submission_UVKGLZE.csv")
sample
final_df = pd.DataFrame({"ID":sample.ID,})
final = pd.concat([final_df,df],axis=1)
final.to_csv("submission.csv",index = False)
print(final.shape)
final.head()