# Experimenting with Topic Modeling using Word Embeddings

The data set being used contains research paper titles and abstracts as well as labels as either Computer Science, Physics, Mathematics, Statistics, Quantitative Biology, Quantitative Finance, or some combination of those labels.  The approach that I am taking is to convert the text to a vector using word embeddings trained on this data set, then I will train a classifier for each of the labels, separately.  At the end I am going to create a function that when text is inputed will return the likely topic(s) of the title or abstract.

This function will evaluate the inputed text on each of the classifiers separately, then return an array with the results of each one in the same order that they appear in the columns in the training dataset.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


import sys
import pandas as pd
import matplotlib.pyplot as plt
import random


from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.ensemble import RandomForestClassifier
from gensim.models import Word2Vec
import gensim.utils

import tensorflow as tf

%matplotlib notebook
print('You\'re running python %s' % sys.version.split(' ')[0])

#### Load the training data:

In [None]:
train = pd.read_csv('/kaggle/input/topic-modeling-for-research-articles/train.csv',keep_default_na=False)

## Take a look at the training data:

In [None]:
train.head()

#### Create array with labels for later:

In [None]:
labels = np.array(['Computer Science', 'Physics', 'Mathematics', 'Statistics', 'Quantitative Biology', 'Quantitative Finance'],dtype='str')

#### Break up dataset into lists that can be used for training and testing sets:

In [None]:
# this row number has a title that has no strings after simple_preprocess, so I removed it
issueRowNumber = 8270

# X's
inputAbstracts = train['ABSTRACT'].tolist()
inputTitles = train['TITLE'].tolist()
inputAbstracts.pop(issueRowNumber)
inputTitles.pop(issueRowNumber)

# y's or labels
labelColumns = [None]*len(labels)
for i in range(len(labels)):
    col = train[labels[i]].tolist()
    col.pop(issueRowNumber)
    labelColumns[i] = col

#### Tokenize titles and abstracts:

In [None]:
#tokenize titles:
inputTitleTokens = []
for title in inputTitles:
    tokens = gensim.utils.simple_preprocess(title)
    inputTitleTokens.append(tokens)
    
#tokenize abstracts:   
inputAbstractTokens = []
for abstract in inputAbstracts:
    tokens = gensim.utils.simple_preprocess(abstract)
    inputAbstractTokens.append(tokens)

#### Create Word Embeddings for article titles using Word2Vec

In [None]:
W2V_model_title = Word2Vec(inputTitleTokens, min_count=1,size=100,workers=3, window=5, sg=1)
W2V_model_abstract = Word2Vec(inputAbstractTokens, min_count=1,size=100,workers=3, window=5, sg=1)

#### Vectorize article titles using Word Embeddings:

In [None]:
vectorizedTitles = [None]*len(inputTitleTokens)
for i in range(len(inputTitleTokens)):
    title=[]
    for word in inputTitleTokens[i]:
        try:
            title.append(W2V_model_title.wv[word])
        except:
            'do nothing'
    title_avg = np.mean(np.array(title, dtype='f'),axis=0)
    vectorizedTitles[i]=title_avg

vectorizedAbstracts = [None]*len(inputAbstractTokens)
for i in range(len(inputAbstractTokens)):
    abstract=[]
    for word in inputAbstractTokens[i]:
        try:
            abstract.append(W2V_model_abstract.wv[word])
        except:
            'do nothing'
    abstract_avg = np.mean(np.array(abstract, dtype='f'),axis=0)
    vectorizedAbstracts[i]=abstract_avg

#### Split up testing and training sets:

In [None]:
test_size = len(inputTitles)//5
train_size = len(inputTitles)-test_size
print('Testing set size: '+str(test_size),'|','Training set size: '+str(train_size),'|','Total size: '+str(test_size+train_size))

In [None]:
#create the X test and training matricies for the article titles
temp = np.array(vectorizedTitles)
X_title_test,X_title_train = temp[train_size:],temp[:train_size]
#create the X test and training matricies for the article abstracts
temp = np.array(vectorizedAbstracts)
X_abstract_test,X_abstract_train = temp[train_size:],temp[:train_size]

#create the Y test and training arrays for the article labels (list of "np.array columns")
Y_train,Y_test = [None]*len(labelColumns),[None]*len(labelColumns)
for colNumber in range(len(labelColumns)):
    temp = np.array(labelColumns[colNumber])
    Y_test[colNumber],Y_train[colNumber]  = temp[train_size:],temp[:train_size]

#### Create random forest classifiers for each label:

In [None]:
print('TITLES:')
title_classifiers = [None]*len(Y_train)
for colNumber in range(len(Y_train)):
    temp = RandomForestClassifier(max_depth=6,n_estimators=10)
    temp.fit(X_title_train, Y_train[colNumber])
    title_classifiers[colNumber] = temp
    print(colNumber,labels[colNumber])
    print('Training accuracy:',np.sum(temp.predict(X_title_train)==Y_train[colNumber])/len(X_title_train))
    print('Testing accuracy:',np.sum(temp.predict(X_title_test)==Y_test[colNumber])/len(X_title_test))
    print()

print('ABSTRACTS:')
abstract_classifiers = [None]*len(Y_train)
for colNumber in range(len(Y_train)):
    temp = RandomForestClassifier(max_depth=6,n_estimators=10)
    temp.fit(X_abstract_train, Y_train[colNumber])
    abstract_classifiers[colNumber] = temp
    print(colNumber,labels[colNumber])
    print('Training accuracy:',np.sum(temp.predict(X_title_train)==Y_train[colNumber])/len(X_title_train))
    print('Testing accuracy:',np.sum(temp.predict(X_title_test)==Y_test[colNumber])/len(X_title_test))
    print()

#### Create classifier function that evaluates input text on all five labels:
one for the title, and one for the abstract.

In [None]:
def title_classifier(title):
    global title_classifiers
    tokenTitle = gensim.utils.simple_preprocess(title)
    vecTitle = []
    for word in tokenTitle:
        try:
            vecTitle.append(W2V_model_title.wv[word])
        except:
            'do nothing'
    vecTitle = np.mean(np.array(vecTitle, dtype='f'),axis=0)
    preds = [None]*len(title_classifiers)
    for index in range(len(title_classifiers)):
        preds[index] = int(title_classifiers[index].predict(vecTitle.reshape(1, -1))[0])
    return np.array(preds)

def abstact_classifier(abstract):
    global abstract_classifiers
    tokenAbstract = gensim.utils.simple_preprocess(abstract)
    vecAbstract = []
    for word in tokenAbstract:
        try:
            vecAbstract.append(W2V_model_title.wv[word])
        except:
            'do nothing'
    vecAbstract = np.mean(np.array(vecAbstract, dtype='f'),axis=0)
    preds = [None]*len(abstract_classifiers)
    for index in range(len(abstract_classifiers)):
        preds[index] = int(abstract_classifiers[index].predict(vecTitle.reshape(1, -1))[0])
    return np.array(preds)

#### Try out classifier on some made up article name inputs:


In [None]:
articleName = "New Methods for KNN with text data"
preds = title_classifier(articleName)
print('Output vector:',preds,'|','Predicited Label(s):',labels[preds==1])

In [None]:
articleName = "Pi used in new formula"
preds = title_classifier(articleName)
print('Output vector:',preds,'|','Predicited Label(s):',labels[preds==1])

In [None]:
articleName = "New prime number discovered"
preds = title_classifier(articleName)
print('Output vector:',preds,'|','Predicited Label(s):',labels[preds==1])

In [None]:
articleName = "New Data distribution used to speed up training"
preds = title_classifier(articleName)
print('Output vector:',preds,'|','Predicited Label(s):',labels[preds==1])

## Summary:

While this ensemble classifier does not work perfectly, it does a fairly descent job of classifying the papers correctly. With the way thes models have been trained, it seems that the title gives more of an indication of the field of the paper, rather than the abstract.  Having thought about this a bit, I think maybe the abstract classifiers would be more accurate if the window of the word embeddings for them was larger, since the abstracts have more words, and so maybe a context longer than 5 words.  However, overall the accuracy of 80%+ test and training accuracy for the the titles is a good indication that this classification can be done well.  I am sure it is possible to improve the accuracy a bit.  I believe a better method of combining the word embeddings, instead of a simple average as I have done here, might achieve greater accuracy.