# Topic Modelling:

Topic Modeling is an unsupervised learning approach to clustering documents, to discover topics based on their contents. It is very similar to how K-Means algorithm and Expectation-Maximization work. Because we are clustering documents, we will have to process individual words in each document to discover topics, and assign values to each based on the distribution of these words. This increases the amount of data we are working with, so to handle the large amount of processing required for clustering documents, we will have to utilize efficient sparse data structures.
I (Rehan) will implement two different approaches for topic modeling, and compare their results. These approaches are LDA (Latent Derilicht Analysis), and NMF (Non-negative Matrix factorization). 

## Latent Derilicht Analysis (LDA)

LDA, or Latent Derelicht Analysis is a probabilistic model, and to obtain cluster assignments, it uses two probability values: P( word | topics) and P( topics | documents). These values are calculated based on an initial random assignment, after which they are repeated for each word in each document, to decide their topic assignment. In an iterative procedure, these probabilities are calculated multiple times, until the convergence of the algorithm.
LDA is good in identifying coherent topics:


In [1]:
import pandas as pd 

In [5]:
df = pd.read_csv('E:/train.csv')

In [43]:
df.ABSTRACT[0]

"  Predictive models allow subject-specific inference when analyzing disease\nrelated alterations in neuroimaging data. Given a subject's data, inference can\nbe made at two levels: global, i.e. identifiying condition presence for the\nsubject, and local, i.e. detecting condition effect on each individual\nmeasurement extracted from the subject's data. While global inference is widely\nused, local inference, which can be used to form subject-specific effect maps,\nis rarely used because existing models often yield noisy detections composed of\ndispersed isolated islands. In this article, we propose a reconstruction\nmethod, named RSM, to improve subject-specific detections of predictive\nmodeling approaches and in particular, binary classifiers. RSM specifically\naims to reduce noise due to sampling error associated with using a finite\nsample of examples to train classifiers. The proposed method is a wrapper-type\nalgorithm that can be used with different binary classifiers in a diagn

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer(max_df=0.9, min_df=2, stop_words='english')

In [13]:
dt = countvec.fit_transform(df['ABSTRACT'])

In [14]:
dt

<20972x27580 sparse matrix of type '<class 'numpy.int64'>'
	with 1349506 stored elements in Compressed Sparse Row format>

In [15]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=7, random_state=42)

In [16]:
lda.fit(dt)

LatentDirichletAllocation(n_components=7, random_state=42)

In [19]:
# Grab the vocablary of words 

len(countvec.get_feature_names())

27580

In [21]:
len(lda.components_)

7

In [23]:
(lda.components_).shape

(7, 27580)

In [24]:
lda.components_

array([[ 0.14442402, 50.54828949,  0.14314037, ...,  3.14264268,
         0.14285718,  0.14285718],
       [14.74120455, 18.51109232, 12.30203299, ...,  0.14285715,
         0.14285716,  0.14285716],
       [ 2.78220809,  0.14306021,  4.98339272, ...,  0.14294142,
         0.14321377,  4.03954514],
       ...,
       [ 0.14308457,  0.14301463,  0.14285931, ...,  0.14298731,
         0.14384837,  0.24606986],
       [ 0.14333558,  0.40720527,  0.14285716, ...,  0.14285715,
         2.14095154,  0.14295636],
       [ 8.90256799, 57.02961118,  0.14286031, ...,  0.14285715,
         0.14305125,  0.14285715]])

In [25]:
# Grab the topics
singletopic  = lda.components_[0]

In [26]:
# Argsort gives us indexes of the perticular array:
singletopic.argsort()

array([23951,  1381,  3649, ...,  6554, 23536, 15181], dtype=int64)

In [33]:
import numpy
arr = numpy.array([12,1,199])

In [34]:
arr.argsort()

array([1, 0, 2], dtype=int64)

In [35]:
# ARGSORT >>>> Index position sorted from least ----> Greatest 
# Top 10 Values (GREATEST VALUES)

singletopic.argsort()[-10:]

array([   70,  4882, 17302, 10486, 23625, 10485, 23547,  6554, 23536,
       15181], dtype=int64)

In [36]:
top_ten = singletopic.argsort()[-10:] # These are index position of top 10 in 1st topic

In [38]:
# We will correlate this with countvvectorizer vocabilary to get words: 
for index in top_ten: 
    print(countvec.get_feature_names()[index])

10
cluster
observations
galaxy
stellar
galaxies
stars
data
star
mass


In [40]:
# Grab the highest probability words per topic: 
for i, topic in enumerate(lda.components_):
    print(f"Top 15 words in topic {i} are")
    print([countvec.get_feature_names()[index] for index in topic.argsort() [-15:]])
    print('\n')
    print('\n')
    print('\n')
    print('\n')

Top 15 words in topic 0 are
['observed', 'present', 'planet', 'clusters', 'formation', '10', 'cluster', 'observations', 'galaxy', 'stellar', 'galaxies', 'stars', 'data', 'star', 'mass']








Top 15 words in topic 1 are
['frequency', 'range', 'large', 'optical', 'results', 'magnetic', 'low', 'time', 'temperature', 'phase', 'model', 'using', 'field', 'high', 'energy']








Top 15 words in topic 2 are
['finite', 'space', 'dimensional', 'states', 'non', 'phase', 'prove', 'theory', 'study', 'group', 'mathcal', 'field', 'quantum', 'spin', 'mathbb']








Top 15 words in topic 3 are
['analysis', 'used', 'results', 'learning', 'proposed', 'algorithm', 'systems', 'performance', 'approach', 'information', 'time', 'using', 'paper', 'based', 'data']








Top 15 words in topic 4 are
['model', 'problems', 'distribution', 'new', 'order', 'matrix', 'non', 'algorithm', 'functions', 'linear', 'function', 'results', 'paper', 'method', 'problem']








Top 15 words in topic 5 are
['paper', 's

In [44]:
df.ABSTRACT

0          Predictive models allow subject-specific inf...
1          Rotation invariance and translation invarian...
2          We introduce and develop the notion of spher...
3          The stochastic Landau--Lifshitz--Gilbert (LL...
4          Fourier-transform infra-red (FTIR) spectra o...
                               ...                        
20967      Machine learning is finding increasingly bro...
20968      Polycrystalline diamond coatings have been g...
20969      We present a new approach for identifying si...
20970      The sum of Log-normal variates is encountere...
20971      Recently, optional stopping has been a subje...
Name: ABSTRACT, Length: 20972, dtype: object

In [45]:
results = lda.transform(dt)

In [47]:
results[0]

array([8.01737950e-04, 5.52641515e-02, 8.02141178e-04, 8.00890144e-04,
       5.29845338e-02, 8.01020097e-04, 8.88545525e-01])

In [48]:
df['topic'] = results.argmax(axis=1)

In [49]:
df.head()

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance,topic
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0,6
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0,6
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0,2
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0,4
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0,6


## Non-Negative Metrix vectorization: 
Non-negative Matrix Factorization is a Linear-algeabreic model, that factors high-dimensional vectors into a low-dimensionality representation. Similar to Principal component analysis (PCA), NMF takes advantage of the fact that the vectors are non-negative. By factoring them into the lower-dimensional form, NMF forces the coefficients to also be non-negative.

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [51]:
dfm = tfidf.fit_transform(df['ABSTRACT'])

In [52]:
dfm

<20972x27580 sparse matrix of type '<class 'numpy.float64'>'
	with 1349506 stored elements in Compressed Sparse Row format>

In [53]:
from sklearn.decomposition import NMF

In [54]:
nmf_model = NMF(n_components=7, random_state=42)

In [55]:
nmf_model.fit(dfm)

NMF(n_components=7, random_state=42)

In [58]:
for index,topic in enumerate(nmf_model.components_):
    print(f'Top 15 words for topic {index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

Top 15 words for topic 0
['proposed', 'control', 'solution', 'gradient', 'time', 'convergence', 'stochastic', 'convex', 'method', 'optimal', 'algorithms', 'problems', 'optimization', 'problem', 'algorithm']


Top 15 words for topic 1
['theory', 'solutions', 'theorem', 'algebra', 'type', 'equation', 'functions', 'spaces', 'groups', 'finite', 'space', 'prove', 'group', 'mathcal', 'mathbb']


Top 15 words for topic 2
['surface', 'mass', 'systems', 'electron', 'density', 'state', 'transition', 'states', 'temperature', 'energy', 'phase', 'field', 'quantum', 'magnetic', 'spin']


Top 15 words for topic 3
['image', 'machine', 'models', 'adversarial', 'convolutional', 'trained', 'task', 'classification', 'tasks', 'training', 'deep', 'networks', 'network', 'neural', 'learning']


Top 15 words for topic 4
['user', 'users', 'used', 'different', 'paper', 'methods', 'method', 'real', 'using', 'time', 'approach', 'analysis', 'based', 'information', 'data']


Top 15 words for topic 5
['algorithm', 'd

In [59]:
topic_results = nmf_model.transform(dfm)

In [61]:
topic_results[0].argmax()

4

In [62]:
topic_results.argmax(axis=1)

array([4, 3, 1, ..., 3, 6, 6], dtype=int64)

In [63]:
df['Topic'] = topic_results.argmax(axis=1)

In [65]:
topic_dict = {0:'Problem Solving', 1:'Equation Algebra', 2: 'Physics',  3: 'Neural Network',  4:'Data Anlysis', 5: 'Graph Structure', 6:'Algorithm details' }
df['Topic_Label'] = df['Topic'].map(topic_dict)

In [66]:
df.head()

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance,topic,Topic,Topic_Label
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0,6,4,Data Anlysis
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0,6,3,Neural Network
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0,2,1,Equation Algebra
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0,4,1,Equation Algebra
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0,6,4,Data Anlysis


In [67]:
Dataset = df.drop(columns=['Computer Science', 'Physics', 'Mathematics', 'Statistics', 'Quantitative Biology', 'Quantitative Finance'])

In [69]:
Dataset.head()

Unnamed: 0,ID,TITLE,ABSTRACT,topic,Topic,Topic_Label
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,6,4,Data Anlysis
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,6,3,Neural Network
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,2,1,Equation Algebra
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,4,1,Equation Algebra
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,6,4,Data Anlysis
