<a href="https://colab.research.google.com/github/vmavis/colab/blob/main/topic_modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Importing Data and Libraries**

All the necessary libraries and functions are installed and imported first. Further need of other libraries and functions may require us to import them seperately.

Numpy lets us work with arrays while pandas let us work with dataframes and read our data. Pickle lets us serialize and de-serialize object structures by implementing binary protocols. Re and spacy let us work with data cleansing. Gensim lets us work with topic modelling, document indexing, and similarity retrieval problems.

In [None]:
import numpy as np
import pandas as pd

import pickle as pkl
import re
import spacy

import gensim
from gensim import corpora, models
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora import Dictionary
from gensim.models import TfidfModel, LdaMulticore, CoherenceModel

The data set is read into a data frame and read_csv indicates that Python is attempting to read a csv (Comma Seperated Value) file. A glimpse of the data is shown using head(). If the number of rows is not specified, the first five rows will automatically be printed.

In [None]:
no3 = pd.read_csv('/content/data_3D.csv')
no3.head()

Unnamed: 0.1,Unnamed: 0,ABSTRACT
0,300,One of the popular approaches for low-rank t...
1,301,Nous tentons dans cet article de proposer un...
2,302,X-ray computed tomography (CT) using sparse ...
3,303,A singular (or Hermann) foliation on a smoot...
4,304,The Weyl semimetal phase is a recently disco...


We check the number of rows and columns using shape. We then check the number of null values and the data type of each variable using info.

In [None]:
no3.shape

(100, 2)

In [None]:
no3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  100 non-null    int64 
 1   ABSTRACT    100 non-null    object
dtypes: int64(1), object(1)
memory usage: 1.7+ KB


## **Creating User-Defined Functions**

A function to cleanse our data is defined below. All letters are first converted into lowercase. Certain patterns are then either removed or replaced with a white space. The mentioned patterns are:

- re.sub(r'\d+', '', i ): removing one or more digit characters
- re.sub(r'[^\w]', ' ', i): replacing alpha-numeric characters at the beginning of the string with a white space
- re.sub(r'https', '', i): removing https
- re.sub(r'com', '', i): removing com
- re.sub(r'((?<=^)|(?<= )).((?=$)|(?= ))', '', i): removing single character words
- re.sub(r'\s+', ' ', i): replacing one or more white spaces with a single white space

- re.sub(r'\s$', '', i): removing a white space at the end of the string

In [None]:
def cleansing(df):
    df_clean = df.str.lower()
    df_clean = [re.sub(r'\d+', '', i ) for i in df_clean]
    df_clean = [re.sub(r'[^\w]', ' ', i) for i in df_clean]
    df_clean = [re.sub(r'https', '', i) for i in df_clean]
    df_clean = [re.sub(r'com', '', i) for i in df_clean]
    df_clean = [re.sub(r'((?<=^)|(?<= )).((?=$)|(?= ))', '', i) for i in df_clean]
    df_clean = [re.sub(r'\s+', ' ', i) for i in df_clean]
    df_clean = [re.sub(r'\s$', '', i) for i in df_clean]
    return df_clean

A function to tokenize our data is defined below. The mentioned function also removes stopwords contained in the gensim library. All words in our data which are not stopwords and have a length larger than 3 characters are added to a list.

In [None]:
def preprocess(text):
    result = []
    for token in simple_preprocess(text) :
        if token not in STOPWORDS and len(token)>3:
            result.append(token)
    return result

A function to build a dictionary based on our data is defined below. It filters token which appears less than 3 times (no_below) or more than 30% of the whole document (no_above). It then keeps only the top 30,000 most frequent tokens (keep_n).

In [None]:
def build_dic(text):
    dictionary = Dictionary(text)
    dictionary.filter_extremes(no_below=3, no_above=0.3, keep_n= 30000)
    return dictionary

A function to build a Bag of Word (BOW) corpus is defined below. It counts the number of occurrences of each unique word, converts the word to its integer word ID, and the result is returned as a sparse vector.

In [None]:
def build_vec(text, dictionary):
    bow_corpus = [dictionary.doc2bow(doc) for doc in text]
    return bow_corpus

A function to transform a word-document co-occurrence matrix into a locally/globally weighted TF-IDF matrix is defined below. It uses the previously defined corpus.

In [None]:
def vector_tfidf(bow_corpus):
    tfidf = TfidfModel(bow_corpus)
    corpus_tfidf = tfidf[bow_corpus]
    return corpus_tfidf

The Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning method which infers possible topics based on the words in the documents using Bayesian network. It uses a generative probabilistic model and Dirichlet distributions to do so. A function which uses LDA approach and all CPU cores is defined below.

In [None]:
def model_lda(dictionary, bow_corpus, num_topic, alpha, eta):
    lda_model = LdaMulticore(bow_corpus, num_topics=num_topic, id2word=dictionary, passes=10, workers=2, alpha=alpha, eta=eta)
    return lda_model

A function to compute the coherence score of our topic model is defined below. Coherence score is a metric which measures how interpretable the topics are to humans and how similar the words are to each other.

In [None]:
def score_perf(lda_model, text, dictionary):
    coherence_model_lda = CoherenceModel(model=lda_model, texts=text, dictionary=dictionary, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    return coherence_lda

## **Data Preprocessing**

The first 70 rows of our data are allocated for the training set. The previous defined functions are applied to the training data. The data is cleansed and tokenized. The stopwords are also removed. The data is then shown below.

In [None]:
x_train = no3.iloc[1:70]

In [None]:
clean_train = cleansing(x_train['ABSTRACT'])

In [None]:
process_train = []
for i in range(len(clean_train)):
    process_train.append(preprocess(clean_train[i]))
process_train

[['nous',
  'tentons',
  'dans',
  'article',
  'proposer',
  'thèse',
  'cohérente',
  'concernant',
  'formation',
  'notion',
  'involution',
  'dans',
  'brouillon',
  'project',
  'desargues',
  'pour',
  'cela',
  'nous',
  'donnons',
  'analyse',
  'détaillée',
  'premières',
  'pages',
  'dudit',
  'brouillon',
  'prenant',
  'développements',
  'particuliers',
  'aident',
  'prendre',
  'intention',
  'desargues',
  'nous',
  'mettons',
  'cette',
  'analyse',
  'regard',
  'lecture',
  'fait',
  'jean',
  'beaugrand',
  'trouve',
  'dans',
  'advis',
  'charitables',
  'purpose',
  'article',
  'propose',
  'coherent',
  'thesis',
  'girard',
  'desargues',
  'arrived',
  'notion',
  'involution',
  'brouillon',
  'project',
  'purpose',
  'detailed',
  'analysis',
  'pages',
  'brouillon',
  'including',
  'developments',
  'particular',
  'cases',
  'help',
  'understand',
  'goal',
  'desargues',
  'clarify',
  'links',
  'notion',
  'involution',
  'harmonic',
  'division

The dictionary is built first. It is then used to build the BOW corpus. The corpus is transformed into TF-IDF vectors.

In [None]:
dictionary = build_dic(process_train)
bow_corpus = build_vec(process_train, dictionary)
corpus_tfidf = vector_tfidf(bow_corpus)

The LDA model is defined with 0.7 as both the document-topic distribution and the topic-word distribution. The number of topics is set to 4. The coherence score is then shown.

In [None]:
lda_model = model_lda(dictionary, corpus_tfidf, 4, 0.7, 0.7)
coherence_score = score_perf(lda_model, process_train, dictionary)
coherence_score

0.4713491051050879

Each topic and its representative words are shown below.

In [None]:
for topic, words in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(topic, words))
    print("\n")

Topic: 0 
Words: 0.006*"learning" + 0.005*"deep" + 0.004*"data" + 0.004*"neural" + 0.004*"network" + 0.004*"networks" + 0.004*"models" + 0.004*"directly" + 0.003*"model" + 0.003*"sparse"


Topic: 1 
Words: 0.004*"data" + 0.003*"type" + 0.003*"quantum" + 0.003*"model" + 0.003*"energy" + 0.003*"case" + 0.003*"finite" + 0.003*"learning" + 0.003*"operator" + 0.003*"magnetic"


Topic: 2 
Words: 0.003*"data" + 0.003*"type" + 0.003*"model" + 0.003*"energy" + 0.003*"surface" + 0.003*"quantum" + 0.003*"curves" + 0.003*"finite" + 0.003*"case" + 0.003*"learning"


Topic: 3 
Words: 0.003*"type" + 0.003*"data" + 0.003*"energy" + 0.003*"case" + 0.003*"operator" + 0.003*"learning" + 0.003*"model" + 0.003*"paper" + 0.003*"study" + 0.003*"quantum"




The last 30 rows of our data are allocated for the testing set. The previous defined functions are applied to the testing data. The data is cleansed and tokenized. The stopwords are also removed. It is then shown below.

In [None]:
x_test = no3.iloc[71:100]

In [None]:
clean_test = cleansing(x_test['ABSTRACT'])

In [None]:
process_test = []
for i in range(len(clean_test)):
    process_test.append(preprocess(clean_test[i]))

The BOW corpus is also built using the previously defined dictionary, but instead of using the training data, this one uses the testing data. The corpus is transformed into TF-IDF vectors.

In [None]:
bow_corpus_test = build_vec(process_test, dictionary)
corpus_tfidf_test = vector_tfidf(bow_corpus_test)

All cleansed testing data is stored in 'text'. The topic with the highest score is stored in 'topic1'. Its score is stored in 'score1'. The topic with the second highest score is stored in 'topic2'. Its score is stored in 'score2'.

In [None]:
test_shape = len(clean_test)
text = []
topic1 = []
score1 = []
topic2 = []
score2 = []
for doc in range(test_shape):
    text.append(clean_test[doc])
    topic1.append(sorted(lda_model[bow_corpus_test[doc]], key=lambda tup: -1*tup[1])[0][0])
    score1.append(sorted(lda_model[bow_corpus_test[doc]], key=lambda tup: -1*tup[1])[0][1])
    topic2.append(sorted(lda_model[bow_corpus_test[doc]], key=lambda tup: -1*tup[1])[1][0])
    score2.append(sorted(lda_model[bow_corpus_test[doc]], key=lambda tup: -1*tup[1])[1][1])

The mentioned variables are converted into dataframe columns and combined into a single dataframe. As they may still carry the row index of the original dataframe, we use reset_index() to reset the index to the default integer index beginning at 0.

In [None]:
text = pd.DataFrame(text, columns=['Text']).reset_index(drop=True)
topic1 = pd.DataFrame(topic1, columns=['Topic 1']).reset_index(drop=True)
score1 = pd.DataFrame(score1, columns=['Score 1']).reset_index(drop=True)
topic2 = pd.DataFrame(topic2, columns=['Topic 2']).reset_index(drop=True)
score2 = pd.DataFrame(score2, columns=['Score 2']).reset_index(drop=True)
test_result = pd.concat([text, topic1, score1, topic2, score2], axis=1)
test_result.head()

Unnamed: 0,Text,Topic 1,Score 1,Topic 2,Score 2
0,this paper discusses metropolis hastings algo...,0,0.407849,1,0.212673
1,luke lee is tan chin tuan centennial professo...,3,0.272429,1,0.255043
2,topology has appeared in different physical c...,0,0.3675,1,0.211295
3,we study the scale and tidy subgroups of an e...,1,0.271495,1,0.262018
4,the coupled exciton vibrational dynamics of t...,0,0.48308,2,0.19024


Each topic score and its representative words of the first sample are shown below.

In [None]:
for score, words in sorted(lda_model[bow_corpus_test[0]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nWords: {}".format(words, lda_model.print_topic(score, 10)))


Score: 0.4077267646789551	 
Words: 0.006*"learning" + 0.005*"deep" + 0.004*"data" + 0.004*"neural" + 0.004*"network" + 0.004*"networks" + 0.004*"models" + 0.004*"directly" + 0.003*"model" + 0.003*"sparse"

Score: 0.21268802881240845	 
Words: 0.004*"data" + 0.003*"type" + 0.003*"quantum" + 0.003*"model" + 0.003*"energy" + 0.003*"case" + 0.003*"finite" + 0.003*"learning" + 0.003*"operator" + 0.003*"magnetic"

Score: 0.1950809210538864	 
Words: 0.003*"type" + 0.003*"data" + 0.003*"energy" + 0.003*"case" + 0.003*"operator" + 0.003*"learning" + 0.003*"model" + 0.003*"paper" + 0.003*"study" + 0.003*"quantum"

Score: 0.18450433015823364	 
Words: 0.003*"data" + 0.003*"type" + 0.003*"model" + 0.003*"energy" + 0.003*"surface" + 0.003*"quantum" + 0.003*"curves" + 0.003*"finite" + 0.003*"case" + 0.003*"learning"


Each topic score and its representative words of the second sample are shown below.

In [None]:
for score, words in sorted(lda_model[bow_corpus_test[1]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nWords: {}".format(words, lda_model.print_topic(score, 10)))


Score: 0.2718086838722229	 
Words: 0.003*"type" + 0.003*"data" + 0.003*"energy" + 0.003*"case" + 0.003*"operator" + 0.003*"learning" + 0.003*"model" + 0.003*"paper" + 0.003*"study" + 0.003*"quantum"

Score: 0.25593870878219604	 
Words: 0.004*"data" + 0.003*"type" + 0.003*"quantum" + 0.003*"model" + 0.003*"energy" + 0.003*"case" + 0.003*"finite" + 0.003*"learning" + 0.003*"operator" + 0.003*"magnetic"

Score: 0.24049757421016693	 
Words: 0.003*"data" + 0.003*"type" + 0.003*"model" + 0.003*"energy" + 0.003*"surface" + 0.003*"quantum" + 0.003*"curves" + 0.003*"finite" + 0.003*"case" + 0.003*"learning"

Score: 0.23175503313541412	 
Words: 0.006*"learning" + 0.005*"deep" + 0.004*"data" + 0.004*"neural" + 0.004*"network" + 0.004*"networks" + 0.004*"models" + 0.004*"directly" + 0.003*"model" + 0.003*"sparse"


Each topic score and its representative words of the last sample are shown below.

In [None]:
for score, words in sorted(lda_model[bow_corpus_test[-1]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nWords: {}".format(words, lda_model.print_topic(score, 10)))


Score: 0.3414439558982849	 
Words: 0.006*"learning" + 0.005*"deep" + 0.004*"data" + 0.004*"neural" + 0.004*"network" + 0.004*"networks" + 0.004*"models" + 0.004*"directly" + 0.003*"model" + 0.003*"sparse"

Score: 0.2275521159172058	 
Words: 0.004*"data" + 0.003*"type" + 0.003*"quantum" + 0.003*"model" + 0.003*"energy" + 0.003*"case" + 0.003*"finite" + 0.003*"learning" + 0.003*"operator" + 0.003*"magnetic"

Score: 0.21994377672672272	 
Words: 0.003*"data" + 0.003*"type" + 0.003*"model" + 0.003*"energy" + 0.003*"surface" + 0.003*"quantum" + 0.003*"curves" + 0.003*"finite" + 0.003*"case" + 0.003*"learning"

Score: 0.21106018126010895	 
Words: 0.003*"type" + 0.003*"data" + 0.003*"energy" + 0.003*"case" + 0.003*"operator" + 0.003*"learning" + 0.003*"model" + 0.003*"paper" + 0.003*"study" + 0.003*"quantum"


## **Conclusion**

A low coherence score indicates that the topics are vague, noisy, or irrelevant, while a high coherence score indicates that the topics are consistent, clear, and relevant. With a coherence score of 0.47, our topic model seems to have a little difficulty in providing good topics. This may due to our data not having the best reliability and accuracy, hence the low coherence score.