# Minapharm ML Engineer Technical Assesment

---


**Candidate Name** : WALEED ELBADRY

**Date** : 14/04/2022

**Email** : wbadry@live.com

---


### Introduction

---

A data repository is composed of _5000 articles_ is provided to be utilized to extract the most important topics and assigning it to each article with a probability score.


### Libraries and dependancies

---

The developed notebook ran on Ubuntu _Linux 20.04_

![Linux OS](images/os.png)

Since it is odd to have dataset in **rar format**, an extra step is needed to extract it.

It is recommended to install `unrar` application using terminal command

```bash
sudo apt install unrar
```

The application will be used as a tool to unrar the `minapharm downloaded dataset`.

You may run the following command in the terminal to install all of the used libraries

```bash
python -m pip install -U pandas numpy scikit-learn gensim urlib progressbar rarfile nltk spacy pprint matplotlib pyldavis 

```

Last , but not least, installing the dictionary needed for tokenization

```bash
python -m spacy download en_core_web_sm
```


In [69]:
# imports
# Run in terminal or command prompt
# python3 -m spacy download en
import urllib
import progressbar
from rarfile import RarFile
import numpy as np
import pandas as pd
import re, nltk, spacy, gensim

# Sklearn
import sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('mode.chained_assignment', None)
pd.set_option('display.max_columns', 100)

# Libraries versions
pprint('Libraries:')
pprint('progressbar : {:}'.format(progressbar.__version__))
pprint('pandas : {:}'.format(pd.__version__))
pprint('numpy : {:}'.format(np.__version__))
pprint('scikit-learn : {:}'.format(sklearn.__version__))
pprint('gensim : {:}'.format(gensim.__version__))


'Libraries:'
'progressbar : 2.5'
'pandas : 1.4.2'
'numpy : 1.21.5'
'scikit-learn : 1.0.2'
'gensim : 4.1.2'


In [70]:

class MyProgressBar():
    """Class to display the progress bar during the download of the dataset
    """

    def __init__(self):
        """Initialize the progress bar count
        """
        self.pbar = None

    def __call__(self, block_num, block_size, total_size):
        """Progress bar visualization

        Args:
            block_num (byte): The current block number to be downloaded
            block_size (int): Block size
            total_size (_type_): Total number of blocks to be downloaded
        """
        if not self.pbar:
            self.pbar = progressbar.ProgressBar(maxval=total_size)
            self.pbar.start()

        downloaded = block_num * block_size
        if downloaded < total_size:
            self.pbar.update(downloaded)
        else:
            self.pbar.finish()


In [71]:
# Download the dataset from minapharm provided link
url = 'https://www.minapharm.com/gShare/Pubmed5k.rar'
print('Downloading the compressed file, please wait ...')
status = urllib.request.urlretrieve(url, 'data/Pubmed5k.rar', MyProgressBar())
print('Download is completed')


Downloading the compressed file, please wait ...


100% |########################################################################|

Download is completed





In [72]:
# Extract the excel file from the rar file
RarFile.UNRAR_TOOL = 'unrar'
rar_file_path = 'data/Pubmed5k.rar'
with RarFile(rar_file_path) as file:
    file.extract(file.namelist()[0], path='data')

In [73]:
# Read the excel file into dataframe
excel_file_path = 'data/Pubmed5k.xlsx'
df = pd.read_excel(excel_file_path, sheet_name='random 5k', header=0)

### 1. Data wrangling

---

The first step after importing the data is to check some insights regarding the **data structure** .


In [74]:
# Some quick insights
print('The dataset has:')
print('{0:5d} records'.format(df.shape[0]))
print('{0:5d} columns'.format(df.shape[1]))
print()


print('The number of empty records in any column:')
print(df.isnull().sum())
print()

print('Number of unique records:')
print(df.nunique())
print()

print('Exploring columns datatypes:')
print(df.dtypes)


The dataset has:
 4999 records
    3 columns

The number of empty records in any column:
ArticleID    0
Title        0
Abstract     0
dtype: int64

Number of unique records:
ArticleID    4999
Title        4999
Abstract     4989
dtype: int64

Exploring columns datatypes:
ArticleID     int64
Title        object
Abstract     object
dtype: object


From the _previous insights_, there are several duplicated records, mainly in the `Abstract` section


### 2. Cleaning articles with missing abstracts

---

After wrangling the data, each article is stored into 3 columns:

- `Article ID` : serving as the unique ID for storing the article.
- `Title` : The article published title or main header.
- `Abstract` : Summary of the article.


In [75]:
# Show the dupplicated records
count = df.Abstract.duplicated().sum()
print("There are {:} duplicated records in the abstract".format(count))

df_dup = df[df.Abstract.duplicated()]
df_dup


There are 10 duplicated records in the abstract


Unnamed: 0,ArticleID,Title,Abstract
2590,34669440,Peptide-based urinary monitoring of fibrotic nonalcoholic steatohepatitis by mass-barcoded activity-based sensors.,[Figure: see text].
2591,34669441,A rapid assay provides on-site quantification of tetrahydrocannabinol in oral fluid.,[Figure: see text].
2592,34669442,Fatal enhanced respiratory syncytial virus disease in toddlers.,[Figure: see text].
2593,34669443,Macrophage migration inhibitory factor drives pathology in a mouse model of spondyloarthritis and is associated with human disease.,[Figure: see text].
2594,34669444,"Development of ICT01, a first-in-class, anti-BTN3A antibody for activating V?9Vd2 T cell-mediated antitumor immune response.",[Figure: see text].
3872,34258891,Too much of a good thing in ischemic mitral: lessons for surgeons and cardiologists.,No abstract present.
3873,34258892,COVID-19 infection and cardiometabolic complications: short- and long-term treatment and management considerations.,No abstract present.
3874,34258893,Comments on Cardiovascular effects of waterpipe smoking: a systematic review and meta-analysis.,No abstract present.
3875,34258894,A case of COVID-19 infection quickly relieved after nasal instillations and gargles with povidone iodine.,No abstract present.
4757,34425679,Study of anabolic activity of dry extracts of leaves and rhizomes of Iris hungarica on a model of hydrocortisone-induced protein catabolism.,"This article presents the results of the study of the anabolic effect of dry extracts of Iris hungarica leaves and rhizomes on the model of hydrocortisoneinduced protein catabolism. Previous studies have established the presence of anabolic activity of dry extracts of Iris hungarica leaves and rhizomes in intact animals. Therefore, it was reasonable to study the effect of the experimental extracts on the state of protein metabolism, which is regulated by glucocorticoids. The model of hydrocortisoneinduced protein catabolism was used to determine anabolic activity for dry extracts of Iris hungarica leaves and rhizomes at a dose of 150 mg/kg by monitoring the recovery of body weight and the increase in the total protein in the cardiac muscle of rats and in muscle tissue homogenate, which is aimed to promote myofibrillar hypertrophy. Dry extract of Iris hungarica rhizomes reduced urea excretion, normalized metabolism, restored nitrogen balance, and inhibited protein catabolism. The ..."


Since the removal of these records won't affect the **topic modeling task** since the information loss is expected to be:

$$
Information \ Loss (\%) = {10 \over 4999} \times 100 = 0.2 \%
$$

The loss was assessed by the candidate to be acceptable. Therefore, _all of the duplicated records_ would be removed.


In [76]:
# Remove duplicated records
df_clean = df[~df.Abstract.duplicated()]

# Show number of records after cleaning it
print('The cleaned dataset has:')
print('{0:5d} records'.format(df_clean.shape[0]))
print('{0:5d} columns'.format(df_clean.shape[1]))


The cleaned dataset has:
 4989 records
    3 columns


**Apparently**, the two columns, namely `Title` and `Abstract` are string but was read as object. We may convert it into string for easy tokenization later.


In [77]:
# Convert the Title and Abstract to string
df_clean['Title'] = df_clean['Title'].astype("string")
df_clean['Abstract'] = df_clean['Abstract'].astype("string")

# Verify the conversion
print('Verifying columns datatypes:')
print(df_clean.dtypes)


Verifying columns datatypes:
ArticleID     int64
Title        string
Abstract     string
dtype: object


### 3. Topic modeling

---


In the **Natural Language Processing (NLP)** , `topic modeling` is to select which topic a given text or article is about.

In other words, extracting descriptive hidden topics from large volumes of text aka _corpus_. This approach can be achieved by searching for a statistical model by means of :

#### 3.1. _Dimensionality reduction_ :

If we consider text ${T}$ in the language vocabulary ${V}$ , we may use features encoding to map words ${Word_i}$ found in that text into topics ${Topic_i}$ with weights or probability ${Weight_i}$. Mathematically can be reformed as:

$${Word_i,T} \in {V} \rightarrow {Topic_i,Weight_i} \in {Topics}$$

#### 3.2. _Unsupervised learning_ :

As of clustering algorithms where objects are mapped to each cluster centroid by its Euclidean distance, text or words are mapped to topics with probability or weight score.

<p align="center">
    <img src="images/topic_modeling.jpeg" alt="Image"/>
</p>


### 4. Algorithms for topic modeling

---

Several algorithms has been proposed over years for topic modeling. The three common algorithms are:

- Latent Dirichlet Allocation (LDA)
- Non Negative Matrix Factorization (NMF)
- Latent Semantic Analysis (LSA)
- Parallel Latent Dirichlet Allocation (PLDA)
- Pachinko Allocation Model (PAM)

Many resources including Kaggle and medium posts suggested using LDA, therefore, we may commence with it.

### 5. Latent Dirichlet Allocation (LDA)

---

Latent Dirichlet Allocation is a statistical and graphical model which are used to obtain relationships between multiple documents in a corpus. It is developed using Variational Exception Maximization (VEM) algorithm for obtaining the maximum likelihood estimate from the whole corpus of text.

Mathematically :

$${P(topic \ T | document \ d)} \ * {P(word \ d | topic \ T)}$$

where:

${P(topic \ T | document \ d)}$ : Proportion of words in document d that are assigned to topic t.

${P(word \ d | topic \ T)}$ :Proportion of assignments to topic t across all documents from words that come from w

Just like the clustering algorithm, it is an iterative process until reaching the same conditional probability of topic assignment with no changes after successive iteration.

The expectation would be generating an output topics with mixtures such as in pharmaceuticals :

- Topic 1 : 60% Analgesics , 40% Antacids
- Topic 2 : 80% Antiarrhythmics, 6% Antacids, 4% Antibacterials

Thereafter, each article would be assigned to each topic with a probability :

<div align="center">

| Article ID |       Topic 1       |       Topic 2       |    Topic n    |
| :--------: | :-----------------: | :-----------------: | :-----------: |
|  5654562   |         0.6         |         0.3         |      ...      |
|  4564562   |        0.95         |        0.01         |      ...      |
|     M      | ${P_{max}(t\|d)_1}$ | ${P_{max}(t\|d)_2}$ | ${P(t\|d)_n}$ |

</div>


#### 5.1. Abstract cleaning and lemmatization

Since the abstract would cover a summary of the article, we need to preprocess it first :

- Converting the `Abstract` column into list of lists
- Using [regular expressions](https://www.dataquest.io/blog/regex-cheatsheet/) , we may :
  - remove `author emails`, `new line` breaks, `single quotes` and `characters less than 3`


In [78]:
# Convert to list
data = df.Abstract.values.tolist()
# Remove Emails
data = [re.sub(r'\S*@\S*\s?', '', sent) for sent in data]
# Remove new line characters
data = [re.sub(r'\s+', ' ', sent) for sent in data]
# Remove distracting single quotes
data = [re.sub(r"\'", "", sent) for sent in data]
# Remove characters less than 3
data = [re.sub(r'\b\w{1,3}\b',"", sent) for sent in data]

# Example of cleaned abstract (still having some punctuation)
pprint(data[:1])

['Coordination variability ()  commonly analyzed  understand dynamical '
 'qualities  human locomotion.  purpose  this study   develop guidelines   '
 'number  trials required  inform  calculation   stable mean lower limb  '
 'during overground locomotion. Three-dimensional lower limb kinematics were '
 'captured   recreational runners performing  trials each  preferred  fixed '
 'speed walking  running. Stance phase   calculated   segment  joint couplings '
 'using  modified vector coding technique.  number  trials required  achieve   '
 'mean within %   strides average  determined  each coupling  individual.  '
 'statistical outputs  mode (walking  running)  speed (preferred  fixed) were '
 'compared when informed  differing numbers  trials.  minimum   trials were '
 'required  stable mean stance phase . With fewer than  trials,   '
 'underestimated     oversight  significant differences between mode  speed. '
 'Future overground locomotion  research  healthy populations using  vecto

The next step , is to convert the sentences into words using [gensim library](https://radimrehurek.com/gensim/) to be ready for [tokenization](https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/) by splitting the `sentences` into `words`.

In [79]:
def sent_to_words(sentences):
    for sentence in sentences:
        # Remove punctuation
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))

# Tokenization of the first abstract in the articles dataset
print(data_words[:1])

[['coordination', 'variability', 'commonly', 'analyzed', 'understand', 'dynamical', 'qualities', 'human', 'locomotion', 'purpose', 'this', 'study', 'develop', 'guidelines', 'number', 'trials', 'required', 'inform', 'calculation', 'stable', 'mean', 'lower', 'limb', 'during', 'overground', 'locomotion', 'three', 'dimensional', 'lower', 'limb', 'kinematics', 'were', 'captured', 'recreational', 'runners', 'performing', 'trials', 'each', 'preferred', 'fixed', 'speed', 'walking', 'running', 'stance', 'phase', 'calculated', 'segment', 'joint', 'couplings', 'using', 'modified', 'vector', 'coding', 'technique', 'number', 'trials', 'required', 'achieve', 'mean', 'within', 'strides', 'average', 'determined', 'each', 'coupling', 'individual', 'statistical', 'outputs', 'mode', 'walking', 'running', 'speed', 'preferred', 'fixed', 'were', 'compared', 'when', 'informed', 'differing', 'numbers', 'trials', 'minimum', 'trials', 'were', 'required', 'stable', 'mean', 'stance', 'phase', 'with', 'fewer', 'th

#### 5.2. Lemmatization

By using [`SpaCy` library](https://spacy.io/), the words are lammetized. In other words, using a dictionary , every verb is converted to its root such as `has` and `had` becomes `have`. We are only interested in nouns and verbs for articles topic modeling  

In [80]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): #'NOUN', 'ADJ', 'VERB', 'ADV'
    """Lemmatization of abstract words

    Args:
        texts (string): list of words
        allowed_postags (list, optional): word type. Defaults to ['NOUN', 'ADJ', 'VERB', 'ADV'].

    Returns:
        lemma of the word
    """
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

# Initialize spacy ‘en’ model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'VERB']) #select noun and verb
# Example of the words after lemmatization
print(data_lemmatized[:2])

['analyze understand quality locomotion purpose study develop guideline number trial require inform calculation mean limb locomotion limb kinematic capture runner perform trial prefer fix speed walk run stance phase calculate segment coupling use modify vector code technique number trial require achieve average determine couple output mode walk run speed prefer fix compare inform differ number trial trial require stance phase trial underestimate oversight difference mode speed locomotion research population use vector code approach trial researcher aware consequence number trial study finding', 'scenario knee valgus alteration knee lead risk injury weakness musculature abduction extension hext rotation contribute increase landing task focus question decrease strength associate increase landing task athlete summary finding study include randomize control trial cohort study case control study find decrease strength contribute increase landing task study find extensor contribute control a

#### 5.3. Vectorization

[Text Vectorization](geeksforgeeks.org/using-countvectorizer-to-extracting-features-from-text/) is the process of counting the frequency of word occurance across the document to perform encoded word matrix. This matrix can be used to compute the conditional probability of the word with respect to a topic or document.

In [81]:
vectorizer = CountVectorizer(analyzer='word',
                             min_df=10,
# minimum reqd occurences of a word
                             stop_words='english',
# remove stop words
                             lowercase=True,
# convert all words to lowercase
                             token_pattern='[a-zA-Z0-9]{3,}',
# num chars > 3
                             max_features=50000,
# max number of uniq words    
)
# Vectorization
data_vectorized = vectorizer.fit_transform(data_lemmatized)

data_vectorized[:1]

<1x2930 sparse matrix of type '<class 'numpy.int64'>'
	with 47 stored elements in Compressed Sparse Row format>

### 6. Hyperparameter Tuning

---


By investigating [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) of scikit-learn, there are many parameters to tune. However, the most effective hyperparameters:

- `n_componentsint` : Number of topics.
- `learning_decay` : It is a parameter that control learning rate in the online learning method. The value should be set between (0.5, 1.0] to guarantee asymptotic convergence.

These parameters could be tuned using `Grid Search` which creates a grid of possible parameters.


In [82]:
# Define Search Param
search_params = {'n_components': [10, 15, 20, 25, 30], 'learning_decay': [.5, .7, .9]}
# Init the Model
lda = LatentDirichletAllocation(max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params, verbose=2)
# Do the Grid Search
model.fit(data_vectorized)


Fitting 5 folds for each of 15 candidates, totalling 75 fits
[CV] END ................learning_decay=0.5, n_components=10; total time=   6.4s
[CV] END ................learning_decay=0.5, n_components=10; total time=   6.3s
[CV] END ................learning_decay=0.5, n_components=10; total time=   6.3s
[CV] END ................learning_decay=0.5, n_components=10; total time=   6.3s
[CV] END ................learning_decay=0.5, n_components=10; total time=   6.2s
[CV] END ................learning_decay=0.5, n_components=15; total time=   6.1s
[CV] END ................learning_decay=0.5, n_components=15; total time=   6.2s
[CV] END ................learning_decay=0.5, n_components=15; total time=   6.2s
[CV] END ................learning_decay=0.5, n_components=15; total time=   6.3s
[CV] END ................learning_decay=0.5, n_components=15; total time=   6.2s
[CV] END ................learning_decay=0.5, n_components=20; total time=   6.4s
[CV] END ................learning_decay=0.5, n_c

GridSearchCV(estimator=LatentDirichletAllocation(learning_method='online',
                                                 learning_offset=50.0,
                                                 max_iter=5, random_state=0),
             param_grid={'learning_decay': [0.5, 0.7, 0.9],
                         'n_components': [10, 15, 20, 25, 30]},
             verbose=2)

#### 6.1. Model selection

A model with `higher log-likelihood` and `lower perplexity` (exp(-1. * log-likelihood per word)) is considered to be `good`.

In [83]:
# Best Model
best_lda_model = model.best_estimator_
# Model Parameters
print("Best Model's Params: ", model.best_params_)
# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)
# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))

Best Model's Params:  {'learning_decay': 0.5, 'n_components': 10}
Best Log Likelihood Score:  -539653.5520135149
Model Perplexity:  907.480004528889


From the `Grid search` , we may find:
- The best `learning rate` is `0.5`
- The articles dataset is best to be described by `10 topics`

### 7. Evaluation of the topic model

--------------------------------------------

As `requested` , we may then select the `highest 3 topics` in terms of probabilities and display the probability of the top `10 topics` as it is the best model based on LDA.

In [84]:
# Create Document — Topic Matrix
lda_output = best_lda_model.transform(data_vectorized)
# column names
topicnames = ['Topic' + str(i) for i in range(best_lda_model.n_components)]
# index names
# docnames = [“Doc” + str(i) for i in range(len(data))]
docnames = [str(ID) for ID in df.ArticleID]
# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)
# Get dominant topic for each document
# dominant_topic = np.argmax(df_document_topic.values, axis=1)
dominant_topic = pd.DataFrame(df_document_topic.columns[np.argsort(-np.array(df_document_topic.values) ,axis=1)[:, :3]], index =docnames,columns = ['1st Max','2nd Max','3rd Max'])
df_document_topic = pd.concat([df_document_topic, dominant_topic], axis=1)
#df_document_topic['dominant_topic'] =  dominant_topic
# change article ID to be column
df_document_topic.reset_index(inplace=True)
df_document_topic = df_document_topic.rename(columns = {'index':'ArticleID'})
df_document_topic_15 = df_document_topic.head(15)
df_document_topic_15

  dominant_topic = pd.DataFrame(df_document_topic.columns[np.argsort(-np.array(df_document_topic.values) ,axis=1)[:, :3]], index =docnames,columns = ['1st Max','2nd Max','3rd Max'])


Unnamed: 0,ArticleID,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,1st Max,2nd Max,3rd Max
0,34153941,0.0,0.0,0.0,0.0,0.18,0.46,0.03,0.03,0.04,0.24,Topic5,Topic9,Topic4
1,34153942,0.09,0.0,0.29,0.0,0.14,0.14,0.0,0.0,0.0,0.34,Topic9,Topic2,Topic4
2,34153964,0.0,0.0,0.38,0.13,0.0,0.27,0.0,0.0,0.13,0.08,Topic2,Topic5,Topic3
3,34153968,0.0,0.0,0.7,0.21,0.0,0.0,0.0,0.0,0.08,0.0,Topic2,Topic3,Topic8
4,34153978,0.15,0.0,0.43,0.0,0.0,0.0,0.15,0.0,0.27,0.0,Topic2,Topic8,Topic0
5,34153979,0.38,0.0,0.29,0.0,0.0,0.0,0.0,0.0,0.32,0.0,Topic0,Topic8,Topic2
6,34153980,0.0,0.38,0.0,0.3,0.0,0.19,0.0,0.0,0.13,0.0,Topic1,Topic3,Topic5
7,34153982,0.0,0.0,0.36,0.49,0.05,0.0,0.0,0.0,0.09,0.0,Topic3,Topic2,Topic8
8,34153983,0.03,0.0,0.0,0.62,0.0,0.24,0.0,0.0,0.0,0.09,Topic3,Topic5,Topic9
9,34153984,0.0,0.24,0.0,0.41,0.0,0.18,0.11,0.0,0.0,0.06,Topic3,Topic1,Topic5


In [85]:
# Save the result to article_topic.csv
df_document_topic.to_excel('data/article_topic.xlsx',sheet_name='topics for articles')

### 8. Topics insights

----------------------------------------------------

We may check how many words were representing each topic

In [86]:
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)
# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names()
df_topic_keywords.index = topicnames

pprint("Each topic has:")
pprint( "  {:} words".format(df_topic_keywords.shape[1]))
# View
df_topic_keywords.head(model.best_params_['n_components'])

'Each topic has:'
'  2930 words'




Unnamed: 0,aberration,ability,ablation,abnormality,absence,absorb,absorption,abstract,abundance,abuse,accelerate,acceleration,accept,acceptability,acceptance,access,accessibility,accessory,accident,accompany,accord,accordance,account,accounting,accumulate,accumulation,accuracy,ace,achieve,achievement,acid,acknowledge,acquire,acquisition,act,action,activate,activation,activity,actor,acuity,acute,adapt,adaptability,adaptation,add,addiction,addition,additive,address,...,ward,warning,warrant,waste,wastewater,water,wave,wavelength,way,weaken,weakness,wealth,wear,weather,website,week,weight,welfare,wellbee,wheat,width,wildlife,willingness,window,winter,wish,withdraw,withdrawal,woman,women,word,work,worker,workflow,workforce,working,workload,workplace,workshop,world,worsen,wound,write,xenograft,year,yeast,yield,youth,zinc,zone
Topic0,0.1,0.107395,0.100006,0.10002,0.100042,0.100013,0.100011,21.068396,0.100009,33.776803,0.100019,0.100101,0.10013,0.102737,0.100018,0.100288,0.100323,0.100004,0.100032,0.100067,21.454473,13.791648,48.927458,11.19927,0.100026,0.100004,0.100008,0.100018,0.100027,0.100142,0.100352,0.100027,0.100044,12.296457,0.100046,0.310672,0.100052,0.100028,270.689381,2.520677,0.100013,0.10036,19.146317,2.166852,0.100061,11.417815,22.677746,28.007477,0.100018,43.988327,...,0.100018,0.100057,9.87594,0.100148,0.100002,0.109528,0.101326,0.100011,0.100038,0.100103,0.100034,31.895891,0.100043,0.100008,29.553382,0.374262,4.90154,0.100143,30.067521,0.100006,0.100008,0.100001,0.100011,0.100022,0.10005,0.100013,0.100154,0.100019,442.695719,39.751305,0.100103,106.965331,18.234956,0.100006,0.100015,12.869422,0.100012,9.311776,0.10001,28.988408,20.591144,0.100023,0.702614,0.100002,303.274724,0.100035,4.50249,57.075486,0.100021,0.101914
Topic1,20.198554,67.496501,0.10001,0.100028,22.778556,5.426876,0.102874,0.100011,46.015356,0.105511,20.915125,0.100019,0.100051,0.100027,0.10001,0.100021,0.100007,0.100459,0.100014,17.477084,0.101733,0.100053,9.10723,0.10004,34.355251,75.043939,0.100004,24.759219,0.119445,0.100012,208.44188,0.100014,13.511978,0.100014,36.611142,70.095557,117.592907,158.481859,420.188303,0.100006,0.100008,0.100015,0.100059,0.10017,0.100024,11.947367,0.100294,122.782462,0.100039,9.904374,...,0.100023,0.100001,3.721624,0.100016,0.100056,0.100139,0.100011,0.100005,0.100018,3.455308,0.10002,0.1,0.100001,0.100001,0.1,0.100025,45.544659,0.100005,0.1,35.00546,0.100009,0.100007,0.100002,21.88896,0.100009,0.100001,0.100006,0.10011,0.100014,0.100001,0.100043,18.121381,0.100006,0.100051,0.1,0.100013,0.1,0.100001,0.1,0.100029,0.100012,32.434335,0.100001,16.148006,0.100032,59.090164,10.709311,0.100004,6.282828,0.100016
Topic2,0.100025,0.100023,0.100001,9.700153,25.38001,0.100035,0.100013,0.100007,0.100857,0.100024,0.100053,0.100065,5.597971,0.100017,0.10001,0.100022,0.100021,0.1,25.616259,0.100032,65.523623,0.100012,34.474274,2.987933,0.10004,0.100008,0.100012,0.100014,0.100164,0.100062,0.100023,6.88904,0.100028,0.100021,0.100026,0.100027,0.100006,0.100021,162.464531,0.100003,4.743919,12.962936,0.100123,0.100039,0.100015,0.100053,0.100024,19.514503,0.100012,0.100031,...,0.100027,13.554809,10.536692,0.100397,0.100009,0.100042,55.472075,0.100003,0.100008,0.100202,2.138227,0.100026,6.759169,17.897085,0.100005,304.692156,175.347712,0.100106,0.10002,0.100019,0.100026,0.100005,0.100017,0.100081,0.100035,0.100005,6.351425,0.100073,114.813399,0.100021,0.100022,28.837509,0.100018,0.100027,0.100007,6.204825,0.10001,0.100007,0.100029,4.749965,13.353788,0.100008,0.100013,0.100026,750.244116,0.100007,0.100024,0.100012,0.100004,0.100019
Topic3,0.100071,0.100064,25.055031,60.393058,19.029974,0.1,0.100006,0.100041,0.100003,0.100002,0.10005,0.100002,17.174452,0.100009,0.100016,0.100027,0.100002,13.610915,7.585698,15.064317,74.638834,0.10006,16.831334,0.100027,0.100009,0.100032,0.100014,0.100006,52.137059,0.100021,0.100026,0.100016,28.813705,0.100007,0.100178,0.100011,0.100002,0.100014,0.100122,0.100012,0.100023,28.593208,0.10002,0.100005,0.100011,19.636451,0.100025,46.966385,0.100007,0.100023,...,5.170173,0.100021,18.310894,0.100015,0.100004,0.100004,0.100057,0.10001,0.100017,0.100014,3.603899,0.100014,0.10002,0.100008,0.100005,26.204043,9.476335,0.100006,0.100001,0.100006,1.281008,0.100003,0.100015,2.594475,0.100014,0.100007,0.100046,6.627129,185.589669,0.100003,0.100022,0.100546,0.100006,0.100006,0.100002,6.702462,0.100008,0.100001,0.100011,45.780213,27.495483,80.186915,0.10001,1.584407,632.047058,0.100001,7.477801,0.100006,0.100003,6.641749
Topic4,0.100086,30.144463,0.10001,0.100025,40.97299,5.49055,0.100082,0.100007,0.100008,0.100015,3.546516,12.431171,0.100016,0.100042,0.100018,8.509455,0.100024,0.100018,0.100123,0.100031,31.797824,0.107859,48.525995,0.100005,0.100028,9.760327,58.276992,0.100002,40.177603,0.100062,0.100012,0.100015,0.100056,26.173632,0.100012,0.100017,0.10001,0.10004,39.282746,0.100038,13.465467,0.100011,15.957531,0.100054,39.612514,2.724872,0.100035,5.568338,0.100008,0.234742,...,0.100014,0.100008,0.100038,0.239765,0.100005,0.116527,0.100028,0.100147,0.101624,0.100068,0.100156,0.100007,12.845155,0.100014,0.100008,0.100031,12.758868,0.100013,0.100016,0.100012,42.756608,0.100004,0.100007,0.100089,0.100005,0.100001,0.100119,0.100035,0.10001,0.100011,0.100043,44.73034,0.100004,8.298018,0.100002,0.100469,0.100019,0.100007,0.100003,0.100224,0.100011,0.100061,0.10001,0.100044,61.277265,0.100056,0.100035,0.100015,0.100003,0.103039
Topic5,0.100026,14.931481,0.100009,0.100024,0.100038,0.10001,0.100315,0.100023,0.100012,0.10001,0.10004,0.255394,17.140379,0.100028,0.100012,0.100033,0.100018,0.100001,0.487866,0.100891,71.778714,14.92145,0.10005,0.100033,0.100032,0.100045,253.068252,0.100003,54.90834,0.10003,3.474495,0.100047,5.457576,0.100032,0.100035,0.100021,2.929346,0.110073,4.733668,0.100049,0.100006,6.380554,0.100048,0.100294,14.48682,13.976595,0.100005,18.327891,0.100015,1.167011,...,0.100012,0.100066,0.316329,0.100033,0.100013,0.100013,0.100038,0.100029,0.100019,0.100175,0.100012,0.100015,0.100057,0.100045,0.100044,89.607387,39.229147,0.100039,0.100008,0.100011,0.460407,0.100008,0.100005,0.100054,0.100003,0.100008,4.097326,9.472477,0.100018,0.100027,0.10004,0.150763,0.100023,0.100044,0.100001,0.10004,0.100015,0.100018,0.100001,0.100029,0.100007,0.177934,0.100015,0.100003,3.859796,0.100009,15.080189,0.100011,0.100018,0.100039
Topic6,0.100066,35.061954,0.100005,0.100018,0.100028,0.100218,57.016749,0.100011,0.100028,0.100003,13.632296,0.100027,0.100049,0.100152,0.10003,0.100018,0.10002,0.100081,0.100007,0.100022,0.100073,0.100079,0.100034,0.100022,28.141853,0.100044,5.46147,0.100005,105.026931,3.861633,170.796525,0.100054,0.100025,2.405345,0.100033,0.10003,0.100021,0.100016,0.100039,0.1,0.100007,0.10003,0.100194,0.100201,0.100011,22.401617,0.100001,88.245722,15.118781,13.159733,...,0.100015,0.100006,0.100013,52.058296,60.646644,471.068532,0.100027,16.741859,0.100025,8.583673,0.100008,0.100002,3.647817,0.100115,0.100006,0.100013,0.100049,0.100015,0.100001,0.100053,0.100018,0.100008,0.100001,0.100038,0.100004,0.100004,0.10008,0.100056,0.100007,0.100011,0.992914,103.443687,0.100005,0.100072,0.100001,0.100063,0.100026,0.100005,0.1,0.126923,0.100004,0.100208,0.100003,0.100028,0.101127,0.100092,95.144298,0.100006,26.688902,0.100031
Topic7,0.100018,0.100015,0.1,0.100027,32.093823,0.100052,0.100004,1.654593,41.723982,0.100001,1.349595,0.100003,0.10004,0.100017,0.100004,0.10003,0.100004,0.100044,0.100026,0.100041,21.521805,0.100052,0.100046,0.100023,0.100034,0.10004,0.100008,0.100009,0.100019,0.100018,0.100371,0.100006,0.12021,7.981186,0.100024,0.100019,0.100005,0.100006,0.100025,0.100001,0.100001,0.10002,0.100027,0.100128,21.444545,0.100051,0.1,30.46607,0.100073,0.174031,...,0.100006,0.100023,0.100068,0.100774,0.100019,93.606937,0.100015,0.100002,0.100012,0.10004,0.100024,0.100006,0.100006,0.10008,0.100016,0.100013,0.100028,0.100026,0.100013,0.100047,0.10001,27.164715,0.100014,14.521201,37.912137,0.1,0.100041,0.100016,0.100005,0.100015,0.100028,1.280437,0.100008,0.100141,0.100002,0.100008,0.100001,0.100001,0.100066,35.90569,0.100021,0.100004,0.10001,0.100003,55.52484,0.100023,24.404672,0.100001,0.100027,93.396939
Topic8,0.1,40.850804,0.100002,0.100007,0.100057,0.100043,0.100011,7.72895,0.100001,0.753593,16.628437,0.100005,0.100187,36.960055,44.641845,251.573111,14.657784,0.100004,0.100104,8.638034,17.413369,0.100037,0.100053,0.100045,0.100052,0.100002,0.100011,0.100007,52.063892,24.867351,0.100002,0.100035,0.100033,1.595047,0.100041,46.520105,0.100029,0.100013,14.774789,9.480564,0.100079,0.100229,33.762113,8.565183,0.100045,8.743048,0.100027,56.269091,0.100004,191.463851,...,15.459584,0.100173,0.192755,0.100094,0.100005,0.100008,0.100251,0.100001,10.33047,0.100143,0.100047,0.100012,0.114577,0.100038,0.100036,0.100028,0.100014,34.218304,0.100073,0.100007,0.100012,0.100006,39.301219,0.100016,0.100025,7.517493,0.1002,0.100004,2.935329,8.799969,5.149436,203.58747,231.622139,0.10002,18.835271,0.100153,3.00337,8.852491,21.718439,34.353675,0.145267,0.100021,47.443062,0.1,124.043528,0.100004,0.100043,25.242914,0.100001,0.100009
Topic9,0.10003,47.141969,0.100005,0.100032,0.100034,0.100009,0.100019,0.100018,0.100015,0.100011,2.118168,0.100131,0.100135,0.107568,0.100009,12.932976,112.315175,0.100068,0.100168,10.497145,0.10004,0.100051,0.100064,0.100026,0.100061,0.100016,0.100015,0.100006,38.847641,0.100041,0.100011,6.480777,42.054799,0.100046,21.506813,21.682854,0.100061,22.414747,140.687553,0.100032,0.100013,0.100065,51.895171,0.100145,0.100036,10.222037,0.100011,85.034086,0.100044,78.505097,...,0.100031,0.10033,10.219934,0.100054,0.100009,0.100022,79.30665,0.100019,30.576518,0.102733,9.583115,0.100029,41.857209,0.100139,0.100034,0.101334,0.100021,30.123782,0.100029,0.100028,0.100019,0.10001,0.100024,0.100095,0.100025,0.100285,0.100075,3.293459,0.100015,0.100038,31.549948,195.621075,0.100015,39.538741,0.100015,0.100109,5.811612,0.100014,0.100014,55.419418,0.100024,0.100035,0.100026,0.100003,15.942244,0.100021,5.30302,3.285782,0.100003,6.725997


We can see that each `topic` is composed of `2930` words.

In [87]:
# Save the result to topic_words.xlsx
df_topic_keywords.to_excel('data/topic_words.xlsx',sheet_name='topics for articles')

### 9. Communication and findings

---------------------------------------------

- After `wrangling the data`, the dataset is composed of nearly 5000 articles composed of:
  - ArticleID : the article ID.
  - Title : The article title.
  - Abstract : The summary of the article.
- `0.2 % of the articles` had `no summary`, that's why it was removed from the dataset.
- `The abstract` was the selected field for `topic modeling` since it would contain rich set of words summarizing the article.
- The procedure of topic modeling is commenced by `slicing` the `sentences` into `words`. Then after tokenization, the words are `lammetized` to its roots and cleaned from punctuations and common 3 letters words.
- The `Latent Dirichlet Allocation (LDA) algorithm` is forming `topic modeling` based on the `likelihood` of words `frequency`.
- There are many parameters affecting the performance of topic modeling, two major `hyperparameters` are the `learning rate` and the `number of topics`.
- `Grid search` is one of the methods that can be used for selection of best parameters. The pitfall is its intensive computation time.
- For the `current dataset`, a total of `10 topics` was found to be enough for modeling the whole articles with around `2900 words per topic`.
- The number of words can be `drastically reduced` if further optimization is carried out to find the words with higher frequency and omit the words that is unlikely to be repeated. 
- `Each article` in the dataset was `mapped` to all of the `10 topics` with `probability` of each topic and `the highest 3 topics` and saved to excel file.
- The `bag of words` for each topic was `listed` and also saved in excel file.

### 10 References

--------------------------------------------------------------------
1. [How to generate an LDA Topic Model for Text Analysis](https://yanlinc.medium.com/how-to-build-a-lda-topic-model-using-from-text-601cdcbfd3a6)
2. [Topic Modeling and Latent Dirichlet Allocation (LDA) in Python](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24)
3. [David M. Blei et al, Latent Dirichlet Allocation,Journal of Machine Learning Research 3 (2003) 993-1022,2003](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)
4. [Topic Modeling with Gensim](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)
5. [A Deeper Meaning: Topic Modeling in Python](https://www.toptal.com/python/topic-modeling-python)
6. [Latent Dirichlet Allocation: Intuition, math, implementation and visualisation with pyLDAvis](https://towardsdatascience.com/latent-dirichlet-allocation-intuition-math-implementation-and-visualisation-63ccb616e094)