## Topic Modelling Introduction

Refer to [Topic Modeling With Latent Dirichlet Allocation](https://medium.com/towards-data-science/topic-modeling-with-latent-dirichlet-allocation-ea3ebb2be9f4)

### Topic modeling is an unsupervised learning technique that unearths the underlying “topics” in a given collection of documents. Its ability to group or divide documents based on their topics makes it a very valuable asset for businesses. Topic modeling is present in many applications such as recommendation systems and search engines.

### One of the most popular methods of topic modeling is the Latent Dirichlet Allocation (LDA). 

This technique runs on the following 2 assumptions:

1. Each document comprises a mixture of topics
2. Each topic comprises a mixture of words

LDA represents words with topic probabilities and represents topics with word probabilities.

Due to the algorithm behind LDA, using this topic modeling method requires considerable computation.

## Stemming and Lemmatization

Stemming refers to reducing a word to its stem form. For instance, the stem of the word computer, computed, and computing is “comput.” To perform stemming, you can use the PorterStemmer object from the nltk.stem module. The word that you want to apply stemming on is passed to the stem() function of the
PorterStemmer object. Here is an example of how you can perform stemming with the NLTK library.

Lemmatization refers to reducing a word to its root form, as found in the dictionary. Lemmatization is different from stemming. In stemming, a word is reduced to its root form even if the root has no meaning. On the other hand, in lemmatization, a word is reduced to its meaningful representation, as found
in a dictionary.

To perform lemmatization, you can use the WordNet-Lemmatizer object from the nltk.stem module. The word that you want to apply stemming on is passed to the lemmatize() function of the WordNetLemmatizer object.

See below for an example

In [1]:
from nltk.stem import PorterStemmer
words = ["Compute", "Computer", "Computing", "Computed", "Computes"]
ps =PorterStemmer()
for word in words :
    stem=ps.stem(word)
    print(stem)

comput
comput
comput
comput
comput


In [2]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
words = ["acts","acted", "acting", "smiles", "smile"]

for word in words :
    lemma = wordnet_lemmatizer.lemmatize(word)
    print(lemma)

act
acted
acting
smile
smile


## Topic Modelling on Climate Change Tweets

In [113]:
import pandas as pd
import nltk
import re
from nltk.stem import WordNetLemmatizer
import warnings

warnings.simplefilter("ignore", DeprecationWarning)

stemmer = WordNetLemmatizer()

nltk.download('stopwords')

en_stop = nltk.corpus.stopwords.words('english')

# add additional stop words

additional_stopwords=['rt', 'climate', 'change', '#climatechange', 'dey', 'amp']

en_stop=en_stop+additional_stopwords

df=pd.read_csv('climate_change02.csv', sep='|')

# create a corpus of a documen. Each tweet is treated as a document
corpus = list(df['text'])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sli\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Data Cleaning

In [114]:
def text_cleaning(doc):
    
     # remove https
    
    doc=re.sub(r"https.+", " ", doc)

    # remove special characters, keep character, number, # and @
    
    doc=re.sub(r"[^a-zA-Z0-9#@]", " ", str(doc))
    
    # remove single character data                   
       
    doc = re.sub(r"\s+[a-zA-Z]\s+", ' ', doc)
    
    # change more than one white space to one white space

    doc = re.sub(r'\s+', ' ', doc)
                      
    doc = re.sub(r'^b\s+', '', doc)
                 
    doc = doc.lower()

    words = doc.split()
    
    # Remove numbers, but not words that contain numbers.
    words = [word for word in words if not word.isnumeric()]
             
    words = [stemmer.lemmatize(word) for word in words]
    words = [word for word in words if word not in en_stop]
    words = [word for word in words if len(word)  >= 3]

    return words

In [115]:
formated_data = [];
for doc in corpus:
    words = text_cleaning(doc)
    formated_data.append(words)

### 9.2.3. Topic Modeling with LDA

In [116]:
import gensim

from gensim import corpora

# create dictionary
gensim_dict = corpora.Dictionary(formated_data)

# create document term matrix
gensim_corpus = [gensim_dict.doc2bow(word, allow_update=True) for word in formated_data]

In [117]:
# define number of topics

num_topics=5

lda_topic_models = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=num_topics, id2word=gensim_dict, passes=20,  random_state=42)

In [118]:
lda_topics = lda_topic_models.print_topics(num_words=15)
for topic_name in lda_topics:
    print(topic_name)

(0, '0.013*"like" + 0.013*"report" + 0.013*"@ipcc" + 0.012*"new" + 0.010*"today" + 0.010*"time" + 0.009*"want" + 0.008*"year" + 0.008*"world" + 0.008*"#ipcc" + 0.008*"chang" + 0.008*"longer" + 0.008*"health" + 0.007*"#climate" + 0.007*"@jordanmulinzi"')
(1, '0.030*"make" + 0.030*"music" + 0.028*"american" + 0.027*"sing" + 0.025*"must" + 0.022*"get" + 0.020*"drug" + 0.016*"country" + 0.015*"awarded" + 0.015*"still" + 0.014*"violence" + 0.013*"@tomiwebstr" + 0.012*"grammy" + 0.011*"say" + 0.011*"fuel"')
(2, '0.053*"@tammtamp" + 0.043*"song" + 0.033*"@golferic1" + 0.032*"album" + 0.031*"disaster" + 0.028*"kidjo" + 0.027*"angelique" + 0.025*"danger" + 0.025*"jack" + 0.025*"@letter" + 0.024*"talked" + 0.024*"induced" + 0.024*"dropped" + 0.023*"headline" + 0.011*"imagine"')
(3, '0.026*"ipcc" + 0.019*"africa" + 0.016*"panel" + 0.015*"report" + 0.015*"win" + 0.015*"join" + 0.014*"even" + 0.014*"intergovernmental" + 0.013*"still" + 0.011*"latest" + 0.011*"threat" + 0.011*"fela" + 0.011*"help" +

## Visualize the result

In [119]:
import pyLDAvis
import pyLDAvis.gensim_models as gensim_models

# visualize LDA model results
pyLDAvis.enable_notebook()

gensim_models.prepare(lda_topic_models, dictionary=gensim_dict, corpus=gensim_corpus)

  default_term_info = default_term_info.sort_values(


## assign dominant topic to each tweet

In [120]:
# get topic for each tweet

topics=lda_topic_models.get_document_topics(gensim_corpus)


In [121]:
# assign topic back to data frame

# set maximun column size
pd.options.display.max_colwidth = 150

df['topics']=topics

df[['text', 'topics']].head(5)

Unnamed: 0,text,topics
0,@jemmaspatronus It was a stolen joke. As usual. https://t.co/gkG3t53wIj,"[(0, 0.046375226), (1, 0.046422217), (2, 0.8144232), (3, 0.04639462), (4, 0.04638471)]"
1,@jwmkup @SuarezMiami Much of what is going on in society should be concerning to us all. There’s widespread injusti… https://t.co/HfdfHRaEyW,"[(0, 0.35286978), (1, 0.031001113), (2, 0.031225368), (3, 0.2143347), (4, 0.37056902)]"
2,RT @NewsNancy9: @J_a_l_i_USA Man-made climate change is a cover for the biggest heist in world history. People better wake up before they’…,"[(0, 0.24055485), (1, 0.016450785), (2, 0.18389916), (3, 0.10282883), (4, 0.4562664)]"
3,"RT @tearfundaus: ""Any serious discipleship in these days needs to take climate change seriously."" - Rev Tim Costello AO. Download Tearfund…","[(0, 0.59891015), (1, 0.23359235), (2, 0.12784173), (3, 0.020031832), (4, 0.01962396)]"
4,RT @bapslosangeles: #BAPSLosAngeles and other @BAPS mandirs joined iconic landmarks by turning off our non-essential lights at 8:30 this…,"[(0, 0.36484426), (1, 0.024251105), (2, 0.16621505), (3, 0.023777297), (4, 0.42091227)]"


## find the dominant topic for each tweet

In [122]:
# convert topic from list to dictionary

topic_dict=[dict(topic) for topic in list(df['topics'])]

# find the dominat topic and retun key

dominant_topic=[max(topic, key=topic.get) for topic in topic_dict]

# assign the dominant topic for each tweet

df['dominant_topic']=dominant_topic

In [125]:
df[['topics', 'dominant_topic']].head()

Unnamed: 0,topics,dominant_topic
0,"[(0, 0.046375226), (1, 0.046422217), (2, 0.8144232), (3, 0.04639462), (4, 0.04638471)]",2
1,"[(0, 0.35286978), (1, 0.031001113), (2, 0.031225368), (3, 0.2143347), (4, 0.37056902)]",4
2,"[(0, 0.24055485), (1, 0.016450785), (2, 0.18389916), (3, 0.10282883), (4, 0.4562664)]",4
3,"[(0, 0.59891015), (1, 0.23359235), (2, 0.12784173), (3, 0.020031832), (4, 0.01962396)]",0
4,"[(0, 0.36484426), (1, 0.024251105), (2, 0.16621505), (3, 0.023777297), (4, 0.42091227)]",4


In [126]:
df['dominant_topic'].value_counts()

0    21662
2    17671
4    15545
3    12645
1    10727
Name: dominant_topic, dtype: int64

In [139]:
df[df['dominant_topic']==4]['text'].head(20)

1         @jwmkup @SuarezMiami Much of what is going on in society should be concerning to us all. There’s widespread injusti… https://t.co/HfdfHRaEyW
2         RT @NewsNancy9: @J_a_l_i_USA Man-made climate change is a cover for the biggest heist in world history.  People better wake up before they’…
4          RT @bapslosangeles: #BAPSLosAngeles and other @BAPS mandirs joined iconic landmarks   by turning off our non-essential lights at 8:30 this…
5     RT @DelhiAkshardham: Supporting #EarthHour2022 @DelhiAkshardham &amp; other @BAPS mandirs joined with iconic landmarks by switching off non-ess…
7     RT @DelhiAkshardham: Supporting #EarthHour2022 @DelhiAkshardham &amp; other @BAPS mandirs joined with iconic landmarks by switching off non-ess…
10    RT @pacific_rcc: 📆 26-29 April — discussing the impact of La Niña, cyclone season &amp; climate change in the Pacific 🌊  Pacific Island Meteoro…
11                                                                              @wtmpacific CL