* **Purpose**: Explore topic modelling (LDA) for COVID Tweets
* **Ideal outcome**: topic modelling to cluster the many thousands of tweets into a few buckets, allowing me to infer topic focus and political bias of the accounts that created the tweets.
* **Method**: forked "Topic Modelling (LDA) on Elon Tweets" and run on English language tweets in USA with a few tweeks. Saved the tweet clusters and manually inspected them.
* **Actual outcome**: Even though the measures such as "Intertopic Distance Map" showed clear separation of topics, I did not discover obverious the clusters when manually inspecting the clustered tweets. 
* **Analysis**: LDA in its current form take words individually and do not factor in the context. Recent NLP technology such as BERT is not integrated into LDA yet. Thus, for novice ML user like me, it may not be easy to extract topics in nuanced discussions. To put this in perspective, using LDA, it would probably be easy to distinguish COVID tweets from govenment election or black-lives-matter protests. But it would be harder to extract topics within the COVID such as open up economy vs vaccine, as both may use similar vocabulary. Also, it would be hard to extract topics related to different news events such as "New York COVID cases spike", "Florida open economy plan delayed".

In [None]:
import os
import pandas as pd
import numpy as np
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
import gensim
from gensim import corpora, models, similarities
import logging
import tempfile
from nltk.corpus import stopwords
from string import punctuation
from collections import OrderedDict
import seaborn as sns
import pyLDAvis.gensim
import matplotlib.pyplot as plt
%matplotlib inline

init_notebook_mode(connected=True) #do not miss this line

import warnings
warnings.filterwarnings("ignore")

In [None]:
datafile = '/kaggle/input/coronavirus-covid19-tweets-late-april/2020-04-16 Coronavirus Tweets.CSV'

In [None]:
# read tweets, filter to USA and English, sample to 1000

tweets1 = pd.read_csv(datafile, encoding='latin1')
tweets1 =tweets1[(tweets1.country_code == "US") & (tweets1.lang == "en")].reset_index(drop = True)
tweets2 = tweets1[['text']].copy()
tweets2 = tweets2.rename(columns={'text': 'Tweet'})
tweets = tweets2.sample(1000).reset_index()

print("Number of tweets: ",len(tweets['Tweet']))
tweets.head(5)

In [None]:
# Preparing a corpus for analysis and checking first 10 entries

corpus=[]
a=[]
for i in range(len(tweets['Tweet'])):
        a=tweets['Tweet'][i]
        corpus.append(a)
        
corpus[0:10]

In [None]:
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
# removing common words and tokenizing. Also removing Covid hash tags, as it appeared in all LDA topics equally. 

list1 = ['RT','rt','#coronavirus','#covid19','#covid_19']
stoplist = stopwords.words('english') + list(punctuation) + list1

texts = [[word for word in str(document).lower().split() if word not in stoplist] for document in corpus]

dictionary = corpora.Dictionary(texts)
dictionary.save(os.path.join(TEMP_FOLDER, 'elon.dict'))  # store the dictionary, for future reference

In [None]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'elon.mm'), corpus)  # store to disk, for later use

In the previous cells, I created a corpus of documents represented as a stream of vectors. To continue, lets use that corpus, with the help of Gensim.

### Creating a transformation


The transformations are standard Python objects, typically initialized by means of a training corpus:

Different transformations may require different initialization parameters; in case of TfIdf, the “training” consists simply of
going through the supplied corpus once and computing document frequencies of all its features.
Training other models, such as Latent Semantic Analysis or Latent Dirichlet Allocation, is much more involved and,
consequently, takes much more time.

In [None]:
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model


### Note
Transformations always convert between two specific vector spaces. The same vector space (= the same set of feature ids) must be used for training as well as for subsequent vector transformations. Failure to use the same input feature space, such as applying a different string preprocessing, using different feature ids, or using bag-of-words input vectors where TfIdf vectors are expected, will result in feature mismatch during transformation calls and consequently in either garbage output and/or runtime exceptions.

From now on, tfidf is treated as a read-only object that can be used to apply a transformation to a whole corpus:

In [None]:
corpus_tfidf = tfidf[corpus]  # step 2 -- use the model to transform vectors

### LDA:
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Latent Dirichlet Allocation, LDA is yet another transformation from bag-of-words counts into a topic space of lower dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA), so LDA’s topics can be interpreted as probability distributions over words. These distributions are, just like with LSA, inferred automatically from a training corpus. Documents are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).

In [None]:
total_topics = 8

lda = models.LdaModel(corpus, id2word=dictionary, num_topics=total_topics)
corpus_lda = lda[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

In [None]:
#Show first n important word in the topics:
lda.show_topics(total_topics,10)

In [None]:
for topic_id in range(lda.num_topics):
    topk = lda.show_topic(topic_id, 30)
    topk_words = [ w for w, _ in topk ]
    print('{}: {}'.format(topic_id, ' '.join(topk_words)))

In [None]:
#Add top topic and score for each document

tweets['topic1'] = np.nan
tweets['score1'] = np.nan
for i in range(len(tweets['Tweet'])):
    topic = lda.get_document_topics(lda.id2word.doc2bow(tweets['Tweet'][i].split()))[0]
    tweets['topic1'][i] = topic[0]
    tweets['score1'][i] = topic[1]

# Save to file to be downloaded

dfTopic = tweets.sort_values(by=['score1'], ascending = False).head(500)
dfTopic = dfTopic.sort_values(by=['topic1'])
dfTopic.to_csv('/kaggle/working/topics.csv',index=False)
dfTopic.head(50)


In [None]:
data_lda = {i: OrderedDict(lda.show_topic(i,10)) for i in range(total_topics)}

In [None]:
df_lda = pd.DataFrame(data_lda)
df_lda = df_lda.fillna(0).T
print(df_lda.shape)

In [None]:
df_lda

In [None]:
g=sns.clustermap(df_lda.corr(), center=0, standard_scale=1, cmap="RdBu", metric='cosine', linewidths=.75, figsize=(15, 15))
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.show()
#plt.setp(ax_heatmap.get_yticklabels(), rotation=0)  # For y axis

In [None]:
pyLDAvis.enable_notebook()
panel = pyLDAvis.gensim.prepare(lda, corpus_lda, dictionary, mds='tsne')
panel