<a href="https://colab.research.google.com/github/ustab/machine-learning/blob/master/lda_topic_modeling_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Extracting Topic Modeling Features using LDA (Latent Dirichlet Allocation)

In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model (Un-supervised) that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is an example of a topic model. In this, observations (e.g., words) are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [5]:
dataset = pd.read_csv('/content/Mental-Health-Twitter.csv').drop(['Unnamed: 0'],axis=1)

In [6]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 50)
# to see the full post_text in given dataset

In [7]:
dataset.head()

Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label
0,637894677824413696,Sun Aug 30 07:48:37 +0000 2015,It's just over 2 years since I was diagnosed with #anxiety and #depression. Today I'm taking a moment to reflect on how far I've come since.,1013187241,84,211,251,837,0,1
1,637890384576778240,Sun Aug 30 07:31:33 +0000 2015,"It's Sunday, I need a break, so I'm planning to spend as little time as possible on the #A14...",1013187241,84,211,251,837,1,1
2,637749345908051968,Sat Aug 29 22:11:07 +0000 2015,Awake but tired. I need to sleep but my brain has other ideas...,1013187241,84,211,251,837,0,1
3,637696421077123073,Sat Aug 29 18:40:49 +0000 2015,RT @SewHQ: #Retro bears make perfect gifts and are great for beginners too! Get stitching with October's Sew on sale NOW! #yay http://t.co/…,1013187241,84,211,251,837,2,1
4,637696327485366272,Sat Aug 29 18:40:26 +0000 2015,It’s hard to say whether packing lists are making life easier or just reinforcing how much still needs doing... #movinghouse #anxiety,1013187241,84,211,251,837,1,1


In [11]:
dataset['user_id'].nunique()
# total unique users

72

In [13]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [17]:
!pip install matplotlib-venn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [18]:
!apt-get -qq install -y graphviz && pip install pydot
import pydot

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [16]:
!apt-get -qq install -y libfluidsynth1

Selecting previously unselected package libfluidsynth1:amd64.
(Reading database ... (Reading database ... 5%(Reading database ... 10%(Reading database ... 15%(Reading database ... 20%(Reading database ... 25%(Reading database ... 30%(Reading database ... 35%(Reading database ... 40%(Reading database ... 45%(Reading database ... 50%(Reading database ... 55%(Reading database ... 60%(Reading database ... 65%(Reading database ... 70%(Reading database ... 75%(Reading database ... 80%(Reading database ... 85%(Reading database ... 90%(Reading database ... 95%(Reading database ... 100%(Reading database ... 155741 files and directories currently installed.)
Preparing to unpack .../libfluidsynth1_1.1.9-1_amd64.deb ...
Unpacking libfluidsynth1:amd64 (1.1.9-1) ...
Setting up libfluidsynth1:amd64 (1.1.9-1) ...
Processing triggers for libc-bin (2.27-3ubuntu1.5) ...


In [19]:
!pip install cartopy
import cartopy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cartopy
  Downloading Cartopy-0.20.3.tar.gz (10.8 MB)
[K     |████████████████████████████████| 10.8 MB 8.9 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25herror
  Downloading Cartopy-0.20.2.tar.gz (10.8 MB)
[K     |████████████████████████████████| 10.8 MB 49.7 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25herror
  Downloading Cartopy-0.20.1.tar.gz (10.8 MB)
[K     |████████████████████████████████| 10.8 MB 42.9 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25herror
  Downloading Cartopy-0.20.0.tar.gz (10.8 MB)
[K     |████████████████████████████████| 10.8 MB 53.3 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build whe

In [15]:
!apt-get -qq install -y libarchive-dev && pip install -U libarchive
import libarchive

Selecting previously unselected package libarchive-dev:amd64.
(Reading database ... 155685 files and directories currently installed.)
Preparing to unpack .../libarchive-dev_3.2.2-3.1ubuntu0.7_amd64.deb ...
Unpacking libarchive-dev:amd64 (3.2.2-3.1ubuntu0.7) ...
Setting up libarchive-dev:amd64 (3.2.2-3.1ubuntu0.7) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting libarchive
  Downloading libarchive-0.4.7.tar.gz (23 kB)
Collecting nose
  Downloading nose-1.3.7-py3-none-any.whl (154 kB)
[K     |████████████████████████████████| 154 kB 9.6 MB/s 
[?25hBuilding wheels for collected packages: libarchive
  Building wheel for libarchive (setup.py) ... [?25l[?25hdone
  Created wheel for libarchive: filename=libarchive-0.4.7-py3-none-any.whl size=31646 sha256=ba2c5273040e922581c01e8262660c1ea660274ba268ff84f95fb31f3e5ced21
  Stored in directory: /root/.cache/pip/wheels/63/b1/

In [21]:
pip install emoji --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting emoji
  Downloading emoji-2.0.0.tar.gz (197 kB)
[K     |████████████████████████████████| 197 kB 6.7 MB/s 
[?25hBuilding wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-2.0.0-py3-none-any.whl size=193022 sha256=6936bfaca0fed9ca43c1f8d2ba78df7da27d4ffc672780e5809dda088f6cf720
  Stored in directory: /root/.cache/pip/wheels/ec/29/4d/3cfe7452ac7d8d83b1930f8a6205c3c9649b24e80f9029fc38
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-2.0.0


In [22]:
# Text Pre-processing (NLP)
import re
import nltk
import emoji

In [23]:
# Custom Tokenizer
def tokenize(text):
    text = emoji.replace_emoji(text, replace='')    # Removes Emoji's from Text
    text = re.sub(r"http\S+", "", text)             # Removes URLs from Text
    text = re.sub(r"[^\w\s]", "", text)             # Removes
    tokens = [word for word in nltk.word_tokenize(text) if len(word)>3]   # Tokenization and Keeping only the words with length at least 4
    stems = [stemmer.stem(item) for item in tokens]                       # Stemming
    return tokens

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [30]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize,sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [31]:
import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

In [32]:
vectorizer_tf = TfidfVectorizer(tokenizer=tokenize, stop_words='english', max_df=0.75, max_features=10000, use_idf=False, norm=None)
tf_vectors = vectorizer_tf.fit_transform(dataset.post_text)

In [33]:
vectorizer_tf.get_feature_names_out()[100:200]

# vectorizer_tf.get_feature_names_out().shape

array(['70412', '70685', '7yrold', '800656hope', '8664887386',
       '89stevesmith', '933flz', '_5sosfamupdates', '_ahmadqushairi',
       '_betwixt', '_dpiddy', '_fatinsalihah', '_fawky', '_gabrielpicolo',
       '_kelibatbangsat', '_snape_', '_spalala', '_spetty',
       'a1sincedaynone', 'a__twat', 'a_venture3', 'aaay', 'aag2016',
       'aaron', 'aaron_ariff', 'aaroncarpenter', 'aartorias', 'abandoned',
       'abby', 'abcnetwork', 'ability', 'able', 'abroad', 'absence',
       'absolute', 'absolutely', 'abuse', 'abused', 'abusing',
       'acaciabrinley', 'academic', 'academy', 'accent', 'accept',
       'acceptable', 'acceptance', 'accepted', 'accepts', 'access',
       'accessories', 'accident', 'accidentally', 'accomplish',
       'accomplished', 'according', 'account', 'accountability',
       'accountants', 'accounting', 'accounts', 'accurate', 'accurately',
       'accused', 'accuses', 'accustomed', 'acetaminophen', 'ache',
       'achieve', 'achieved', 'achievement', 'achi

In [34]:
from sklearn import decomposition

# Creating top 25 topics/clusters to get Summary
lda = decomposition.LatentDirichletAllocation(n_components=25, max_iter=10, learning_method='online', learning_offset=50, n_jobs=1, random_state=42)

W1 = lda.fit_transform(tf_vectors)
H1 = lda.components_

In [35]:
W1.shape

(20000, 25)

In [36]:
# Prints top 15 relevant words for each of the 25 topics
num_words = 15

vocab = np.array(vectorizer_tf.get_feature_names_out())

top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_words-1:-1]]
topic_words = ([top_words(t) for t in H1])
topics = [' '.join(t) for t in topic_words]

In [37]:
topics

['time following really hard morning care tomorrow heart post important story high came make guess',
 'depression make treatments years overcome sure read team anxiety honestly support children treatment diagnosed salon',
 'actually wait start adclaidekanes tweets 5sostumblrx lies understand seen arent vine wouldnt month weeks giving',
 'thank need fuck happy yall live birthday mind literally check free started kind coming tired',
 'said putin change season truth veganrevoiution ways ready russian wasnt proud albino animals middle checked',
 'like thanks sleep guys tweet baby cool maybe left fine sweet goes lady fucked young',
 'hello right game lydiamcrtins lets zaynmalik okay head looking idea playing song listen forget ugly',
 'today hope play damn hananyxnyx mnwild azfarovski bea_viel88 true voting took bitch minutes word death',
 'misslusyd genevieveverso thefuxedos azarkansero tauri3l trying pretty liz_smith333 wanted house fake place account wants nothingalarming',
 'want thing 

In [38]:
# deciding to which topic/cluster the current document belongs to (argmax)

colnames = ["Topic" + str(i) for i in range(lda.n_components)]
docnames = ["Doc" + str(i) for i in range(len(dataset.post_text))]

df_doc_topic_pos = pd.DataFrame(np.round(W1,2),columns=colnames,index=docnames)
significanttopic = np.argmax(df_doc_topic_pos.values,axis=1)

df_doc_topic_pos['dominant_topic'] = significanttopic

In [39]:
# Final feautures

df_doc_topic_pos

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,...,Topic16,Topic17,Topic18,Topic19,Topic20,Topic21,Topic22,Topic23,Topic24,dominant_topic
Doc0,0.00,0.64,0.00,0.00,0.00,0.00,0.00,0.09,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.10,0.09,1
Doc1,0.12,0.00,0.00,0.12,0.00,0.00,0.00,0.00,0.00,0.12,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.12,0.00,12
Doc2,0.01,0.01,0.01,0.72,0.01,0.15,0.01,0.01,0.01,0.01,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,3
Doc3,0.00,0.09,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.73,0.00,0.00,0.00,0.00,0.09,0.00,18
Doc4,0.09,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.25,0.00,0.50,0.00,23
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Doc19995,0.01,0.01,0.01,0.26,0.01,0.26,0.01,0.01,0.01,0.01,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.26,0.01,3
Doc19996,0.01,0.01,0.21,0.01,0.01,0.01,0.01,0.01,0.21,0.41,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,9
Doc19997,0.00,0.00,0.12,0.00,0.23,0.00,0.00,0.00,0.12,0.00,...,0.00,0.00,0.00,0.00,0.00,0.12,0.00,0.00,0.00,4
Doc19998,0.01,0.01,0.01,0.01,0.17,0.01,0.01,0.01,0.01,0.01,...,0.01,0.01,0.01,0.01,0.17,0.01,0.01,0.01,0.17,10


In [60]:
df_doc_topic_pos.columns

Index(['Topic0', 'Topic1', 'Topic2', 'Topic3', 'Topic4', 'Topic5', 'Topic6',
       'Topic7', 'Topic8', 'Topic9', 'Topic10', 'Topic11', 'Topic12',
       'Topic13', 'Topic14', 'Topic15', 'Topic16', 'Topic17', 'Topic18',
       'Topic19', 'Topic20', 'Topic21', 'Topic22', 'Topic23', 'Topic24',
       'dominant_topic'],
      dtype='object')

In [62]:
df_doc_topic_pos.to_csv("lda_features.csv")

In [63]:
df=pd.read_csv('lda_features.csv')

### Visualizing LDA Features

In [71]:
!pip install pyLDAvis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [72]:
import pyLDAvis.gensim_models

In [73]:
import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda, tf_vectors, vectorizer_tf)

  by='saliency', ascending=False).head(R).drop('saliency', 1)
