<a href="https://colab.research.google.com/github/qChenSKIM/claims_topic_modeling/blob/main/topic_modeling_genism.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Genism: topic modeling 
https://radimrehurek.com/gensim/index.html

https://radimrehurek.com/gensim/models/ldaseqmodel.html

In [34]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from collections import defaultdict
from gensim import corpora
import pandas as pd

In [35]:
# input =".".join(combined_text_list[0])
# print("the first cluster:",input)

In [36]:
documents = inputs =['Has a white color',
                     'Has a color that matches the taste',
       'Interesting shape', 'Looks well stocked', 'Has a soft texture',
       'Has smooth edges', 'Feels easy to squeeze',
       'Feels comfortable between the gums and lips',
       'Provides the same satisfaction as other nicotine products, e.g. cigarettes, vape, snus, heat-not-burn tobacco products, etc',
       'More enjoyable to use than other nicotine products, such as cigarettes, vaping, heat-not-burn tobacco products, etc',
       ' Provides a pleasant tingling sensation on the gums',
       'Does not cause drooling in the mouth or throat',
       'Invisible when placed between gums & lips (no protrusion)',
       'Spit free No need to spit while using',
       'It feels comfortable to talk or laugh while using',
       'Remain in position and do not move once placed in the mouth',
       'Has less health risks than cigarettes, for example, contains fewer toxins & harmful chemicals',
       'Does not cause irritation to the tongue',
       'Does not cause irritation to the gums',
       'Does not cause throat irritation',
       'Made from food-grade ingredients', 
       'Tobacco free',
       'Made of high quality materials',
       'Made from natural ingredients',
       'Does not disturb others, for example there is no smell of smoke or vape vapor',
       'Can be used anytime, anywhere',
       'Not messy, unlike snus/snuff or chewing tobacco',
       'Easy to use. No matches, macis or charger needed',
       'It feels satisfying',
      'It feels like it lasts a long time',
       'Has a delicious aftertaste in the mouth',
       'The smell is delicious', 'Releases long-lasting nicotine',
       'Get rid of nicotine straight away',
       'Has varying levels of nicotine',
       'Delivers nicotine through absorption. No need to suck or chew',
       'Does not stain teeth or fingers like cigarettes',
       'Does not leave a smoky or unpleasant odor on hair or clothes like cigarettes',
       'Does not cause breath smell like cigarettes',
       'Does not cause premature skin aging like cigarettes',
       'Environmentally friendly, for example 100% biodegradable',
       'The cans are made of 100% recyclable and recyclable plastic',
       'The cans are made of environmentally friendly materials, such as paper/pulp',
       'Made in an environmentally friendly way', 'The shape is slim',
       'Mini shape', 'Moist texture', 'The texture is dry',
       'Can be used at any time, for example after waking up, while traveling, at work, etc',
       'Suitable for socializing moments, for example with adult friends, family, or coworkers',
       'Can be used without having to take special time, for example, no need to take a break to smoke',
       'Comes in various flavors that suit certain moments, for example the taste of coffee after coffee',
       'Made by experts, for example taste experts',
       'Made in Indonesia',
       'Made in the Nordic region, for example Sweden or Denmark',
       'Undergo rigorous scientific testing',
       'Comes from a trusted and well-known brand',
       'Comes from the best selling brand',
       'Comes from a premium brand',
       'Coming from a brand I can understand/feel familiar with',
       ' No need to inhale smoke or vape (unlike, for example, cigarettes/heat-not-burn or vaping products',
       'Does not stick between gums & lips when removed',
       'Can be placed in different locations between the gums & lips',
       'Easy to take out of the can',
       'Has a variety of different flavors', 'Has an intense taste',
       'It tastes sweet',
       'Has a cool/refreshing taste, like mint or menthol',
       'Offers a cleaner experience than cigarettes, e.g. without fire, ash, smoke or unpleasant odors',
       'Can be disposed of in the disposal container at the top of the can lid',
       'Each one is wrapped in paper',
       'Has a paper slip that can be used to dispose of used bags',
       "Doesn't make me feel judged by others (compared to eg cigarettes)",
       'Is an alternative to nicotine products that makes me feel socially acceptable',
       'Is an alternative to nicotine products that makes me feel that I am making a better choice',
       'Is an alternative to nicotine products that makes me feel progressive & modern',
       'The can is easy to open & close',
       'The can feels durable & sturdy',
       ' The nicotine bag looks neat & organized when I open the can',
       'The can keeps the nicotine pouch fresh over time']

In [37]:
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

In [38]:
# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Term Frequency * Inverse Document Frequency, Tf-Idf expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality, except that features which were rare in the training corpus will have their value increased. It therefore converts integer-valued vectors into real-valued ones, while leaving the number of dimensions intact. It can also optionally normalize the resulting vectors to (Euclidean) unit length.

In [39]:
from gensim import models

tp_model = models.TfidfModel(corpus)  # step 1 -- initialize a model

Latent Semantic Indexing, LSI (or sometimes LSA) transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality. For the toy corpus above we used only 2 latent dimensions, but on real corpora, target dimensionality of 200–500 is recommended as a “golden standard” 1

In [40]:
# tp_model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)

Random Projections, RP aim to reduce vector space dimensionality. This is a very efficient (both memory- and CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness. Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.

In [41]:
# tp_model = models.RpModel(tfidf_corpus, num_topics=500)

Latent Dirichlet Allocation, LDA is yet another transformation from bag-of-words counts into a topic space of lower dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA), so LDA’s topics can be interpreted as probability distributions over words. These distributions are, just like with LSA, inferred automatically from a training corpus. Documents are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).



In [42]:
# tp_model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)

Hierarchical Dirichlet Process, HDP is a non-parametric bayesian method (note the missing number of requested topics):

In [43]:
# tp_model = models.HdpModel(corpus, id2word=dictionary)

In [44]:
doc_bow = [(0, 1), (1, 1)]
print(tp_model[doc_bow])  # step 2 -- use the model to transform vectors

[(0, 0.8806890932176118), (1, 0.4736947551826392)]


In [45]:
corpus_tfidf = tp_model[corpus]
# for doc in corpus_tfidf:
#     print(doc)

In [46]:
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)  # initialize an LSI transformation
corpus_lsi = lsi_model[corpus_tfidf]  # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

lsi_model.print_topics(5)

[(0,
  '0.392*"not" + 0.369*"does" + 0.347*"or" + 0.300*"cause" + 0.287*"like" + 0.246*"cigarettes" + 0.213*"has" + 0.186*"irritation" + 0.181*"nicotine" + 0.126*"gums"'),
 (1,
  '-0.463*"nicotine" + -0.387*"has" + 0.237*"not" + 0.235*"does" + 0.227*"cause" + -0.198*"can" + -0.189*"that" + -0.170*"an" + -0.158*"is" + 0.154*"cigarettes"'),
 (2,
  '0.644*"made" + 0.436*"from" + 0.296*"brand" + 0.264*"comes" + 0.257*"ingredients" + -0.187*"nicotine" + 0.171*"example" + -0.166*"has" + 0.095*"environmentally" + 0.079*"friendly"'),
 (3,
  '-0.380*"can" + 0.367*"has" + -0.352*"feels" + -0.261*"&" + -0.259*"it" + -0.215*"easy" + -0.199*"between" + -0.199*"lips" + -0.195*"be" + -0.189*"gums"'),
 (4,
  '0.439*"made" + -0.438*"brand" + -0.386*"comes" + -0.381*"from" + 0.212*"or" + -0.188*"has" + 0.184*"example" + 0.184*"it" + 0.148*"feels" + 0.123*"no"')]

In [47]:
# both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
# for doc, as_text in zip(corpus_lsi, documents):
#     print(doc, as_text)

In [48]:
test_model = models.HdpModel(corpus, id2word=dictionary)
test_model.print_topics(5)

[(0,
  '0.082*open + 0.056*when + 0.054*unpleasant + 0.043*comfortable + 0.035*products, + 0.031*like + 0.030*nicotine + 0.027*than + 0.025*no + 0.023*are'),
 (1,
  '0.063*smoke + 0.059*using + 0.051*made + 0.049*brand + 0.041*between + 0.040*comes + 0.029*& + 0.027*color + 0.027*example + 0.026*mouth'),
 (2,
  '0.063*feel + 0.056*does + 0.044*other + 0.039*between + 0.038*texture + 0.035*taste + 0.034*need + 0.029*without + 0.028*cause + 0.025*such'),
 (3,
  '0.048*such + 0.042*cigarettes, + 0.040*smoke + 0.035*easy + 0.032*looks + 0.032*me + 0.032*100% + 0.032*taste + 0.031*that + 0.026*are'),
 (4,
  '0.065*shape + 0.057*or + 0.053*at + 0.048*environmentally + 0.036*than + 0.031*other + 0.028*coffee + 0.027*vape + 0.025*on + 0.025*friendly')]

In [51]:
# corpus_tfidf = tp_model[corpus]
# for doc in corpus_tfidf:
#     print(doc)

## how each doc distributes acorss topics

In [50]:
# both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
for doc, as_text in zip(corpus_lsi, documents):
    print(doc, as_text)

[(0, 0.13764681653012198), (1, -0.2765318206592029), (2, -0.11190312307773062), (3, 0.2757158753677313), (4, -0.14015856849602426)] Has a white color
[(0, 0.15615051160625396), (1, -0.34931898059850175), (2, -0.08188315949232393), (3, 0.28209318068570494), (4, -0.10849360416690498)] Has a color that matches the taste
[(0, 0.017424086818623594), (1, -0.04210946607708844), (2, -0.019466842706042518), (3, 0.0002593820468874149), (4, 0.002125834554148545)] Interesting shape
[(0, 0.018646437843185787), (1, -0.040654206093178546), (2, -0.014753481217476829), (3, -0.05283213424125874), (4, -0.012822187426571341)] Looks well stocked
[(0, 0.15176118378698406), (1, -0.30508508042483345), (2, -0.1397703799027402), (3, 0.3025816714136205), (4, -0.17613632159060616)] Has a soft texture
[(0, 0.2128637272991892), (1, -0.38739851017306337), (2, -0.16592700140183914), (3, 0.36706822563297975), (4, -0.18752088078806706)] Has smooth edges
[(0, 0.11550914281872517), (1, -0.09891670627700631), (2, -0.03690