# Working with LDA

- Pre-processing text
    - Tokenize, normalize (lowercase)
    - Stop word removal
    - Stemming
- Convert tokenized documents to a document - term matrix
- Build LDA models on the doc-term matrix

In [1]:
text = """
In recent years, genetic testing has revolutionized the landscape of personalized medicine, offering tailored treatment strategies based on an individual's genetic makeup. This advancement allows healthcare providers to predict disease susceptibility, identify optimal drug therapies, and personalize preventive care plans.

Genetic testing involves analyzing an individual's DNA to detect variations that may indicate predisposition to certain diseases or influence drug metabolism. For instance, pharmacogenomics uses genetic information to predict how patients will respond to medications, minimizing adverse reactions and optimizing efficacy.

In oncology, genetic testing plays a crucial role in identifying mutations associated with cancer development, guiding treatment decisions such as targeted therapies and immunotherapies. Moreover, in reproductive medicine, genetic testing helps assess the risk of inherited disorders in embryos during in vitro fertilization (IVF), offering prospective parents valuable information to make informed decisions about family planning.

Despite its benefits, ethical considerations and privacy concerns surround genetic testing. Issues such as the potential misuse of genetic information and the psychological impact of test results on patients highlight the importance of comprehensive genetic counseling and stringent privacy safeguards.

Looking ahead, ongoing research in genomics promises further advancements in personalized medicine, potentially uncovering new therapeutic targets and refining diagnostic approaches. As technology continues to evolve, integrating genetic testing into routine clinical practice holds the promise of improving patient outcomes and shaping the future of healthcare delivery.

This text outlines the transformative impact of genetic testing in personalized medicine, emphasizing its multifaceted applications across various medical disciplines while acknowledging the ethical implications that accompany its widespread adoption.
"""

In [18]:
import nltk
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

In [44]:
lower_case=text.lower()

rm_punctuation = lower_case.translate(str.maketrans('', '', string.punctuation))

sw = set(stopwords.words('english'))
token_words = nltk.word_tokenize(rm_punctuation)
rm_sw = [word for word in token_words if word not in sw]

porter = nltk.PorterStemmer()
doc_text=[porter.stem(t) for t in rm_sw]

doc_set=[doc_text]

In [45]:
doc_set

[['recent',
  'year',
  'genet',
  'test',
  'revolution',
  'landscap',
  'person',
  'medicin',
  'offer',
  'tailor',
  'treatment',
  'strategi',
  'base',
  'individu',
  'genet',
  'makeup',
  'advanc',
  'allow',
  'healthcar',
  'provid',
  'predict',
  'diseas',
  'suscept',
  'identifi',
  'optim',
  'drug',
  'therapi',
  'person',
  'prevent',
  'care',
  'plan',
  'genet',
  'test',
  'involv',
  'analyz',
  'individu',
  'dna',
  'detect',
  'variat',
  'may',
  'indic',
  'predisposit',
  'certain',
  'diseas',
  'influenc',
  'drug',
  'metabol',
  'instanc',
  'pharmacogenom',
  'use',
  'genet',
  'inform',
  'predict',
  'patient',
  'respond',
  'medic',
  'minim',
  'advers',
  'reaction',
  'optim',
  'efficaci',
  'oncolog',
  'genet',
  'test',
  'play',
  'crucial',
  'role',
  'identifi',
  'mutat',
  'associ',
  'cancer',
  'develop',
  'guid',
  'treatment',
  'decis',
  'target',
  'therapi',
  'immunotherapi',
  'moreov',
  'reproduct',
  'medicin',
  'gen

In [29]:
import gensim
from gensim import corpora,models

In [64]:
dictionary = corpora.Dictionary(doc_set)
corpus = [dictionary.doc2bow(doc) for doc in doc_set] # tupla contendo o id e frequencia do token
ldamodel=gensim.models.ldamodel.LdaModel(corpus,num_topics=4,id2word=dictionary,passes=50)
print(ldamodel.print_topics(num_topics=4,num_words=5))

[(0, '0.051*"genet" + 0.038*"test" + 0.019*"person" + 0.019*"inform" + 0.019*"medicin"'), (1, '0.007*"risk" + 0.007*"role" + 0.007*"shape" + 0.007*"disord" + 0.007*"disciplin"'), (2, '0.007*"genet" + 0.007*"safeguard" + 0.007*"predisposit" + 0.007*"misus" + 0.007*"genom"'), (3, '0.007*"adopt" + 0.007*"risk" + 0.007*"moreov" + 0.007*"parent" + 0.007*"ongo"')]


In [65]:
print(ldamodel.print_topics(num_topics=4,num_words=1))

[(0, '0.051*"genet"'), (1, '0.007*"risk"'), (2, '0.007*"genet"'), (3, '0.007*"adopt"')]
