<a href="https://colab.research.google.com/github/simon-clematide/casdmit-fs21/blob/master/notebooks/zora_topic_modeling_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup




In [18]:
#! pip install --upgrade pandas-profiling jupyter
! pip install pyldavis==3.4.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyldavis==3.4.0
  Downloading pyLDAvis-3.4.0-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from pyldavis==3.4.0)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyldavis
Successfully installed funcy-2.0 pyldavis-3.4.0


In [19]:
# avoid warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# `pyLDAvis` 

pyLDAvis now also supports LDA application from scikit-learn. Let's take a look into this in more detail. We will be using the 20 newsgroups dataset as provided by scikit-learn.

In [20]:
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

## Load Zora Abstract Sample


In [2]:
! test -e zora-eng-dewey.lemmatized.fasttext.tsv || curl https://files.ifi.uzh.ch/cl/siclemat/lehre/fs23/bibliosuisse/data/zora-eng-dewey.lemmatized.fasttext.tsv -o zora-eng-dewey.lemmatized.fasttext.tsv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15.0M  100 15.0M    0     0  6691k      0  0:00:02  0:00:02 --:--:-- 6688k


In [6]:
def load_data(inputfile):
    texts = []

    with open(inputfile,"r",encoding="utf-8") as input:
        for i,line in enumerate(input):
            texts.append(line.strip().split("\t")[1])
    return texts

In [7]:
docs_raw = load_data("zora-eng-dewey.lemmatized.fasttext.tsv")

In [8]:
print(len(docs_raw))

10267


In [9]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [10]:
print(docs_raw[72])

multifocal epithelial tumors and field cancerization from loss of mesenchymal csl signal it be currently unclear whether tissue change surround multifocal epithelial tumor be a cause or consequence of cancer . here , we provide evidence that loss of mesenchymal notch / csl signaling cause tissue alteration , include stromal atrophy and inflammation , which precede and be potent trigger for epithelial tumor . mouse carry a mesenchymal - specific deletion of csl / rbp - jk , a key notch effector , exhibit spontaneous multifocal keratinocyte tumor that develop after dermal atrophy and inflammation . csl - deficient dermal fibroblast promote increase tumor cell proliferation through upregulation of c - jun and c - fos expression and consequently high level of diffusible growth factor , inflammatory cytokine , and matrix - remodel enzyme . in human skin sample , stromal field adjacent to multifocal premalignant actinic keratosis lesion exhibit decrease notch / csl signal and associate molec

## Convert to document-term matrix

Next, the raw documents are converted into document-term matrix, possibly as raw counts or in TF-IDF form.

In [13]:
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5,  # exclude words with a relative document frequency greater than 50%
                                min_df = 20    # exclude tokens that occur less than 20 times
                                )
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
print(dtm_tf.shape)

(10267, 4883)


In [14]:
# How does a certain document look like in this representation?
# Get the mapping of column indices to vocabulary items
index2vocabulary_item = tf_vectorizer.get_feature_names_out()

# Get the dense matrix representation of the document-term matrix
doc_index = 72  # Index of the document to show
doc_matrix = dtm_tf.getrow(doc_index).toarray()

# Print the words and their counts in the document
for i, count in enumerate(doc_matrix[0]):
    if count > 0:
        word = index2vocabulary_item[i]
        print(f"{word}: {count}")

adjacent: 1
alteration: 1
associate: 1
atrophy: 2
cancer: 2
carry: 1
cause: 3
cell: 1
change: 3
consequence: 1
consequently: 1
currently: 1
cutaneous: 1
cytokine: 1
decrease: 1
deficient: 1
deletion: 1
dermal: 2
develop: 1
effector: 1
environmental: 1
enzyme: 1
epithelial: 3
evidence: 1
exhibit: 2
expression: 2
factor: 1
fibroblast: 1
field: 3
gene: 1
growth: 1
high: 1
human: 1
importantly: 1
include: 1
increase: 1
induce: 1
inflammation: 2
inflammatory: 1
key: 1
know: 1
lesion: 1
level: 1
loss: 2
matrix: 1
mesenchymal: 3
molecular: 1
mouse: 1
potent: 1
precede: 1
proliferation: 1
promote: 1
provide: 1
sample: 1
signal: 2
signaling: 1
skin: 2
specific: 1
spontaneous: 1
stromal: 2
surround: 1
tissue: 2
trigger: 1
tumor: 4
unclear: 1
upregulation: 1


Alternative, we can build a tf-idf document-term matrix

## Fit Latent Dirichlet Allocation models

Finally, the LDA models are fitted. n_components is number of topics.

In [15]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=10, random_state=0,verbose=1, max_iter=10)
lda_tf.fit(dtm_tf)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


In [None]:
n

## Visualizing the models with pyLDAvis
Multidimensional scaling = Dimension reduction

Can you reidentify the subjects? 

Hover over topics circles and terms to explore the connection between words and topics...

In [21]:
pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer)

  and should_run_async(code)


Wie könnte man das Topic modeling verbessern?