Skip to content

Identify topics of text corpus and classify documents into topics with different methods.

Notifications You must be signed in to change notification settings

zushicat/text-topics

Repository files navigation

text-topics

NMF implementation of 2 cases:

  • if you know the number of topics: nmf_fixed_k.py
  • if you don't know the number of topics: nmf_unknown_k.py

LDA implementation incl. grid search for unknown number of topics (k): lda.py
This is for the sake of completeness, since

  • most parts of the code are no different from NMF usage (hence a little redundant)
  • the LDA results are not as good as those of NMF

A simple Keras implementation of a text multiclass classifier (with known classes): keras_simple_classifier.py

Human readable topics

A topic can be represented resp. interpreted by the most important token / phrases of its documents. Sometimes, this is not as clear as one would like.
These scripts:

  • identify_topic.py
  • _request_wikipedia.py

try to solve this problem by requesting Wikipedia with top token on a document level and processing the returned categories for each topic.

The results are quite satisfying as shown in following example:

Top phrases from each topic:

[
  [
    "henry", "england", "elizabeth", "king", "anne", "marriage", "death", "son", "throne", "college"
  ],
  [
    "design", "architect", "architecture", "niemeyer", "building", "office", "movement", "designer", "furniture", "site"
  ],
  [
    "film", "swanson", "keaton", "bow", "hollywood", "actress", "cinema", "star", "pickford", "actor"
  ]
]
Top 3 phrases for the same topics from Wikipedia category phrase processing:

topic 0: 16th century | english | monarchs
topic 1: american | architects | 20th century
topic 2: american | actresses | 20th century

Data

The directories in /data:

  • source_texts:
    Excerpts of wikipedia biographies falling in 3 broad topics:
    • Tudor dynasty (marked with "a")
    • Midcentury Architects / Designer (marked with "b")
    • Stars of the silent movie area (marked with "c")
  • target_texts:
    Very short texts based on source texts whith varying similarity, marked accordingly to the source texts. Also, one text about a movie star not included in source texts and one text about "Charlie Brown" without any topic affiliation (marked with "d").

This data is corresponding to: https://github.com/zushicat/text-similarity-extractive

Further Reading

General

Text Cleaning (NLTK)

  • "Stemming and Lemmatization in Python": datacamp.com/community/tutorials/stemming-lemmatization-python

Tokenizer / Vectorizer

LDA / NMF

About

Identify topics of text corpus and classify documents into topics with different methods.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages