Code for Text Engineering courses, University of Cologne
Java Graphviz (DOT)
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is even with fsteeg:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Code for Text Engineering courses, University of Cologne

Information Retrieval

Course plan and material (in German)

Text Mining

Course plan and material (in German)

Functional Technical Uses Literature
tm1 Corpus and data access OOD und TDD basics; object DB und native queries DB4O; Crawler (ir6) Gamma et al. (1994), Kap. 1; Bloch (2008), Item 16
tm2 Data enrichment with standoff annotation Generics; XML binding for export und import; Schema generation as a form of MDD (code-first) Index (ir2); TF-IDF (ir5); JAXB (or Java 6) Thompson & McKelvie (1997); Bloch (2008), Ch. 5; Naftalin & Wadler (2006) Part 1
tm3 Text classification with naive bayes Delegation and strategy for modular classification Crawler (ir6) Gamma et al. (1994), S. 315; Bloch (2008), Item 21
tm4 Comparative text classification and evaluation Using the Weka-API, adapter for integration Weka (developer version) Gamma et al. (1994), S. 139; Witten & Frank (2005)
tm5 Flat k-means clustering and purity evaluation Java Concurrency API (CopyOnWriteArrayList, ExecutorService), visualization with Graphviz DOT TF-IDF vectors and cosine similarity (ir5) Bloch (2008), Item 68
tm6 Release engineering CRISP builds with Ant All previous code Clark (2006), Kap. 2


  • Files runnable as Java application and JUnit test for each session can be found in package (X for the session number)
  • To run all tests: run as JUnit test (needs corpora in data/, run as Java application to generate)
  • The Ant script can compile and deploy the code as an executable Jar (ant deploy), generate Javadoc (ant doc) and run tests (ant test), which are summarized in an HTML report (ant report)


  • Bloch, Joshua (2008), Effective Java, Second Edition, Addison-Wesley.
  • Clark, Mike (2006), Projekt-Automatisierung, Hanser.
  • Gamma, Erich, Helm, Richard, Johnson, Ralph and John Vlissides (1995), Design Patterns. Elements of Reusable Object-Oriented Software, Addison-Wesley.
  • Naftalin, Maurice and Philip Wadler (2006), Java Generics and Collections, O’Reilly.
  • Thompson, H. S. and McKelvie, D. (1997), Hyperlink semantics for standoff markup of read-only documents. In Proceedings of SGML Europe ’97: The next decade – Pushing the Envelope, page 227–229.
  • Ian H. Witten & Eibe Frank (2005), Data Mining: Practical Machine Learning Tools and Techniques (Second Edition), Morgan Kaufmann.
#tableborders td {border: 1px solid #ccc; padding: .1em .25em;}