Code for Text Engineering courses, University of Cologne

Information Retrieval

Course plan and material (in German)

Text Mining

Course plan and material (in German)

Functional Technical Uses Literature
tm1 Corpus and data access OOD und TDD basics; object DB und native queries DB4O; Crawler (ir6) Gamma et al. (1994), Kap. 1; Bloch (2008), Item 16
tm2 Data enrichment with standoff annotation Generics; XML binding for export und import; Schema generation as a form of MDD (code-first) Index (ir2); TF-IDF (ir5); JAXB (or Java 6) Thompson & McKelvie (1997); Bloch (2008), Ch. 5; Naftalin & Wadler (2006) Part 1
tm3 Text classification with naive bayes Delegation and strategy for modular classification Crawler (ir6) Gamma et al. (1994), S. 315; Bloch (2008), Item 21
tm4 Comparative text classification and evaluation Using the Weka-API, adapter for integration Weka (developer version) Gamma et al. (1994), S. 139; Witten & Frank (2005)
tm5 Flat k-means clustering and purity evaluation Java Concurrency API (CopyOnWriteArrayList, ExecutorService), visualization with Graphviz DOT TF-IDF vectors and cosine similarity (ir5) Bloch (2008), Item 68
tm6 Release engineering CRISP builds with Ant All previous code Clark (2006), Kap. 2


  • Files runnable as Java application and JUnit test for each session can be found in package (X for the session number)
  • To run all tests: run as JUnit test (needs corpora in data/, run as Java application to generate)
  • The Ant script can compile and deploy the code as an executable Jar (ant deploy), generate Javadoc (ant doc) and run tests (ant test), which are summarized in an HTML report (ant report)


