Skip to content
This repository has been archived by the owner on Jul 18, 2023. It is now read-only.

Latest commit

 

History

History
30 lines (26 loc) · 1023 Bytes

README.md

File metadata and controls

30 lines (26 loc) · 1023 Bytes

stylometry

Stylometric (parallel) framework in Python for big data in clusters

Features

  • Parallelized (thus fast)
  • Intended to integrate with a database-based corpus
  • A variety of feature-generation techniques:
    • byte-ngrams
    • word-ngrams
    • readability metrics
    • simple statistics
    • part-of-speech tagging
    • part-of-speech ngrams
    • word/pos hybrids
  • Plugs into a variety of stylometric techniques:
    • ppm-c (compression)
    • dmc (compression)
    • gvc (spam-filter)
    • sofia-ml (machine learning)
  • Some graphing utilities to show performance

We also provide some plugs to transform existing corpora into database format. We also provide some plugs to export features into SVM-light sparse data format.

Assumptions

We assume you have lots of RAM or lots of time or lots of CPU cores or all 3.

Haphazard off-the-cuff observed metrics

  • 30 million comments generally takes about a day to process 1 type of feature
  • 3 million posts generally takes about an hour to process 1 type of feature