Skip to content

@wareya wareya released this Aug 12, 2017 · 1 commit to master since this release

Use 64-bit Java if the analyzer gets stuck in initialization.

Update: The user dictionary has been updated. If you downloaded before this update, please download the new user dictionary file in the attachments. It fixes a problem with the sino-japanese words and adds some important overrides for family member names. The updated user dictionary is also in the archive.

Added functionality to make it easier to use the analyzer for mining. Written by someone else. Detailed in #1

1w6yydl

Assets 4

@wareya wareya released this Jul 22, 2017 · 9 commits to master since this release

  • Allow analyzing frequency by lexeme or word, not just individual spellings
  • Analyzing by lexeme or word causes word or spelling information (respectively) to be pulled out into auxiliary columns
  • Normalizer understands this but does not normalize word/spelling occurrences, just adds them

image

Assets 3

@wareya wareya released this May 28, 2017 · 12 commits to master since this release

  • Replace particle, number, and name filter with custom user filters (userfilter.csv).
  • Allow command line interface to use user dictionary (enabled by default, like it is in the GUI)
  • Normalizer/merger still included

image

userfilter.csv: Format is BaseWrittenForm, POS1, POS2, POS3, POS4. Words that match all listed fields are ignored. Unused fields are blank. Trailing fields can be omitted.

Assets 3

@wareya wareya released this May 17, 2017 · 14 commits to master since this release

Analyzer:

  • Add option to load a user dictionary from userdict.csv. Example userdict.csv contains 自転車.
  • Names dictionary still baked in, will remove it and make a second download later.
    Normalizer:
  • Remove outliers instead of trying to adjust distribution. Simpler, works better, smaller impact on resulting distribution.

image

Assets 3
Pre-release
Pre-release

@wareya wareya released this May 15, 2017 · 15 commits to master since this release

Analyzer:
Output lemma to avoid criss-crossing completely homophonous terms with the same grammatical categories

Normalizer/merger:

Let adding single-file inputs work
Make deskew operate per input list
Final exponent now manually set

Assets 4
Pre-release
Pre-release

@wareya wareya released this Apr 7, 2017 · 18 commits to master since this release

Adds several megabytes of proper names to the dictionary as dummy entries with worst-case-scenario weight. They have no pronunciation or part of speech information, but they allow kuromoji to generate better segmentations, and therefore a more accurate frequency list, without including the monstrosity that is neologd.

If you hang or run out of memory, make sure you're using 64-bit Java.

use the companion program to combine lists made from different sources: https://github.com/wareya/normalizer

image

Assets 3
Pre-release
Pre-release

@wareya wareya released this Apr 6, 2017 · 19 commits to master since this release

  • Fixed mistake where part of speech filter wasn't catching proper names
  • Add second branch working off the neologd dictionary instead of the kanaaccent one.

The neologd version uses an indev version of kuromoji.

If your corpus is reasonably large (hundreds of megabytes or larger), you want analyzer.zip.

If your corpus is small (tens of megabytes or smaller) and contains a density of proper names (novels, VNs, etc) you want analyzer_neologd.zip.

image

EDIT: There was an issue with analyzer.jar in analyzer.zip in this release. If you downloaded it within the first 10 minutes after the release, please redownload it if you have any problems with the part of speech filter.

Assets 4
Pre-release
Pre-release

@wareya wareya released this Apr 2, 2017 · 23 commits to master since this release

Fix GUI file selection bug

image

Assets 3
Pre-release
Pre-release

@wareya wareya released this Mar 22, 2017 · 26 commits to master since this release

Add a basic GUI

Known GUI bug: only works with files in same folder as program. CLI does not have this issue.

image

Assets 3
Pre-release
Pre-release

@wareya wareya released this Mar 22, 2017 · 27 commits to master since this release

Add option to count the line of the initial occurrence of each term

Assets 3
You can’t perform that action at this time.