New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor #6

Merged
merged 37 commits into from Feb 12, 2014

Conversation

Projects
None yet
2 participants
@kmike
Member

kmike commented Dec 13, 2013

A PR with a lot of breaking changes. Check a new README.rst to get an overview.

@tpeng could you check it and merge?

@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Dec 13, 2013

Member

FYI: this is a connection graph of various labels (thresholded by transition weights; layout is also based on the weights):

index

Member

kmike commented Dec 13, 2013

FYI: this is a connection graph of various labels (thresholded by transition weights; layout is also based on the weights):

index

Show outdated Hide outdated webstruct/loaders.py
@tpeng

This comment has been minimized.

Show comment
Hide comment
@tpeng

tpeng Dec 23, 2013

Contributor

the change looks good. here are my comments:

  • HtmlFeaturesExtractor is gone, but WapitiChunker not updated to use the new HtmlFeatureExtractor and HtmlTokenzier
  • NerFeatureExtractor in train-token-model2 looks generic, can we add it to the source?
  • we're missing an example global feature function. it's a little confusing there.
Contributor

tpeng commented Dec 23, 2013

the change looks good. here are my comments:

  • HtmlFeaturesExtractor is gone, but WapitiChunker not updated to use the new HtmlFeatureExtractor and HtmlTokenzier
  • NerFeatureExtractor in train-token-model2 looks generic, can we add it to the source?
  • we're missing an example global feature function. it's a little confusing there.
@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Dec 23, 2013

Member

Hey Terry,

You're right, there are some unfinished parts here:

  1. WapitiChunker is not updated - I'll update it this week;
  2. as for global feature functions, there is https://github.com/kmike/webstruct/tree/gazetteers branch in my fork that uses some of those to match multi-word tokens. I think that before making the example we should use it ourselves.
  3. training process is complicated now, and we need something like NerFeatureExtractor for sure; I tried to make it generic, but I don't feel it is good enough;
  4. I don't like that it is HtmlTokenizer who extracts the subset of tags from annotated data, because if you want to check which tags work better you can't reuse tokenization results.
  5. I just realized I reintroduced tags/labels confusion: sometimes variables named "tags" means "HTML tags", sometimes they mean "tags from our tagset". I think it is better to consistently use either "labels" name for our tagset or "html_tags" for HTML tags. What do you prefer?
Member

kmike commented Dec 23, 2013

Hey Terry,

You're right, there are some unfinished parts here:

  1. WapitiChunker is not updated - I'll update it this week;
  2. as for global feature functions, there is https://github.com/kmike/webstruct/tree/gazetteers branch in my fork that uses some of those to match multi-word tokens. I think that before making the example we should use it ourselves.
  3. training process is complicated now, and we need something like NerFeatureExtractor for sure; I tried to make it generic, but I don't feel it is good enough;
  4. I don't like that it is HtmlTokenizer who extracts the subset of tags from annotated data, because if you want to check which tags work better you can't reuse tokenization results.
  5. I just realized I reintroduced tags/labels confusion: sometimes variables named "tags" means "HTML tags", sometimes they mean "tags from our tagset". I think it is better to consistently use either "labels" name for our tagset or "html_tags" for HTML tags. What do you prefer?
@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Dec 27, 2013

Member

TODO:

  • WapitiCRF shoudl handle wapiti model files better, because it is hard to use/move an unpickled model (save them in-memory and create in temporary locations for wapiti?);
  • update README;
  • original wapiti C binary is required for training - check if we can create a transparent pip-installable wrapper.
Member

kmike commented Dec 27, 2013

TODO:

  • WapitiCRF shoudl handle wapiti model files better, because it is hard to use/move an unpickled model (save them in-memory and create in temporary locations for wapiti?);
  • update README;
  • original wapiti C binary is required for training - check if we can create a transparent pip-installable wrapper.

kmike added a commit that referenced this pull request Feb 12, 2014

@kmike kmike merged commit 9064c58 into scrapinghub:master Feb 12, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment