Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor #6

Merged
merged 37 commits into from
Feb 12, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
819cfce
training data fixes
kmike Dec 13, 2013
31a85fc
big refactoring
kmike Dec 13, 2013
01949fa
split features into token features and global features
kmike Dec 17, 2013
8e3f7e3
make HtmlToken.token unicode
kmike Dec 17, 2013
61553a6
«tag» now means NER tag
kmike Dec 24, 2013
ccaf723
HtmlLoader
kmike Dec 24, 2013
f7efc8b
added support for WebAnnotator > 1.14 title annotation feature
kmike Dec 25, 2013
848e94b
break interface again: fit/transform methods now accepts multiple seq…
kmike Dec 25, 2013
e1257eb
tokenization changes: split by «|», make tokenizer aware of some unic…
kmike Dec 26, 2013
2f5d34d
one more tokenization fix
kmike Dec 26, 2013
9867983
trainer for Wapiti CRF models
kmike Dec 26, 2013
00c328e
make HtmlTokenizer and HtmlFeatureExtractor work on lists of trees by…
kmike Dec 26, 2013
5a5b455
add load_trees helper for bulk loading data
kmike Dec 26, 2013
7f5ecb1
add WapitiCRF to top-level exports
kmike Dec 26, 2013
02b79f5
update requirements.txt
kmike Dec 26, 2013
00bdf22
remove WapitiChunker; add transform and score methods to WapitiCRF
kmike Dec 26, 2013
a09b0a4
attributes are renamed to fix serialization and __repr__
kmike Dec 27, 2013
8942108
HtmlLoader cleanup
kmike Dec 27, 2013
fcf9840
smart_join utility function
kmike Dec 27, 2013
0bf420e
add support for auto-extracting dev data for wapiti training
kmike Dec 27, 2013
8917fba
move load_trees to the bottom and expose it in webstruct top-level na…
kmike Dec 27, 2013
874a2d5
a couple of helpers for easier training and prediction
kmike Dec 27, 2013
89e5779
smarter smart_join
kmike Dec 27, 2013
04e1728
improved docstring for IobEncoder.group
kmike Jan 9, 2014
113ae9c
extract_raw method for model.NER
kmike Jan 9, 2014
62131d9
heuristic algorithm for grouping entities into clusters
kmike Jan 10, 2014
e809864
minor docstring fix
kmike Jan 10, 2014
3892a7c
[wip] gazetteers support
kmike Dec 17, 2013
81a567c
Drop prebuilt gazetteer features; better utils for creating own gazet…
kmike Jan 13, 2014
a262e0b
minor docstring fix for geonames.read_geonames; extract csv parameter…
kmike Jan 14, 2014
e2eb818
utility for reading zipped geonames files
kmike Jan 14, 2014
b205c04
import pandas only on demand
kmike Jan 14, 2014
ab797de
don’t remove forms and annoying tags by default
kmike Jan 18, 2014
0111de9
support passing LongestMatch instances to LongestMatchGlobalFeature
kmike Jan 29, 2014
6d8793c
split token_shape feature function into several smaller functions
kmike Jan 29, 2014
945f219
split prefix and suffix features
kmike Jan 29, 2014
545ae30
cut-off support for HtmlFeatureExtractor
kmike Jan 30, 2014
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This package contains a library for extracting contact information from
HTML pages.

Supported functionality (so far)
------------
--------------------------------

- American contact information extraction
- Netherlands open hours extraction
Expand Down
2 changes: 1 addition & 1 deletion notebooks/train-token-model.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
"import random\n",
"import itertools\n",
"\n",
"sys.path.insert(0, '../../webstruct/')\n",
"sys.path.insert(0, '..')\n",
"\n",
"import wapiti\n",
"from webstruct.feature_extraction import HtmlFeaturesExtractor\n",
Expand Down
Loading