GitHub - xiaohan2012/capitalization-restoration-train: Training code for the capitalization restorer

Producing new data set for CRF classifier

Use the working script: new_data_pipeline.sh. Basically, it does the above.

Or do the following step by step by hand(not recommended)

python print_filenames_and_titles.py: get the file paths and news titles that accords to our requirement(non-monocase title and non-empty article body)
python copy_puls_file_to_local.py: copy the files somewhere writable&accessible
python extract_doc_ids.py: save the ids of documents to be used
puls-core-process-document.sh: using PULS to preprocess the documents. This will generate the .auxil files
process_and_save_capitalized_headlines.py: save the malformed headlines somewhere
make_data_puls.py: extract the features for CRF classifier to use
train_puls_model.sh: train the model

Producing new data set for rule-based classifier

The process is divided into two parts: one part is shared with the data creation process for CRF classifier(step 1 to 5).

The other is outputing the labels in separate files for the rule-based classifier to use.

Run make_rule_based_corpus.sh

Evaluation

CRF classifier evaluation

Refer to the comments in train_puls_model.sh and comment/uncomment certain lines to do that.

Itermediate performance statitics(as they will be processed later) will be saved in target paths as specified in that script.

Rule-based classifier evaluation

Do the following:

Change the variables in puls-rule-based-parallel.sh if you'd like to
Run puls-rule-based-parallel.sh to use the IE rule-based capitalization recovery tool to process the evaluation data
Go to the directory specified by $result_dir variable in the puls-rule-based-parallel.sh and concatenate all the result files (starting with id_) into a whole result file
Run python evaluate.py to print the result matrix, where rows are the statistics for each label and columns are number of match, number of predictions y model and number of lables in reality

short-cut:

As the data preparation is done, if you want to evaluate rule-based classifier, just run the above plus the final post score processing.

Post processing

Both evaluation scripts print out itermediate result(like number of correct predictions support) for the final scores. You need to run replace the data in calc_cv_result.py according to the comment in the script and run it.

Trainable document Id path

Documents are filtered by whether their title is trainable(correctly-capitalized) and whether they contain non-empty body, the list of document ids is saved under data/tmp/2015-08-18/filtered_trainable_doc_ids.txt

Printing error example

For CRF classiier, pred_err.py will print out the error examples as well as confusion matrix

> # Example: python pred_err.py ${model_path} ${test_sentence_path} ${test_sentence_feature_path}
> DATA_ROOT=/cs/taatto/home/hxiao/capitalization-recovery
> python pred_err.py --model ${DATA_ROOT}/result/feature/cap/1+2+3+4+5+6/model --sent_path ${DATA_ROOT}/corpus/news_title_cap/30000/test.txt --crfsuite_path ${DATA_ROOT}/result/feature/cap/1+2+3+4+5+6/test.crfsuite.txt

For rule-based classifier, evaluate.py will do the same role. Note, you need to set print_errors=True when calling eval_rule_based in the evaluate.py script.

TODO

Add more features to handle mixed-case words, for example: TSX-Venture, or split the word by the hyphen
In capitalized titles(more information is preserved), some words are already all-uppercase/mixed-cased. Dictionary feature does not take into account mixed case words.
Spelling/morphology, funds = fund + s
POS tag for capitalized words seems to tend to be NNP. Maybe lowercase the sentence and capitalize it?

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
capitalization_effect		capitalization_effect
crfsuite-0.12/example		crfsuite-0.12/example
data		data
first_letter		first_letter
result		result
test_data		test_data
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
all_feature_experiments.sh		all_feature_experiments.sh
all_training_size_experiments.sh		all_training_size_experiments.sh
b3-result.txt		b3-result.txt
baseline1.py		baseline1.py
baseline2.py		baseline2.py
baseline3.py		baseline3.py
build_model.sh		build_model.sh
calc_cv_result.py		calc_cv_result.py
cap_detect.py		cap_detect.py
cap_transform.py		cap_transform.py
copy_models.sh		copy_models.sh
copy_puls_file_to_local.py		copy_puls_file_to_local.py
count_auxil_file.sh		count_auxil_file.sh
cp_fnames_and_titles.txt		cp_fnames_and_titles.txt
cp_paf_file.sh		cp_paf_file.sh
current_server		current_server
cv-all.sh		cv-all.sh
cv_servers.txt		cv_servers.txt
data.py		data.py
dist_util.py		dist_util.py
distribute.sh		distribute.sh
error_display.py		error_display.py
errors.py		errors.py
evaluate.py		evaluate.py
extract_doc_ids.py		extract_doc_ids.py
feature_experiments_commands.sh		feature_experiments_commands.sh
fgcv-all.sh		fgcv-all.sh
fgcv_servers.txt		fgcv_servers.txt
filter_doc_ids.py		filter_doc_ids.py
fnames_and_titles.txt		fnames_and_titles.txt
ground_truth.py		ground_truth.py
hmm.py		hmm.py
killscreen.sh		killscreen.sh
label.py		label.py
make_data.sh		make_data.sh
make_puls_data.py		make_puls_data.py
make_puls_data.sh		make_puls_data.sh
make_rule_based_corpus.sh		make_rule_based_corpus.sh
new_data_pipeline.sh		new_data_pipeline.sh
non-english-examples.txt		non-english-examples.txt
one_gram_capitalizer.py		one_gram_capitalizer.py
performance.txt		performance.txt
pred_err.py		pred_err.py
print_filenames_and_titles.py		print_filenames_and_titles.py
process_and_save_capitalized_headlines.py		process_and_save_capitalized_headlines.py
puls-core-process-document.sh		puls-core-process-document.sh
puls-rule-based-parallel.sh		puls-rule-based-parallel.sh
puls-rule-based.sh		puls-rule-based.sh
puls_util.py		puls_util.py
report.md		report.md
report_feature_result.sh		report_feature_result.sh
report_training_size.sh		report_training_size.sh
result.txt		result.txt
reuters.txt		reuters.txt
rm_auxil_file.sh		rm_auxil_file.sh
rule_based.py		rule_based.py
run.sh		run.sh
run_feature_experiment.sh		run_feature_experiment.sh
run_training_size_experiment.sh		run_training_size_experiment.sh
save_error.sh		save_error.sh
scipy_util.py		scipy_util.py
servers.lst		servers.lst
servers.txt		servers.txt
single_classifier.py		single_classifier.py
start-all.sh		start-all.sh
stopall.sh		stopall.sh
temp.sh		temp.sh
test.sh		test.sh
test_servers.txt		test_servers.txt
train_puls_model.sh		train_puls_model.sh
train_puls_model_with_words.sh		train_puls_model_with_words.sh
train_puls_model_without_words.sh		train_puls_model_without_words.sh
training_size_experiments_commands.sh		training_size_experiments_commands.sh
unigram.py		unigram.py
unzipall.sh		unzipall.sh
util.py		util.py
word_shape_util.py		word_shape_util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Producing new data set for CRF classifier

Producing new data set for rule-based classifier

Evaluation

CRF classifier evaluation

Rule-based classifier evaluation

Post processing

Trainable document Id path

Printing error example

TODO

About

Releases

Packages

Languages

License

xiaohan2012/capitalization-restoration-train

Folders and files

Latest commit

History

Repository files navigation

Producing new data set for CRF classifier

Producing new data set for rule-based classifier

Evaluation

CRF classifier evaluation

Rule-based classifier evaluation

Post processing

Trainable document Id path

Printing error example

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages