Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documents source #1

Closed
lecw opened this issue Jul 14, 2019 · 2 comments
Closed

Add documents source #1

lecw opened this issue Jul 14, 2019 · 2 comments

Comments

@lecw
Copy link
Contributor

lecw commented Jul 14, 2019

Ce qui serait pratique ? Avoir les fichiers texte source des documents annotés.

À extraire / formater, soit à la main à partir du googledoc "Annotation WL+FG" (ou "Documents à annoter"), soit programmatiquement à partir de la ressource JSON.

@lecw
Copy link
Contributor Author

lecw commented Jul 14, 2019

e.g. this file contains all tokenized sentences, which is half good.

wire_tokenized_sentences.txt

@rali-udem
Copy link
Owner

Commit 056fd7c now provides excerpts for the original documents, as well as their source URLs. Tokenization is left to the reader, which is not an enormous task given the number of documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants