-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Targets for Refactoring #27
Comments
Ken's notes: Proposed the design of a NEW corpus object (1) The corpus object is an S3 class defined as a speical class of list (2) Corpus list elements: (3) Index flag (TRUE or FALSE) - gets reset depending on the operation (3) Note: all options can be overridden when using specific commands (dfm, kwic) Methods:corpus(texts, ...) <- replaces corpusCreate. Similar to data.frame which print.corpus(corpus) displays summary information on a corpus, esp. metadata summary.corpus(corpus) details of the texts in a corpus '+' corpus concatenate texts in two corpus objects index.corpus(corpus) recompiles the corpus index. Could include counts, subset.corpus() as it now exists sample.corpus(corpus, level=c("sentence", "documment", "word", "paragraph"), size, replace=TRUE, prob=NULL) Extractor/Assignment functions for corpus slots: documents.corpus(corpus) metadata.corpus(corpus, level=c("documents", "corpus")) stopwords.corpus(corpus) extracts or assigns stopwords associated with corpus collocations.corpus(corpus) stemming.corpus(corpus) TRUE or FALSE flag to be set with corpus trim.corpus(corpus) min doc and min word trimming features encoding(corpus) set or extract encodings of attribs$texts dictionary(corpus, name="dictionaryname") Extractor only (no assignment): sentences.corpus(corpus) extract sentence list from a corpus Analysis of corpus directly: (also defined for .character whenever applicable) readability.corpus(corpus, [options]) Manipulation/conversion of corpus export.corpus(to=c("text", "alceste", "tm", "qdaminerXML", "maxqda"), from=c("quanteda"), [options]) |
Constructor for corpus as outlined here:http://adv-r.had.co.nz/OO-essentials.html#s3 The constructor should be a generic function named "corpus". If no arguments are passed, getTextsGui can be run. |
Some issues resolved by last hackathon, others distributed into new issues. |
Accessor functions:
texts()
words()
data() - (return only the attribs or texts + attribs?)
tokenizedTexts() - I suggest that when we run tokenize() we should store the result in the corpus object and simply retrieve the tokenized texts afterwards
Generic Functions:
clean() corpus, text, (dfm?)
tokenize() corpus, text
stopwords() corpus, text, dfm
sample() corpus, dfm
The text was updated successfully, but these errors were encountered: