Targets for Refactoring #27

pnulty · 2014-08-11T15:30:19Z

Accessor functions:

texts()
words()
data() - (return only the attribs or texts + attribs?)
tokenizedTexts() - I suggest that when we run tokenize() we should store the result in the corpus object and simply retrieve the tokenized texts afterwards

Generic Functions:
clean() corpus, text, (dfm?)
tokenize() corpus, text
stopwords() corpus, text, dfm
sample() corpus, dfm

pnulty · 2014-08-11T17:23:21Z

Ken's notes:

Proposed the design of a NEW corpus object

(1) The corpus object is an S3 class defined as a speical class of list

(2) Corpus list elements:
a) data.frame of documents, called attribs (as now) consisting of:
i. texts
vector of the texts in the corpus, with an Encoding() flag set on
each element
ii. user-defined variables associated with each document
iii. row.names(attribs) will be a unique key of document names
b) data.frame of document-level metadata
automatically defined or defined by the user. row.names correspond to those
in the documents ("attribs") data.frame
- original file name
- source (disk, assignment, etc.)
- notes
- LANGUAGE
- optional info from the "Dublin Scheme"
c) list of corpus-level meta-data, including
- notes
- citation information
- creation details
d) user-supplied variables-level meta-data
- details on each user-defined "attribute"
e) collocations. List of word sequences that will be treated as single
types when extracting word-based features
f) dictionar(y/ies). named list of dictionaries associated with the corpus
g) stopwords. list of character elements associated with the stopwords.
f) stemming. TRUE or FALSE depending on whether to use stemming with this corpus.
g) clean rules? such as punctuation/number/case

(3) Index flag (TRUE or FALSE) - gets reset depending on the operation

(3) Note: all options can be overridden when using specific commands (dfm, kwic)
but the settings will determine the defaults. This is for replication purposes
and convenience if a user determines that for a corpus, there should be a
"standard" set of option settings.

Methods:

corpus(texts, ...) <- replaces corpusCreate. Similar to data.frame which
creates a data.frame. Would be nice to combine existing
functions into options (for reading from file= or
directory= etc options)

print.corpus(corpus) displays summary information on a corpus, esp. metadata
and citation information and current settings for things
like collocations, stemming, dictionaries, etc.

summary.corpus(corpus) details of the texts in a corpus

'+' corpus concatenate texts in two corpus objects
union of meta-data, first gets priority

index.corpus(corpus) recompiles the corpus index. Could include counts,
word syllable counts, document, paragraph, and sentence
locations. Or POS for each word.

subset.corpus() as it now exists

sample.corpus(corpus, level=c("sentence", "documment", "word", "paragraph"), size, replace=TRUE, prob=NULL)
for producing a sample of texts and meta-data from a corpus where the resampling
of the texts is performed at the "level" option. Meta-data is matched to the
sampled document units.
sample.character(characterVector) core engine of sample.corpus

Extractor/Assignment functions for corpus slots:

documents.corpus(corpus)
extracts or assigns the texts (same as current getTexts())

metadata.corpus(corpus, level=c("documents", "corpus"))
extracts or assigns corpus metata data

stopwords.corpus(corpus) extracts or assigns stopwords associated with corpus

collocations.corpus(corpus)
extracts or assigns collocations to be treated as "features"
when extracting features from the corpus

stemming.corpus(corpus) TRUE or FALSE flag to be set with corpus

trim.corpus(corpus) min doc and min word trimming features

encoding(corpus) set or extract encodings of attribs$texts

dictionary(corpus, name="dictionaryname")
to extract or set the dictionaries associated with corpus

Extractor only (no assignment):

sentences.corpus(corpus) extract sentence list from a corpus
words/vocabulary.corpus(corpus) extract list of word types from a corpus (given settings)

Analysis of corpus directly: (also defined for .character whenever applicable)

readability.corpus(corpus, [options])
kwic.corpus(corpus, [options])
collocations.corpus(corpus, [options])

Manipulation/conversion of corpus

export.corpus(to=c("text", "alceste", "tm", "qdaminerXML", "maxqda"), from=c("quanteda"), [options])
import.corpus(from=c("text", "alceste", "tm", "qdaminerXML", "maxqda"), from=c("quanteda"))

pnulty · 2014-08-12T11:08:02Z

Constructor for corpus as outlined here:http://adv-r.had.co.nz/OO-essentials.html#s3

The constructor should be a generic function named "corpus". If no arguments are passed, getTextsGui can be run.

kbenoit · 2015-05-05T12:58:38Z

Some issues resolved by last hackathon, others distributed into new issues.

kbenoit closed this as completed May 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Targets for Refactoring #27

Targets for Refactoring #27

pnulty commented Aug 11, 2014

pnulty commented Aug 11, 2014

pnulty commented Aug 12, 2014

kbenoit commented May 5, 2015

Targets for Refactoring #27

Targets for Refactoring #27

Comments

pnulty commented Aug 11, 2014

pnulty commented Aug 11, 2014

Methods:

pnulty commented Aug 12, 2014

kbenoit commented May 5, 2015