Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Targets for Refactoring #27

Closed
pnulty opened this issue Aug 11, 2014 · 3 comments
Closed

Targets for Refactoring #27

pnulty opened this issue Aug 11, 2014 · 3 comments

Comments

@pnulty
Copy link
Collaborator

pnulty commented Aug 11, 2014

Accessor functions:

texts()
words()
data() - (return only the attribs or texts + attribs?)
tokenizedTexts() - I suggest that when we run tokenize() we should store the result in the corpus object and simply retrieve the tokenized texts afterwards

Generic Functions:
clean() corpus, text, (dfm?)
tokenize() corpus, text
stopwords() corpus, text, dfm
sample() corpus, dfm

@pnulty
Copy link
Collaborator Author

pnulty commented Aug 11, 2014

Ken's notes:

Proposed the design of a NEW corpus object

(1) The corpus object is an S3 class defined as a speical class of list

(2) Corpus list elements:
a) data.frame of documents, called attribs (as now) consisting of:
i. texts
vector of the texts in the corpus, with an Encoding() flag set on
each element
ii. user-defined variables associated with each document
iii. row.names(attribs) will be a unique key of document names
b) data.frame of document-level metadata
automatically defined or defined by the user. row.names correspond to those
in the documents ("attribs") data.frame
- original file name
- source (disk, assignment, etc.)
- notes
- LANGUAGE
- optional info from the "Dublin Scheme"
c) list of corpus-level meta-data, including
- notes
- citation information
- creation details
d) user-supplied variables-level meta-data
- details on each user-defined "attribute"
e) collocations. List of word sequences that will be treated as single
types when extracting word-based features
f) dictionar(y/ies). named list of dictionaries associated with the corpus
g) stopwords. list of character elements associated with the stopwords.
f) stemming. TRUE or FALSE depending on whether to use stemming with this corpus.
g) clean rules? such as punctuation/number/case

(3) Index flag (TRUE or FALSE) - gets reset depending on the operation

(3) Note: all options can be overridden when using specific commands (dfm, kwic)
but the settings will determine the defaults. This is for replication purposes
and convenience if a user determines that for a corpus, there should be a
"standard" set of option settings.

Methods:

corpus(texts, ...) <- replaces corpusCreate. Similar to data.frame which
creates a data.frame. Would be nice to combine existing
functions into options (for reading from file= or
directory= etc options)

print.corpus(corpus) displays summary information on a corpus, esp. metadata
and citation information and current settings for things
like collocations, stemming, dictionaries, etc.

summary.corpus(corpus) details of the texts in a corpus

'+' corpus concatenate texts in two corpus objects
union of meta-data, first gets priority

index.corpus(corpus) recompiles the corpus index. Could include counts,
word syllable counts, document, paragraph, and sentence
locations. Or POS for each word.

subset.corpus() as it now exists

sample.corpus(corpus, level=c("sentence", "documment", "word", "paragraph"), size, replace=TRUE, prob=NULL)
for producing a sample of texts and meta-data from a corpus where the resampling
of the texts is performed at the "level" option. Meta-data is matched to the
sampled document units.
sample.character(characterVector) core engine of sample.corpus

Extractor/Assignment functions for corpus slots:


documents.corpus(corpus)
extracts or assigns the texts (same as current getTexts())

metadata.corpus(corpus, level=c("documents", "corpus"))
extracts or assigns corpus metata data

stopwords.corpus(corpus) extracts or assigns stopwords associated with corpus

collocations.corpus(corpus)
extracts or assigns collocations to be treated as "features"
when extracting features from the corpus

stemming.corpus(corpus) TRUE or FALSE flag to be set with corpus

trim.corpus(corpus) min doc and min word trimming features

encoding(corpus) set or extract encodings of attribs$texts

dictionary(corpus, name="dictionaryname")
to extract or set the dictionaries associated with corpus

Extractor only (no assignment):


sentences.corpus(corpus) extract sentence list from a corpus
words/vocabulary.corpus(corpus) extract list of word types from a corpus (given settings)

Analysis of corpus directly: (also defined for .character whenever applicable)


readability.corpus(corpus, [options])
kwic.corpus(corpus, [options])
collocations.corpus(corpus, [options])

Manipulation/conversion of corpus


export.corpus(to=c("text", "alceste", "tm", "qdaminerXML", "maxqda"), from=c("quanteda"), [options])
import.corpus(from=c("text", "alceste", "tm", "qdaminerXML", "maxqda"), from=c("quanteda"))

@pnulty
Copy link
Collaborator Author

pnulty commented Aug 12, 2014

Constructor for corpus as outlined here:http://adv-r.had.co.nz/OO-essentials.html#s3

The constructor should be a generic function named "corpus". If no arguments are passed, getTextsGui can be run.

@kbenoit
Copy link
Collaborator

kbenoit commented May 5, 2015

Some issues resolved by last hackathon, others distributed into new issues.

@kbenoit kbenoit closed this as completed May 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants