Skip to content

Latest commit

 

History

History
370 lines (253 loc) · 25.5 KB

milestones_api_docs.md

File metadata and controls

370 lines (253 loc) · 25.5 KB

Milestones API Documentation

Milestones are typically location designators (e.g. chapter numbers, lines numbers, or groups of sentences) that help you identify structural divisions within documents.

Note

The milestones submodule provides useful functionality for other features of Lexos and will eventually be made into a separate module. For now, it can be accessed as a submodule of Rolling Windows.

Since milestones can span multiple tokens, each token in the document is classified using the IOB method (also used by spaCy's named entity recognition component). The value I assigned to milestone_iob indicates that a token is "inside" (part of) a milestone. The value O indicates that the token is "outside" of (not part of) a milestone. The value B indicates that the token is the "beginning" of (the first token in) a milestone. The milestone_label attribute provides a text representation of the combined tokens. Note, however, that by default it is truncated after twenty characters. Its main function is thus as a point of reference for the user.

Note

Custom attributes in spaCy are accessed with the ._. prefix, so the milestone_iob and milestone_iob values for the first token in the document would be accessed with ms.doc[0]._.milestone_iob and ms.doc[0]._.milestone_label.

If you already have a doc with milestone attributes, you can simple initialise a Milestones object using that doc, and the milestone_iob and milestone_label attributes will be available. If you have used the Milestones class to create these attributes, in most cases you will want to replace your original doc with the one in your Milestones object with doc = ms.doc.

lexos.milestones.helpers

lexos.milestones.helpers mostly consists of deprecated functions. The only one currently used is lexos.milestones.helpers.ensure_list. The deprecated functions are not documented below.

lexos.milestones.helpers.ensure_list

Wraps any input in a list if it is not already a list.

def ensure_list(input: Any) -> list
Parameter Description Required
input: Any An input variable. Yes

lexos.milestones.get_multiple_milestones

Get a list of Milestone objects from a list of docs. This function may be deprecated.

def get_multiple_milestones(docs: List[spacy.tokens.doc.Doc], nlp: str = "xx_sent_ud_sm", patterns: Any = None, case_sensitive: bool = True, mode: str = None, skip_token: bool = False, remove_token: bool = False, split_lines: bool = False, split_sentences: bool = False, step: int = None, remove_milestone: bool = True) -> List[Milestones]
Parameter Description Required
docs: List[spacy.tokens.doc.Doc] A list of spaCy Doc objects. Yes
nlp: str The name of a spaCy language model. Default is xx_sent_ud_sm. No
patterns: Any The list of patterns to match milestone spans or line breaks. If nothing is supplied, get_line_spans() will use the default pattern for line breaks. Default is None. No
case_sensitive: bool Whether to use case sensitive matching. Default is True. No
mode: str The mode to use for token matching. Default is None. No
skip_token: bool Set milestone start to the token following the milestone span. Default is False. No
remove_token: bool Set milestone start to the token following the milestone span and remove the milestone span. Default is False. No
split_lines: bool Use set_line_spans() instead of set_milestones(). Default is False. No
split_sentences: bool Use set_sentence_spans() instead of set_milestones(). Default is False. No
step: int The number of lines or sentences to include in the spans. By default, all are included.
remove_milestone: Whether or not to remove the linebreak using split_lines. Default is None.
No
remove_milestone: bool Whether or not to remove the linebreak using split_lines. Default is True. No

lexos.milestones.Milestones

Creates a Milestones object. The object has the property spans, which returns the value of Returns Milestones.doc.spans["milestones"].

class Milestones(doc: spacy.tokens.doc.Doc, *, nlp: str = "xx_sent_ud_sm", patterns: Any = None, case_sensitive: bool = True)
Attribute Description Required
doc: spacy.tokens.doc.Doc A spaCy Doc object. Yes
nlp: str The name of a spaCy language model. Default is xx_sent_ud_sm. No
patterns: Any A pattern or list of patterns to match to milestones. Default is None. No
case_sensitive: bool Whether to use case sensitive matching. Default is True. No

Private Methods

lexos.milestones.Milestones.__iter__

Returns a Milestones generator of Milestones.spans.

def __iter__(self)
lexos.milestones.Milestones._assign_token_attributes

Assign token attributes in the doc based on spans.

def _assign_token_attributes(self, spans: List[spacy.tokens.span.Span])
Parameter Description Required
spans: List[spacy.tokens.span.Span] A list of spaCy Span objects. Yes
lexos.milestones.Milestones._autodetect_mode

Autodetect mode for matching milestones if not supplied (experimental). Returns a string to supply to the mode parameter of lexos.milestones.Milestones.get_matches.

def _autodetect_mode(self, patterns: Any) -> str
Parameter Description Required
patterns: Any The pattern(s) to match. Yes
lexos.milestones.Milestones._get_string_matches

Get matches to milestone patterns in strings. Returns a list of spaCy spans matching the pattern.

def _get_string_matches(self, patterns: Any, flags: Enum) -> List[spacy.tokens.Span]
Parameter Description Required
patterns: Any The pattern(s) to match. Yes
flags: Enum An enum containing Python re flags. Yes
lexos.milestones.Milestones._get_phrase_matches

Get matches to milestone patterns in phrases. Returns a list of spaCy spans matching the pattern.

def _get_phrase_matches(self, patterns: Any, attr: str = "ORTH") -> List[spacy.tokens.Span]
Parameter Description Required
patterns: Any The pattern(s) to match. Yes
attr: str A string indicating the spaCy token attribute to match. Default is ORTH. No
lexos.milestones.Milestones._get_rule_matches

Get matches to milestone patterns in phrases. Returns a list of spaCy spans matching the pattern.

def _get_rule_matches(self, patterns: Any) -> List[spacy.tokens.Span]
Parameter Description Required
patterns: Any The pattern(s) to match. Yes
lexos.milestones.Milestones._remove_duplicate_spans

Remove duplicate spans, generally created when a pattern is added.

def _remove_duplicate_spans(self, spans: List[spacy.tokens.Span]) -> List[spacy.tokens.Span]
Parameter Description Required
spans: List[spacy.tokens.Span] A list of spaCy Span objects. Yes
lexos.milestones.Milestones._set_case_sensitivity

Set the object's case sensitivity.

def _set_case_sensitivity(self, case_sensitive: bool = True)
Parameter Description Required
case_sensitive: bool Whether or not to perform case-sensitive searching. Default is True. Yes
lexos.milestones.Milestones._to_spacy_span

Convert a re.match object to a spaCy Span object.

def _to_spacy_span(self, match: Match) -> spacy.tokens.Span
Parameter Description Required
match: re.match A re.match object. Yes

Public Methods

lexos.milestones.Milestones.add

Add patterns. Note that the resulting patterns are unsorted. Depending on what you are doing, you may need to call ms.patterns = sorted(ms.patterns).

def add(self, patterns: Any, mode: str = "string") -> None
Parameter Description Required
patterns: Any The pattern(s) to match. Yes
mode: str The mode to use for matching. Default is string. No
lexos.milestones.Milestones.get_matches

Get matches to milestone patterns. Returns a list of spaCy spans matching the pattern.

def get_matches(self, patterns: Any = None, mode: str = None, case_sensitive: bool = True)
Parameter Description Required
patterns: Any The pattern(s) to match. Yes
mode: str The mode to use for matching:

- string: Match milestone patterns in the document text.
- phrase: Match to milestone patterns in phrases.
- rule: Match to milestone patterns with spaCy rules.
- sentence: Match milestone patterns in sentences.

Default is None.
No
case_sensitive: bool Whether to use case sensitive matching. Default is True. No

The mode parameter identifies the function to use for matching patterns. The string mode matches character sequences in the document's text. The phrase mode matches token sequences in the document using spaCy's Phrase Matcher. The rule mode matches a spaCy Rule Matcher pattern. The sentence mode works somewhat differently, it uses returns a list of sentences in the document. Since it uses spaCy's sentence detection component, it will only work if that component is available in the selected language model. If no mode is provided, Lexos will attempt to auto-detect the most appropriate mode based on the pattern.

Pattern matching may not work as desired in RTL languages like Arabic and Hebrew. Some functions to handle RTL languages have been prototyped but are not part of this version of Milestones.

Tip

The string mode matches patterns using regular expressions, which may occasionally cause mismatches. For instance, matching "Mr. Darcy" will return matches to "Mrs Darcy" since "." indicates any single character in regular expressions. Typically, this problem can be avoided by selecting the phrase mode.

Caution

Calling Milestones.get_matches() will overwrite any pre-existing patterns. If you wish to add patterns to existing ones, use the Milestones.add() method, which updates the list of patterns and sets the milestones matching both the previous and the new milestones. You can also remove patterns with the Milestones.remove() method. Both methods accept the mode parameter. Finally, you can clear the pattern list by calling the Milestones.reset() method. This will also reset all milestone_iob values to "O" and all milestone_label values to empty strings.

lexos.milestones.Milestones.remove

Remove patterns.

def remove(self, patterns: Any, mode: str = "string") -> None
Parameter Description Required
patterns: Any The pattern(s) to match. Yes
mode: str The mode to use for matching. Default is string. No
lexos.milestones.Milestones.reset

Reset all milestone values to defaults. Does not modify patterns or any other settings.

def reset(self)
lexos.milestones.Milestones.set_custom_spans

Generate spans based on a custom list. Returns a list of spaCy spans.

def set_custom_spans(self, spans: List[spacy.tokens.Span], step: int = None, type: str = "custom") -> List[spacy.tokens.Span])
Parameter Description Required
pattern: List[spacy.tokens.Span] The string or regex pattern to use to identify the milestone. Yes
step: str The number of spans to group into each milestone span. By default, all spans are included. Default is None. No
step: str The type of span used. Default is custom. No
lexos.milestones.Milestones.set_line_spans

Generate spans based on line breaks. Returns a list of spaCy spans.

def set_line_spans(self, pattern: str = r".+?\n", step: int = None, remove_milestone: bool = True) -> List[spacy.tokens.Span])
Parameter Description Required
pattern: str The string or regex pattern to use to identify the milestone. Default is r".+?\n". No
step: str The number of spans to group into each milestone span. By default, all lines are included. Default is None. No
remove_milestone: bool Whether or not to remove the line break character. Default is True. No
lexos.milestones.Milestones.set_milestones

Commit milestones to the object instance.

def set_milestones(self, spans: List[spacy.tokens.span.Span], skip_token: bool = False, remove_token: bool = False) -> None
Parameter Description Required
spans: List[spacy.tokens.span.Span] The span(s) to use for identifying token attributes. Yes
skip_token: bool Set milestone start to the token following the milestone span. Default is False. No
remove_token: bool Set milestone start to the token following the milestone span and remove the milestone span. Default is False. No
lexos.milestones.Milestones.set_sentence_spans

Generate spans with n sentences per span. Returns a list of spaCy spans.

def set_sentence_spans(self, step: int = None) -> List[spacy.tokens.Span])
Parameter Description Required
step: str The number of spans to group into each milestone span. By default, all lines are included. Default is None. No
lexos.milestones.Milestones.to_list

Get a list of milestone dictionaries. Some language models include a final punctuation mark in the token string, particularly at the end of a sentence. The strip_punct argument is a somewhat hacky convenience method to remove it. However, the user may wish instead to do some post-processing in order to use the output for their own purposes.

def to_list(self, strip_punct: bool = True) -> List[dict]
Parameter Description Required
strip_punct: bool Strip single punctuation mark at the end of the character string. Default is True. No

lexos.milestones.util

lexos.milestones.helpers mostly consists of deprecated functions. The only one currently used is lexos.milestones.helpers.ensure_list. The deprecated functions are not documented below.

lexos.milestones.util.chars_to_tokens

Generate a characters to tokens mapping. Returns a dictionary mapping character indexes to token indexes.

def chars_to_tokens(doc: spacy.tokens.doc.Doc) -> Dict[int, int]
Parameter Description Required
doc: spacy.tokens.doc.Doc A spaCy Doc object. Yes

lexos.milestones.util.lowercase_spacy_rules

Converts a spaCy Matcher rule to lower case. Performs the same function as rollingwindows.calculators.spacy_rule_to_lower.

def spacy_rule_to_lower(patterns: Union[Dict, List[Dict]], old_key: Union[List[str], str] = ["TEXT", "ORTH"], new_key: str = "LOWER") -> list
Parameter Description Required
patterns: Union[Dict, List[Dict]] A string to match against the Roman numerals pattern. Yes
old_key: Union[List[str], str] A dictionary key or list of keys to rename. Default is ["TEXT", "ORTH"]. No
new_key: str The new key name. Default is LOWER. No

lexos.milestones.util.filter_doc

Applies a filter to a document and returns a new document. This function is a duplicate of rollingwindows.filters.filter_doc.

def filter_doc(input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc], n: int = 1000, window_units: str = "characters", alignment_mode: str = "strict") -> Iterator
Parameter Description Required
doc: spacy.tokens.doc.Doc A spaCy Doc object. Yes
keep_ids: int A list of spaCy Token ids to keep in the filtered Doc. Yes
spacy_attrs: List[str] A list of spaCy Token attributes to keep in the filtered Doc. Default is the SPACY_ATTRS list imported with util.* No
force_ws: bool Force a whitespace at the end of every token except the last. Default is True. No

* The default list of spaCy token attributes can be inspected by calling util.SPACY_ATTRS.

rollingwindows.filters.get_doc_array

Converts a spaCy Doc object into a numpy array. This function is a duplicate of rollingwindows.filters.get_doc_array.

def get_doc_array(doc: spacy.tokens.doc.Doc, spacy_attrs: List[str] = SPACY_ATTRS, force_ws: bool = True) -> np.ndarray
Parameter Description Required
doc: spacy.tokens.doc.Doc A spaCy Doc object. Yes
keep_ids: int A list of spaCy Token ids to keep in the filtered Doc. Yes
spacy_attrs: List[str] A list of spaCy Token attributes to keep in the filtered Doc. Default is the SPACY_ATTRS list imported with util.* No
force_ws: bool Force a whitespace at the end of every token except the last. Default is True. No

* The default list of spaCy token attributes can be inspected by calling util.SPACY_ATTRS.

The following options are available for handling whitespace:

  1. force_ws=True ensures that token_with_ws and whitespace_ attributes are preserved, but all tokens will be separated by whitespaces in the text of a doc created from the array.
  2. force_ws=False with SPACY in spacy_attrs preserves the token_with_ws and whitespace_ attributes and their original values. This may cause tokens to be merged if subsequent processing operates on the doc.text.
  3. force_ws=False without SPACY in spacy_attrs does not preserve the token_with_ws and whitespace_ attributes or their values. By default, doc.text displays a single space between each token.