Milestones are typically location designators (e.g. chapter numbers, lines numbers, or groups of sentences) that help you identify structural divisions within documents.
Note
The milestones
submodule provides useful functionality for other features of Lexos and will eventually be made into a separate module. For now, it can be accessed as a submodule of Rolling Windows.
Since milestones can span multiple tokens, each token in the document is classified using the IOB method (also used by spaCy's named entity recognition component). The value I
assigned to milestone_iob
indicates that a token is "inside" (part of) a milestone. The value O
indicates that the token is "outside" of (not part of) a milestone. The value B
indicates that the token is the "beginning" of (the first token in) a milestone. The milestone_label
attribute provides a text representation of the combined tokens. Note, however, that by default it is truncated after twenty characters. Its main function is thus as a point of reference for the user.
Note
Custom attributes in spaCy are accessed with the ._.
prefix, so the milestone_iob
and milestone_iob
values for the first token in the document would be accessed with ms.doc[0]._.milestone_iob
and ms.doc[0]._.milestone_label
.
If you already have a doc with milestone attributes, you can simple initialise a Milestones
object using that doc, and the milestone_iob
and milestone_label
attributes will be available. If you have used the Milestones
class to create these attributes, in most cases you will want to replace your original doc with the one in your Milestones
object with doc = ms.doc
.
lexos.milestones.helpers
mostly consists of deprecated functions. The only one currently used is lexos.milestones.helpers.ensure_list
. The deprecated functions are not documented below.
Wraps any input in a list if it is not already a list.
def ensure_list(input: Any) -> list
Parameter | Description | Required |
---|---|---|
input : Any |
An input variable. | Yes |
Get a list of Milestone objects from a list of docs. This function may be deprecated.
def get_multiple_milestones(docs: List[spacy.tokens.doc.Doc], nlp: str = "xx_sent_ud_sm", patterns: Any = None, case_sensitive: bool = True, mode: str = None, skip_token: bool = False, remove_token: bool = False, split_lines: bool = False, split_sentences: bool = False, step: int = None, remove_milestone: bool = True) -> List[Milestones]
Parameter | Description | Required |
---|---|---|
docs : List[spacy.tokens.doc.Doc] |
A list of spaCy Doc objects. |
Yes |
nlp : str |
The name of a spaCy language model. Default is xx_sent_ud_sm . |
No |
patterns : Any |
The list of patterns to match milestone spans or line breaks. If nothing is supplied, get_line_spans() will use the default pattern for line breaks. Default is None . |
No |
case_sensitive : bool |
Whether to use case sensitive matching. Default is True . |
No |
mode : str |
The mode to use for token matching. Default is None . |
No |
skip_token : bool |
Set milestone start to the token following the milestone span. Default is False . |
No |
remove_token : bool |
Set milestone start to the token following the milestone span and remove the milestone span. Default is False . |
No |
split_lines : bool |
Use set_line_spans() instead of set_milestones() . Default is False . |
No |
split_sentences : bool |
Use set_sentence_spans() instead of set_milestones() . Default is False . |
No |
step : int |
The number of lines or sentences to include in the spans. By default, all are included. remove_milestone: Whether or not to remove the linebreak using split_lines . Default is None . |
No |
remove_milestone : bool |
Whether or not to remove the linebreak using split_lines . Default is True . |
No |
Creates a Milestones
object. The object has the property spans
, which returns the value of Returns Milestones.doc.spans["milestones"]
.
class Milestones(doc: spacy.tokens.doc.Doc, *, nlp: str = "xx_sent_ud_sm", patterns: Any = None, case_sensitive: bool = True)
Attribute | Description | Required |
---|---|---|
doc : spacy.tokens.doc.Doc |
A spaCy Doc object. |
Yes |
nlp : str |
The name of a spaCy language model. Default is xx_sent_ud_sm . |
No |
patterns : Any |
A pattern or list of patterns to match to milestones. Default is None . |
No |
case_sensitive : bool |
Whether to use case sensitive matching. Default is True . |
No |
Returns a Milestones
generator of Milestones.spans
.
def __iter__(self)
Assign token attributes in the doc based on spans.
def _assign_token_attributes(self, spans: List[spacy.tokens.span.Span])
Parameter | Description | Required |
---|---|---|
spans : List[spacy.tokens.span.Span] |
A list of spaCy Span objects. |
Yes |
Autodetect mode for matching milestones if not supplied (experimental). Returns a string to supply to the mode parameter of lexos.milestones.Milestones.get_matches
.
def _autodetect_mode(self, patterns: Any) -> str
Parameter | Description | Required |
---|---|---|
patterns : Any |
The pattern(s) to match. | Yes |
Get matches to milestone patterns in strings. Returns a list of spaCy spans matching the pattern.
def _get_string_matches(self, patterns: Any, flags: Enum) -> List[spacy.tokens.Span]
Parameter | Description | Required |
---|---|---|
patterns : Any |
The pattern(s) to match. | Yes |
flags : Enum |
An enum containing Python re flags. |
Yes |
Get matches to milestone patterns in phrases. Returns a list of spaCy spans matching the pattern.
def _get_phrase_matches(self, patterns: Any, attr: str = "ORTH") -> List[spacy.tokens.Span]
Parameter | Description | Required |
---|---|---|
patterns : Any |
The pattern(s) to match. | Yes |
attr : str |
A string indicating the spaCy token attribute to match. Default is ORTH . |
No |
Get matches to milestone patterns in phrases. Returns a list of spaCy spans matching the pattern.
def _get_rule_matches(self, patterns: Any) -> List[spacy.tokens.Span]
Parameter | Description | Required |
---|---|---|
patterns : Any |
The pattern(s) to match. | Yes |
Remove duplicate spans, generally created when a pattern is added.
def _remove_duplicate_spans(self, spans: List[spacy.tokens.Span]) -> List[spacy.tokens.Span]
Parameter | Description | Required |
---|---|---|
spans : List[spacy.tokens.Span] |
A list of spaCy Span objects. |
Yes |
Set the object's case sensitivity.
def _set_case_sensitivity(self, case_sensitive: bool = True)
Parameter | Description | Required |
---|---|---|
case_sensitive : bool |
Whether or not to perform case-sensitive searching. Default is True . |
Yes |
Convert a re.match
object to a spaCy Span
object.
def _to_spacy_span(self, match: Match) -> spacy.tokens.Span
Parameter | Description | Required |
---|---|---|
match : re.match |
A re.match object. |
Yes |
Add patterns. Note that the resulting patterns are unsorted. Depending on what you are doing, you may need to call ms.patterns = sorted(ms.patterns)
.
def add(self, patterns: Any, mode: str = "string") -> None
Parameter | Description | Required |
---|---|---|
patterns : Any |
The pattern(s) to match. | Yes |
mode : str |
The mode to use for matching. Default is string . |
No |
Get matches to milestone patterns. Returns a list of spaCy spans matching the pattern.
def get_matches(self, patterns: Any = None, mode: str = None, case_sensitive: bool = True)
Parameter | Description | Required |
---|---|---|
patterns : Any |
The pattern(s) to match. | Yes |
mode : str |
The mode to use for matching: - string : Match milestone patterns in the document text.- phrase : Match to milestone patterns in phrases.- rule : Match to milestone patterns with spaCy rules.- sentence : Match milestone patterns in sentences.Default is None . |
No |
case_sensitive : bool |
Whether to use case sensitive matching. Default is True . |
No |
The mode
parameter identifies the function to use for matching patterns. The string
mode matches character sequences in the document's text. The phrase
mode matches token sequences in the document using spaCy's Phrase Matcher. The rule
mode matches a spaCy Rule Matcher pattern. The sentence
mode works somewhat differently, it uses returns a list of sentences in the document. Since it uses spaCy's sentence detection component, it will only work if that component is available in the selected language model. If no mode
is provided, Lexos will attempt to auto-detect the most appropriate mode based on the pattern.
Pattern matching may not work as desired in RTL languages like Arabic and Hebrew. Some functions to handle RTL languages have been prototyped but are not part of this version of Milestones
.
Tip
The string
mode matches patterns using regular expressions, which may occasionally cause mismatches. For instance, matching "Mr. Darcy" will return matches to "Mrs Darcy" since "." indicates any single character in regular expressions. Typically, this problem can be avoided by selecting the phrase
mode.
Caution
Calling Milestones.get_matches()
will overwrite any pre-existing patterns. If you wish to add patterns to existing ones, use the Milestones.add()
method, which updates the list of patterns and sets the milestones matching both the previous and the new milestones. You can also remove patterns with the Milestones.remove()
method. Both methods accept the mode
parameter. Finally, you can clear the pattern list by calling the Milestones.reset()
method. This will also reset all milestone_iob
values to "O" and all milestone_label
values to empty strings.
Remove patterns.
def remove(self, patterns: Any, mode: str = "string") -> None
Parameter | Description | Required |
---|---|---|
patterns : Any |
The pattern(s) to match. | Yes |
mode : str |
The mode to use for matching. Default is string . |
No |
Reset all milestone
values to defaults. Does not modify patterns or any other settings.
def reset(self)
Generate spans based on a custom list. Returns a list of spaCy spans.
def set_custom_spans(self, spans: List[spacy.tokens.Span], step: int = None, type: str = "custom") -> List[spacy.tokens.Span])
Parameter | Description | Required |
---|---|---|
pattern : List[spacy.tokens.Span] |
The string or regex pattern to use to identify the milestone. | Yes |
step : str |
The number of spans to group into each milestone span. By default, all spans are included. Default is None . |
No |
step : str |
The type of span used. Default is custom . |
No |
Generate spans based on line breaks. Returns a list of spaCy spans.
def set_line_spans(self, pattern: str = r".+?\n", step: int = None, remove_milestone: bool = True) -> List[spacy.tokens.Span])
Parameter | Description | Required |
---|---|---|
pattern : str |
The string or regex pattern to use to identify the milestone. Default is r".+?\n" . |
No |
step : str |
The number of spans to group into each milestone span. By default, all lines are included. Default is None . |
No |
remove_milestone : bool |
Whether or not to remove the line break character. Default is True . |
No |
Commit milestones to the object instance.
def set_milestones(self, spans: List[spacy.tokens.span.Span], skip_token: bool = False, remove_token: bool = False) -> None
Parameter | Description | Required |
---|---|---|
spans : List[spacy.tokens.span.Span] |
The span(s) to use for identifying token attributes. | Yes |
skip_token : bool |
Set milestone start to the token following the milestone span. Default is False . |
No |
remove_token : bool |
Set milestone start to the token following the milestone span and remove the milestone span. Default is False . |
No |
Generate spans with n sentences per span. Returns a list of spaCy spans.
def set_sentence_spans(self, step: int = None) -> List[spacy.tokens.Span])
Parameter | Description | Required |
---|---|---|
step : str |
The number of spans to group into each milestone span. By default, all lines are included. Default is None . |
No |
Get a list of milestone dictionaries. Some language models include a final punctuation mark in the token string, particularly at the end of a sentence. The strip_punct
argument is a somewhat hacky convenience method to remove it. However, the user may wish instead to do some post-processing in order to use the output for their own purposes.
def to_list(self, strip_punct: bool = True) -> List[dict]
Parameter | Description | Required |
---|---|---|
strip_punct : bool |
Strip single punctuation mark at the end of the character string. Default is True . |
No |
lexos.milestones.helpers
mostly consists of deprecated functions. The only one currently used is lexos.milestones.helpers.ensure_list
. The deprecated functions are not documented below.
Generate a characters to tokens mapping. Returns a dictionary mapping character indexes to token indexes.
def chars_to_tokens(doc: spacy.tokens.doc.Doc) -> Dict[int, int]
Parameter | Description | Required |
---|---|---|
doc : spacy.tokens.doc.Doc |
A spaCy Doc object. |
Yes |
Converts a spaCy Matcher
rule to lower case. Performs the same function as rollingwindows.calculators.spacy_rule_to_lower
.
def spacy_rule_to_lower(patterns: Union[Dict, List[Dict]], old_key: Union[List[str], str] = ["TEXT", "ORTH"], new_key: str = "LOWER") -> list
Parameter | Description | Required |
---|---|---|
patterns : Union[Dict, List[Dict]] |
A string to match against the Roman numerals pattern. | Yes |
old_key : Union[List[str], str] |
A dictionary key or list of keys to rename. Default is ["TEXT", "ORTH"] . |
No |
new_key : str |
The new key name. Default is LOWER . |
No |
Applies a filter to a document and returns a new document. This function is a duplicate of rollingwindows.filters.filter_doc
.
def filter_doc(input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc], n: int = 1000, window_units: str = "characters", alignment_mode: str = "strict") -> Iterator
Parameter | Description | Required |
---|---|---|
doc : spacy.tokens.doc.Doc |
A spaCy Doc object. |
Yes |
keep_ids : int |
A list of spaCy Token ids to keep in the filtered Doc . |
Yes |
spacy_attrs : List[str] |
A list of spaCy Token attributes to keep in the filtered Doc . Default is the SPACY_ATTRS list imported with util .* |
No |
force_ws : bool |
Force a whitespace at the end of every token except the last. Default is True . |
No |
* The default list of spaCy token attributes can be inspected by calling util.SPACY_ATTRS
.
Converts a spaCy Doc
object into a numpy
array. This function is a duplicate of rollingwindows.filters.get_doc_array
.
def get_doc_array(doc: spacy.tokens.doc.Doc, spacy_attrs: List[str] = SPACY_ATTRS, force_ws: bool = True) -> np.ndarray
Parameter | Description | Required |
---|---|---|
doc : spacy.tokens.doc.Doc |
A spaCy Doc object. |
Yes |
keep_ids : int |
A list of spaCy Token ids to keep in the filtered Doc . |
Yes |
spacy_attrs : List[str] |
A list of spaCy Token attributes to keep in the filtered Doc . Default is the SPACY_ATTRS list imported with util .* |
No |
force_ws : bool |
Force a whitespace at the end of every token except the last. Default is True . |
No |
* The default list of spaCy token attributes can be inspected by calling util.SPACY_ATTRS
.
The following options are available for handling whitespace:
force_ws=True
ensures thattoken_with_ws
andwhitespace_
attributes are preserved, but all tokens will be separated by whitespaces in the text of a doc created from the array.force_ws=False
withSPACY
inspacy_attrs
preserves thetoken_with_ws
andwhitespace_
attributes and their original values. This may cause tokens to be merged if subsequent processing operates on thedoc.text
.force_ws=False
withoutSPACY
inspacy_attrs
does not preserve thetoken_with_ws
andwhitespace_
attributes or their values. By default,doc.text
displays a single space between each token.