# Milestones Tutorial

Milestones are specified locations in the text that designate structural or sectional divisions. A milestone can be either a designated unit *within* the text or a placemarker inserted between sections of text. The Lexos `milestones` module provides methods for identifying milestone locations by searching for patterns you designate. There three separate classes for identifying milestones in different ways: `StringMilestones`, `TokenMilestones`, and `SpanMilestones`. We will look at each of these in turn.

## `StringMilestones`

The `StringMilestones` class is used for extracting and storing milestones in strings or spaCy Doc objects. It uses regular expressions to find patterns and returns their locations and text.

Here is a basic example in which we will search for the word "Chapter" followed by one or more digits:

In [None]:
# Import the StringMilestones class
from lexos.milestones.string_milestones import StringMilestones

# Sample text
text = "Chapter 1\nThis is a sample text.\nChapter 2: The Journey Begins\nThis is the second chapter."

# Create StringMilestones object with the text and pattern to search for
milestones = StringMilestones(doc=text, patterns="Chapter \\d+")

# Print the start character, end character, and text of each milestone
for milestone in milestones:
    print(milestone.start, milestone.end, milestone.text)

You can supply a list of patterns:

In [None]:
# Create StringMilestones object with the text and pattern to search for
milestones = StringMilestones(doc=text, patterns=["Chapter", "chapter"])

# Print the start character, end character, and text of each milestone
for milestone in milestones:
    print(milestone.start, milestone.end, milestone.text)

You can make your search case insensitive if we set `case_sensitive=False`.

In [None]:
# Create StringMilestones object with the text and pattern to search for
milestones = StringMilestones(doc=text, patterns="Chapter", case_sensitive=False)

# Print the start character, end character, and text of each milestone
for milestone in milestones:
    print(milestone.start, milestone.end, milestone.text)

You can use the `set` method to change any previously assigned milestones.

In [None]:
milestones.set("The", case_sensitive=False)

for milestone in milestones:
    print(milestone.start, milestone.end, milestone.text)

The `StringMilestone` class also accepts a spaCy `Doc` object.

In [None]:
# Import the Lexos Tokenizer class
from lexos.tokenizer import Tokenizer

text = "the Chapter 1: Introduction. Chapter 2: Methods."

# Create a Tokenizer instance and create a spaCy Doc object
tokenizer = Tokenizer(model="en_core_web_sm")
doc = tokenizer.make_doc(text)

# Create StringMilestones object with the spaCy Doc and print the milestones
milestones = StringMilestones(doc=doc, patterns="Chapter", case_sensitive=False)

for milestone in milestones:
    print(milestone.start, milestone.end, milestone.text)

## `TokenMilestones`

The `TokenMilestones` class is used for extracting and storing milestones tokenized text, such as spaCy Doc objects. It differs from `StringMilestones` in that it matches against full tokens. Furthermore, the process has two steps. First you must generate a list of matches using the `get_matches` methods. Next, you must commit those passages to the `Milestones` object and the `Doc` object by passing this list to the `set_milestones` method.

In [None]:
# Import the Lexos TokenMilestones class
from lexos.milestones.token_milestones import TokenMilestones

# Create a spaCy Doc object
text = "Chapter 1: Introduction. Chapter 2: Methods."
tokenizer = Tokenizer(model="en_core_web_sm")
doc = tokenizer.make_doc(text)

# Create TokenMilestones object with the spaCy Doc and print the milestones
milestones = TokenMilestones(doc=doc)
matches = milestones.get_matches(patterns="Chapter")

# Set new milestones using the matches found
milestones.set_milestones(matches)


The `set_milestones` method creates two custom attributes in the document's tokens: `milestone_iob` and `milestone_label`. The first is an indication of whether the token is inside the milestone ("I"), outside the milestone ("O"), or at the beginning of the milestone ("B"). Tokens at the beginning of a milestone contain the complete text of the milestone as the `milestone_label` value; otherwise, it is an empty string.

In [None]:
# Display the tokens with their milestone information
for token in doc:
    print(token.text, token.i, token._.milestone_iob, token._.milestone_label)

By setting `mode` to "phrase", you can match multiple tokens.

In [None]:
# Create a TokenMilestones object with the spaCy Doc and print the milestones
milestones = TokenMilestones(doc=doc)
matches = milestones.get_matches(patterns="Chapter 1", mode="phrase")

# Set new milestones using the matches found
milestones.set_milestones(matches)

# Display the tokens with their milestone information
for token in doc:
    print(token.text, token.i, token._.milestone_iob, token._.milestone_label)

Set `mode` to "rule" to use more complex spaCy rule matching patterns. See the [spaCy documentation](https://spacy.io/usage/rule-based-matching) for a full description of the syntax.

In [None]:
pattern = [{"TEXT": "Chapter"}, {"IS_DIGIT": True}]
milestones = TokenMilestones(doc=doc)
matches = milestones.get_matches(patterns=[pattern], mode="rule")
milestones.set_milestones(matches)
for token in doc:
    print(token.text, token._.milestone_iob, token._.milestone_label)

## SpanMilestones

Span milestones are used to group spans together for analysis or visualization. Span milestones differ from normal milestones in that milestones are "invisible" structural boundaries between spans or groups of spans (e.g. sentence or line breaks). Thus, instead of storing a list of patterns representing milestones, span milestones store the groups of spans themselves.

There are three subclasses that inherit from `SpanMilestones`: `LineMilestones`, `SentenceMilestones`, and `CustomMilestones`.

### LineMilestones

The `LineMilestones` class is the easiest to understand. It splits the text on line breaks and generates a list of spaCy `Span` objects. These can be accessed through the `spans` of both the `Milestones` and the `Doc` objects:


In [None]:
# Import the Lexos LineMilestones class
from lexos.milestones.span_milestones import LineMilestones

# Create a spaCy Doc object
text = "Chapter 1: Introduction.\nChapter 2: Methods."
tokenizer = Tokenizer(model="en_core_web_sm")
doc = tokenizer.make_doc(text)

# Create LineMilestones object with the spaCy Doc and set the milestones
milestones = LineMilestones(doc=doc)
milestones.set()

# Print the milestone span text in both the milestones and Doc objects
print(milestones.spans)
print(doc.spans)

You can iterate through the `milestones.spans` list directly, as shown below:

In [None]:
for milestone in milestones:
    print(milestone.start, milestone.end, milestone.text)

There is also a `to_list()` method, which returns a list of dictionaries providing additional indexing information, should you need it.

In [None]:
print(milestones.to_list())

By default, the pattern used to identify line breaks is "\n", but this can be customed with the `pattern` keyword when calling `set`. By default, all line breaks are not included in the milestone spans, but this can be disabled with `remove_linebreak= False`.

### SentenceMilestones

The `SentenceMilestones` class works in a similar way:

In [None]:
# Import the Lexos SentenceMilestones class
from lexos.milestones.span_milestones import SentenceMilestones

# Create a spaCy Doc object using a model with a sentence segmenter
text = "This is sentence 1. This is sentence 2."
tokenizer = Tokenizer(model="en_core_web_sm")
doc = tokenizer.make_doc(text)
print(f"Doc Sentences: {list(doc.sents)}")

# Create SentenceMilestones object with the spaCy Doc and set the milestones
milestones = SentenceMilestones(doc=doc)
milestones.set()

# Print the text of the spans in a list
print(f"Milestone spans: {milestones.spans}")
print(f"Doc spans: {doc.spans}")
print(f"Milestones list: {milestones.to_list()}")

Note that the `Doc` object already has a `sents` attribute that contains a generator sentence spans. This is generated automatically *if and only if* your language model has a sentence segmenter. If it does not, you cannot use the `SentenceMilestones` class and will need to rely on the custom approach discussed below. See the [spaCy documentation](https://spacy.io/usage/linguistic-features#sbd) for further information on creating `Doc` objects with sentence segmentation in the pipeline.

### CustomMilestones

The `CustomMilestones` class can be used to generate milestones based on arbitrary spans. A good way to demonstrate this is to reproduce the sentence segments shown above.

In [None]:
# Import the Lexos CustomMilestones class
from lexos.milestones.span_milestones import CustomMilestones

# Create two spaCy Span objects from the existing Doc
spans = [doc[0:5], doc[5:10]]

# Create CustomMilestones object with the spaCy Doc and set the milestones
milestones = CustomMilestones(doc=doc)
milestones.set(spans)

# Print the text of the spans in a list
print(f"Milestone spans: {milestones.spans}")

Here we have manually set our spans to include the first and last five tokens, which happen to coincide with sentence boundaries. But we could easily create spans separated in other ways.

Note that, unlike the previous two classes, `CustomMilestones` requires you to pass a list of `Span` objects to the `set` method.

### Additional Settings and Methods

All three classes have additional `max_label_length` and `step` parameters. The `max_label_length` is the maximum number of characters in a token's `milestones_label` attribute (the default is 20). The `step` parameter takes an integer indicating the number of spans per item in the milestones list. For instance, if you wanted to have a milestone every tenth sentence, setting `step=10` would mean that every item in the `milestones.spans` list would consist of ten sentences. This parameter can similarly be used to group lines or custom spans.

All three classes have a `reset` method, which will remove all spans from both the `Milestones` and `Doc` objects.
