# Rolling Windows Tutorial

This notebook will guide you through analyzing text patterns using the Lexos `rolling_windows` module. Whether you want to track word frequencies, compare character usage, or visualize how themes change throughout a document, this guide will show you exactly what to do.

First, let's load a sample text:

In [None]:
# CHANGE THIS: Replace "A_Scandal_in_Bohemia.txt" with your filename
filename = "A_Scandal_in_Bohemia.txt"

# Import the Lexos Loader class
from lexos.io import Loader

# Create a Loader instance and load the file
loader = Loader()
loader.load([filename])

# Remove extra whitespace and newlines
text = loader.texts[0]
text = " ".join(text.split())
print(f"✓ Loaded {len(text)} characters from {filename}")

## Creating Windows

Windows are overlapping segments of your text. Think of it like a sliding frame that moves through your document, creating overlapping "snapshots" of your text.

Before beginning your analysis, you need to decide on your window settings. The first decision is how to segment your text, by *n* characters or *n* tokens. Segmenting by characters is good for analysing patterns of symbols such as punctuation marks or individual letters. Segmenting by tokens is good for studying word patterns or patterns of larger semantic units.

In the cell below, we'll create windows of 10,000 characters each.

In [None]:
# Import the Windows class
from lexos.rolling_windows import Windows

# Create a Windows object
windows = Windows()

# Define your window settings and generate windows
windows(input=text, n=10000, window_type="characters")

# Print the number of windows
print(f"Generated windows: {windows.length}")

# Iterate through the first five windows and print the first 20 characters of each
for window in list(windows)[:5]:
    print(window[:20])

print(f"Remaining windows: {windows.length}")


Some important observations:

1. Each window is 10,000 characters long (we're only displaying the first twenty), and each window start index advances one character from the last.
2. The `windows` instance produces a generator. If you need to know the number of windows, use the `length` property.
3. The generator will be exhausted whenever you iterate through it (In the cell above, the generator is exhausted when it is converted to a list). If you need to access your windows a second time, you will have to regenerate them.

We can similarly generate windows of tokens if we have a spaCy `Doc` object. We'll create one below using our text and spaCy's small English language model.

In [None]:
# Import the Lexos Tokenizer class
from lexos.tokenizer import Tokenizer

# Create a Tokenizer instance and make a spaCy Doc
tokenizer = Tokenizer(model="en_core_web_sm")
doc = tokenizer.make_doc(text)

 We can now generate windows from our spaCy doc.

In [None]:
# Define your window settings and generate windows
windows(input=doc, n=100, window_type="tokens")

# Iterate through the first five windows and print the first 20 characters of each
for window in list(windows)[:5]:
    print(window[:20])

Here we can see that our windows consist of tokens, in this case words (100 per window). Each window advances one word.

Larger structural divisions can be used as tokens. For instance, we can split our text on line breaks and submit the list of lines as our input. The lines will be counted like tokens. This can be useful for studying texts like poetry. Note that, in the code below, we do some manipulations of the text to get rid of extra white space but keep single line breaks. 

In [None]:
import re

# Strip and leading/trailing whitespace
lines = loader.texts[0].strip()
# Replace multiple spaces with a single space
lines = re.sub(' +', ' ', lines)
# Replace multiple newlines with a single newline
lines = re.sub('\n+', '\n', lines)
# Split into lines and strip each line
lines = lines.split('\n')
lines = [line.strip() for line in lines]

# Generate windows of lines (treating each line as a token)
windows(input=lines, n=5, window_type="tokens")
for window in list(windows)[:5]:
    # print the first 20 characters of each window
    print(window[:20])

## Rolling Windows Calculators

Once we have created our windows, we want to calculate the changes of frequency in which patterns occur in these windows. By "patterns", we mean characters or character sequences, tokens such as words, or perhaps large spans such as lines or sentences. "Frequency" can be understood in many ways. At a bare minimum, it can be a raw count of the number of occurrences. However, we can also apply further calculations in order to normalize for window or document length, or to compare frequencies of different patterns. The `rolling_windows` module allows you to select a calculator class to decide which kind of calculation you wish to apply. Lexos comes with three different calculator classes: `Counts`, `Averages`, and `Ratios`. (You can also design your own and have it inherit from the `BaseCalculator` class.) In the section below, we will go over the usage of each of the built-in classes.

## The `Counts` Class

The `Counts` class counts the raw number of occurrences for each your desired patterns in your windows, without normalization. It is a good way to learn how to implement each of the other classes. A sample implementation is given in the cell below. Try changing the `patterns_to_count` (some examples are given in comments).

Other settings to experiment with:

- Try changing the number of windows and the `window_type` when you generate your windows.
- Try changing the case sensitivity option in the calculator. This determines whether the counter matches patterns regardless of case or distinguishes upper and lower case when matching patterns.

In [None]:
# Import the Counts calculator
from lexos.rolling_windows.calculators.counts import Counts

# CHANGE THESE: Your patterns to search for
patterns_to_count = ["a", "e"]  # Example: Common letters
# patterns_to_count = ["love", "hate"]  # Example: Emotional words
patterns_to_count = ["Sherlock", "Watson"]  # Example: Character names

# Create a new set of windows for counting
windows = Windows()
windows(input=doc, n=5000, window_type="tokens")

# Create the calculator and run the analysis
counter = Counts()
counter(
    patterns=patterns_to_count,
    windows=windows,
    mode="exact",
    case_sensitive=True
)

# Get results as a DataFrame
count_results = counter.to_df()
print(f"✓ Counted patterns in {len(count_results)} windows")
print("\nSample results (5 random windows):")
print(count_results.sample(5))
print(f"\nTotal occurrences across all windows:")
print(count_results.sum())

> Remember that your windows are a generator, which will be exhausted when you run it through a calculator. If you change your calculator settings, you will need to rebuild your windows. 

Notice that the calculator's `mode` parameter is set to "exact". This sets the calculator to find exact (or case-insensitive) matches to the string patterns you pass to it. You can also pass regular expressions, using `mode="regex"`. An example is given below: 

In [None]:
# Create a new set of windows for counting
windows = Windows()
windows(input=text, n=5000, window_type="characters")

# Create the calculator and run the analysis matching tokens starting with "sh"
counter = Counts()
counter(
    patterns=[r"\bsh\w*"],  # \b = word boundary, \w* = any word characters
    windows=windows,
    mode="regex",
    case_sensitive=False
)

# Get results as a DataFrame
count_results = counter.to_df()
print("REGEX EXAMPLE - Words starting with 'sh':\n")
print(f"✓ Counted patterns in {len(count_results)} windows")
print(f"✓ Total occurrences across all windows: {count_results[r'\bsh\w*'].sum()}")
print("✓ Sample results (5 random windows):")
count_results.sample(5)


You can also search for multi-word phrases in spaCy `Doc` objects. Set your window output to `spans` and your calculator's `mode` to "multi_token".

In [None]:
# Create a new windows instance for counting
windows = Windows()
windows = windows(
    input=doc,
    n=5000,
    window_type="tokens",
    output="spans"
)

# Create the calculator and run the analysis matching the token pattern "sherlock holmes"
counter = Counts()
counter(
    patterns=["sherlock holmes"],
    windows=windows,
    mode="multi_token",
    case_sensitive=False
)

results = counter.to_df()
print("MULTI-WORD PHRASE EXAMPLE - 'sherlock holmes':")
print(f"Found in {len(results)} windows")
print(f"Total occurrences: {results.sum().values[0] if len(results) > 0 else 0}")


If you have token windows derived from a spaCy `Doc`, you can search for linguistic patterns. In the example below, we search for instances of a proper noun using spaCy's small English-language model.

In [None]:
# Create a new Windows instance for counting
windows = windows(
    input=doc,
    n=5000,
    window_type="tokens",
    output="spans"
)

# Create the calculator and run the analysis matching proper nouns
counter = Counts()
counter(
    patterns=[[{"POS": "PROPN"}]],
    windows=windows,
    mode="spacy_rule",
    model="en_core_web_sm"
)

results = counter.to_df()
print("LINGUISTIC PATTERN EXAMPLE - Proper nouns:")
print(f"Found in {len(results)} windows")
print(f"Total occurrences: {results.sum().values[0]}")


### The `Averages` Class

The `Averages` calculator is used when you want know **how often** patterns appear. Calculating the average frequency per unit is a method of normalizing the frequency, which can be helpful when comparing across window sizes or texts of different lengths.

The usage is very similar to the `Counts` class.

In [None]:
# Import the Averages calculator
from lexos.rolling_windows.calculators.averages import Averages

# CHANGE THESE: Your patterns to analyze
patterns_to_average = ["a", "e"]  # Same format as counting

# Create fresh windows for averaging
windows = windows(
    input=text,
    n=5000,
    window_type="characters",
    output="strings"
)

# Create and call the calculator
calculator = Averages()
calculator(
    patterns=patterns_to_average,
    windows=windows,
    mode="exact",
    case_sensitive=False
)

# Get results
results = calculator.to_df()
print(f"✓ Calculated average frequencies in {len(results)} windows")
print("\nRandom 5 windows:")
print(results.sample(5))
print(f"\nOverall average frequency:")
print(results.mean())

### The `Ratios` Class

The `Ratios` calculator class is used when you compare the **balance between exactly two patterns**. The following notes are helpful in interpreting the results:

- **0.0** = Only the second pattern appears
- **0.5** = Both patterns appear equally
- **1.0** = Only the first pattern appears
- Values closer to 0 favor the second pattern
- Values closer to 1 favor the first pattern

> **Important:** The `Ratios` calculator requires **exactly** 2 patterns. For comparing more than two patterns, use `Counts` or `Averages`.

In [None]:
# Import the Ratios calculator
from lexos.rolling_windows.calculators.ratios import Ratios

# CHANGE THESE: Exactly 2 patterns to compare
patterns_to_compare = ["a", "e"]  # Will calculate a / (a + e)
# patterns_to_compare = ["positive", "negative"]  # Example: sentiment balance
# patterns_to_compare = ["past", "present"]  # Example: tense analysis

# Create fresh windows for ratios
windows = Windows()
windows = windows(input=text, n=5000, window_type="characters", output="strings")

# Create and run the calculator
calculator = Ratios()
calculator(
    patterns=patterns_to_compare, windows=windows, mode="exact", case_sensitive=False
)

# Get results
results = calculator.to_df()
print(f"✓ Calculated ratios in {len(results)} windows")
print("\nSample results (5 random windows):")
print(results.sample(5))

# Interpret the results
avg_ratio = results.iloc[:, 0].mean()
print(f"\nAverage ratio: {avg_ratio:.3f}")
print(
    f"→ '{patterns_to_compare[0]}' appears more frequently overall"
    if avg_ratio > 0.5
    else f"→ '{patterns_to_compare[1]}' appears more frequently overall"
    if avg_ratio < 0.5
    else "→ Both patterns appear equally overall"
)


## Visualizing Results

Now let's create visualizations of your analysis. You can create either static plots (for reports, publications, and presentations) using the `SimplePlotter` class or interactive plots (for exploration) using the `PlotlyPlotter` class.

### Using `SimplePlotter` for Static Plots

For illustration, we'll use the averages we calculated above.

In [None]:
# Import the Averages calculator
from lexos.rolling_windows.calculators.averages import Averages

# Create fresh windows for averaging
windows = Windows()
windows = windows(
    input=text,
    n=5000,
    window_type="characters",
    output="strings"
)

# Create and call the calculator
calculator = Averages()
calculator(
    patterns=["holmes", "watson"],
    windows=windows,
    mode="exact",
    case_sensitive=False
)

# Get results
results = calculator.to_df()

Now import the `SimplePlotter` class and generate the plot. You can uncomment the code at the bottom of the cell to save the plot.

In [None]:
# Import the SimplePlotter class
from lexos.rolling_windows.plotters.simple_plotter import SimplePlotter

# CHANGE THIS: Use your results from above
data_to_plot = results

# Create a static plot
plotter = SimplePlotter(
    df=data_to_plot,
    title="Pattern Frequency Analysis",  # CHANGE THIS
    xlabel="Window Position",
    ylabel="Frequency",  # CHANGE THIS based on your analysis type
    figsize=(10, 6),  # (width, height) in inches
    show_legend=True,
    show_grid=True
)

# Generate the plot
plotter.plot()

# Optional: Save the plot
# plotter.save("my_analysis.png", dpi=300)
# plotter.save("my_analysis.pdf")  # For publications

By default, the `plotter.plot` function displays the plot. If you do not wish to do this, such as if you just want to save the file, use `plotter.plot(show=False)`. If you wish to show the figure later, you can call `plotter.show()`.

#### Adding Milestones

Milestones mark important points in your text (like chapter breaks, scene changes, etc.), which can help you interpret the graph. You can pass a dictionary of milestone labels and location indexes. For instance, "Chapter 1" might occur at the beginnning of the document, (index 0) and "Chapter 2" might begin at character or token 5000. In this case, your dictionary would look like `{"Chapter 1": 0, "Chapter 2": 5000}`.

You can construct a milestone dictionary on your own, but it is easier to use the Lexos `milestones` module to calculate the milestone locations. A simple example is given below (see the Lexos `milestones` documentation for all the options).

We know that our source document has three chapters with the word "CHAPTER" all in capital letters and the chapter numbers in Roman numerals. It is therefore possible to use a fairly simple regular expression to detect all three chapters.

In [None]:
from lexos.milestones.string_milestones import StringMilestones

milestones = StringMilestones(doc=text, patterns="CHAPTER I+")
milestone_dict = {milestone.text: milestone.start for milestone in milestones}
milestone_dict


We're now ready to create our plot with milestones.

In [None]:
# Create plot with calculated milestones
plotter = SimplePlotter(
    df=data_to_plot,
    title="Analysis with Calculated Markers",
    figsize=(12, 6),
    show_milestones=True,
    milestone_labels=milestone_dict,
    milestone_colors="red",
    milestone_style="--",  # Dashed lines
    show_milestone_labels=True,
    milestone_labels_rotation=45
)

plotter.plot()

### Using `PlotlyPlotter` for Interactive Plots

It can sometimes be useful to generate an interactive plot for exploration or web presentation because you can pan and zoom around the image, as well as access detailed information when hovering over coordinates in the plot. Lexos has the `PlotlyPlotter` class, which uses the Python <a href="https://plotly.com/python/" target="_blank">Plotly</a> library to generate a graph from Rolling Windows data. Most of the parameters parallel those in the `SimplePlotter` class, but see the API documentation for a list of all options. The example below illustrates a typical usage similar to the previous example. Here is a quick list of the interactive features in the toolbar that appears when you hover over the plot:

- **Hover** over points to see exact values
- **Zoom** in/out with mouse wheel or zoom controls
- **Pan** by clicking and dragging
- **Toggle** lines on/off by clicking legend items
- **Download** plot as PNG using the camera icon

In [None]:
# Import the PlotlyPlotter class
from lexos.rolling_windows.plotters.plotly_plotter import PlotlyPlotter

# Create an interactive plot
plotter = PlotlyPlotter(
    df=data_to_plot,
    title="Interactive Pattern Analysis",  # CHANGE THIS
    xlabel="Window Position",
    ylabel="Frequency",  # CHANGE THIS based on your analysis type
    width=800,  # pixels
    height=600,  # pixels
    showlegend=True,
    show_milestones=True,
    milestone_labels=milestone_dict,
    show_milestone_labels=True,
)

# Generate the plot
plotter.plot()

# Optional: Save as interactive HTML
# plotter.save("my_analysis.html")

## Saving Your Work

In the cells above, we have commented out code to save the plotter output. For the `SimplePlotter` class, this is the equivalent of matplotlib's <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html" target="_blank">pyplot.savefig</a> function and will accept any of its keyword arguments. The `PlotlyPlotter` class accepts any keyword arguments for the <a href="https://plotly.github.io/plotly.py-docs/generated/plotly.io.write_html.html" target="_blank">plotly.io.write_html</a> and <a href="https://plotly.github.io/plotly.py-docs/generated/plotly.io.write_image.html" target="_blank">plotly.io.write_image</a> functions. If your filename ends in `.html`, Lexos will call `write_html`; otherwise, it calss `write_image`.

To save your data, the easiest method is to create a pandas DataFrame and use its built-in methods. We have already repeatedly used `results = calculator.to_df()` to obtain our data. If we want to save it as a CSV file, we can use `pandas.DataFrame.to_csv`. This is illustrated in the cell below using the data we plotted last. For good measure, we'll generate some summary statistics and save them to a CSV file.

In [None]:
# Configure your output file path here
output = "sample_outputs/sample_analysis_data.csv"
summary_output = "sample_outputs/sample_analysis_summary.csv"

# Save your numerical results as CSV
data_to_plot.to_csv(output, index=True)
print(f"✓ Saved data to {output}")

# Create a summary report
summary = pd.DataFrame({
    'Pattern': data_to_plot.columns,
    'Total': data_to_plot.sum().values,
    'Average': data_to_plot.mean().values,
    'Max': data_to_plot.max().values,
    'Min': data_to_plot.min().values,
    'Std_Dev': data_to_plot.std().values
})

print("\n=== SUMMARY REPORT ===")
print(summary)

# Save summary
summary.to_csv(summary_output, index=False)
print("✓ Saved summary to sample_analysis_summary.csv")

---

## Troubleshooting Common Issues

**Problem:** "Windows are consumed" error  
**Solution:** Create fresh windows for each calculator call

**Problem:** spaCy patterns don't work  
**Solution:** Ensure `window_type="tokens"` and `output="tokens"` for spaCy rules

**Problem:** No matches found  
**Solution:** Check case sensitivity, try `case_sensitive=False`

**Problem:** Regex not working  
**Solution:** Use raw strings (`r"pattern"`) and escape special characters

**Problem:** Memory issues with large texts  
**Solution:** Reduce window size or limit text length (e.g., `text[:1000000]`)