**Damask** (ˈdæməsk; دمشق) is a reversible patterned fabric of silk, wool, linen, cotton, or synthetic fibers, with a pattern formed by weaving. 

This library, much like its namesake damask fabric, intertwines complexity and functionality into a seamless whole. 

Just as damask is known for its intricate, reversible patterns woven into a single piece of fabric, the Damask class weaves together text and annotations, allowing for rich, layered analysis without cutting or altering the original 'fabric' of the text. 

The library's ability to segment and annotate text non-destructively mirrors the way patterns in damask fabric are an integral part of its structure, rather than being merely printed or dyed on.

Firstly, at core, a Damask is meant to be a drop in replacement for a string.

## Basic Text Operations

In [None]:
from damask.models import Damask


In [None]:

myDamask = Damask(
    "Sitting on the dock of the bay, waiting for the ..."
)

print(myDamask)

We're going to load PaulGraham's essay in to a Damask to provide a more substantial piece of text to work with.

In [None]:
text = open("PaulGrahamEssay.txt", "r")
text = text.read()

essay = Damask(text)

print(essay[0:100])

In the context of text processing, it's often necessary to divide large blocks of text into smaller segments. Typically, this process involves breaking the text into a new array of substrings, which can lead to the loss of the original text structure.

However, Damask offers a non-destructive alternative. Instead of creating an array of substrings, Damask retains the entire original text and records the positions where splits occur. This way, you can access the individual segments without losing the context of the whole text. Damask provides an easy-to-use interface to interact with these segments, allowing you to handle large texts more effectively while preserving their integrity.

The splitting is achieved via **Segmenters**

Several are provided, but you can write your own custom logic, and apply it to a Damask by subclassing Segmenter and creating generators that parse the original text and identify start/end indices of your desired segments.

## Segmentation with Damask

In [None]:
from damask.segmenters import SentenceSegmenter, WordSegmenter, ChunkSegmenter

split_into_sentences = SentenceSegmenter()
split_into_words = WordSegmenter()
split_into_chunks = ChunkSegmenter(1024)


A Damask method: **segment_text** allows you to apply the segmenter function to the Damask and store this under a key

In [None]:
essay.segment_text(segmenter=split_into_sentences, annotation_type="sentences")
essay.segment_text(segmenter=split_into_words, annotation_type="words")
essay.segment_text(segmenter=split_into_chunks, annotation_type="chunks")
essay.segment_text(segmenter=ChunkSegmenter(512), annotation_type="chunks512")

A single damask can maintain multiple different segments within the same structure
- The original text is unchanged
- You can divide the text into sentences, words or custom-sized chunks or create your own segmentation logic.
- These divisions are not destructive - they are only created when you need them.
- You can access and work with different segments at any time without losing the context of the original full text.

you can call this via

\<your instance\>.\<your chunk key name\>.texts (to get all the chunks)

\<your instance\>.\<your chunk key name\>.annotations (to access the annotation and associated metadata)



In [None]:
print(essay.sentences.texts[0:10])
print(essay.get_annotation_sets())
print(essay.chunks.texts[0:10])

This is achieved by keep track of each **Annotation** which is basically a segment with some metadata attached

In [None]:
# We print just the first annotation for each annotation type

print(essay.sentences.annotations[0])
print(essay.words.annotations[0])
print(essay.chunks.annotations[0])

You can tag any metadata to the annotations - by default segmenting creates a basic annotation with a start, end index, and a metadata dictionary that contains a type name and a uuid.

We're now going to use NTLK to annotate each sentence with a sentiment score.  We can use any python functionality in this annotation function.

In [None]:
from damask.annotators import PosAnnotator, SentimentAnnotator, LengthAnnotator
from damask.annotators import EmbeddingAnnotator, ChatCompletionAnnotator
essay.enrich_annotations(
    enricher=SentimentAnnotator(), annotation_type="sentences", parallel=True, workers=20
)
essay.enrich_annotations(
    enricher=PosAnnotator(), annotation_type="words", parallel=True, workers=20
)


Similarly we can embed "sentences"

In [None]:
"""
essay.enrich_annotations(
    enricher=EmbeddingAnnotator(), annotation_type="sentences", parallel=True, workers=40
)
"""


...or use chat completion to list questions a "chunk" answers

In [None]:
prompt= "You are a binary classifier that is given: A context, A user question and a Classifier task."
user_prompt = """Context: \"\"\"{text}\"\"\" 
User Question: \"What did the author do in summer of 2006?\"
Task: If the context directly answers the question, return 1, and cite how it answers the question, linking context to quesiton.
If not, return 0
"""

"""
essay.enrich_annotations(
    enricher=ChatCompletionAnnotator(
        system_prompt=prompt,
        user_prompt=user_prompt,
    ),
    annotation_type="chunks",
    parallel=True,
    workers=10,
)
"""


In [None]:
# print(essay.chunks.annotations)
"""
for chunk in essay.chunks.annotations:
    if chunk.metadata["chat_completion"] != "0":
        print(chunk)
        print(chunk.metadata)  # This will print the metadata dictionary for each chunk
"""

In [None]:
print (essay.sentences.annotations[0:3])


In [None]:
print (essay.words.annotations[0:10])

We also provide a straightforward method to display the contents of the annotation sets of the damask.

In [None]:
print (essay.annotation_sets_as_table())

In [None]:
output = ""
counter = 1
for sentence in essay.sentences.texts:
    # skip empty sentences
    if not sentence.strip():
        continue
    output += f"<|{counter}|>{sentence}</|{counter}|>\n"
    counter += 1

print(output)