## Preprocess Story Files

Data comes from the [DeepMind Question Answering Dataset](https://cs.nyu.edu/~kcho/DMQA/) consisting of CNN News Stories. Each story is packaged as a `.story` file in the dataset. A `.story` file consists of paragraphs of text for the story, followed by multiple manually created highlight phrases/sentences preceded by the `@highlight` annotation.

The story file in our example [looks like this](../data/be8fc9fffd65a5a38153e83acc304a83c2e206e1.story).

In this notebook, we will convert a specified `.story` file to two separate files -- one containing the sentences of the story text, one sentence per line, called `sentences.txt`, and another containing the lines of the highlights, called `summary.txt`.

In [1]:
import os
import spacy

In [2]:
STORY_FILE = "be8fc9fffd65a5a38153e83acc304a83c2e206e1.story"

DATA_DIR = "../data"
STORY_PATH = os.path.join(DATA_DIR, STORY_FILE)
SENTENCE_PATH = os.path.join(DATA_DIR, "sentences.txt")
SUMMARY_PATH = os.path.join(DATA_DIR, "summary.txt")

### Preprocess

We will use the sentence tokenizer from Spacy to tokenize input paragraph text into sentences.

The `@highlight` annotations are at the end of the file, so when we see the first one, we set a flag and start writing exclusively to the summary file.

In [3]:
nlp = spacy.load('en_core_web_sm')
sent_tokenizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sent_tokenizer)

In [4]:
def tokenize_to_sentences(line, nlp):
    doc = nlp(line)
    return [s.string for s in doc.sents]

In [5]:
fstories = open(STORY_PATH, "r")
fsentences = open(SENTENCE_PATH, "w")
fsummary = open(SUMMARY_PATH, "w")

start_summary = False

for line in fstories:
    line = line.strip()
    if len(line) == 0:
        continue
    if line.startswith("@highlight"):
        start_summary = True
        continue
    if start_summary:
        fsummary.write(line + "\n")
    else:
        sents = tokenize_to_sentences(line, nlp)
        for sent in sents:
            fsentences.write(sent + "\n")

fsentences.close()
fsummary.close()
fstories.close()

### Output files

The `sentences.txt` file [looks like this](../data/sentences.txt).

The `summary.txt` file [looks like this](../data/summary.txt).