## Introduction

Throughout this series of notebooks, we will learn about a powerful natural language processing (NLP) library named spaCy. In a previous version of this series (2020), I was working with spaCy 2.0. This series is based around spaCy 3.0 which brings with it a lot of new bells and whistles, including BERT models. We'll cover that towards the end of this series. This notebook, however, is intended for one purpose, preparing the data for processing.

The text data that we will work with is the first Harry Potter book by J.K. Rowling. This is a fun toy example because it will give us the chance the work with some real-world data (yes, scholars study Harry Potter) to test the power of the spaCy library.

## Importing the Data

In order to clean the text data, we must first open it. In this book, the data is stored in the subfolder "data" with the title "harry_potter.txt". We will be loading this data so that we can clean it and store it as "data/harry_potter_cleaned.txt". In the kernal below, wew open up the text.

In [38]:
file = "./data/harry_potter.txt"
with open (file, "r", encoding="utf-8") as f:
    text = f.read()

Now that we have loaded the data, let's print it off to see what it looks like.

In [40]:
print (text[0:2000])

Harry Potter and the Sorcerer's Stone


CHAPTER ONE

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say
that they were perfectly normal, thank you very much. They were the last
people you'd expect to be involved in anything strange or mysterious,
because they just didn't hold with such nonsense.

Mr. Dursley was the director of a firm called Grunnings, which made
drills. He was a big, beefy man with hardly any neck, although he did
have a very large mustache. Mrs. Dursley was thin and blonde and had
nearly twice the usual amount of neck, which came in very useful as she
spent so much of her time craning over garden fences, spying on the
neighbors. The Dursleys had a small son called Dudley and in their
opinion there was no finer boy anywhere.

The Dursleys had everything they wanted, but they also had a secret, and
their greatest fear was that somebody would discover it. They didn't
think they could bear it if anyone found out about the Potters. Mr

To the human observer, this looks quite good. But for a machine, this is not in the proper structure. We need to do a few standard cleaning methods on the data.

## Cleaning the Data

Here, we will begin to clean the data to prepare it properly for processing via spaCy. The first thing we will want to do is to separate the entire text into individual chapters. When trying to manipulate textual data in this way, it is always a good idea to look for patterns in the data that will easily allow you to manipulate it. In our case, the Harry Potter text begins each chapter with a capitalized "CHAPTER" followed by the number, spelled out. We can use this to split the entire text into chapters.

In [8]:
chapters = text.split("CHAPTER ")[1:]
print (chapters[0][0:300])

ONE

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say
that they were perfectly normal, thank you very much. They were the last
people you'd expect to be involved in anything strange or mysterious,
because they just didn't hold with such nonsense.

Mr. Dursley 


Now that we have each chapter separated, we can begin to break down the text further. We can tokenize it on the paragraph level and use those paragraphs as our basic size of the data that we will be passing to spaCy. We can take advantage of the fact that each paragraph is separated by two line breaks. Within each paragraph, line breaks indicate a line break in the text. We remove those and replace them with a simple space. This allows for us to have each paragraph stored as a separate object.

In [37]:
data = []
for chapter in chapters:
    paras = []
    paragraphs = chapter.split("\n\n")
    for paragraph in paragraphs:
        if paragraph != "":
            paragraph = paragraph.replace("\n", " ")
            paras.append(paragraph)
    num = paras[0]
    title = paras[1]
    paras = paras[2:]
    data.append((num, title, paras))

with open ("data/harry_potter_cleaned.txt", "w", encoding="utf-8") as f:
    for item in data:
        f.write(f"CHAPTER {item[0]}: {item[1]}"+"\n")
        for para in item[2]:
            f.write(para+"\n")
        f.write("\n\n")

    

Now that we have the data cleaned, we can begin analyzing it. Throughout the next few notebooks, this will be our ultimate goal.