<font color="grey">Qi Yu (University of Konstanz)  |  ZHAW, March 03-04, 2022</font>

# Hands-On: Regular Expressions and File I/O
# SAMPLE SOLUTION

In this exercise, you will work on the file ```peterpan.txt```. Please follow the instructions below to complete the exercise.

## 0. Read in file

Read in the file ```peterpan.txt``` as a list of lines by executing the following code.

As the line 1-65 and the line 6287-6644 are not the main text (the former one is the preamble, and the latter one is the licence), we remove them using the last line in the block below.


In [1]:
f = open("peterpan.txt", "r")
lines = f.readlines()
f.close()

lines = lines[65:6286]

## 1. Text normalization

Text normalization is an essential step for (almost) all computational tasks on text data. Besides the (already removed) preambles and licences, the remaining text still contains some noises. For each line, please do the following cleaning, and append the cleaned lines to the list ```lines_cleaned```:

1. The quotation marks in the file are not in the standard form, e.g., ```“``` is used instead of ```"```. In real cases, this may cause problems for further tasks. Please do the following replacement:

    1. Replace the double quotation marks ```“``` and ```”``` to ```"```.
    2. Replace the single quotation mark ```’``` to ```'```.


2. Some expressions are surrounded by underscores, e.g., ```_Jolly Roger_```, ```_embonpoint_```. Please remove all the underscores.

3. The chapter titles (see below for an example) remain in the text. For use cases in which we are only interested in the main text, we will need to get rid of them. 

    ```Chapter I.```
    
    ```PETER BREAKS THROUGH```
    
    Please use the method ```re.sub()``` and follow the guide below to remove all chapter titles:

    1. First, substitute all chapter numerations, which are in form of "Chapter XXX.", e.g., ```Chapter III.```, ```Chapter IV.```, with empty strings ```''```.
    1. Since all chapter names (e.g., ```PETER BREAKS THROUGH``` above) are written in all-capital form, we can remove them by using an empty string to substitute all lines that consist solely of the following characters: 
        1. capital letters, and/or...
        2. spaces, and/or...
        3. one or more of the following punctuations: comma ```, ```, exclamation mark ```!```, question mark ```?```, double quotation mark```"```, and single quotation mark ```'```.

In [2]:
import re

In [3]:
# A trick: Playground to check whether a regular expression does the intended match
#for line in lines:
 #   if re.fullmatch("[\sA-Z,!?\"']+\n", line):
  #      print(line)

PETER BREAKS THROUGH

THE SHADOW

COME AWAY, COME AWAY!

THE FLIGHT

THE ISLAND COME TRUE

THE LITTLE HOUSE

THE HOME UNDER THE GROUND

THE NEVER BIRD

THE HAPPY HOME

THE CHILDREN ARE CARRIED OFF

DO YOU BELIEVE IN FAIRIES?

THE PIRATE SHIP

THE RETURN HOME

WHEN WENDY GREW UP



In [4]:
lines_cleaned = []

for line in lines: 
    line = line.replace("“", '"')
    line = line.replace("”", '"')
    line = line.replace("’", "'")
    line = line.replace("_", "")
    
    line = re.sub("Chapter\s\w+\.\n", " ", line)
    line = re.sub("[\sA-Z,!?\"']+\n", " ", line) #  Alternative solution:  \W*[\sA-Z]{2,}\W*\n   (Credit to: Giulia D'Agostino)
    
    if line:
        lines_cleaned.append(line)

In [5]:
# Check the result
lines_cleaned

[' ',
 ' ',
 '\n',
 '\n',
 'All children, except one, grow up. They soon know that they will grow\n',
 'up, and the way Wendy knew was this. One day when she was two years old\n',
 'she was playing in a garden, and she plucked another flower and ran\n',
 'with it to her mother. I suppose she must have looked rather\n',
 'delightful, for Mrs. Darling put her hand to her heart and cried, "Oh ',
 'why can\'t you remain like this for ever!" This was all that passed\n',
 'between them on the subject, but henceforth Wendy knew that she must\n',
 'grow up. You always know after you are two. Two is the beginning of the\n',
 'end.\n',
 '\n',
 'Of course they lived at 14, and until Wendy came her mother was the\n',
 'chief one. She was a lovely lady, with a romantic mind and such a sweet\n',
 'mocking mouth. Her romantic mind was like the tiny boxes, one within\n',
 'the other, that come from the puzzling East, however many you discover\n',
 'there is always one more; and her sweet mocking mouth

4. Finally, please remove all lines that only contains whitespaces (including newlines ```\n```) from ```lines_cleaned```.

In [6]:
lines_final = []

for line in lines_cleaned:
    if re.fullmatch("\s+", line) is None:
        lines_final.append(line)        

In [7]:
# Check the result
lines_final

['All children, except one, grow up. They soon know that they will grow\n',
 'up, and the way Wendy knew was this. One day when she was two years old\n',
 'she was playing in a garden, and she plucked another flower and ran\n',
 'with it to her mother. I suppose she must have looked rather\n',
 'delightful, for Mrs. Darling put her hand to her heart and cried, "Oh ',
 'why can\'t you remain like this for ever!" This was all that passed\n',
 'between them on the subject, but henceforth Wendy knew that she must\n',
 'grow up. You always know after you are two. Two is the beginning of the\n',
 'end.\n',
 'Of course they lived at 14, and until Wendy came her mother was the\n',
 'chief one. She was a lovely lady, with a romantic mind and such a sweet\n',
 'mocking mouth. Her romantic mind was like the tiny boxes, one within\n',
 'the other, that come from the puzzling East, however many you discover\n',
 'there is always one more; and her sweet mocking mouth had one kiss on\n',
 'it that We

## 2. Write file

As a last step, please write all the cleaned lines in ```lines_cleaned``` into a new file named ```peterpan_cleaned.txt```.

In [8]:
out = open("peterpan_cleaned.txt", "w")

for line in lines_final:
    out.write(line)
    
out.close()