# Pre-processing Texts

This notebook is used to cleaned the text and put into chunks based on the model requirements.

Current process:
* Clean off top and bottom unnecessary words e.g. table of content
* Chunk into 450 tokens (for BERT) - can be changed later depends on the model
* Put into rows with Author name as csv


In [24]:
import nltk
import numpy as np
import random
import pandas as pd
import re
from collections import OrderedDict, defaultdict
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sye\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [25]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

## Support Functions

In [78]:
def split_by_chapter(text):
    thisdict = defaultdict()
    # ALERT - manual check if there is 3 digit chapters
    # split by chapter with one or two digits number with newline
    split_text = re.split("CHAPTER \w+|CONCLUSION", text)
    i = 0
    for s in split_text:
        if i > 0:
            Chapter = "chapter_" + str(i)
            thisdict[Chapter] = s
        i+=1
    return thisdict

## Read Data

In [119]:
# ALERT - manual check your file location
f = open("Dataset/Original_Book/Mark_Twain/ThePrinceAndThePauper.txt", "r", encoding="utf8")

In [120]:
book = f.read().replace('\n', ' ').replace('_', '')

In [121]:
###SCRATCH
split_text = re.split("CHAPTER \w+|CONCLUSION", book)
len(split_text)
split_text[1]

'. The birth of the Prince and the Pauper.  In the ancient city of London, on a certain autumn day in the second quarter of the sixteenth century, a boy was born to a poor family of the name of Canty, who did not want him.  On the same day another English child was born to a rich family of the name of Tudor, who did want him. All England wanted him too.  England had so longed for him, and hoped for him, and prayed God for him, that, now that he was really come, the people went nearly mad for joy.  Mere acquaintances hugged and kissed each other and cried. Everybody took a holiday, and high and low, rich and poor, feasted and danced and sang, and got very mellow; and they kept this up for days and nights together.  By day, London was a sight to see, with gay banners waving from every balcony and housetop, and splendid pageants marching along.  By night, it was again a sight to see, with its great bonfires at every corner, and its troops of revellers making merry around them.  There was 

### Clean words, space, newline

In [122]:
book = re.sub('([.,!?()""])', r' \1 ', book)

In [123]:
book_dict = split_by_chapter(book)

In [124]:
for key,item in book_dict.items():
    ## remove newline and space at beginning and end
    book_dict[key] = re.sub(' +', ' ', item)  

In [125]:
book_dict.keys()

dict_keys(['chapter_1', 'chapter_2', 'chapter_3', 'chapter_4', 'chapter_5', 'chapter_6', 'chapter_7', 'chapter_8', 'chapter_9', 'chapter_10', 'chapter_11', 'chapter_12', 'chapter_13', 'chapter_14', 'chapter_15', 'chapter_16', 'chapter_17', 'chapter_18', 'chapter_19', 'chapter_20', 'chapter_21', 'chapter_22', 'chapter_23', 'chapter_24', 'chapter_25', 'chapter_26', 'chapter_27', 'chapter_28', 'chapter_29', 'chapter_30', 'chapter_31', 'chapter_32', 'chapter_33'])

In [126]:
# ALERT - manual check quickly to scan through
book_dict['chapter_1']

' . The birth of the Prince and the Pauper . In the ancient city of London , on a certain autumn day in the second quarter of the sixteenth century , a boy was born to a poor family of the name of Canty , who did not want him . On the same day another English child was born to a rich family of the name of Tudor , who did want him . All England wanted him too . England had so longed for him , and hoped for him , and prayed God for him , that , now that he was really come , the people went nearly mad for joy . Mere acquaintances hugged and kissed each other and cried . Everybody took a holiday , and high and low , rich and poor , feasted and danced and sang , and got very mellow; and they kept this up for days and nights together . By day , London was a sight to see , with gay banners waving from every balcony and housetop , and splendid pageants marching along . By night , it was again a sight to see , with its great bonfires at every corner , and its troops of revellers making merry ar

### Chunk data into 128 tokens each

BERT can hadle up to 512.

In [127]:
# ALERT - I change to 128 tokens, feel free to change your chunk size
# First check the chapter length
for key,item in book_dict.items():
    if len(item.split()) > 128:
        print(key, len(item.split()))

chapter_1 306
chapter_2 1815
chapter_3 2579
chapter_4 1424
chapter_5 2763
chapter_6 2924
chapter_7 1356
chapter_8 983
chapter_9 1006
chapter_10 3346
chapter_11 1786
chapter_12 4455
chapter_13 1652
chapter_14 4327
chapter_15 4071
chapter_16 1063
chapter_17 3994
chapter_18 3199
chapter_19 2196
chapter_20 2540
chapter_21 1687
chapter_22 2176
chapter_23 1281
chapter_24 1092
chapter_25 2734
chapter_26 1605
chapter_27 3780
chapter_28 1408
chapter_29 791
chapter_30 956
chapter_31 2339
chapter_32 4531
chapter_33 3928


In [128]:

# create two empty lists for df 
chapterindex = []
text = []

for key,item in book_dict.items():
    # wordcount starts from 0
    wordcount = 0 
    # keep adding sentences until 128 tokens
    chapter_chunk_text = ""
    # sentences in each chapter
    sentences = tokenizer.tokenize(item)
    # loop through each sentence 
    for sent in sentences:
        # check the word count per sentence
        wv = len(sent.split())
        # if adding this sentence makes it over 128 tokens 
        # ALERT - I change to 128 tokens, feel free to change your chunk size
        if wordcount + wv >= 128:
            # push row and clean cache
            chapterindex.append(key)
            text.append(chapter_chunk_text)
            wordcount = 0
            chapter_chunk_text = sent
        # if not exceed, append the text and add wordcount
        chapter_chunk_text += ' ' + sent
        wordcount += wv
    # once a chapter finished, push all rest text 
    chapterindex.append(key)
    text.append(chapter_chunk_text)
        

In [129]:
# create dataframe 
book_df = pd.DataFrame(
    {'chapter': chapterindex,
     'text': text
    })

In [130]:
# create total words count per row
book_df['totalwords'] = book_df['text'].str.split().str.len()

In [131]:
book_df.iloc[0]['text']

'  . The birth of the Prince and the Pauper . In the ancient city of London , on a certain autumn day in the second quarter of the sixteenth century , a boy was born to a poor family of the name of Canty , who did not want him . On the same day another English child was born to a rich family of the name of Tudor , who did want him . All England wanted him too . England had so longed for him , and hoped for him , and prayed God for him , that , now that he was really come , the people went nearly mad for joy . Mere acquaintances hugged and kissed each other and cried .'

In [132]:
book_df.shape

(708, 3)

In [133]:
book_df['totalwords'].sum()

102250

In [134]:
# ALERT - depending on how well tokenizer.tokenize(item) 
# split the sentence, some sentences can be VERY long because the 
# function fails to identify the sentence end, which makes it possible
# to be over 300 token limit 
book_df.describe()

Unnamed: 0,totalwords
count,708.0
mean,144.420904
std,35.148394
min,38.0
25%,126.0
50%,141.0
75%,160.0
max,398.0


In [135]:
book_df['author'] = 'MarkTwain'

### Save as csv with author name

In [136]:
Author = "MarkTwain"
Book = "TheAdventuresOfTomSawyer"
Book = "ThePrinceAndThePauper"
book_df.to_csv(Author+"_"+Book+".csv")

In [137]:
f.close()