# Project: Literature Analysis

### Reading is great. And with so many amazing books out there also come great movies, reviews, and summaries. Reading those reviews and watching those films often only gives us a picture of what the book is actually like, though. With the power of data science and natural language processing, I am able to bring another dimension to how we understand literature.

For this project, I am looking at the following eight writings:
* **The Foundation by Isaac Asimov** - a book I am currently reading, by my favorite sci-fi writer 
* **A Clockwork Orange by Anthony Burgess** - the writing behind a famous extravagant horror movie by Stanley Kubrik, a book with a unique writing style and vocabulary
* **Comments to the Society of the Spectacle by Guy Debord** - a continuation of a book I was taught in university about the influence of the capitalist media on the society
* **A Brief History of Time by Stephen Hawking** - a book that excited millions about the workings of our universe
* **For Whom the Bell Tolls by Ernest Hemingway** - a writing with a unique writing style and themes specific to American writers
* **Carrie by Stephen King** - one of the most well-known horrors out there
* **The Hobbit by J.R.R. Tolkien** - a very long journey by very short people, one that so many people and communities hold dear to their heart
* **Slaughterhouse Five by Kurt Vonnegut** - a book highly recommended to me

# Data Scraping and Cleaning

## Outline

1. **Finding data**
    - go to Archives.Org and found .txt versions of the above-mentioned books
    
    
2. **Collecting data**
    - use Data Scraping using requests and Beautiful Soup python libraries to acquire data
    
    
3. **Cleaning the Data**
    - **Corpus**
        - Create a pandas dataframe 
        - **Round 1 Cleaning** - delete new lines
        - **Round 2 Cleaning** - clean up things like copyright notes
    - **Document-Term Matrix**
        - **Round 3 Cleaning** - tokenize text (i.e. lowercase, remove punctuation, remove digits) 
        - Create a document-term matrix using CountVectorizer
        

## Collecting Data

Luckily, I found the full texts of books I wanted to look at online. Archive.Org makes them available for free for non-profit and educational purposes.

In [None]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from archive.org
def url_to_booktext(url):
    '''Returns transcript data specifically from scrapsfromtheloft.com.'''
    page = requests.get(url).text # get the text from url list using requests
    soup = BeautifulSoup(page, "lxml") # tell the program to read the text as an html doc
    text = [pre.text for pre in soup.find(class_="container container-ia").find_all('pre')] # go to "container container-ia" and take all preformatted text (pre)
    print(url)
    return text

# URLs of book texts in scope
urls = ['https://archive.org/stream/SlaughterhouseFiveOrTheChildrensCrusade/Slaughterhouse%20Five%20or%20The%20Children%27s%20Crusade_djvu.txt',
        'https://archive.org/stream/5FoundationAndEarthAsimovIsaac/1_-_Foundation_-_Asimov_Isaac_djvu.txt',
        'https://archive.org/stream/in.ernet.dli.2015.463378/2015.463378.For-Whom_djvu.txt',
        'https://archive.org/stream/CarrieStephenKing/Carrie_-_Stephen_King_djvu.txt',
        'https://archive.org/stream/ABriefHistoryOfTimeByStephenHawking/A%20Brief%20History%20Of%20Time%20by%20Stephen%20Hawking_djvu.txt',
        'https://archive.org/stream/CommentsOnTheSocietyOfTheSpectacle/Comments%20on%20the%20Society%20of%20the%20Spectacle_djvu.txt',
        'https://archive.org/stream/TheHobbitByJRRTolkienEBOOK/The%20Hobbit%20byJ%20%20RR%20Tolkien%20EBOOK_djvu.txt',
        'https://archive.org/stream/AnthonyBurgessAClockworkOrange/Anthony-Burgess-A-Clockwork-Orange_djvu.txt']

# Writers' names
writers = ['vonnegut', 'asimov', 'hemingway', 'king', 'hawking', 'debord', 'tolkien', 'burgess']

In [None]:
# Actually request transcripts (takes some time to run)
writings = [url_to_booktext(u) for u in urls]

In [None]:
# Pickle files for later use
# Do not create text files with that text manually!!! Otherwise python cannot create a directory for pickles and later cannot load the pickles - they simply do not exist
# You only need to do this step once. Otherwise you will get an error "A subdirectory or file writings already exists."

# Make a new directory to hold the text files
!mkdir writings

for i, c in enumerate(writers): # for each author number i and author name c
    with open("writings/" + c + ".txt", "wb") as file: # in the file writings/__name-of-author____
        pickle.dump(writings[i], file) # pickle the writings[author_number]

In [None]:
# Load pickled files
data = {}
for i, c in enumerate(writers): # for each author number i and author name c
    with open("writings/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file) # load pickled files into data

In [None]:
# Double check to make sure data has been loaded properly
data.keys()
# Later on, it ends up being easier to work with authors in alphabetic order, 
    # so that you can easily tell any changes in their order

In [None]:
# More checks
data['vonnegut'][:2]

## Cleaning the data



The output of this notebook will be clean, organized data in two standard text formats:
1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

In [None]:
# Let's take a look at our data again
next(iter(data.keys()))

In [None]:
# Note that our dictionary is currently in key: writer, value: list of text format
next(iter(data.values()))

In [None]:
# We are going to change this to key: writer, value: string format
def combine_text(list_of_text): 
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text) # join the objects from the list with ' ' space in between
    return combined_text


In [None]:
# Combine it!
# The format is now: key - writer, value - a giant string of text
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [None]:
# We could keep our data in dictionary format
# But for visual and later application purposes let's put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose() # input the data_combined dictionary into a pd dataframe and transpose
data_df.columns = ['writing'] # name the columns "writing"
data_df = data_df.sort_index() # sort by index - in this case alphabetically
data_df # display the new pandas dataframe

In [None]:
# Let's take a look at the writing of Isaac Asimov
data_df.writing.loc['asimov'] # access a group of columns (in this case writing column) from 'asimov' row

## Corpus prep - Data Cleaning Round 1

In [None]:
# Create a first round of text cleaning techniques
import re # regular expressions
import string # will be used to remove punctuation

def clean_text_round1(text):
    '''Take out all the \n for clarity of reading'''
    text = re.sub('\n', '', text) # substitute \n for empty characters
    return text

# Create a lambda function to apply this change to all books
round1 = lambda x: clean_text_round1(x) 

In [None]:
# Actually apply round 1 of cleaning
data_df = pd.DataFrame(data_df.writing.apply(round1))
data_df

## Corpus prep - Data Cleaning Round 2 - Manual Useless-Text Cleaning

As we could see earlier in the writing display, each of the books has different unnecessary information in the beginning and end - stuff that does not necessarily relate to the author's writing and can clog the corpus in the future. Unfortunately, since each book's unnecessary content is of different length and content, I will manually clean the first unnecessary stuff I can clean here.

In [None]:
# Let's take a look at Asimov's writing first
data_df.writing.loc['asimov']

In [None]:
# Right away we can see unncessary characters in the beginning

# Put text into a separate variable to make the actual execution more understandable and visible
del_text_asimov1 = 'FOUNDATION CUE^.TIhJIlT^«l T-WCf Hi THE FOUNDATION HOVELS F oundation 1 - F oundation OceanofPDF. com F oundation 1 - F oundation Foundation 1 - Foundation By Asimov, Isaac THE STORY BEHIND THE “FOUNDATION” By ISAAC ASIMOV '
data_df.writing.loc['asimov'] = data_df.writing.loc['asimov'].replace(del_text_asimov1, "") # take out the chosen text

# There is also some non-Asimov writing in the end (although it is still definitely worth reading!)
del_text_asimov2 = 'ABOUT THE AUTHOR Isaac Asimov was born in the Soviet Union to his great surprise. He moved quickly to correct the situation. When his parents emigrated to the United States, Isaac (three years old at the time) stowed away in their baggage. He has been an American citizen since the age of eight. Brought up in Brooklyn, and educated in its public schools, he eventually found his way to Columbia University and, over the protests of the school administration, managed to annex a series of degrees in chemistry, up to and including a Ph.D. He then infiltrated Boston University and climbed the academic ladder, ignoring all cries of outrage, until he found himself Professor of Biochemistry. Meanwhile, at the age of nine, he found the love of his life (in the inanimate sense) when he discovered his first science-fiction magazine. By the time he was eleven, he began to write stories, and at eighteen, he actually worked up the nerve to submit one. It was rejected. After four long months of tribulation and suffering, he sold his first story and, thereafter, he never looked back. In 1941, when he was twenty-one years old, he wrote the classic short story “Nightfall” and his future was assured. Shortly before that he had begun writing his robot stories, and shortly after that he had begun his Foundation series. What was left except quantity? At the present time, he has published over 260 books, distributed through every major division of the Dewey system of library classification, and shows no signs of slowing up. He remains as youthful, as lively, and as lovable as ever, and grows more handsome with each year. You can be sure that this is so since he has written this little essay himself and his devotion to absolute objectivity is notorious. He is married to Janet Jeppson, psychiatrist and writer, has two children by a previous marriage, and lives in New York City. OceanofPDF.com'
data_df.writing.loc['asimov'] = data_df.writing.loc['asimov'].replace(del_text_asimov2, "")

data_df.writing.loc['asimov']

In [None]:
# Now to Burgess
data_df.writing.loc['burgess']

In [None]:
# Repeat the same process as above:
    # Look at the text and choose what to delete
del_text_burgess1 = 'A CLOCKWORK ORANGE (UK Version) by ANTHONY BURGESS Contents Introduction (A Clockwork Orange Resucked) Part 1 Part 2 Part 3 Glossary of Nadsat Language Anthony Burgess was born in Manchester in 1917 and was a graduate of the University there. After six years in the Army he worked as an instructor for the Central Advisory Council for Forces Education, as a lecturer in Phonetics and as a grammar school master. From 1954 till 1960 he was an education officer in the Colonial Service, stationed in Malaya and Brunei. He has been called one of the very few literary geniuses of our time. Certainly he borrowed from no other literary source than himself. That source produced thirty-two novels, a volume of verse, two plays, and sixteen works of nonfiction-together with countless music compositions, including symphonies, operas, and jazz. His most recent work was A Mouthful of Air: Language, Languages... Especially English. Anthony Burgess died in 1993. Introduction A Clockwork Orange Resucked I first published the novella A Clockwork Orange in 1962, which ought to be far enough in the past for it to be erased from the world\'s literary memory. It refuses to be erased, however, and for this the film version of the book made by Stanley Kubrick may A CLOCKWORK ORANGE (UK Version) by ANTHONY BURGESS Contents Introduction (A Clockwork Orange Resucked) Part 1 Part 2 Part 3 Glossary of Nadsat Language Anthony Burgess was born in Manchester in 1917 and was a graduate of the University there. After six years in the Army he worked as an instructor for the Central Advisory Council for Forces Education, as a lecturer in Phonetics and as a grammar school master. From 1954 till 1960 he was an education officer in the Colonial Service, stationed in Malaya and Brunei. He has been called one of the very few literary geniuses of our time. Certainly he borrowed from no other literary source than himself. That source produced thirty-two novels, a volume of verse, two plays, and sixteen works of nonfiction-together with countless music compositions, including symphonies, operas, and jazz. His most recent work was A Mouthful of Air: Language, Languages...Especially English. Anthony Burgess died in 1993. Introduction A Clockwork Orange Resucked '
data_df.writing.loc['burgess'] = data_df.writing.loc['burgess'].replace(del_text_burgess1, "")

del_text_burgess2 = 'Document Outline • Page 1 o Introduction • Page 2 • Page 3 • Page 4 • Page 5 o Part 1 • Page 6 • Page 7 • Page 8 • Page 9 • Page 10 • Page 11 • Page 12 • Page 13 • Page 14 • Page 15 • Page 16 • Page 17 • Page 18 • Page 19 • Page 20 • Page 21 • Page 22 • Page 23 • Page 24 • Page 25 • Page 26 • Page 27 • Page 28 • Page 29 • Page 30 • Page 31 • Page 32 • Page 33 • Page 34 • Page 35 • Page 36 • Page 37 • Page 38 • Page 39 • Page 40 • Page 41 • Page 42 • Page 43 • Page 44 o Part Two • Page 45 • Page 46 • Page 47 • Page 48 • Page 49 • Page 50 • Page 51 • Page 52 • Page 53 • Page 54 • Page 55 • Page 56 • Page 57 • Page 58 • Page 59 • Page 60 • Page 61 • Page 62 • Page 63 • Page 64 • Page 65 • Page 66 • Page 67 • Page 68 • Page 69 • Page 70 • Page 71 • Page 72 • Page 73 ° Part Three • Page 74 • Page 75 • Page 76 • Page 77 • Page 78 • Page 79 • Page 80 • Page 81 • Page 82 • Page 83 • Page 84 • Page 85 • Page 86 • Page 87 • Page 88 • Page 89 • Page 90 • Page 91 • Page 92 • Page 93 • Page 94 • Page 95 • Page 96 • Page 97 • Page 98 • Page 99 • Page 100 • Page 101 • Page 102 • Page 103 • Page 104 • Page 105 • Page 106 ° Glossary • Page 107 • Page 108 • Page 109'
data_df.writing.loc['burgess'] = data_df.writing.loc['burgess'].replace(del_text_burgess2, "")

del_text_burgess3 = 'Glossary of Nadsat Language Words that do not appear to be of Russian origin are distinguished by asterisks. (For help with the Russian, I am indebted to the kindness of my colleague Nora Montesinos and a number of correspon-dents.) *appy polly loggy - apology choodesny - wonderful baboochka - old woman *chumble - to mumble *baddiwad - bad clop - to knock banda - band cluve - beak bezoomny - mad collocoll - bell biblio - library *crack - to break up or bust\' bitva - battle *crark - to yowl? Bog - God crast - to steal or rob; bolnoy - sick robbery bolshy - big, great creech - to shout or scream brat, bratty - brother *cutter - money bratchny - bastard dama - lady britva - razor ded - old man brooko - belly deng - money brosay - to throw devotchka - girl bugatty - rich dobby - good cal - feces *dook - trace, ghost *cancer - cigarette domy - house cantora - office dorogoy - dear, valuable carman - pocket dratsing - fighting chai - tea *drencrom - drug *charles, charlie - chaplain droog - friend chasha - cup *dung - to defecate chasso - guard dva - two cheena - woman eegra - game cheest - to wash eemya - name chelloveck - person, man, *eggiweg - egg fellow *filly - to play or fool with chepooka - nonsense *firegold - drink *fist - to punch loveted - caught *flip - wild? lubbilubbing - making love forella - \'trout\' ^luscious glory - hair gazetta - newspaper malchick - boy glazz - eye malenky - little, tiny gloopy - stupid maslo - butter *golly - unit of money merzky - filthy goloss - voice messel - thought, fancy goober - lip mesto - place gooly - to walk millicent - policeman gorlo - throat minoota - minute govoreet - to speak or talk molodoy - young grahzny - dirty moloko - milk grazzy - soiled moodge - man gromky - loud morder - snout groody - breast *mounch - snack gruppa - group mozg - brain *guff - guffaw nachinat - to begin gulliver - head nadmenny - arrogant *guttiwuts - guts nadsat - teenage *hen-korm - chickenfeed nagoy - naked *horn - to cry out *nazz - fool horrorshow - good, well neezhnies - underpants *in-out in-out - copulation nochy - night interessovat - to interest hoga - foot, leg itty - to go nozh - knife *jammiwam - jam nuking - smelling jeezny - life oddy knocky - lonesome kartoffel - potato odin - one keeshkas - guts okno - window kleb - bread oobivat - to kill klootch - key ookadeet - to leave knopka - button ooko - ear kopat - to \'dig\' oomny - brainy koshka - cat oozhassny - terrible kot - tomcat oozy - chain krowy - blood osoosh - to wipe kupet - to buy otchkies - eyeglasses lapa - paw *pan-handle - erection lewdies - people *pee and em - parents *lighter - crone? peet - to drink litso - face pishcha - food lomtick, piece, bit platch - to cry platties - clothes *shlaga - club pletcho - shoulder shlapa - hat plenny - prisoner shoom - noise plesk - splash shoot - fool *plosh - to splash *sinny - cinema plott - body skazat - to say podooshka - pillow *skolliwoll - school pol - sex skorry - quick, quickly polezny - useful *skriking - scratching *polyclef - skeleton key skvat - to grab pony - to understand sladky - sweet poogly - frightened sloochat - to happen pooshka - \'cannon\' sloosh, slooshy - to hear, to prestoopnick - criminal listen privodeet - to lead slovo - word somewhere smeck - laugh *pretty polly - money smot - to look prod - to produce sneety - dream ptitsa - \'chick\' *snoutie - tobacco? pyahnitsa - drunk *snuff it - to die rabbit - work, job sobirat - to pick up radosty - joy *sod - to fornicate, fornicator raskazz - story soomka - \'bag\' rassoodock - mind soviet - advice, order raz - time spat - to sleep razdraz - upset *splodge, splosh - splash razrez - to rip, ripping *spoogy - terrified rook, rooker - hand, arm *Staja - State Jail rot - mouth starry - ancient rozz - policeman strack - horror sabog - shoe *synthemesc - drug sakar - sugar tally - waist sammy - generous *tashtook - handkerchief *sarky - sarcastic *tass - cup scoteena - \'cow\' tolchock - to hit or push; blow, shaika - gang beating *sharp - female toofles - slippers sharries - buttocks tree - three shest - barrier vareet - to \'cook up\' *shilarny - concern *vaysay - washroom *shive - slice veck - (see chelloveck) shiyah - neck *vellocet - drug shlem - helmet veshch - thing viddy - to see or look yeckate - to drive voloss - hair *warble - song von - smell zammechat - remarkable vred - to harm or damage zasnoot - sleep yahma - hole zheena - wife *yahoodies - Jews zoobies - teeth yahzick - tongue zvonock - bellpull *yarbles - testicles zvook - sound  '
data_df.writing.loc['burgess'] = data_df.writing.loc['burgess'].replace(del_text_burgess3, "")

data_df.writing.loc['burgess'] = data_df.writing.loc['burgess'].replace("\\", "")

data_df.writing.loc['burgess']

In [None]:
data_df.writing.loc['hawking']

In [None]:
del_text_hawking1 = 'A Brief History of Time - Stephen Hawking A Brief History of Time Stephen Hawking Chapter 1 - Our Picture of the Universe Chapter 2 - Space and Time Chapter 3 - The Expanding Universe Chapter 4 - The Uncertainty Principle Chapter 5 - Elementary Particles and the Forces of Nature Chapter 6 - Black Holes Chapter 7 - Black Holes Ain\'t So Black Chapter 8 - The Origin and Fate of the Universe Chapter 9 - The Arrow of Time Chapter 10 - Wormholes and Time Travel Chapter 1 1 - The Unification of Physics Chapter 12 - Conclusion Glossary Acknowledgments & About The Author FOREWARD '
data_df.writing.loc['hawking'] = data_df.writing.loc['hawking'].replace(del_text_hawking1, "")
data_df.writing.loc['hawking']

In [None]:
data_df.writing.loc['hemingway']

In [None]:
del_text_hemingway1 = 'FOR WHOM THE BELL TOLLS by ERNEST HEMINGWAY JONATHAN CAPE THIRTY BEDFORD SQUARE LONDON FIRST PUBLISHED. MARCH 1923 SECOND IMPRESSION, MARCH 1 94 1 THIRD IMPRESSION, MARCH 1£>4X FOURTH IMPRESSION, JUNE 1941 FIFTH IMPRESSION, OCTOBER 1941 SIXTH IMPRESSION, FEBRUARY 1 942 SEVENTH IMPRESSION, SEPTEMBER 1942 EIGHTH IMPRESSION, FEBRUARY 1943 NINTH IMPRESSION, MARCH 1 944 TENTH IMPRESSION, MAY 1945 ELEVENTH IMPRESSION, AUGUST 1946 TWELFTH IMPRESSION, SEPTEMBER 1947 THIRTEENTH IMPRESSION, JANUARY 1949 FOURTEENTH IMPRESSION, NOVEMBER 1 950 FIFTEENTH IMPRESSION, SEPTEMBER 1952 SIXTEENTH IMPRESSION, APRIL 1954 SEVENTEENTH IMPRESSION, JUNE 1955 EIGHTEENTH IMPRESSION, 195 8 PRINTED IN GREAT BRITAIN IN THE CITY OF OXFORD AT THE ALDBN PRESS BOUND BY A. W. BAIN Sc CO. LTD., LONDON '
data_df.writing.loc['hemingway'] = data_df.writing.loc['hemingway'].replace(del_text_hemingway1, "")

data_df.writing.loc['hemingway'] = data_df.writing.loc['hemingway'].replace("/", "")

data_df.writing.loc['hemingway']


In [None]:
data_df.writing.loc['king']

In [None]:
del_text_king1 = 'STEPHEN KING Carrie DOUBLEDAY New York London Toronto Sydney Auckland CARRIE Contents Title Page Dedication Part One: Blood Sport News item from the Westover. . . Part Two: Prom Night She put the dress on for the first. . . Part Three: Wreckage Westover Mercy Hospital/Report of Decease From the national AP ticker . . . By Stephen King Copyright '
data_df.writing.loc['king'] = data_df.writing.loc['king'].replace(del_text_king1, "")

del_text_king2 = 'Melia All my love, BY STEPHEN KING NOVELS AS RICHARD BACHMAN Carrie Rage \'Salem\'s Lot The Long Walk The Shining Roadwork The Stand The Running Man The Dead Zone Thinner Firestarter Cujo COLLECTIONS The Dark Tower: Night Shift The Gunslinger Different Seasons Christine Skeleton Crew Pet Sematary The Talisman NONFICTION (with Peter Straub) Danse Macabre It The Eyes of the Dragon SCREENPLAYS Misery Creepshow The Tommyknockers Cat\'s Eye The Dark Tower II: Silver Bullet Drawing of the Three The Dark Half The Stand: The Complete & Uncut Edition Footnotes To return to the corresponding text, click on the reference number or "Return to text." * 1 Lyrics from JUST LIKE A WOMAN by Bob Dylan. Copyright © 1966 Dwarf Music. Used by permission of Dwarf Music. Return to text. * 2 Lyrics from TOMBSTONE BLUES by Bob Dylan. Copyright © 1965 M. Witmark & Sons. All Rights Reserved. Used by permission of WARNER BROS. MUSIC. Return to text. PUBLISHED BY DOUBLEDAY a division of Bantam Doubleday Dell Publishing Group, Inc. 1540 Broadway, New York, New York 10036 DOUBLEDAY and the portrayal of an anchor with a dolphin are trademark of Doubleday, a division of Bantam Doubleday Dell Publishing Group, Inc. www.doubleday.com LIBRARY OF CONGRESS CATALOG CARD NUMBER 73-9037 Copyright © 1974 by Stephen King ALL RIGHTS RESERVED elSBN: 978-0-385-52883-2 v3.0 '
data_df.writing.loc['king'] = data_df.writing.loc['king'].replace(del_text_king2, "")

data_df.writing.loc['king'] = data_df.writing.loc['king'].replace("\'", "")

data_df.writing.loc['king']

In [None]:
data_df.writing.loc['tolkien']

In [None]:
del_text_tolkien1 = 'THE HOBBIT OR THERE AND BACK AGAIN J.R.R. TOLKIEN The Hobbit is a tale of high adventure, undertaken by a company of dwarves, in search of dragon-guarded gold. A reluctant partner in this perilous quest is Bilbo Baggins, a comfort- loving, unambitious hobbit, who surprises even himself by his resourcefulness and his skill as a burglar. Encounters with trolls, goblins, dwarves, elves and giant spiders, conversations with the dragon, Smaug the Magnificent, and a rather unwilling presence at the Battle of the Five Armies are some of the adventures that befall Bilbo. But there are lighter moments as well: good fellowship, welcome meals, laughter and song. Bilbo Baggins has taken his place among the ranks of the immortals of children’s fiction. Written for Professor Tolkien’s own children, The Hobbit met with instant acclaim when published. It is a complete and marvellous tale in itself, but it also forms a prelude to The Lord of the Rings. ‘One of the most influential books of our generation’ The Times CONTENTS COVER PAGE TITLE PAGE LIST OF ILLUSTRATIONS NOTE ON THE TEXT AUTHOR’S NOTE CHAPTER I: AN UNEXPECTED PARTY CHAPTER II: ROAST MUTTON CHAPTER III: A SHORT REST CHAPTER IV: OVER HILL AND UNDER HILL CHAPTER V: RIDDLES IN THE DARK CHAPTER VI: OUT OF THE FRYING-PAN INTO THE FIRE CHAPTER VII: QUEER LODGINGS CHAPTER VIII: FLIES AND SPIDERS CHAPTER IX: BARRELS OUT OF BOND CHAPTER X: A WARM WELCOME CHAPTER XI: ON THE DOORSTEP CHAPTER XII: INSIDE INFORMATION CHAPTER XIII: NOT AT HOME CHAPTER XIV: FIRE AND WATER CHAPTER XV: THE GATHERING OF THE CLOUDS CHAPTER XVI: A THIEF IN THE NIGHT CHAPTER XVII: THE CLOUDS BURST CHAPTER XVIII: THE RETURN JOURNEY CHAPTER XIX: THE LAST STAGE WORKS BYJ.R.R. TOLKIEN COPYRIGHT ABOUT THE PUBLISHER ILLUSTRATIONS Thror\'s Map The Trolls The Mountain-path The Misty Mountains looking West Beorn’s Hall The Elvenkina’s Gate Lake Town The Front Gate The Hall at Baa-End Map of Wilderland NOTE ON THE TEXT '
data_df.writing.loc['tolkien'] = data_df.writing.loc['tolkien'].replace(del_text_tolkien1, "")

del_text_tolkien2 = 'mmma ginrawira WORKS BY J.R.R. TOLKIEN The Hobbit Leaf by Niggle On Fairy-Stories Farmer Giles of Ham The Homecoming of Beorhtnoth The Lord of the Rings The Adventures of Tom Bombadil The Road Goes Ever On (with Donald Swann) Smith of Wootton Major WORKS PUBLISHED POSTHUMOUSLY Sir Gawain and the Green Knight, Pearl and Sir Orfeo The Father Christmas Letters The Silmarillion Pictures by J.R.R. Tolkien Unfinished Tales The Letters of J.R.R. Tolkien Finn and Hengest Mr Bliss The Monsters and the Critics & Other Essays Roverandom The Children of Hurin The Legend of Sigurd and Gudrun THE HISTORY OF MIDDLE-EARTH - BY CHRISTOPHER TOLKIEN I The Book of Lost Tales, Part One II The Book of Lost Tales, Part Two III The Lays of Beleriand IV The Shaping of Middle-earth V The Lost Road and Other Writings VI The Return of the Shadow VII The Treason of Isengard VIII The War of the Ring IX Sauron Defeated X Morgoth’s Ring XI The War of the Jewels XII The Peoples of Middle-earth COPYRIGHT HarperCollins Publishers 77-85 Fulham Palace Road, Hammersmith, London W6 8JB www.tolkien.co.uk 135798642 This new reset edition is based on the edition first published in 1995 First published by HarperCollins Publishers 1991 Fifth edition (reset) 1995 First published in Great Britain by George Allen & Unwin 1937 Second edition 1951 Third edition 1966 Fourth edition 1978 Copyright © The J. R. R. Tolkien Copyright Trust 1937, 1951, 1966, 1978, 1995 and ‘Tolkien’ ®are registered trademarks of The J. R. R. Tolkien Estate Limited EPub Edition MARCH 2009 ISBN: 978-0-007- 32260-2 All rights reserved under International and Pan- American Copyright Conventions. By payment of the required fees, you have been granted the non-exclusive, non-transferable right to access and read the text of this e-book on-screen. No part of this text may be reproduced, transmitted, down-loaded, decompiled, reverse engineered, or stored in or introduced into any information storage and retrieval system, in any form or by any means, whether electronic or mechanical, now known or hereinafter invented, without the express written permission of HarperCollins e- books. ABOUT THE PUBLISHER Australia HarperCollins Publishers (Australia) Pty. Ltd. 25 Ryde Road (PO Box 321) Pymble, NSW 2073, Australia http://www.harpercollinsebooks.com.au Canada HarperCollins Canada 2 Bloor Street East - 20th Floor Toronto, ON, M4W, 1A8, Canada http://www.harpercollinsebooks.ca New Zealand HarperCollinsPublishers (New Zealand) Limited P.O. Box 1 Auckland, New Zealand http://www.harpercollinsebooks.co.nz United Kingdom HarperCollins Publishers Ltd. 77-85 Fulham Palace Road London, W6 8JB, UK http://www.harpercollinsebooks.co.uk United States HarperCollins Publishers Inc. 10 East 53rd Street New York, NY 10022 http://www.harpercollinsebooks.com - The reason for this use is given in The Lord of the Rings, III, 1136. Son of Azog. See ± '
data_df.writing.loc['tolkien'] = data_df.writing.loc['tolkien'].replace(del_text_tolkien2, "")

data_df.writing.loc['tolkien']


In [None]:
data_df.writing.loc['vonnegut']

In [None]:
del_text_vonnegut1 = 'SLAUGHTERHOUSE-FIVE OR THE C HIL DREN’S CRUSADE A Duty-dance with Death KURT VONNEGUT, JR. [NAL Release #21] [15 jan 2001 - OCR errors removed - vl] A fourth-generation German-American now living in easy circumstances on Cape Cod [and smoking too much], who, as an American infantry scout hors de combat, as a prisoner of war, witnessed the fire-bombing of Dresden, Germany, \'The Florence of the Elbe,\' a long time ago, and survived to tell the tale. This is a novel somewhat in the telegraphic schizophrenic manner of tales of the planet Tralfamadore, where the flying saucers come from. Peace. Granada Publishing Limited Published in 1972 by Panther Books Ltd Frogmore, St Albans, Herts AL2 2NF Reprinted 1972, 1973 (twice), 1974, 1975 First published in Great Britain by Jonathan Cape Ltd 1970 Copyright (D Kurt Vonnegut Jr. 1969 Made and printed in Great Britain by Richard Clay (The Chaucer Press) Ltd Bungy, Suffolk Set in Linotype Plantin This book is sold subject to the condition that it shah not, by way of trade or otherwise, be lent, re-sold, hired out or otherwise circulated without the publisher\'s prior consent in any form of binding or cover other than that in which it is published and without a similar condition including this condition being imposed on the subsequent purchaser. This book is published at a net price and is supplied subject to the Publishers Association Standard Conditions of Sale registered under the Restrictive Trade Practices Act, 1956. Grateful acknowledgment is made for permission to reprint the following material: \'The Waking\': copyright 1953 by Theodore Roethke from THE COLLECTED POEMS OF THEODORE ROETHKE printed by pennission of Doubleday & Company, Inc. THE DESTRUCTION OF DRESDEN by David Irving: From the Introduction by Ira C. Eaker, Lt. Gen. USAF (RET.) and Foreword by Air Marshall Sir Robert Saundby. Copyright ©1963 by William Kimber and Co. Limited. Reprinted by permission of Holt, Rinehart and Winston, Inc. and William Kimber and Co. Limited. \'Leven Cent Cotton’ by Bob Miller and Emma Dermer: Copyright © 1928, 1929 by MCA Music, a Division of MCA Inc. Copyright renewed 1955,1956 and assigned to MCA Music, a division of MCA Inc. Used by pennission. for Mary O’ Hare and Gerhard Muller '
data_df.writing.loc['vonnegut'] = data_df.writing.loc['vonnegut'].replace(del_text_vonnegut1, "")

data_df.writing.loc['vonnegut'] = data_df.writing.loc['vonnegut'].replace("\'", "")

data_df.writing.loc['vonnegut']


In [None]:
# Finally, let's look at our relatively clean data!
data_df

In [None]:
# Let's add the books' full names as well
full_names = ['Asimov - The Foundation', 'Burgess - Clockwork Orange', 'Debord - Comments to the Society of the Spectacle', 'Hawking - A Brief History of Time', 'Hemingway - For Whom the Bell Tolls', 'King - Carrie', 'Tolkien - The Hobbit', 'Vonnegut - Slaughterhouse Five']

data_df['full_name'] = full_names
data_df

In [None]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

## Document-term Matrix prep - Data Cleaning Round 3

In [None]:
# Apply a 3rd round of text cleaning techniques

def clean_text_round3(text):
    '''Make text lowercase, remove punctuation and remove digits and words containing numbers.'''
    text = text.lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # remove punctuation
    text = re.sub('[‘’“”…]', '', text) # remove more punctuation
    text = re.sub('\w*\d\w*', '', text) # remove digits (\d) and anything near digits (\w and \w*)
    return text

round3 = lambda x: clean_text_round3(x)

In [None]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.writing.apply(round3))
data_clean

In [None]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format)
data_clean.to_pickle('data_clean.pkl')

## Data for the document-term matrix is ready! Now let's actually create our DTM

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words - common words that add no additional meaning to text such as 'a', 'the', etc.

In [None]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english') # create a count vectorizer excluding stop words
data_cv = cv.fit_transform(data_clean.writing) # fit the vectorizer onto the data
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names()) # convert into a 2D array
data_dtm.index = data_clean.index # label the columns
data_dtm

In [None]:
# Let's pickle the dtm for later use
data_dtm.to_pickle("dtm.pkl")

In [None]:
pickle.dump(cv, open("cv.pkl", "wb")) # pickle count vectorizer object

## Next up - Exploratory Data Analysis!