# Me--Smith - by Caroline Lockhart
## My analysis of the most popular download from Project Gutenberg of July 13, 2019.
I got the txt file from https://www.gutenberg.org/ebooks/27438

In [3]:
txt = 'Me-Smith-pg27438.txt'

with open(txt, 'r') as f:
    text = f.read()

Through trial and error, I decided on the following length to omit from the analysis. The last "ME--SMITH" below is the title of the book before the chapter numeral starts off the book. I only want to analyse the actual text, as near as possible.

In [37]:
print(len(text), '\n', text[:3060])

406554 
 ï»¿The Project Gutenberg EBook of 'Me-Smith', by Caroline Lockhart

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: 'Me-Smith'

Author: Caroline Lockhart

Illustrator: Gayle Hoskins

Release Date: December 8, 2008 [EBook #27438]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK 'ME-SMITH' ***




Produced by Roger Frank and the Online Distributed
Proofreading Team at http://www.pgdp.net





[Illustration: "THAT LOOK IN YOUR EYES--THAT LOOK AS IF YOU HADN'T
NOTHIN' TO HIDE--IS IT TRUE?" Page 59]




"ME-SMITH"

BY

CAROLINE LOCKHART

WITH ILLUSTRATIONS BY

GAYLE HOSKINS

NEW YORK

GROSSET & DUNLAP

PUBLISHERS




Copyright 1911
By J. B. Lippincott Company

Published February 15, 1911
Second printing, February 25, 1911
Third printing, March 5, 1911
Fourth pri

So I'll put the pre-text into its own variable, and I'll get rid of the post-text too. Also by trial and error.

In [74]:
pretext = text[:3060]
posttext = text[-30200:]
# print(posttext)

In [84]:
data = text[3061:-30200]
print(data[:50])
print('**********')
print(data[-50:])

I

"ME--SMITH"


A man on a tired gray horse reine
**********
uns--tell the Schoolmarm I died game,
me--Smith!"



### Check out some of the data

In [97]:
import numpy as np

In [105]:
print("Before being cleaned up, there are about:")
print("\n{} characters \n{} words \n{} unique words".format(len(data), len(data.split()), len({word:None for word in data.split()})))

lines = data.split('\n')
print("\n{} lines of text \naverage of {:.4} words per line".format(len(lines), np.average([len(line.split()) for line in lines])))

Before being cleaned up, there are about:

373293 characters 
66296 words 
12812 unique words

8437 lines of text 
average of 7.858 words per line


#### Clean up the data
I want to clean up the data a bit first:
- convert everything to lower case
- remove the "images" (which are between square brackets `[]`)
- get rid of stopwords
- check to see if I should stem the words too

I know all this can be done pretty easily using the `nltk` tools and regex.

In [106]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

In [107]:
# may as well create a function to do this
def text_cleaner(text):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    words = text.split()
    words = [w for w in words if w not in stopwords.words("english")]
    stemmed = [stemmer.stem(w) for w in words]
    
    return words, stemmed

In [110]:
words, stemmed = text_cleaner(data)

In [116]:
print("After cleaning it up, there are \n\n{} words and {} stemmed words, and".format(len(words), len(stemmed)))
print("{} unique words and {} unique stemmed words".format(len(set(words)), len(set(stemmed))))

After cleaning it up, there are 

32966 words and 32966 stemmed words, and
7077 unique words and 5066 unique stemmed words


#### Let's see what kind of common words are in there!

In [119]:
freqDistWords = nltk.FreqDist(words)
freqDistStems = nltk.FreqDist(stemmed)

In [120]:
print(freqDistWords)
print(freqDistStems)

<FreqDist with 7077 samples and 32966 outcomes>
<FreqDist with 5066 samples and 32966 outcomes>


In [127]:
# %pprint
freqDistWords.most_common(50)

[('smith', 610),
 ('susie', 295),
 ('like', 289),
 ('ralston', 228),
 ('said', 211),
 ('would', 197),
 ('eyes', 172),
 ('tubbs', 170),
 ('could', 168),
 ('upon', 166),
 ('woman', 164),
 ('one', 161),
 ('man', 151),
 ('dora', 145),
 ('mcarthur', 139),
 ('time', 135),
 ('horse', 132),
 ('little', 124),
 ('looked', 123),
 ('back', 112),
 ('white', 109),
 ('face', 108),
 ('indian', 103),
 ('get', 97),
 ('see', 96),
 ('hand', 94),
 ('never', 93),
 ('horses', 91),
 ('go', 88),
 ('come', 86),
 ('know', 86),
 ('good', 84),
 ('make', 84),
 ('say', 84),
 ('head', 83),
 ('got', 83),
 ('look', 82),
 ('saddle', 81),
 ('thought', 79),
 ('made', 78),
 ('think', 78),
 ('knew', 77),
 ('take', 77),
 ('house', 75),
 ('way', 75),
 ('schoolmarm', 74),
 ('mother', 73),
 ('something', 73),
 ('babe', 73),
 ('long', 72)]

In [128]:
freqDistStems.most_common(50)

[('smith', 610),
 ('like', 324),
 ('susi', 295),
 ('look', 252),
 ('ralston', 228),
 ('hors', 223),
 ('eye', 213),
 ('said', 211),
 ('would', 197),
 ('tubb', 170),
 ('could', 168),
 ('one', 166),
 ('upon', 166),
 ('woman', 164),
 ('time', 155),
 ('hand', 155),
 ('man', 151),
 ('indian', 150),
 ('dora', 145),
 ('mcarthur', 139),
 ('littl', 124),
 ('get', 123),
 ('back', 119),
 ('come', 119),
 ('know', 117),
 ('face', 116),
 ('go', 116),
 ('white', 112),
 ('see', 110),
 ('make', 109),
 ('say', 107),
 ('thought', 98),
 ('head', 96),
 ('never', 93),
 ('think', 92),
 ('take', 92),
 ('want', 91),
 ('saddl', 89),
 ('seem', 87),
 ('good', 85),
 ('got', 83),
 ('day', 79),
 ('made', 78),
 ('knew', 77),
 ('long', 76),
 ('hous', 76),
 ('way', 76),
 ('schoolmarm', 75),
 ('mother', 73),
 ('feel', 73)]