# ParselTongue: Text Analytics on Python

Python is a great scripting tool for text analytics. This post is about the basics of Natural Language Processing using python, specifically, the NLTK library. Here are some common steps taken to prepare text for analysis:
1. Parse text from files.
* Tokenize text into sentences.
* Tokenize sentences into words.
* Explore the words.
* Lowercase and Tokenize words.
* Remove punctuations. 
* Identify entities.
* Remove stopwords.
* Stem words.
* Create a term dictionary.
* Create a document term matrix.
* Identify topics.

The text used in this notebook is the Sherlock Holmes thriller "The sign of 4" by Sir Arthur Conan Dolye. This book can be retrieved from the Project Gutenberg via this [link](http://www.gutenberg.org/cache/epub/2097/pg2097.txt).

In [5]:
import urllib

txt = urllib.urlopen("http://www.gutenberg.org/cache/epub/2097/pg2097.txt").read()
print txt

﻿The Project Gutenberg EBook of The Sign of the Four, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net


Title: The Sign of the Four

Author: Arthur Conan Doyle

Posting Date: November 19, 2008 [EBook #2097]
Release Date: March, 2000
[This file last updated March 2, 2011]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK THE SIGN OF THE FOUR ***














The Sign of the Four


By

Sir Arthur Conan Doyle




Contents





Chapter I

The Science of Deduction

Sherlock Holmes took his bottle from the corner of the mantel-piece and
his hypodermic syringe from its neat morocco case. With his long,
white, nervous fingers he adjusted the delicate needle, and rolled back
his left shirt-cuff.  For some

The whole book has been saved in a variable called txt. 

A quick look at the text suggests that the text is in a regular and structured format. The content does not start until the word "Chapter I". Thereafter, each chapter starts with the chapter numbers, followed by the chapter title. The end of the story is indicated by the line "End of Project Gutenberg's The Sign of the Four, by Arthur Conan Doyle" followed by the license details of the Project Gutenberg.

In [24]:
'''First split the contents by the line "End of Project Gutenberg's" indicating the end of the content and taking only 
the first item of the resulting list. This is followed by spliting by "Chapter " to separate the introduction of the 
ebook from the chapters. The chapters themselves are separate items in the list'''
chapters = txt.split("End of Project Gutenberg's")[0].split("Chapter ")[1:]

print "There are {0} chapters in this book".format(len(chapters))

There are 12 chapters in this book


In [41]:
re.sub("^[IVX]{1,2}(\n){1,2}", "", "V\n")

''

In [44]:
chapters

['I\r\n\r\nThe Science of Deduction\r\n\r\nSherlock Holmes took his bottle from the corner of the mantel-piece and\r\nhis hypodermic syringe from its neat morocco case. With his long,\r\nwhite, nervous fingers he adjusted the delicate needle, and rolled back\r\nhis left shirt-cuff.  For some little time his eyes rested thoughtfully\r\nupon the sinewy forearm and wrist all dotted and scarred with\r\ninnumerable puncture-marks. Finally he thrust the sharp point home,\r\npressed down the tiny piston, and sank back into the velvet-lined\r\narm-chair with a long sigh of satisfaction.\r\n\r\nThree times a day for many months I had witnessed this performance, but\r\ncustom had not reconciled my mind to it.  On the contrary, from day to\r\nday I had become more irritable at the sight, and my conscience swelled\r\nnightly within me at the thought that I had lacked the courage to\r\nprotest.  Again and again I had registered a vow that I should deliver\r\nmy soul upon the subject, but there was 

In [47]:
import re

titles = []
contents = []

# Save the title of the chapter into a list called titles.
# Save a list of paragraphs of each chapter into a list call contents.
# Hence contents[] is a list of list.
for chapter in chapters:
    chap = re.sub("^[IVX]{1,2}\r(\n){1,2}", "", chapter)
    temp = chap.split("\n\n")
    titles.append(temp[0])
    contents.append(temp[1:])

Now that we have the content of the book, text pre-processing can proceed. We start by identifying the sentences, followed by tokenizing the words. For this, the nltk library is used.

In [18]:
contents

[[], [], [], [], [], [], [], [], [], [], [], []]

In [17]:
import nltk

# Ignore the chapters and paragraphs for now.
# split each paragraphs into sentences,
sentences_in_paragraphs = [nltk.sent_tokenize(paragraphs) for content in contents for paragraphs in content]

# Examine the sentence list
sentences_in_paragraphs[:3]

[]

In [12]:
# split each sentences into words
tokens_in_sentences = [nltk.word_tokenize(s) for paragraph in sentences_in_paragraphs for s in paragraph]

# Examine the words list
tokens_in_sentences[:3]

[]

Common English words like "he", "the", "with" often bring no meaning except in a grammatical contruct. These are called stopwords and should be removed. The words are first cleaned by lowering the cases, followed by removing the stop words, before stemming the words.

In [13]:
lower_case = [[token.lower() for token in s] for s in tokens_in_sentences]

from nltk.corpus import stopwords
stop_list = stopwords.words('english')
no_stop_words = [[w for w in s if w not in stop_list] for s in lower_case]

#Examine no_stop_words
no_stop_words[:5]

[]

In [14]:
from nltk.stem.porter import *
stemmer = PorterStemmer()
stemmed_words = [[stemmer.stem(w) for w in s] for s in no_stop_words]

# Examine the stemmed_words
stemmed_words[:5]

[]

Before further analysis, let's take a look at the terms used in the book using a word cloud. After examining the word cloud, further pre-processing may be needed to remove more stopwords, clean the words further, prevent some words from being cleaned etc.

In [15]:
from wordcloud import WordCloud

# Remove punctuations and create a long string of words separated by a space.
words_only = ' '.join([w for s in stemmed_words for w in s if re.match("^[a-z1-9]+$", w)])

wordcloud = WordCloud(background_color='black',
                      width=1800,
                      height=1400,
                      font_path='C:\\Windows\\Fonts\\CabinSketch-Bold.ttf').generate(words_only)

IndexError: list index out of range