# NLP Project - A Comparison of Writing Styles in Detective-like Novels

***

This project is based on 5 novels by two different authors, so there are a lot of material to import and read.

The books and their respective authors are:

By Arthur Conan Doyle:
* The Adventures of Sherlock Holmes
* The Memoirs of Sherlock Holmes
* The Return of Sherlock Holmes

By Anna Katharine Green:
* The Affair Next Door
* The Leavenworth Case

**Note:** All the books are available on Project Gutemberg website, in the following links:

### Step 1 - Plan Analysis
***

Everyone knows who Sherlock Holmes is, you'll hardly find someone unfamiliar with this name. But a not very known name is Anna Katherine Green, once called "the mother of detective novel" (Wikipedia). Still according to Wikipedia, she was one of the first writers of detective fiction in the United States. Both were chosen for this analysis due to having their books written in similar times (Anna was born only 13 years before Arthur). That being said, this analysis is focused on comparing their writing styles using grammatical structures in novels about detective work.

The writing style will be compared using two grammatical structures and verifying their frequency throughout the book, this way it will be possible to identify preferences for structures. This project will analyse the frequency of **Noun Phrases** and **Verb Phrases** in the five books and, with some luck, we'll discover something interesting. Is there a preference for one structure over another?

Also, the amount of details in the books will be analysed to see which author provides more complete descriptions of scenarios, characters, situations, etc. This can be done by analysing the amount of adjectives present in the book. Adjectives can modify the meaning of many words. Therefore, their presence in the book is directly related with the detailing of the novel.

#### Tools and libraries

In order to enable those structures to take place in Python, some tools and techniques were used to not only manipulate the data but also search through it. Those are:

* Regular Expressions (regex)
* NLTK - Natural Language Toolkit
* POS tagging (part-of-speech tagging)
* Tokenization
* Chunking
* Syntax Parsing Analysis

**Note:** Preprocessing/Normalization is a common part of most NLP-related work. However, in this case, some of the techniques like *Lemmatization* (bringing words to their root word) or *Stopwords removal* (removing grammatical words such as prepositions) would make the search of patterns through grammatical structures impossible. Therefore, the **only** preprocessing stage included in this project is **tokenization** (breaking paragraphs and sentences into smaller units).

**POS tags:** https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

### Step 2 - Importing books
***

In [2]:
sh_adventures = open("The Adventures of Sherlock Holmes.txt",encoding='utf-8').read().lower()

In [5]:
sh_adventures



In [3]:
sh_memoirs = open("The Memoirs of Sherlock Holmes.txt",encoding='utf-8').read().lower()
sh_memoirs



In [4]:
sh_return = open("The Return of Sherlock Holmes.txt",encoding='utf-8').read().lower()
sh_return



In [5]:
anna_leavenworth = open("THE LEAVENWORTH CASE.txt",encoding='utf-8').read().lower()
anna_leavenworth



In [6]:
anna_nd_affair = open("THE AFFAIR NEXT DOOR.txt",encoding='utf-8').read().lower()
anna_nd_affair



### Step 3 - **Tokeinization**
***

In [7]:
from nltk.tokenize import PunktSentenceTokenizer, word_tokenize

In [8]:
def book_tokenizer(book): # Tokenizes by both sentence and words
    sentence_tokenizer = PunktSentenceTokenizer(book)
    tokenized_sentence = sentence_tokenizer.tokenize(book)
    
    tokenized_words = list()
    
    for sentence in tokenized_sentence:
        tokenized_words.append(word_tokenize(sentence))
    
    return tokenized_words

In [9]:
tokenized_sh_adventures = book_tokenizer(sh_adventures)

In [120]:
tokenized_sh_memoirs = book_tokenizer(sh_memoirs)

In [11]:
tokenized_sh_return = book_tokenizer(sh_return)

In [12]:
tokenized_anna_leavenworth = book_tokenizer(anna_leavenworth)

In [13]:
tokenized_anna_nd_affair= book_tokenizer(anna_nd_affair)

### Step 4 - POS Tagging
***

Now that every book was broken down to words within separated sentences, the words can recieve tags indicating their part of speech using the function **pos_tag()** from NLTK.. Those tags will be used to define the structures when looking up for them in the book!

A link with the complete table of symbols is available in *Step 1*.

As you might observe from the first tagged book, the *pos_tag()* function isn't perfect, and some mistakes do occur. For example, it takes the author's name 'Doyle' as JJ (adjective). Although some mistakes happen, the tool seems to be accurate enough to sustain the desired structures, and to be sure some of the structures will be checked to see if they correspond to the search.

**The Adventures of Sherlock Holmes**

In [2]:
from nltk import pos_tag, RegexpParser

In [15]:
def book_tagging(book):
    
    tagged_book = []
    
    for word in (book):
        tagged_book.append(pos_tag(word))
    
    return tagged_book

In [16]:
tagged_sh_adventures = book_tagging(tokenized_sh_adventures)

In [17]:
tagged_sh_adventures

[[('the', 'DT'),
  ('adventures', 'NNS'),
  ('of', 'IN'),
  ('sherlock', 'NN'),
  ('holmes', 'NNS'),
  ('by', 'IN'),
  ('arthur', 'NN'),
  ('conan', 'NN'),
  ('doyle', 'JJ'),
  ('contents', 'NNS'),
  ('i', 'NN'),
  ('.', '.')],
 [('a', 'DT'),
  ('scandal', 'NN'),
  ('in', 'IN'),
  ('bohemia', 'NN'),
  ('ii', 'NN'),
  ('.', '.'),
  ('the', 'DT'),
  ('red-headed', 'JJ'),
  ('league', 'NN'),
  ('iii', 'NN'),
  ('.', '.'),
  ('a', 'DT'),
  ('case', 'NN'),
  ('of', 'IN'),
  ('identity', 'NN'),
  ('iv', 'NN'),
  ('.', '.'),
  ('the', 'DT'),
  ('boscombe', 'NN'),
  ('valley', 'NN'),
  ('mystery', 'NN'),
  ('v.', 'IN'),
  ('the', 'DT'),
  ('five', 'CD'),
  ('orange', 'NN'),
  ('pips', 'NNS'),
  ('vi', 'NN'),
  ('.', '.'),
  ('the', 'DT'),
  ('man', 'NN'),
  ('with', 'IN'),
  ('the', 'DT'),
  ('twisted', 'JJ'),
  ('lip', 'NN'),
  ('vii', 'NN'),
  ('.', '.'),
  ('the', 'DT'),
  ('adventure', 'NN'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('blue', 'JJ'),
  ('carbuncle', 'NN'),
  ('viii', 'NN'),
  ('.'

**The Memoirs of Sherlock Holmes**

In [18]:
tagged_sh_memoirs = book_tagging(tokenized_sh_memoirs)

In [19]:
tagged_sh_memoirs

[[('the', 'DT'),
  ('memoirs', 'NN'),
  ('of', 'IN'),
  ('sherlock', 'NN'),
  ('holmes', 'NNS'),
  ('by', 'IN'),
  ('arthur', 'NN'),
  ('conan', 'NN'),
  ('doyle', 'JJ'),
  ('contents', 'NNS'),
  ('i', 'NN'),
  ('.', '.')],
 [('silver', 'NN'),
  ('blaze', 'NN'),
  ('ii', 'NN'),
  ('.', '.'),
  ('the', 'DT'),
  ('adventure', 'NN'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('cardboard', 'NN'),
  ('box', 'NN'),
  ('iii', 'NN'),
  ('.', '.'),
  ('the', 'DT'),
  ('yellow', 'JJ'),
  ('face', 'NN'),
  ('iv', 'NN'),
  ('.', '.'),
  ('the', 'DT'),
  ('stockbroker', 'NN'),
  ('’', 'NNP'),
  ('s', 'NN'),
  ('clerk', 'NN'),
  ('v.', 'IN'),
  ('the', 'DT'),
  ('“', 'NNP'),
  ('_gloria', 'NNP'),
  ('scott_', 'NN'),
  ('”', 'NNP'),
  ('vi', 'NN'),
  ('.', '.'),
  ('the', 'DT'),
  ('musgrave', 'JJ'),
  ('ritual', 'JJ'),
  ('vii', 'NN'),
  ('.', '.'),
  ('the', 'DT'),
  ('reigate', 'NN'),
  ('squires', 'VBZ'),
  ('viii', 'NNS'),
  ('.', '.')],
 [('the', 'DT'),
  ('crooked', 'JJ'),
  ('man', 'NN'),
  ('ix', '

**The Return of Sherlock Holmes**

In [20]:
tagged_sh_return = book_tagging(tokenized_sh_return)

In [21]:
tagged_sh_return

[[('the', 'DT'),
  ('return', 'NN'),
  ('of', 'IN'),
  ('sherlock', 'NN'),
  ('holmes', 'NNS'),
  ('by', 'IN'),
  ('sir', 'NN'),
  ('arthur', 'NN'),
  ('conan', 'NN'),
  ('doyle', 'NN'),
  ('contents', 'VBZ'),
  ('the', 'DT'),
  ('adventure', 'NN'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('empty', 'JJ'),
  ('house', 'NN'),
  ('the', 'DT'),
  ('adventure', 'NN'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('norwood', 'NN'),
  ('builder', 'VB'),
  ('the', 'DT'),
  ('adventure', 'NN'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('dancing', 'VBG'),
  ('men', 'NNS'),
  ('the', 'DT'),
  ('adventure', 'NN'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('solitary', 'JJ'),
  ('cyclist', 'NN'),
  ('the', 'DT'),
  ('adventure', 'NN'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('priory', 'JJ'),
  ('school', 'NN'),
  ('the', 'DT'),
  ('adventure', 'NN'),
  ('of', 'IN'),
  ('black', 'JJ'),
  ('peter', 'NN'),
  ('.', '.')],
 [('the', 'DT'),
  ('adventure', 'NN'),
  ('of', 'IN'),
  ('charles', 'NNS'),
  ('augustus', 'VBP'),
  ('milverto

**The Affair Next Door**

In [22]:
tagged_anna_leavenworth = book_tagging(tokenized_anna_leavenworth)

In [23]:
tagged_anna_leavenworth

[[('the', 'DT'),
  ('leavenworth', 'NN'),
  ('case', 'NN'),
  ('by', 'IN'),
  ('anna', 'JJ'),
  ('katherine', 'NN'),
  ('green', 'JJ'),
  ('contents', 'NNS'),
  ('book', 'NN'),
  ('i', 'NN'),
  ('.', '.')],
 [('the', 'DT'), ('problem', 'NN'), ('i', 'NN'), ('.', '.')],
 [('“', 'VB'),
  ('a', 'DT'),
  ('great', 'JJ'),
  ('case', 'NN'),
  ('”', 'NNP'),
  ('ii', 'NN'),
  ('.', '.'),
  ('the', 'DT'),
  ('coroner', 'NN'),
  ('’', 'NNP'),
  ('s', 'NN'),
  ('inquest', 'NN'),
  ('iii', 'NN'),
  ('.', '.'),
  ('facts', 'NNS'),
  ('and', 'CC'),
  ('deductions', 'NNS'),
  ('iv', 'VBP'),
  ('.', '.'),
  ('a', 'DT'),
  ('cuts', 'JJ'),
  ('v.', 'NN'),
  ('expert', 'JJ'),
  ('testimony', 'NN'),
  ('vi', 'NN'),
  ('.', '.'),
  ('side-lights', 'NNS'),
  ('vii', 'NN'),
  ('.', '.'),
  ('mary', 'JJ'),
  ('leavenworth', 'NN'),
  ('viii', 'NN'),
  ('.', '.')],
 [('circumstantial', 'JJ'),
  ('evidence', 'NN'),
  ('ix', 'NN'),
  ('.', '.'),
  ('a', 'DT'),
  ('discovery', 'NN'),
  ('x.', 'IN'),
  ('mr.', 'NN')

**The Affair Next Door**

In [24]:
tagged_anna_nd_affair = book_tagging(tokenized_anna_nd_affair)

In [25]:
tagged_anna_nd_affair

[[('that', 'DT'),
  ('affair', 'VBD'),
  ('next', 'JJ'),
  ('door', 'NN'),
  ('by', 'IN'),
  ('anna', 'JJ'),
  ('katharine', 'NN'),
  ('green', 'JJ'),
  ('_book', 'NNP'),
  ('i._', 'NN'),
  ('miss', 'VBD'),
  ('butterworth', 'NN'),
  ("'s", 'POS'),
  ('window', 'NN'),
  ('.', '.')],
 [('i', 'NN'), ('.', '.')],
 [('a', 'DT'), ('discovery', 'NN'), ('.', '.')],
 [('i', 'NN'),
  ('am', 'VBP'),
  ('not', 'RB'),
  ('an', 'DT'),
  ('inquisitive', 'JJ'),
  ('woman', 'NN'),
  (',', ','),
  ('but', 'CC'),
  ('when', 'WRB'),
  (',', ','),
  ('in', 'IN'),
  ('the', 'DT'),
  ('middle', 'NN'),
  ('of', 'IN'),
  ('a', 'DT'),
  ('certain', 'JJ'),
  ('warm', 'JJ'),
  ('night', 'NN'),
  ('in', 'IN'),
  ('september', 'NN'),
  (',', ','),
  ('i', 'RB'),
  ('heard', 'VBD'),
  ('a', 'DT'),
  ('carriage', 'NN'),
  ('draw', 'VBZ'),
  ('up', 'RP'),
  ('at', 'IN'),
  ('the', 'DT'),
  ('adjoining', 'NN'),
  ('house', 'NN'),
  ('and', 'CC'),
  ('stop', 'NN'),
  (',', ','),
  ('i', 'NN'),
  ('could', 'MD'),
  ('no

### Step 5 - Defining the Grammatical Structures in Regular Expressions
***

Now we've come to the interesting part. As with any NLP work, to get here we had to perform some techniques on the data in order to prepare it for analysis. Now, using regular expressions, the patterns we're looking for will be defined based on the tag we just put in each word of the five books.

These are the structures we're looking for:
* Noun Phrases
* Verb Phrases

From these structures, some chunks of language will be retrieved from the books to give an idea of how the author writes.

Along with the structures to be used, a parser object needs to be defined. The parser will grab the tags we want to look and chunk together (the POS tags) and look for them through the book. For this, NLTK's **RegexpParser** will be used. One parser will be created for each structure desired.

The order of elements in a *Noun Phrase* can be verified on the official Cambridge Grammar online dictionary. The order is: **Determiner + Adjective + Nouns activing as modifiers (optional) + head (main Noun, the one being modified)**.

Verb Phrases can be simple or complex. Simple verb phrases is a phrase with a main verb, which will say what type of clause it is (declarative, imperative, etc.). Our focus in on complex verb phrases, which follow this order: **Modal Verb + Auxiliary Verb(s) + Main Verb**.

We can configure these structure using the tags for each part-of-speech and specifying their quantity (e.g. one or more).

In [54]:
np_chunk = "NP: {<DT><JJ>*<NN.>?<NN.>}" # Noun Phrase
np_parser = RegexpParser(np_chunk)

In [68]:
vp_chunk = "VP: {<MD><V..?>*<V..?>}" # Verb Prhase
vp_parser = RegexpParser(vp_chunk)

With the structure and the parser ready, we need to store the filtered chunks in a variable to later analyse them!

### Step 6 - Book Analysis
***

**The Adventures of Sherlock Holmes - Chunks**

In [64]:
np_chunks_sh_adventures = []
vp_chunks_sh_adventures = []

In [69]:
for word in tagged_sh_adventures:
    np_chunks_sh_adventures.append(np_parser.parse(word))
    vp_chunks_sh_adventures.append(vp_parser.parse(word))

With our chunks parsed, we can check how many chunks were found. For this, a chunk counter function is required.

In [32]:
from collections import Counter

In [57]:
def np_chunk_counter(chunked_sentences):
    
    chunks = list()

    for chunked_sentence in chunked_sentences:
        for subtree in chunked_sentence.subtrees(filter=lambda t: t.label() == 'NP'):
            chunks.append(tuple(subtree))

    chunk_counter = Counter()

    for chunk in chunks:
        chunk_counter[chunk] += 1

    return chunk_counter

In [58]:
np_chunk_counter(np_chunks_sh_adventures)

Counter({(('the', 'DT'), ('adventures', 'NNS')): 1,
         (('all', 'DT'), ('emotions', 'NNS')): 1,
         (('the', 'DT'), ('home-centred', 'JJ'), ('interests', 'NNS')): 1,
         (('those', 'DT'), ('clues', 'NNS')): 1,
         (('those', 'DT'), ('mysteries', 'NNS')): 1,
         (('these', 'DT'), ('signs', 'NNS')): 1,
         (('the', 'DT'), ('readers', 'NNS')): 1,
         (('the', 'DT'), ('dark', 'JJ'), ('incidents', 'NNS')): 1,
         (('a', 'DT'), ('few', 'JJ'), ('centuries', 'NNS')): 1,
         (('the', 'DT'), ('edges', 'NNS')): 2,
         (('the', 'DT'), ('steps', 'NNS')): 8,
         (('some', 'DT'), ('hundreds', 'NNS')): 2,
         (('these', 'DT'), ('little', 'JJ'), ('problems', 'NNS')): 2,
         (('all', 'DT'), ('quarters', 'NNS')): 2,
         (('a', 'DT'), ('large', 'JJ'), ('“', 'NNP')): 2,
         (('the', 'DT'), ('‘', 'NNP')): 8,
         (('the', 'DT'), ('small', 'JJ'), ('‘', 'NNP')): 1,
         (('the', 'DT'), ('stairs', 'NNS')): 6,
         (('a', 'D

In [59]:
len(np_chunk_counter(np_chunks_sh_adventures))

622

We have 622 occurances of Noun Phrases in the first Sherlock Holmes book. Impressive. But the book has more to offer. Let's take a look at the Verb Phrases.

In [66]:
def vp_chunk_counter(chunked_sentences):

    chunks = list()

    for chunked_sentence in chunked_sentences:
        for subtree in chunked_sentence.subtrees(filter=lambda t: t.label() == 'VP'):
            chunks.append(tuple(subtree))

    chunk_counter = Counter()

    for chunk in chunks:
        chunk_counter[chunk] += 1

    return chunk_counter

In [70]:
vp_chunk_counter(vp_chunks_sh_adventures)

Counter({(('will', 'MD'), ('is', 'VBZ')): 2,
         (('would', 'MD'), ('have', 'VB'), ('placed', 'VBN')): 1,
         (('might', 'MD'), ('throw', 'VB')): 1,
         (('should', 'MD'), ('have', 'VB'), ('thought', 'VBN')): 1,
         (('can', 'MD'), ('’', 'VB')): 11,
         (('must', 'MD'), ('be', 'VB')): 31,
         (('may', 'MD'), ('be', 'VB')): 22,
         (('will', 'MD'), ('call', 'VB')): 3,
         (('would', 'MD'), ('be', 'VB')): 60,
         (('may', 'MD'), ('want', 'VB')): 1,
         (('would', 'MD'), ('call.', 'VB')): 1,
         (('may', 'MD'), ('address', 'VB')): 1,
         (('may', 'MD'), ('trust', 'VB')): 1,
         (('may', 'MD'), ('say', 'VB')): 5,
         (('must', 'MD'), ('begin', 'VB')): 1,
         (('will', 'MD'), ('be', 'VB')): 26,
         (('may', 'MD'), ('have', 'VB')): 8,
         (('will', 'MD'), ('excuse', 'VB')): 10,
         (('may', 'MD'), ('confess', 'VB')): 1,
         (('might', 'MD'), ('grow', 'VB')): 1,
         (('would', 'MD'), ('condesce

In [71]:
len(vp_chunk_counter(vp_chunks_sh_adventures))

729

Interesting! There are more Verb Phrases than Noun Phrases in the first Sherlock Holmes book. It's hard to say which is more important, since both are essential parts of any language. However, is the focus in Verb Phrases a characteristic of this type of Novel? Perhaps...

To know this, we'll need to see the other books.

Before that, we're going to go the extra mile and see which are the most common Noun Phrases and Verb Phrases in the first book. The top 15 words, making use of the same function with a single addition, the **most_common()** function.

In [73]:
def np_common_chunks(chunked_sentences):
    
    chunks = list()

    for chunked_sentence in chunked_sentences:
        for subtree in chunked_sentence.subtrees(filter=lambda t: t.label() == 'NP'):
            chunks.append(tuple(subtree))

    chunk_counter = Counter()

    for chunk in chunks:
        chunk_counter[chunk] += 1

    return chunk_counter.most_common(15)

In [74]:
def vp_common_chunks(chunked_sentences):

    chunks = list()

    for chunked_sentence in chunked_sentences:
        for subtree in chunked_sentence.subtrees(filter=lambda t: t.label() == 'VP'):
            chunks.append(tuple(subtree))

    chunk_counter = Counter()

    for chunk in chunks:
        chunk_counter[chunk] += 1

    return chunk_counter.most_common(15)

In [75]:
np_common_chunks(np_chunks_sh_adventures)

[((('the', 'DT'), ('facts', 'NNS')), 40),
 ((('the', 'DT'), ('papers', 'NNS')), 36),
 ((('the', 'DT'), ('windows', 'NNS')), 32),
 ((('a', 'DT'), ('few', 'JJ'), ('minutes', 'NNS')), 32),
 ((('the', 'DT'), ('words', 'NNS')), 18),
 ((('the', 'DT'), ('steps', 'NNS')), 16),
 ((('the', 'DT'), ('‘', 'NNP')), 16),
 ((('the', 'DT'), ('others', 'NNS')), 14),
 ((('the', 'DT'), ('streets', 'NNS')), 14),
 ((('some', 'DT'), ('years', 'NNS')), 14),
 ((('the', 'DT'), ('shutters', 'NNS')), 14),
 ((('the', 'DT'), ('stairs', 'NNS')), 12),
 ((('the', 'DT'), ('hands', 'NNS')), 12),
 ((('the', 'DT'), ('trees', 'NNS')), 12),
 ((('the', 'DT'), ('initials', 'NNS')), 12)]

In [76]:
vp_common_chunks(vp_chunks_sh_adventures)

[((('would', 'MD'), ('be', 'VB')), 60),
 ((('should', 'MD'), ('be', 'VB')), 38),
 ((('must', 'MD'), ('be', 'VB')), 31),
 ((('could', 'MD'), ('see', 'VB')), 30),
 ((('will', 'MD'), ('be', 'VB')), 26),
 ((('shall', 'MD'), ('be', 'VB')), 25),
 ((('may', 'MD'), ('be', 'VB')), 22),
 ((('might', 'MD'), ('be', 'VB')), 19),
 ((('will', 'MD'), ('find', 'VB')), 13),
 ((('should', 'MD'), ('like', 'VB')), 12),
 ((('must', 'MD'), ('have', 'VB')), 12),
 ((('can', 'MD'), ('’', 'VB')), 11),
 ((('must', 'MD'), ('have', 'VB'), ('been', 'VBN')), 11),
 ((('can', 'MD'), ('be', 'VB')), 11),
 ((('will', 'MD'), ('excuse', 'VB')), 10)]

Interestingly, the most common Noun Phrases begin with the determinar "The" and the most common verb in the Verb Phrases is "be". Great, now we can begin to analyse the other books and try to look for a pattern!

**The Memoirs of Sherlock Holmes - Chunks**

In [78]:
np_chunks_sh_memoirs = []
vp_chunks_sh_memoirs = []

In [79]:
for word in tagged_sh_memoirs:
    np_chunks_sh_memoirs.append(np_parser.parse(word))
    vp_chunks_sh_memoirs.append(vp_parser.parse(word))

In [80]:
np_chunk_counter(np_chunks_sh_memoirs)

Counter({(('the', 'DT'), ('“', 'NNP'), ('_gloria', 'NNP')): 1,
         (('the', 'DT'), ('quarter-mile', 'JJ'), ('posts', 'NNS')): 1,
         (('the', 'DT'), ('_chronicle_', 'NNS')): 1,
         (('those', 'DT'), ('cases', 'NNS')): 2,
         (('the', 'DT'), ('embellishments', 'NNS')): 1,
         (('the', 'DT'), ('special', 'JJ'), ('points', 'NNS')): 1,
         (('both', 'DT'), ('colonel', 'NNS')): 1,
         (('some', 'DT'), ('ways', 'NNS')): 2,
         (('the', 'DT'), ('essential', 'JJ'), ('facts', 'NNS')): 1,
         (('the', 'DT'), ('cushions', 'NNS')): 2,
         (('the', 'DT'), ('points', 'NNS')): 3,
         (('the', 'DT'), ('events', 'NNS')): 6,
         (('the', 'DT'), ('prizes', 'NNS')): 1,
         (('those', 'DT'), ('odds', 'NNS')): 1,
         (('these', 'DT'), ('lads', 'NNS')): 1,
         (('the', 'DT'), ('others', 'NNS')): 8,
         (('the', 'DT'), ('stables', 'NNS')): 15,
         (('no', 'DT'), ('children', 'NNS')): 2,
         (('the', 'DT'), ('horses', 'NN

In [81]:
len(np_chunk_counter(np_chunks_sh_memoirs))

603

In [82]:
vp_chunk_counter(vp_chunks_sh_memoirs)

Counter({(('shall', 'MD'), ('have', 'VB')): 11,
         (('could', 'MD'), ('challenge', 'VB')): 1,
         (('should', 'MD'), ('be', 'VB')): 18,
         (('would', 'MD'), ('confer', 'VB')): 1,
         (('will', 'MD'), ('go', 'VB')): 2,
         (('would', 'MD'), ('oblige', 'VB')): 2,
         (('should', 'MD'), ('be', 'VB'), ('used', 'VBN')): 1,
         (('may', 'MD'), ('be', 'VB'), ('drawn', 'VBN')): 1,
         (('would', 'MD'), ('think', 'VB')): 3,
         (('shall', 'MD'), ('enumerate', 'VB')): 1,
         (('may', 'MD'), ('wish', 'VB')): 1,
         (('should', 'MD'), ('drink', 'VB')): 1,
         (('would', 'MD'), ('be', 'VB')): 41,
         (('can', 'MD'), ('buy.', 'VB')): 1,
         (('may', 'MD'), ('put', 'VB')): 2,
         (('could', 'MD'), ('give', 'VB')): 6,
         (('may', 'MD'), ('add', 'VB')): 2,
         (('could', 'MD'), ('hear', 'VB')): 3,
         (('could', 'MD'), ('be', 'VB'), ('got', 'VBN')): 2,
         (('could', 'MD'), ('see', 'VB')): 29,
         (('

In [83]:
len(vp_chunk_counter(vp_chunks_sh_memoirs))

659

We appear to be walking towards a pattern. The second Sherlock Holmes book (The Memoirs of Sherlock Holmes) also has more Verb Phrases than Noun Phrases. Could the Verb Phrases be an important part of this type of novel?

Let's see the most common structures for both!

In [84]:
np_common_chunks(np_chunks_sh_memoirs)

[((('the', 'DT'), ('facts', 'NNS')), 18),
 ((('a', 'DT'), ('few', 'JJ'), ('minutes', 'NNS')), 16),
 ((('the', 'DT'), ('stables', 'NNS')), 15),
 ((('the', 'DT'), ('papers', 'NNS')), 13),
 ((('no', 'DT'), ('means', 'NNS')), 11),
 ((('the', 'DT'), ('stairs', 'NNS')), 10),
 ((('some', 'DT'), ('years', 'NNS')), 9),
 ((('the', 'DT'), ('details', 'NNS')), 9),
 ((('the', 'DT'), ('others', 'NNS')), 8),
 ((('the', 'DT'), ('windows', 'NNS')), 8),
 ((('no', 'DT'), ('signs', 'NNS')), 7),
 ((('the', 'DT'), ('servants', 'NNS')), 7),
 ((('the', 'DT'), ('events', 'NNS')), 6),
 ((('the', 'DT'), ('_gloria', 'NNP')), 6),
 ((('the', 'DT'), ('questions', 'NNS')), 6)]

In [85]:
vp_common_chunks(vp_chunks_sh_memoirs)

[((('would', 'MD'), ('be', 'VB')), 41),
 ((('could', 'MD'), ('see', 'VB')), 29),
 ((('shall', 'MD'), ('be', 'VB')), 23),
 ((('will', 'MD'), ('be', 'VB')), 23),
 ((('must', 'MD'), ('be', 'VB')), 19),
 ((('should', 'MD'), ('be', 'VB')), 18),
 ((('could', 'MD'), ('be', 'VB')), 18),
 ((('will', 'MD'), ('find', 'VB')), 17),
 ((('must', 'MD'), ('have', 'VB'), ('been', 'VBN')), 16),
 ((('can', 'MD'), ('be', 'VB')), 14),
 ((('may', 'MD'), ('be', 'VB')), 13),
 ((('should', 'MD'), ('like', 'VB')), 13),
 ((('can', 'MD'), ('imagine', 'VB')), 13),
 ((('shall', 'MD'), ('have', 'VB')), 11),
 ((('should', 'MD'), ('have', 'VB')), 11)]

Very similar results to the most common structures in the first book. This gives us an idea of the author's writing style. 2 out of 3, should we expect similar results for the third book?

**The Return of Sherlock Holmes**

In [88]:
np_chunks_sh_return = []
vp_chunks_sh_return = []

In [89]:
for word in tagged_sh_return:
    np_chunks_sh_return.append(np_parser.parse(word))
    vp_chunks_sh_return.append(vp_parser.parse(word))

In [90]:
np_chunk_counter(np_chunks_sh_return)

Counter({(('those', 'DT'), ('particulars', 'NNS')): 1,
         (('the', 'DT'), ('facts', 'NNS')): 15,
         (('those', 'DT'), ('glimpses', 'NNS')): 1,
         (('the', 'DT'), ('thoughts', 'NNS')): 1,
         (('the', 'DT'), ('various', 'JJ'), ('problems', 'NNS')): 1,
         (('the', 'DT'), ('efforts', 'NNS')): 2,
         (('the', 'DT'), ('australian', 'JJ'), ('colonies', 'NNS')): 1,
         (('no', 'DT'), ('enemies', 'NNS')): 1,
         (('no', 'DT'), ('particular', 'JJ'), ('vices', 'NNS')): 1,
         (('some', 'DT'), ('months', 'NNS')): 6,
         (('the', 'DT'), ('hours', 'NNS')): 1,
         (('the', 'DT'), ('cards', 'NNS')): 1,
         (('some', 'DT'), ('weeks', 'NNS')): 3,
         (('some', 'DT'), ('figures', 'NNS')): 1,
         (('the', 'DT'), ('names', 'NNS')): 2,
         (('the', 'DT'), ('circumstances', 'NNS')): 6,
         (('the', 'DT'), ('flowers', 'NNS')): 3,
         (('any', 'DT'), ('marks', 'NNS')): 1,
         (('a', 'DT'), ('hundred', 'JJ'), ('yards'

In [91]:
len(np_chunk_counter(np_chunks_sh_return))

681

In [92]:
vp_chunk_counter(vp_chunks_sh_return)

Counter({(('should', 'MD'), ('have', 'VB'), ('considered', 'VBN')): 1,
         (('can', 'MD'), ('be', 'VB'), ('imagined', 'VBN')): 1,
         (('would', 'MD'),
          ('have', 'VB'),
          ('been', 'VBN'),
          ('supplemented', 'VBN')): 1,
         (('will', 'MD'), ('recapitulate', 'VB')): 1,
         (('would', 'MD'), ('hurt', 'VB')): 1,
         (('might', 'MD'), ('have', 'VB'), ('lost', 'VBN')): 1,
         (('could', 'MD'), ('be', 'VB'), ('got', 'VBN')): 1,
         (('could', 'MD'), ('be', 'VB'), ('given', 'VBN')): 1,
         (('should', 'MD'), ('have', 'VB'), ('fastened', 'VBN')): 1,
         (('could', 'MD'), ('have', 'VB'), ('climbed', 'VBN')): 1,
         (('must', 'MD'), ('have', 'VB'), ('caused', 'VBN')): 1,
         (('could', 'MD'), ('reconcile', 'VB')): 1,
         (('must', 'MD'), ('be', 'VB')): 21,
         (('could', 'MD'), ('help', 'VB')): 2,
         (('may', 'MD'), ('i', 'VB'), ('ask', 'VB')): 4,
         (('must', 'MD'), ('have', 'VB'), ('fainted', '

In [93]:
len(vp_chunk_counter(vp_chunks_sh_return))

807

In [94]:
np_common_chunks(np_chunks_sh_return)

[((('the', 'DT'), ('papers', 'NNS')), 32),
 ((('the', 'DT'), ('facts', 'NNS')), 15),
 ((('the', 'DT'), ('eyes', 'NNS')), 10),
 ((('the', 'DT'), ('servants', 'NNS')), 10),
 ((('a', 'DT'), ('few', 'JJ'), ('hours', 'NNS')), 9),
 ((('a', 'DT'), ('few', 'JJ'), ('minutes', 'NNS')), 9),
 ((('the', 'DT'), ('grounds', 'NNS')), 8),
 ((('the', 'DT'), ('letters', 'NNS')), 8),
 ((('the', 'DT'), ('others', 'NNS')), 7),
 ((('the', 'DT'), ('windows', 'NNS')), 7),
 ((('the', 'DT'), ('busts', 'NNS')), 7),
 ((('some', 'DT'), ('months', 'NNS')), 6),
 ((('the', 'DT'), ('circumstances', 'NNS')), 6),
 ((('the', 'DT'), ('tracks', 'NNS')), 6),
 ((('the', 'DT'), ('hands', 'NNS')), 6)]

In [95]:
vp_common_chunks(vp_chunks_sh_return)

[((('would', 'MD'), ('be', 'VB')), 36),
 ((('should', 'MD'), ('be', 'VB')), 32),
 ((('will', 'MD'), ('be', 'VB')), 31),
 ((('could', 'MD'), ('see', 'VB')), 28),
 ((('may', 'MD'), ('be', 'VB')), 24),
 ((('shall', 'MD'), ('be', 'VB')), 23),
 ((('must', 'MD'), ('be', 'VB')), 21),
 ((('can', 'MD'), ('’', 'VB')), 20),
 ((('can', 'MD'), ('be', 'VB')), 12),
 ((('will', 'MD'), ('find', 'VB')), 12),
 ((('might', 'MD'), ('be', 'VB')), 11),
 ((('would', 'MD'), ('have', 'VB')), 11),
 ((('will', 'MD'), ('tell', 'VB')), 11),
 ((('will', 'MD'), ('see', 'VB')), 10),
 ((('should', 'MD'), ('like', 'VB')), 10)]

**The Leavenworth Case**

In [96]:
np_chunks_leavenworth = []
vp_chunks_leavenworth = []

In [97]:
for word in tagged_anna_leavenworth:
    np_chunks_leavenworth.append(np_parser.parse(word))
    vp_chunks_leavenworth.append(vp_parser.parse(word))

In [98]:
np_chunk_counter(np_chunks_leavenworth)

Counter({(('the', 'DT'), ('summons', 'NNS')): 4,
         (('the', 'DT'), ('stairs', 'NNS')): 14,
         (('both', 'DT'), ('mr.', 'NNP'), ('veeley', 'NNP')): 1,
         (('the', 'DT'), ('misses', 'NNS')): 4,
         (('the', 'DT'), ('ladies', 'NNS')): 20,
         (('the', 'DT'),
          ('few', 'JJ'),
          ('other', 'JJ'),
          ('preparations', 'NNS')): 1,
         (('a', 'DT'), ('few', 'JJ'), ('words', 'NNS')): 7,
         (('a', 'DT'), ('half-dozen', 'JJ'), ('steps', 'NNS')): 1,
         (('these', 'DT'), ('ladies', 'NNS')): 4,
         (('all', 'DT'), ('intercourse', 'JJ'), ('upon', 'NNS')): 1,
         (('the', 'DT'), ('importunities', 'NNS')): 1,
         (('the', 'DT'), ('steps', 'NNS')): 2,
         (('the', 'DT'), ('young', 'JJ'), ('ladies', 'NNS')): 7,
         (('the', 'DT'), ('fastenings', 'NNS')): 1,
         (('these', 'DT'), ('things', 'NNS')): 4,
         (('the', 'DT'), ('repositories', 'NNS')): 1,
         (('the', 'DT'), ('secrets', 'NNS')): 1,
      

In [99]:
len(np_chunk_counter(np_chunks_leavenworth))

467

In [100]:
vp_chunk_counter(vp_chunks_leavenworth)

Counter({(('will', 'MD'), ('make', 'VB')): 4,
         (('will', 'MD'), ('be', 'VB'), ('overwhelmed', 'VBN')): 1,
         (('can', 'MD'), ('be', 'VB')): 7,
         (('would', 'MD'), ('be', 'VB')): 54,
         (('will', 'MD'), ('go.', 'VB')): 3,
         (('will', 'MD'), ('do', 'VB')): 7,
         (('must', 'MD'), ('have', 'VB'), ('been', 'VBN')): 9,
         (('will', 'MD'), ('defer', 'VB')): 1,
         (('might', 'MD'), ('succeed', 'VB')): 1,
         (('should', 'MD'), ('think', 'VB')): 6,
         (('would', 'MD'), ('wish', 'VB')): 1,
         (('will', 'MD'), ('go', 'VB')): 5,
         (('would', 'MD'), ('seem', 'VB')): 5,
         (('will', 'MD'), ('miss', 'VB')): 1,
         (('should', 'MD'), ('occur', 'VB')): 1,
         (('must', 'MD'), ('have', 'VB'), ('advanced', 'VBN')): 1,
         (('will', 'MD'), ('convince', 'VB')): 1,
         (('may', 'MD'), ('have', 'VB'), ('come', 'VBN')): 1,
         (('must', 'MD'), ('be', 'VB')): 13,
         (('might', 'MD'), ('require', 'VB

In [101]:
len(vp_chunk_counter(vp_chunks_leavenworth))

709

Once again, we come across this result. There are more Verb Phrases than Noun Phrases in the book. This time, we've analysed a different author, and having more verb phrases could be an indicative of grammatical preference in this type of novel. Let's see the common structures.

In [102]:
np_common_chunks(np_chunks_leavenworth)

[((('the', 'DT'), ('papers', 'NNS')), 25),
 ((('the', 'DT'), ('ladies', 'NNS')), 20),
 ((('the', 'DT'), ('stairs', 'NNS')), 14),
 ((('a', 'DT'), ('few', 'JJ'), ('minutes', 'NNS')), 14),
 ((('the', 'DT'), ('servants', 'NNS')), 14),
 ((('the', 'DT'), ('words', 'NNS')), 13),
 ((('the', 'DT'), ('consequences', 'NNS')), 11),
 ((('the', 'DT'), ('facts', 'NNS')), 10),
 ((('these', 'DT'), ('words', 'NNS')), 10),
 ((('a', 'DT'), ('few', 'JJ'), ('words', 'NNS')), 7),
 ((('the', 'DT'), ('young', 'JJ'), ('ladies', 'NNS')), 7),
 ((('all', 'DT'), ('events', 'NNS')), 7),
 ((('the', 'DT'), ('suspicions', 'NNS')), 7),
 ((('the', 'DT'), ('circumstances', 'NNS')), 7),
 ((('all', 'DT'), ('others', 'NNS')), 6)]

In [103]:
vp_common_chunks(vp_chunks_leavenworth)

[((('would', 'MD'), ('be', 'VB')), 54),
 ((('will', 'MD'), ('be', 'VB')), 26),
 ((('may', 'MD'), ('be', 'VB')), 15),
 ((('must', 'MD'), ('be', 'VB')), 13),
 ((('would', 'MD'), ('have', 'VB')), 11),
 ((('will', 'MD'), ('have', 'VB')), 11),
 ((('would', 'MD'), ('have', 'VB'), ('been', 'VBN')), 11),
 ((('can', 'MD'), ('’', 'VB')), 11),
 ((('could', 'MD'), ('be', 'VB')), 10),
 ((('can', 'MD'), ('give', 'VB')), 10),
 ((('must', 'MD'), ('have', 'VB'), ('been', 'VBN')), 9),
 ((('can', 'MD'), ('do', 'VB')), 9),
 ((('should', 'MD'), ('be', 'VB')), 8),
 ((('shall', 'MD'), ('be', 'VB')), 8),
 ((('can', 'MD'), ('be', 'VB')), 7)]

The common structures used by Anna are very similar to the ones used by Arthur, "the" is the predominant determiner and "be" is the predominant verb in the book. Very Interesting. Will the other book provide the same results?

**That Affair Next Door**

In [104]:
np_chunks_nd_affair = []
vp_chunks_nd_affair = []

In [112]:
for word in tagged_anna_nd_affair:
    np_chunks_nd_affair.append(np_parser.parse(word))
    vp_chunks_nd_affair.append(vp_parser.parse(word))

In [113]:
np_chunk_counter(np_chunks_nd_affair)

Counter({(('the', 'DT'), ('summons', 'NNS')): 6,
         (('the', 'DT'), ('stairs', 'NNS')): 21,
         (('both', 'DT'), ('mr.', 'NNP'), ('veeley', 'NNP')): 1,
         (('the', 'DT'), ('misses', 'NNS')): 16,
         (('the', 'DT'), ('ladies', 'NNS')): 21,
         (('the', 'DT'),
          ('few', 'JJ'),
          ('other', 'JJ'),
          ('preparations', 'NNS')): 1,
         (('a', 'DT'), ('few', 'JJ'), ('words', 'NNS')): 10,
         (('a', 'DT'), ('half-dozen', 'JJ'), ('steps', 'NNS')): 1,
         (('these', 'DT'), ('ladies', 'NNS')): 4,
         (('all', 'DT'), ('intercourse', 'JJ'), ('upon', 'NNS')): 1,
         (('the', 'DT'), ('importunities', 'NNS')): 1,
         (('the', 'DT'), ('steps', 'NNS')): 6,
         (('the', 'DT'), ('young', 'JJ'), ('ladies', 'NNS')): 13,
         (('the', 'DT'), ('fastenings', 'NNS')): 1,
         (('these', 'DT'), ('things', 'NNS')): 5,
         (('the', 'DT'), ('repositories', 'NNS')): 1,
         (('the', 'DT'), ('secrets', 'NNS')): 2,
   

In [114]:
len(np_chunk_counter(np_chunks_nd_affair))

878

In [115]:
vp_chunk_counter(vp_chunks_nd_affair)

Counter({(('will', 'MD'), ('make', 'VB')): 7,
         (('will', 'MD'), ('be', 'VB'), ('overwhelmed', 'VBN')): 1,
         (('can', 'MD'), ('be', 'VB')): 12,
         (('would', 'MD'), ('be', 'VB')): 85,
         (('will', 'MD'), ('go.', 'VB')): 3,
         (('will', 'MD'), ('do', 'VB')): 16,
         (('must', 'MD'), ('have', 'VB'), ('been', 'VBN')): 17,
         (('will', 'MD'), ('defer', 'VB')): 1,
         (('might', 'MD'), ('succeed', 'VB')): 1,
         (('should', 'MD'), ('think', 'VB')): 10,
         (('would', 'MD'), ('wish', 'VB')): 1,
         (('will', 'MD'), ('go', 'VB')): 12,
         (('would', 'MD'), ('seem', 'VB')): 7,
         (('will', 'MD'), ('miss', 'VB')): 1,
         (('should', 'MD'), ('occur', 'VB')): 1,
         (('must', 'MD'), ('have', 'VB'), ('advanced', 'VBN')): 1,
         (('will', 'MD'), ('convince', 'VB')): 2,
         (('may', 'MD'), ('have', 'VB'), ('come', 'VBN')): 1,
         (('must', 'MD'), ('be', 'VB')): 19,
         (('might', 'MD'), ('require'

In [116]:
len(vp_chunk_counter(vp_chunks_nd_affair))

1165

For the 5th time, **Verb Phrases** were found to be more common than **Noun Phrases**. The similarities are clear and appearently consistent! Are the commons structures the same?

In [117]:
np_common_chunks(np_chunks_nd_affair)

[((('the', 'DT'), ('papers', 'NNS')), 35),
 ((('a', 'DT'), ('few', 'JJ'), ('minutes', 'NNS')), 31),
 ((('the', 'DT'), ('rings', 'NNS')), 27),
 ((('the', 'DT'), ('hands', 'NNS')), 25),
 ((('the', 'DT'), ('words', 'NNS')), 22),
 ((('the', 'DT'), ('stairs', 'NNS')), 21),
 ((('the', 'DT'), ('ladies', 'NNS')), 21),
 ((('these', 'DT'), ('words', 'NNS')), 20),
 ((('the', 'DT'), ('facts', 'NNS')), 19),
 ((('the', 'DT'), ('keys', 'NNS')), 19),
 ((('the', 'DT'), ('shelves', 'NNS')), 19),
 ((('the', 'DT'), ('misses', 'NNS')), 16),
 ((('the', 'DT'), ('consequences', 'NNS')), 15),
 ((('the', 'DT'), ('streets', 'NNS')), 14),
 ((('the', 'DT'), ('servants', 'NNS')), 14)]

In [118]:
vp_common_chunks(vp_chunks_nd_affair)

[((('would', 'MD'), ('be', 'VB')), 85),
 ((('will', 'MD'), ('be', 'VB')), 55),
 ((('may', 'MD'), ('be', 'VB')), 26),
 ((('will', 'MD'), ('have', 'VB')), 24),
 ((('would', 'MD'), ('have', 'VB'), ('been', 'VBN')), 22),
 ((('should', 'MD'), ('be', 'VB')), 22),
 ((('would', 'MD'), ('have', 'VB')), 21),
 ((('must', 'MD'), ('be', 'VB')), 19),
 ((('must', 'MD'), ('have', 'VB'), ('been', 'VBN')), 17),
 ((('should', 'MD'), ('like', 'VB')), 17),
 ((('will', 'MD'), ('do', 'VB')), 16),
 ((('could', 'MD'), ('be', 'VB')), 16),
 ((('could', 'MD'), ('see', 'VB')), 16),
 ((('can', 'MD'), ('do', 'VB')), 14),
 ((('should', 'MD'), ('say', 'VB')), 13)]

Yes, they are! Which leads us to believe that these two authors have similar writing styles.

### Step 6 - Complexity Analysis
***

We will now go the extra mile and analyse the complexity in the books to see which author just can't get enough details in the book.

We could say the the amount of details present in a book is dependent on adjectives, since they are the ones who add characteristcs and qualities to other words. Thus, the amount of adjectives should be **directly related** to the amount of details present in a novel (or pretty mich any other type of book). Let's see how detailed the books are. To do this, we'll have to go over the process of using regex to search for part-of-speech tags one more time!

In [3]:
adj = "ADJ: {<JJ>}"
adj_parser = RegexpParser(adj)