# **Text Analysis of *The Iliad* and *The Picture of Dorian Gray***

## **Introduction**
This project aims to analyze two classic literary works, *The Iliad* by Homer and *The Picture of Dorian Gray* by Oscar Wilde, using Natural Language Processing (NLP) techniques. The primary goal is to extract meaningful insights by tokenizing sentences, tagging parts of speech, and identifying the most common noun and verb phrases. 

By leveraging Python and the Natural Language Toolkit (NLTK), we process both texts to break them down into structured data, enabling us to compare linguistic patterns and writing styles.

## **Scope and Goals**
The key objectives of this project include:
1. **Tokenization:** Breaking down the texts into sentences and words.
2. **POS (Part-of-Speech) Tagging:** Identifying grammatical categories for each word.
3. **Chunking:** Extracting common **Noun Phrases (NPs)** and **Verb Phrases (VPs)** from the text.
4. **Frequency Analysis:** Identifying the 30 most common NPs and VPs in each text.
5. **Comparative Analysis:** Observing differences in phrase usage between the two works.

---

## **Workflow Process and Steps**

### **1. Data Import and Preprocessing**
The two texts are loaded from separate files (`the_iliad.txt` and `dorian_gray.txt`). They are then converted to lowercase for consistency in processing.

### **2. Tokenization**
We tokenize each text in two stages:
   - **Sentence Tokenization:** Splitting the text into individual sentences.
   - **Word Tokenization:** Breaking each sentence into its component words.

### **3. POS Tagging**
Each word in the tokenized sentences is assigned a part-of-speech (POS) tag. This step helps categorize words as nouns, verbs, adjectives, etc.

### **4. Chunking (NP and VP Extraction)**
Using **regular expression-based chunk grammars**, we define:
   - **Noun Phrases (NPs):** Determiner (optional) → Adjective(s) (optional) → Noun.
   - **Verb Phrases (VPs):** Noun → Verb → (Optional Adverb).

These chunk grammars are applied using NLTK's `RegexpParser`.

### **5. Frequency Analysis of NPs and VPs**
We extract and count the **30 most common** noun and verb phrases from both texts.

---


##### Import Libraries

In [1]:
import nltk
from nltk.tokenize import PunktSentenceTokenizer, word_tokenize
from nltk import pos_tag, RegexpParser
from collections import Counter

#### Helper Functions for Data Preprocessing and Analysis

We define three helper functions below to preprocess the data and analyze the text.

- `word_sentence_tokenize(text: str)`
This function tokenizes the input text into sentences and further tokenizes each sentence into words.

Parameters:: `text` (*str*): The input text.

Returns: A list of lists, where each inner list contains word tokens of a sentence.


- `np_chunk_counter(chunked_sentences)`
This function extracts noun phrase (NP) chunks from chunked sentences and returns the 30 most common NP chunks.

Parameters: `chunked_sentences` (*list*): A list of chunked sentences (output of a `RegexpParser`).
Returns:  A list of tuples, where each tuple contains a noun phrase chunk and its frequency.

- `vp_chunk_counter(chunked_sentences)`
This function extracts verb phrase (VP) chunks from chunked sentences and returns the 30 most common VP chunks.

Parameters: `chunked_sentences` (*list*): A list of chunked sentences (output of a `RegexpParser`).
Returns: A list of tuples, where each tuple contains a verb phrase chunk and its frequency.

In [2]:
def word_sentence_tokenize(text: str):
    # Create a sentence tokenizer using the input text as a model
    sentence_tokenizer = PunktSentenceTokenizer(text)
    # Tokenize text into sentences
    sentence_tokenized = sentence_tokenizer.tokenize(text)
    
    # Tokenize each sentence into words
    word_tokenized = [word_tokenize(sentence) for sentence in sentence_tokenized]
    return word_tokenized

In [3]:
def np_chunk_counter(chunked_sentences):
    chunks = []
    for chunked_sentence in chunked_sentences:
        # Extract all subtrees with label 'NP'
        for subtree in chunked_sentence.subtrees(filter=lambda t: t.label() == 'NP'):
            # Convert subtree to a tuple for hashability
            chunks.append(tuple(subtree))
    
    # Count frequency of each NP chunk
    chunk_counter = Counter(chunks)
    return chunk_counter.most_common(30)

In [4]:
def vp_chunk_counter(chunked_sentences):
    chunks = []
    for chunked_sentence in chunked_sentences:
        # Extract all subtrees with label 'VP'
        for subtree in chunked_sentence.subtrees(filter=lambda t: t.label() == 'VP'):
            chunks.append(tuple(subtree))
    
    # Count frequency of each VP chunk
    chunk_counter = Counter(chunks)
    return chunk_counter.most_common(30)

 ## Main Processing Pipeline for Iliad Test

 The following code reads a text file (e.g., *the_iliad.txt*), processes it through the NLP pipeline, and displays:
 - A sample word-tokenized sentence.
 - A sample POS-tagged sentence.
 - The 30 most common noun phrase (NP) chunks.
 - The 30 most common verb phrase (VP) chunks.

#### Import text

In [5]:
try:
    with open("the_iliad.txt", encoding='utf-8') as file:
        text = file.read().lower()
except FileNotFoundError:
    raise FileNotFoundError("The file 'the_iliad.txt' was not found in the current directory.")

#### Sentence and word tokenization


In [6]:
word_tokenized_text = word_sentence_tokenize(text)
print("Sample word-tokenized sentence (index 100):")
print(word_tokenized_text[100])  # display sample sentence

Sample word-tokenized sentence (index 100):
['he', 'appears', 'as', 'the', 'enunciator', 'of', 'opinions', 'as', 'different', 'in', 'their', 'tone', 'as', 'those', 'of', 'the', 'writers', 'who', 'have', 'handed', 'them', 'down', '.']


#### Part-of-speech tagging for each tokenized sentence

In [7]:
pos_tagged_text = [pos_tag(sentence) for sentence in word_tokenized_text]

print("\nSample POS-tagged sentence (index 101):")
print(pos_tagged_text[101])


Sample POS-tagged sentence (index 101):
[('when', 'WRB'), ('we', 'PRP'), ('have', 'VBP'), ('read', 'VBN'), ('plato', 'JJ'), ('_or_', 'NNP'), ('xenophon', 'NNP'), (',', ','), ('we', 'PRP'), ('think', 'VBP'), ('we', 'PRP'), ('know', 'VBP'), ('something', 'NN'), ('of', 'IN'), ('socrates', 'NNS'), (';', ':'), ('when', 'WRB'), ('we', 'PRP'), ('have', 'VBP'), ('fairly', 'RB'), ('read', 'VBN'), ('and', 'CC'), ('examined', 'VBN'), ('both', 'DT'), (',', ','), ('we', 'PRP'), ('feel', 'VBP'), ('convinced', 'JJ'), ('that', 'IN'), ('we', 'PRP'), ('are', 'VBP'), ('something', 'NN'), ('worse', 'JJR'), ('than', 'IN'), ('ignorant', 'NN'), ('.', '.')]


#### Define chunk grammars and create chunk parsers

In [8]:
# Noun Phrase (NP) Grammar: Optional determiner, any number of adjectives, followed by a noun.
np_chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"

# Verb Phrase (VP) Grammar: This is a simple example. It looks for a noun followed by a verb (and optionally an adverb).
vp_chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

# Create chunk parsers using RegexpParser
np_chunk_parser = RegexpParser(np_chunk_grammar)
vp_chunk_parser = RegexpParser(vp_chunk_grammar)

#### Apply chunking on POS-tagged sentences 

In [9]:
np_chunked_text = [np_chunk_parser.parse(sentence) for sentence in pos_tagged_text]
vp_chunked_text = [vp_chunk_parser.parse(sentence) for sentence in pos_tagged_text]

#### Get and display the most common NP and VP chunks

In [10]:
most_common_np_chunks = np_chunk_counter(np_chunked_text)
most_common_vp_chunks = vp_chunk_counter(vp_chunked_text)


print("\nMost common Noun Phrase (NP) chunks:")
for chunk, freq in most_common_np_chunks:
    print(f"{chunk}: {freq}")

print("\nMost common Verb Phrase (VP) chunks:")
for chunk, freq in most_common_vp_chunks:
    print(f"{chunk}: {freq}")


Most common Noun Phrase (NP) chunks:
(('hector', 'NN'),): 323
(('i', 'NN'),): 277
(('jove', 'NN'),): 257
(('troy', 'NN'),): 208
(('vain', 'NN'),): 195
(('war', 'NN'),): 193
(('son', 'NN'),): 170
(('thou', 'NN'),): 158
(('the', 'DT'), ('plain', 'NN')): 157
(('the', 'DT'), ('field', 'NN')): 154
(('the', 'DT'), ('ground', 'NN')): 138
(('death', 'NN'),): 134
(('hand', 'NN'),): 134
(('greece', 'NN'),): 128
(('heaven', 'NN'),): 127
(('fate', 'NN'),): 127
(('breast', 'NN'),): 122
(('thee', 'NN'),): 122
(('the', 'DT'), ('trojan', 'NN')): 120
(('the', 'DT'), ('god', 'NN')): 119
(('the', 'DT'), ('greeks', 'NN')): 117
(('the', 'DT'), ('war', 'NN')): 117
(('blood', 'NN'),): 115
(('homer', 'NN'),): 112
(('the', 'DT'), ('king', 'NN')): 105
(('rage', 'NN'),): 103
(('force', 'NN'),): 103
(('care', 'NN'),): 99
(('head', 'NN'),): 98
(('man', 'NN'),): 97

Most common Verb Phrase (VP) chunks:
(("'t", 'NN'), ('is', 'VBZ')): 19
(('i', 'NN'), ('am', 'VBP')): 11
(("'t", 'NN'), ('was', 'VBD')): 11
(('the', 'D

 ### Main Processing Pipeline for Dorian Gray Text

 The following code reads a text file (e.g., *dorian_gray.txt*), processes it through the NLP pipeline, and displays:
 - A sample word-tokenized sentence.
 - A sample POS-tagged sentence.
 - The 30 most common noun phrase (NP) chunks.
 - The 30 most common verb phrase (VP) chunks.

#### Import text

In [11]:
try:
    with open("dorian_gray.txt", encoding='utf-8') as file:
        text = file.read().lower()
except FileNotFoundError:
    raise FileNotFoundError("The file 'dorian_gray.txt' was not found in the current directory.")

#### Sentence and word tokenization


In [12]:
word_tokenized_text = word_sentence_tokenize(text)
print("Sample word-tokenized sentence (index 100):")
print(word_tokenized_text[100])  # display sample sentence

Sample word-tokenized sentence (index 100):
['it', 'seems', 'to', 'be', 'the', 'one', 'thing', 'that', 'can', 'make', 'modern', 'life', 'mysterious', 'or', 'marvellous', 'to', 'us', '.']


#### Part-of-speech tagging for each tokenized sentence

In [13]:
pos_tagged_text = [pos_tag(sentence) for sentence in word_tokenized_text]

print("\nSample POS-tagged sentence (index 101):")
print(pos_tagged_text[101])


Sample POS-tagged sentence (index 101):
[('the', 'DT'), ('commonest', 'JJS'), ('thing', 'NN'), ('is', 'VBZ'), ('delightful', 'JJ'), ('if', 'IN'), ('one', 'CD'), ('only', 'RB'), ('hides', 'VBZ'), ('it', 'PRP'), ('.', '.')]


#### Define chunk grammars and create chunk parsers

In [14]:
# Noun Phrase (NP) Grammar: Optional determiner, any number of adjectives, followed by a noun.
np_chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"

# Verb Phrase (VP) Grammar: This is a simple example. It looks for a noun followed by a verb (and optionally an adverb).
vp_chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

# Create chunk parsers using RegexpParser
np_chunk_parser = RegexpParser(np_chunk_grammar)
vp_chunk_parser = RegexpParser(vp_chunk_grammar)

####  Apply chunking on POS-tagged sentences

In [15]:
np_chunked_text = [np_chunk_parser.parse(sentence) for sentence in pos_tagged_text]
vp_chunked_text = [vp_chunk_parser.parse(sentence) for sentence in pos_tagged_text]

In [16]:
# 6. Get and display the most common NP and VP chunks
most_common_np_chunks = np_chunk_counter(np_chunked_text)
most_common_vp_chunks = vp_chunk_counter(vp_chunked_text)


print("\nMost common Noun Phrase (NP) chunks:")
for chunk, freq in most_common_np_chunks:
    print(f"{chunk}: {freq}")

print("\nMost common Verb Phrase (VP) chunks:")
for chunk, freq in most_common_vp_chunks:
    print(f"{chunk}: {freq}")


Most common Noun Phrase (NP) chunks:
(('i', 'NN'),): 965
(('henry', 'NN'),): 202
(('lord', 'NN'),): 197
(('life', 'NN'),): 171
(('harry', 'NN'),): 136
(('dorian', 'JJ'), ('gray', 'NN')): 127
(('something', 'NN'),): 126
(('nothing', 'NN'),): 93
(('basil', 'NN'),): 84
(('the', 'DT'), ('world', 'NN')): 70
(('everything', 'NN'),): 69
(('anything', 'NN'),): 68
(('hallward', 'NN'),): 68
(('the', 'DT'), ('man', 'NN')): 61
(('the', 'DT'), ('room', 'NN')): 60
(('face', 'NN'),): 57
(('the', 'DT'), ('door', 'NN')): 56
(('love', 'NN'),): 55
(('art', 'NN'),): 52
(('course', 'NN'),): 51
(('the', 'DT'), ('picture', 'NN')): 46
(('the', 'DT'), ('lad', 'NN')): 46
(('head', 'NN'),): 44
(('round', 'NN'),): 44
(('hand', 'NN'),): 44
(('the', 'DT'), ('table', 'NN')): 40
(('sibyl', 'NN'),): 40
(('the', 'DT'), ('painter', 'NN')): 38
(('sir', 'NN'),): 38
(('a', 'DT'), ('moment', 'NN')): 38

Most common Verb Phrase (VP) chunks:
(('i', 'NN'), ('am', 'VBP')): 101
(('i', 'NN'), ('was', 'VBD')): 40
(('i', 'NN'), ('

## **Results and Key Findings**

### **Most Common Noun Phrases (NPs)**
#### *The Iliad*
- Frequent noun phrases likely relate to war, heroes, gods, and battles.

#### *The Picture of Dorian Gray*
- Noun phrases focus more on beauty, morality, and society.

### **Most Common Verb Phrases (VPs)**
#### *The Iliad*
- Action-oriented phrases dominate, reflecting the epic’s focus on combat and heroism.

#### *The Picture of Dorian Gray*
- More introspective and descriptive verb phrases are found, aligned with Wilde’s philosophical themes.

---

## **Conclusion and Insights**
This analysis highlights key differences in linguistic patterns between *The Iliad* and *The Picture of Dorian Gray*:
1. **The Iliad** is structured around action-heavy, heroic storytelling with frequent use of battle-related noun and verb phrases.
2. **The Picture of Dorian Gray** employs a more philosophical and descriptive style, with an emphasis on abstract and aesthetic themes.
3. Chunking and phrase frequency analysis provide a structured way to compare different writing styles and themes in literary works.

By utilizing NLP techniques, we gain a deeper understanding of how language is used in different genres, offering insights into both storytelling structures and thematic elements.
