
# Introduction to Natural Language Processing with Python and NLTK

In this notebook, we will explore the basic capabilities of Python's **Natural Language Toolkit (NLTK)** package for text analysis and processing. This lab serves as both a refresher of Python programming concepts and an introduction to fundamental NLP operations.

Natural Language Processing (NLP) is a field that focuses on the interaction between computers and human language. It combines:
- Computer Science
- Artificial Intelligence
- Linguistics

Contents:
- Setting up NLTK
- Basic Text Processing
- Word Frequency Analysis
- Simple Text Statistics
- Working with the Gutenberg Corpus
- Practice Exercises

What you will learn:
- How to install and use NLTK
- Basic text processing techniques
- Working with text data in Python
- Basic text analysis and statistics

Prerequisites:
- Basic Python programming knowledge
- Familiarity with Python data structures (lists, dictionaries)
- Understanding of basic file operations in Python

Source:
- [NLTK Book](http://www.nltk.org/book/)
- [NLTK Documentation](https://www.nltk.org/)
- [Python Documentation](https://docs.python.org/3/)

## Setting up NLTK

First, we need to install NLTK. If you're running this notebook in Colab, run the following cells:

In [1]:
!pip install nltk
import nltk



## Downloading NLTK Packages for This Lab

In this lab, we will utilize various Natural Language Toolkit (NLTK) resources to perform text processing and analysis. To ensure that we have all the necessary components, we need to download the following NLTK packages:

1. **punkt**: This package is used for tokenization, which allows us to break down text into individual words or sentences.
2. **averaged_perceptron_tagger**: This package provides a part-of-speech (POS) tagger, which helps us identify the grammatical roles of words in a sentence.
3. **gutenberg**: This package includes a collection of literary texts from Project Gutenberg, which we will use for our text analysis tasks.
4. **stopwords**: This package contains a list of common words (such as 'and', 'the', 'in') that are often filtered out in text processing because they do not add significant meaning.

To facilitate sentence tokenization specifically, the **punkt** package is crucial, as it includes pre-trained models for segmenting text into sentences.

To download these packages, you can execute the following commands in a code cell:

In [2]:
# Download the necessary NLTK packages
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('gutenberg')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/wafaa/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/wafaa/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package gutenberg to /home/wafaa/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package stopwords to /home/wafaa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Basic Text Processing

Before starting the lab exercises, it's important to familiarize yourself with some key concepts in Natural Language Processing (NLP) using the Natural Language Toolkit (NLTK). To help solidify your understanding, please answer the following questions.

---

### NLTK Basics
Consider the following sentence:

"Amir and Ahmed visited Tamanrasset last summer; they had so much fun. It was a memorable trip filled with adventure."

#### Word Tokenization

Word tokenization allows us to break down text into individual words, or tokens. This process is essential for text analysis, as it enables us to manipulate and analyze each component of the text separately.

- **Question**: How do we tokenize a text into words using NLTK?

  - **Hint**: Look into the `word_tokenize()` function from the `nltk.tokenize` module. This function is specifically designed to split a string into individual words based on whitespace and punctuation.

  - **Task**: Tokenize the provided sentence.  

#### Sentence Tokenization

Sentence tokenization, or sentence segmentation, is the process of dividing a text into its constituent sentences. This is crucial for understanding the structure of a text, as it allows us to analyze individual thoughts or ideas conveyed by each sentence.

- **Question**: How do we split a text into sentences using NLTK?

  - **Hint**: You can use the `sent_tokenize()` function from the `nltk.tokenize` module, which identifies sentence boundaries and separates the text accordingly.

  - **Task**: Apply sentence tokenization to the following text:  
    "Amir and Ahmed visited Tamanrasset last summer; they had so much fun. It was a memorable trip filled with adventure."

#### Part-of-Speech (POS) Tagging

Part-of-Speech (POS) tagging involves assigning grammatical categories to words in a sentence, such as nouns, verbs, and adjectives. This is essential for understanding the structure and meaning of sentences.

- **Question**: What are Part-of-Speech tags, and how can we tag words in a sentence?

  - **Hint**: NLTK has a function called `pos_tag()` that tags words with their respective POS.

  - **Task**: Identify the nouns, verbs, and adjectives using POS tagging in the provided sentence.

In [10]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag

# Your text for analysis
text = "Amir and Ahmed visited Tamanrasset last summer; they had so much fun. It was a memorable trip filled with adventure."

print("\n=== Basic Text Processing ===")

# 1. Sentence Tokenization
sentences = sent_tokenize(text)
print("\n1. Sentence Tokenization:")
print(sentences)

# 2. Word Tokenization
words = word_tokenize(text)
print("\n1. Word Tokenization:")
print(words)

# 3. POS Tagging




=== Basic Text Processing ===


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/home/wafaa/nltk_data'
    - '/home/wafaa/anaconda3/envs/env_ds/nltk_data'
    - '/home/wafaa/anaconda3/envs/env_ds/share/nltk_data'
    - '/home/wafaa/anaconda3/envs/env_ds/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [9]:
help(word_tokenize)

Help on function word_tokenize in module nltk.tokenize:

word_tokenize(text, language='english', preserve_line=False)
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently an improved :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).
    
    :param text: text to split into words
    :type text: str
    :param language: the model name in the Punkt corpus
    :type language: str
    :param preserve_line: A flag to decide whether to sentence tokenize the text or not.
    :type preserve_line: bool



## Python and NLTK Exercises

In the following exercises, we will use the story **Alice in Wonderland** from the **Gutenberg** corpus.

## Exercise 1: Book Word Analysis

In this exercise, we will analyze the story to extract insights about word usage. We will focus on identifying frequently occurring words, specifically filtering by word length and grammatical categories.

### Objectives

1. **Identify Frequent Words**: Our first task is to write a program that analyzes the text to find the 10 most frequently occurring words that are longer than 5 characters (don't forget to remove the stop words).
   - **Hint**: You may find NLTK's `defaultdict` and `FreqDist` useful for counting word occurrences efficiently.

2. **Analyze Grammatical Categories**: Next, we will extend our analysis to focus on the grammatical categories of words. Specifically, we will identify the 10 most frequently occurring nouns, verbs, and adjectives in the text.
   - This will help us understand the text's focus based on the types of words used.

3. **Visualization**: Finally, we will visualize the top 10 most frequent nouns, verbs, and adjectives using a plot.
   - Visualizing this data will provide a clear representation of the word distribution and help in interpreting the results.

In [7]:
import nltk
import random
from nltk.tokenize import word_tokenize
from nltk import FreqDist, pos_tag
from collections import defaultdict, Counter
import matplotlib.pyplot as plt
from nltk.corpus import gutenberg, stopwords

# Load the "Alice in Wonderland" text from NLTK's Gutenberg corpus
alice_text = gutenberg.raw('carroll-alice.txt')
stop_words = set(stopwords.words('english'))
print("Total characters:", len(alice_text))
print("\nFirst 500 characters:\n", alice_text[:500])


def analyze_book_words():
    """
    Analyzes word frequency and grammatical categories in Alice in Wonderland
    """
    print("\n=== Exercise 1: Book Word Analysis ===")


analyze_book_words()

Total characters: 144395

First 500 characters:
 [Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy an

=== Exercise 1: Book Word Analysis ===


### Exercise 2: Text Statistics

Write a program that tracks how often the following main characters appear together within the same paragraph: Alice, Queen, King, Rabbit, Hatter, Duchess.

Output: Display character pairs and the frequency of their co-occurrence.

In [8]:
import nltk
import random
from nltk.tokenize import word_tokenize
from nltk import FreqDist, pos_tag
from collections import defaultdict, Counter
import matplotlib.pyplot as plt
from nltk.corpus import gutenberg, stopwords

# Load the "Alice in Wonderland" text from NLTK's Gutenberg corpus
alice_text = gutenberg.raw('carroll-alice.txt')
stop_words = set(stopwords.words('english'))

def analyze_character_cooccurrence():
    """Analyze character co-occurrence using NLTK sentence tokenization"""
    print("\n=== Exercise 2: Character Co-occurrence Analysis ===")

    main_characters = {'Alice', 'Queen', 'King', 'Rabbit', 'Hatter', 'Duchess'}


analyze_character_cooccurrence()


=== Exercise 2: Character Co-occurrence Analysis ===


## Exercise 3: Next Word Prediction and Sentence Generation

1. **Next Word Prediction**:
   - Develop a program that predicts the **next word** in a text given a specific word.
   
   Hint: Track the words that most commonly appear **immediately after** the given word in the text.
   
2. **Randomized Word Selection**:
   - Modify the word prediction process so that, for a given word, the next word is chosen **randomly from the top 5 most frequent** words that typically follow it in the text.

3. **Generate a Sentence**:
   - Starting with a given word, generate a sentence of 10 words by predicting the next word based on the previous word in the sentence. Randomly choose the next word from the top 5 most frequent followers.

4. **Optional Question: Consecutive Word Pair Prediction (Bigrams)**:
   - Extend the program to predict the most frequent **pair of consecutive words** (bigrams) instead of just a single word. Modify the sentence generation process to randomly choose the next **pair of words** from the top 5 most frequent consecutive word pairs that follow a given word in the text.

### Example Output:
- Given the word **"Alice"**, you might generate sentences like:
  - "Alice was beginning to feel very tired of sitting by the"



In [10]:
import nltk
import random
from nltk.tokenize import word_tokenize
from nltk import FreqDist, pos_tag
from collections import defaultdict, Counter
import matplotlib.pyplot as plt
from nltk.corpus import gutenberg, stopwords

# Load the "Alice in Wonderland" text from NLTK's Gutenberg corpus
alice_text = gutenberg.raw('carroll-alice.txt')
stop_words = set(stopwords.words('english'))

# Tokenize the text and keep all words, including stop words
def preprocess_text(text):
    # Tokenize the text
    tokens = []
    return tokens

tokens = preprocess_text(text)

# Function to track the words that come immediately after a given word
def analyze_next_word(tokens):
    next_word_dict = defaultdict(list)
    # Loop through tokens and collect the next words
    return next_word_dict

# Function to track consecutive word pairs (bigrams) after a given word
def analyze_next_bigram(tokens):
    next_bigram_dict = defaultdict(list)
    # Loop through tokens and collect the next bigram (pair of words)
    return next_bigram_dict

# Sentence generation using predicted next words
def generate_sentence(start_word, next_word_dict, sentence_length=15):
    sentence = [start_word]
    current_word = start_word

    # Generate the rest of the sentence by predicting the next word

    return ' '.join(sentence)

# Sentence generation using predicted next word pairs (bigrams)
def generate_sentence_bigram(start_word, next_bigram_dict, sentence_length=15):
    sentence = [start_word]
    current_word = start_word

    # Generate the rest of the sentence by predicting the next bigram (pair of words)
    return ' '.join(sentence[:sentence_length])

# Example usage
start_word = 'alice'

# Step 1: Analyze the next word for each word in the text
next_word_dict = analyze_next_word(tokens)

# Step 2: Analyze the next bigram (pair of consecutive words) for each word
next_bigram_dict = analyze_next_bigram(tokens)

# Step 3: Generate a sentence using next word prediction
print("Generated Sentence (Next Word Prediction):")
print(generate_sentence(start_word, next_word_dict))

# Step 4: Generate a sentence using next bigram prediction
print("\nGenerated Sentence (Bigram Prediction):")
print(generate_sentence_bigram(start_word, next_bigram_dict))


Generated Sentence (Next Word Prediction):
alice

Generated Sentence (Bigram Prediction):
alice


## Exercise 4: Chapter Vocabulary Profiler

In this exercise, you will create a program that evaluates the vocabulary complexity of each chapter in the story. This analysis will provide insights into the linguistic richness of the chapters by calculating various metrics.

### Objectives

1. **Average Word Length**:
   - Calculate the average length of words in each chapter. This metric helps determine the complexity of the vocabulary used.

2. **Number of Unique Words**:
   - Count the number of unique words in each chapter. This indicates the diversity of the vocabulary.

3. **Frequency of Complex Words**:
   - Identify and count the frequency of complex words, defined as words with more than 5 or 6 letters. This metric highlights the use of advanced vocabulary.

4. **Merging Metrics**:
   - Merge the three metrics—average word length, number of unique words, and frequency of complex words—into a single vocabulary score using the following formula:

   Vocabulary Score = (0.4 * Average Word Length) + (0.4 * Number of Unique Words) + (0.2 * Number of Complex Words)

   Apply this formula to generate a combined score for each chapter, allowing for easier comparison between them.


   Apply this formula to generate a combined score for each chapter, allowing for easier comparison between them.

In [11]:
import nltk
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt

# Load the "Alice in Wonderland" text from NLTK's Gutenberg corpus
from nltk.corpus import gutenberg
alice_text = gutenberg.raw('carroll-alice.txt')

def analyze_chapter_vocabulary():
    """Analyze vocabulary complexity of each chapter and sort them by complexity"""
    print("\n=== Exercise 4: Chapter Vocabulary Analysis ===")


# Call the function
analyze_chapter_vocabulary()



=== Exercise 4: Chapter Vocabulary Analysis ===


## Exercise 5: Character Action Tracker

In this exercise, you will develop a program to identify and track the actions (verbs) most commonly associated with the main characters in the story. This analysis will help you understand the behaviors and activities of the characters throughout the narrative.

### Main Characters

For this analysis, we will focus on the following main characters: **Alice, Queen, King, Rabbit, Hatter, Duchess**.

### Objectives

1. **Identify Actions**:
   - Analyze the text to extract **verbs** associated with each main character. This involves tracking the occurrences of verbs that appear within a short context around the character's name.
   - **Hint**: Focus on identifying verbs that occur **within the next 4 words** after the character's name appears in a sentence.

2. **Count and Rank**:
   - Count how many times each verb is associated with each character and rank them to find the most frequent actions.

3. **Output Results**:
   - For each main character, print a list of the top verbs associated with them, along with their frequencies.


In [12]:
import nltk
import random
from nltk.tokenize import word_tokenize
from nltk import FreqDist, pos_tag
from collections import defaultdict, Counter
import matplotlib.pyplot as plt
from nltk.corpus import gutenberg, stopwords

# Load the "Alice in Wonderland" text from NLTK's Gutenberg corpus
alice_text = gutenberg.raw('carroll-alice.txt')
stop_words = set(stopwords.words('english'))

def track_character_actions():
    """Track character actions using POS tagging"""
    print("\n=== Exercise 5: Character Action Tracker ===")


track_character_actions()


=== Exercise 5: Character Action Tracker ===


## Exercise 6: Sentence Type Classification

In this exercise, we will categorize sentences in the text based on their type, focusing on **Questions**, **Exclamations**, **Statements**, and **Imperatives**. By analyzing the structure of each sentence, we can gain a better understanding of how characters express themselves and how the narrative develops.

### Tasks

1. **Questions**:
   - Identify sentences that ask for information or clarification.
   - Specifically, we will focus on questions that start with one of the following words:
     - "Who"
     - "What"
     - "Why"
     - "Where"
   - Ensure that these sentences end with a question mark (`?`).

2. **Exclamations**:
   - Identify sentences that express strong emotions or surprise.
   - These sentences typically end with an exclamation mark (`!`).

3. **Statements**:
   - Identify sentences that declare facts or provide information.
   - These sentences usually start with a **noun** or **pronoun** (e.g., "Alice," "She," "The cat").
   - They end with a period (`.`).

4. **Imperatives**:
   - Identify sentences that give commands, requests, or instructions.
   - These sentences usually start with a **verb in its base form** (e.g., "Go," "Run," "Take").
   - They can end with a period (`.`) or an exclamation mark (`!`).


### Tasks to Complete:
- Extract and classify sentences from the text into the four categories mentioned above.
- Print the top 5 sentences from each category to understand the different sentence structures and their usage in the story.


In [13]:
import nltk
import matplotlib.pyplot as plt
from nltk.tokenize import sent_tokenize, word_tokenize


# Load the "Alice in Wonderland" text from NLTK's Gutenberg corpus
from nltk.corpus import gutenberg
alice_text = gutenberg.raw('carroll-alice.txt')

# Function to check if a word is a noun or pronoun
def is_noun_or_pronoun(word):
    """Check if a word is a noun or pronoun based on its part of speech tag."""
    return False


# Function to extract and categorize sentences
def categorize_sentences(text):
    """Categorize sentences into Questions, Exclamations, Statements, and Imperatives."""
    sentences = sent_tokenize(text)

    # Initialize categories
    categories = {'Questions': [], 'Exclamations': [], 'Statements': [], 'Imperatives': []}

    return categories

# Categorize the sentences in the text
categories = categorize_sentences(alice_text)

# Function to plot the frequencies of the categories
def plot_category_frequencies(categories):
    """Plot the frequencies of different sentence categories."""


# Plot the frequencies
plot_category_frequencies(categories)



### Exercise 7: Analysis of Violent Actions and Death Mentions

In this exercise, you will develop a program to identify and analyze sentences in the story that contain mentions of violent actions or death. This analysis aims to explore the themes of violence and mortality within the narrative.

### Tasks

1. **Identify Relevant Sentences**:
   - Write a program that scans the text for sentences containing specific keywords related to violent actions or death. Use the following keywords as indicators:
     - **Death-Related Words**:
       - "kill"
       - "murder"
       - "die"
       - "death"
     - **Violent Action Words**:
       - "attack"
       - "stab"
       - "shoot"
       - "injure"
       - "hurt"
       - "blood"
       - "assault"

2. **Categorize Instances**:
   - After identifying sentences containing the relevant keywords, categorize them into two distinct groups:
     - **Death Indicators**: Sentences containing any of the death-related words.
     - **Violent Actions**: Sentences containing any of the violent action words.

3. **Display Results**:
   - For each category, print the top 10 sentences that correspond to each keyword. This will provide insights into how often these themes appear in the text and in what context.


In [14]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import gutenberg

# Load Alice in Wonderland text from the NLTK Gutenberg corpus
alice_text = gutenberg.raw('carroll-alice.txt')

# Keywords related to death and violence
death_related_words = {"kill", "murder", "die", "death"}
violent_action_words = {"attack", "stab", "shoot", "injure", "hurt", "blood", "assault"}


