In [15]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /Users/jason/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/jason/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

# Understanding a Corpus in Natural Language Processing

## What is a Corpus?

In Natural Language Processing (NLP), a **corpus** (plural: **corpora**) refers to a large and structured set of texts that are used for linguistic research, language analysis, or machine learning purposes. The corpus can include a wide variety of written or spoken materials, such as books, articles, transcripts, and even web content. 

### Purpose of a Corpus

- **Linguistic Analysis**: A corpus provides data that researchers can analyze to understand language patterns, usage, and structure.
- **Model Training**: In machine learning, a corpus serves as training data for algorithms, helping them learn to recognize patterns, generate text, or make predictions.
- **Language Resource**: It can be used to create dictionaries, thesauri, and other language resources by providing examples of word usage in context.

### Example of a Corpus

Consider the following text as an example of a small corpus:



In [20]:
corpus = "Hello my name is Jason Tagud and I am a 2nd year university student at Ulster University. I love studying A.I"

# Tokenization in Natural Language Processing (NLP)

Tokenization is the process of splitting text into smaller components, called tokens, which can be words, phrases, or characters. It is a crucial step in Natural Language Processing (NLP) for simplifying and structuring input data for analysis.

## Types of Tokenization

1. **Word Tokenization**: Splits text into individual words.
   - Example: "Hello, my name." → `["Hello", "my", "name"]`

2. **Sentence Tokenization**: Splits text into sentences.
   - Example: "Hello. I am a student." → `["Hello.", "I am a student."]`

3. **Character Tokenization**: Divides text into individual characters.
   - Example: "Hello" → `["H", "e", "l", "l", "o"]`

## Importance

- Prepares raw text for analysis.
- Enables feature extraction for machine learning.
- Improves performance in tasks like sentiment analysis and text classification.

## Example with NLTK

In [22]:
documents=sent_tokenize(corpus)
print(documents)

['Hello my name is Jason Tagud and I am a 2nd year university student at Ulster University.', 'I love studying A.I']


In [23]:
for sentence in documents:
    print(sentence)

Hello my name is Jason Tagud and I am a 2nd year university student at Ulster University.
I love studying A.I


# Word Tokenization with the corpus

In [26]:
from nltk.tokenize import word_tokenize
word_tokenize(corpus)

['Hello',
 'my',
 'name',
 'is',
 'Jason',
 'Tagud',
 'and',
 'I',
 'am',
 'a',
 '2nd',
 'year',
 'university',
 'student',
 'at',
 'Ulster',
 'University',
 '.',
 'I',
 'love',
 'studying',
 'A.I']

In [27]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', 'my', 'name', 'is', 'Jason', 'Tagud', 'and', 'I', 'am', 'a', '2nd', 'year', 'university', 'student', 'at', 'Ulster', 'University', '.']
['I', 'love', 'studying', 'A.I']


# WordPunctTokenizer in NLTK

`WordPunctTokenizer` is a tokenizer from the Natural Language Toolkit (NLTK) that separates words from punctuation marks, creating distinct tokens for each. This is particularly useful in Natural Language Processing (NLP) for text preprocessing.

## Key Features

- **Separates Words and Punctuation**: It effectively splits words from punctuation marks, ensuring that they are treated as separate tokens.
- **Handles Common Punctuation**: Designed to manage various punctuation marks, such as commas, periods, exclamation marks, and question marks.
- **Useful for Preprocessing**: Ideal for preparing text data for analysis and machine learning tasks where clear distinctions between words and punctuation are necessary.

## Example Usage

Here’s how to use `WordPunctTokenizer` in Python with NLTK:

In [28]:
from nltk.tokenize import wordpunct_tokenize
for sentence in documents:
    print(wordpunct_tokenize(sentence))

['Hello', 'my', 'name', 'is', 'Jason', 'Tagud', 'and', 'I', 'am', 'a', '2nd', 'year', 'university', 'student', 'at', 'Ulster', 'University', '.']
['I', 'love', 'studying', 'A', '.', 'I']


# TreebankWordTokenizer in NLTK

The `TreebankWordTokenizer` is a tokenizer in NLTK that closely follows the tokenization conventions used in the Penn Treebank corpus. It is designed to handle contractions and punctuation intelligently, making it suitable for tasks where precision matters.

## Key Features

- **Handles Contractions**: Splits words like *"don't"* into *["do", "n't"]*.
- **Splits Punctuation Properly**: Separates punctuation marks such as parentheses and commas.
- **Preserves Linguistic Conventions**: Useful when working with corpora like Penn Treebank, which require specific tokenization rules.

## Example Usage


In [31]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer=TreebankWordTokenizer()
for sentence in documents:
    print(tokenizer.tokenize(sentence))

['Hello', 'my', 'name', 'is', 'Jason', 'Tagud', 'and', 'I', 'am', 'a', '2nd', 'year', 'university', 'student', 'at', 'Ulster', 'University', '.']
['I', 'love', 'studying', 'A.I']
