```{contents}
```
# Terminologies


* **Corpus** → A collection of text (paragraphs, sentences). Example: one big paragraph is a corpus.

* **Document** → An individual sentence or text unit inside the corpus.

* **Words** → Each individual element (tokens) within a sentence.

* **Vocabulary** → The set of **unique words** in the corpus.

* **Tokenization**:

  * Breaking text into smaller units (**tokens**).
  * Levels of tokenization:

    1. **Paragraph → Sentences** (sentence tokenization).
    2. **Sentence → Words** (word tokenization).
  * Example:

    * Corpus: *“My name is Krish. I am also a YouTuber.”*
    * Sentence Tokens:

      1. “My name is Krish”
      2. “I am also a YouTuber”
    * Word Tokens: \[My, name, is, Krish, I, am, also, a, YouTuber].

* **Vocabulary Example**:

  * Text: *“I like to drink apple juice. My friend likes mango juice.”*
  * Total words = 11, Unique words = 9 (if “like” and “likes” are treated separately, count increases).
  * Vocabulary = {I, like, to, drink, apple, juice, my, friend, likes, mango}.

* Importance → Tokenization is a **key step in text preprocessing** for NLP tasks because models require numerical representations (vectors) of words.

---

## Elaboration & Deeper Insights

1. **Why Corpus, Document, Vocabulary Matter?**

   * Corpus = dataset you are working with (like raw text).
   * Document = training instance (like one review in sentiment analysis).
   * Vocabulary = the dictionary of your text world → forms the **basis for encoding** words into vectors.
   * Example: In text classification, your vocabulary determines the **feature space**.

---

2. **Types of Tokenization**

   * **Sentence Tokenization** → Splits by punctuation (., !, ?).

     * Useful in summarization, translation.
   * **Word Tokenization** → Splits by spaces, punctuation.

     * Useful in bag-of-words, embeddings.
   * **Subword Tokenization (modern NLP)** → Breaks words into smaller chunks (e.g., “playing” → “play” + “ing”).

     * Used in **BERT, GPT, Hugging Face models** to handle unknown words and reduce vocabulary size.

---

3. **Practical Importance of Tokenization**

   * Models like Naive Bayes, SVM, or Deep Learning require text → numbers.
   * Tokenization provides the **units for encoding** into:

     * **Count Vectors (Bag of Words)**
     * **TF-IDF**
     * **Word Embeddings (Word2Vec, GloVe)**
     * **Transformers embeddings (BERT, GPT)**

---

4. **Challenges in Tokenization**

   * **Ambiguity**: “New York” should be one token, not two.
   * **Languages**: In Chinese/Japanese, words are not separated by spaces.
   * **Morphology**: “like” vs “likes” vs “liked” → same base meaning but different tokens.
   * Modern approaches use **lemmatization** + **subword tokenization** to solve these.

---

**Takeaway**

* **Corpus → Documents → Sentences → Words → Vocabulary** forms the basic hierarchy of NLP.
* **Tokenization** is the *gateway* to all NLP tasks: without breaking text into structured units, no ML/DL model can process it.
* Future steps after tokenization include: **normalization, stopword removal, stemming, lemmatization, embeddings**.


In [3]:
# Demonstration of NLP basics: Corpus, Documents, Vocabulary, Tokenization

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter

# Download necessary NLTK data
nltk.download("punkt")

# Example corpus
corpus = "My name is Krish. I am also a YouTuber. I like to drink apple juice. My friend likes mango juice."

# Sentence Tokenization
sentences = sent_tokenize(corpus)

# Word Tokenization
words = word_tokenize(corpus)

# Vocabulary (unique words)
vocabulary = set(words)

# Word Frequency (to show importance in corpus)
word_freq = Counter(words)

sentences, words, vocabulary, word_freq.most_common(10)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


(['My name is Krish.',
  'I am also a YouTuber.',
  'I like to drink apple juice.',
  'My friend likes mango juice.'],
 ['My',
  'name',
  'is',
  'Krish',
  '.',
  'I',
  'am',
  'also',
  'a',
  'YouTuber',
  '.',
  'I',
  'like',
  'to',
  'drink',
  'apple',
  'juice',
  '.',
  'My',
  'friend',
  'likes',
  'mango',
  'juice',
  '.'],
 {'.',
  'I',
  'Krish',
  'My',
  'YouTuber',
  'a',
  'also',
  'am',
  'apple',
  'drink',
  'friend',
  'is',
  'juice',
  'like',
  'likes',
  'mango',
  'name',
  'to'},
 [('.', 4),
  ('My', 2),
  ('I', 2),
  ('juice', 2),
  ('name', 1),
  ('is', 1),
  ('Krish', 1),
  ('am', 1),
  ('also', 1),
  ('a', 1)])