# ML for NLP

## Text Processing

### Tokenization
Tokenization is the process of breaking down text into smaller units such as sentences, words, or subwords. These units are called tokens.

- **Corpus**: A collection of text data used for analysis and training machine learning models.
- **Documents**: Individual pieces of text within the corpus, such as articles, paragraphs, or files.
- **Vocabulary**: The set of unique tokens present in a corpus after tokenization.
- **Words**: The smallest meaningful units in tokenization, typically split by spaces or punctuation.

**Advantages**:
- Simplifies text into manageable units for analysis.
- Enables further preprocessing steps like stemming, lemmatization, and vectorization.
- Supports sentence, word, or subword-level processing for different use cases.

**Limitations**:
- Can split words incorrectly (e.g., hyphenated words or abbreviations).
- Does not consider the semantic context of tokens.
- Fails to handle complex languages with compound words or agglutinative structures effectively.

---

### Stemming
Stemming is the process of reducing words to their root or base form. It removes affixes like suffixes and prefixes. For example:
- "running" → "run"
- "flies" → "fli"

Common algorithms include Porter Stemmer and Snowball Stemmer.

**Advantages**:
- Reduces dimensionality by grouping similar words.
- Computationally faster than lemmatization.

**Limitations**:
- May produce non-meaningful root words (e.g., "flies" → "fli").
- Loses linguistic meaning and context.
- Over-stemming or under-stemming can occur, leading to inaccurate grouping.

---

### Lemmatization
Lemmatization also reduces words to their base or dictionary form, but it considers the context and meaning of the word. For example:
- "running" → "run"
- "better" → "good"

Unlike stemming, lemmatization uses vocabulary and morphological analysis.

**Advantages**:
- Produces meaningful and context-aware base forms.
- Improves accuracy in downstream tasks like sentiment analysis or machine translation.

**Limitations**:
- Computationally intensive due to reliance on vocabulary and context analysis.
- Requires language-specific resources such as dictionaries and part-of-speech tagging.

---

### Stop Words
Stop words are common words in a language that are often removed during text preprocessing because they do not carry significant meaning. Examples include:
- "the"
- "is"
- "and"

**Advantages**:
- Reduces noise in the data.
- Improves model efficiency by focusing on meaningful words.
- Decreases vocabulary size, which can speed up processing.

**Limitations**:
- Removing stop words may discard contextually important words, depending on the task (e.g., in sentiment analysis, words like "not" are critical).
- Requires language-specific stop word lists.

---

### Part of Speech (PoS) Tagging
PoS tagging assigns a part of speech (e.g., noun, verb, adjective) to each token in the text. It is used to understand the grammatical structure and context of the text. Example:
- "She is running." → 
  - She (pronoun)
  - is (verb)
  - running (verb)

**Advantages**:
- Provides grammatical insights that enhance tasks like parsing, NER, and dependency analysis.
- Supports feature extraction for machine learning models.

**Limitations**:
- Requires accurate models and labeled data for tagging.
- Errors in tagging can propagate to downstream tasks.
- Ambiguous words (e.g., "run" as a verb or noun) can be misclassified.

---

### Named Entity Recognition (NER)
NER identifies and categorizes named entities in text into predefined categories, such as:
- **Persons**: "John Doe"
- **Organizations**: "Google"
- **Locations**: "Paris"
- **Dates**: "January 1st, 2024"

**Advantages**:
- Extracts structured information from unstructured text.
- Useful for real-world applications like information retrieval, chatbots, and summarization.
- Simplifies text by focusing on relevant entities.

**Limitations**:
- Performance depends on the quality of training data and the domain (e.g., general vs. medical texts).
- Struggles with out-of-vocabulary entities or informal text.
- Errors in recognizing entities can reduce system effectiveness in downstream tasks.
