# **Natural Language Processing**
Natural Language Processing (NLP) is a branch of Artificial Intelligence that enables computers to understand, interpret, analyze, and generate human language. It combines linguistics, computer science, and machine learning to process textual or spoken data in a meaningful way.

## Why NLP is Needed

- Humans communicate in natural language, not binary or structured formats
- Massive amount of unstructured text data exists (social media, emails, reviews, chats)
- Enables automation of language tasks (chatbots, translation, sentiment analysis)
- Helps extract useful insights from text data
- Bridges communication gap between humans and machines

## Text Preprocessing
Text preprocessing is the initial step in NLP where raw text data is cleaned and transformed into a structured format suitable for analysis. Since natural language is unstructured and noisy, preprocessing improves model performance and accuracy.

- #### **Tokenization**:
    Tokenization is the process of breaking text into smaller units called tokens.
    These tokens can be:

    - Words
    - Sentences
    - Subwords

    Example:
    ‚ÄúMachine learning is amazing‚Äù ‚Üí
    [Machine, learning, is, amazing]

    It converts continuous text into manageable pieces.

- #### **Stemming**:
    Stemming reduces words to their root form by removing suffixes.
    The resulting word may not always be grammatically correct.

    Examples:
    - Playing ‚Üí Play
    - Studies ‚Üí Studi
    - Running ‚Üí Run

    It is rule-based and faster, but less accurate.

- #### **Lemmatization**:

    Lemmatization reduces words to their base or dictionary form (lemma) using linguistic knowledge.

    Examples:
    - Running ‚Üí Run
    - Better ‚Üí Good
    - Studies ‚Üí Study

    Unlike stemming, the output is a valid word.
    It is more accurate but computationally heavier.

## Text Vectorization
Text vectorization is the process of converting textual data into numerical representations so that machine learning models can process and analyze it.
Since algorithms operate on numbers, vectorization transforms words, sentences, or documents into mathematical vectors.

- #### **Bag of Words (BoW)**:
    Bag of Words represents text by counting the frequency of words in a document, ignoring grammar and word order.

    How it works:

    - Create a vocabulary of all unique words
    - Count how many times each word appears in a document
    - Represent document as a frequency vector

    Example:

    Sentence 1: ‚ÄúI love NLP‚Äù
    Sentence 2: ‚ÄúI love ML‚Äù

    Vocabulary ‚Üí [I, love, NLP, ML]

    Vectors:
    - S1 ‚Üí [1, 1, 1, 0]
    - S2 ‚Üí [1, 1, 0, 1]

- #### **TF-IDF (Term Frequency ‚Äì Inverse Document Frequency)**:
    TF-IDF improves BoW by giving importance to meaningful words and reducing the weight of common words.

    Components:

    - TF (Term Frequency) ‚Üí How often a word appears in a document
    - IDF (Inverse Document Frequency) ‚Üí Penalizes words that appear in many documents

    Formula idea:
    TF-IDF = TF √ó IDF

    Words like ‚Äúthe‚Äù, ‚Äúis‚Äù get low weight.
    Rare but important words get high weight.

- #### **N-grams**:
    N-grams consider sequences of N consecutive words instead of single words.

    Types:
    - Unigram ‚Üí single word
    - Bigram ‚Üí two words
    - Trigram ‚Üí three words

    Example:

    Sentence: ‚ÄúI love NLP‚Äù

    Unigrams ‚Üí I, love, NLP
    Bigrams ‚Üí I love, love NLP

    Why needed?

    Because ‚Äúnot good‚Äù ‚â† ‚Äúgood‚Äù

    BoW misses that.
    N-grams partially capture context.

- #### **Word2Vec**:
    Word2Vec is a neural network-based model that converts words into dense vectors based on context.

    Core idea:
    Words appearing in similar contexts have similar meanings.

    Example:
    King ‚Äì Man + Woman ‚âà Queen
    (Yes, algebra with words üòè)

    Features:

    - Dense vectors (low dimensional, like 100‚Äì300 dims)
    - Captures semantic similarity
    - Context-aware (to some extent)

- #### **Average Word2Vec**:
    Word2Vec gives vectors for individual words.
    But what about a sentence?

    Solution:
    Take average of all word vectors in the sentence.

    Example:
    Sentence = [word1_vec + word2_vec + word3_vec] / 3