## NLP 

NLP is sub field of AI which focuses on enabling computers to understand, interpret, and generate human language. NLP combines computational lingustics with machine learning and deep learning to process language in a way that is both meaningful and actionable.

### Uses

- Search engine
- Virtual assistants
- Transiation services
- Email filtering
- Social media monitoring
- Generative models
- Document Classificatoin

## Key terms

- **Corpus**:  A large collection of text data.
  
- **Tokenization**: Breaking text into smaller units such as words or stentences.

  
- **Stemming**: Reducing words to their **root form**.
  
    Example: "running", "runner", and "ran" may all be reduced to "run" using a stemmer like PorterStreammer, through outputs may vary: ["running","runner","ran"] --> ["run","runner","ran"]

- **Lemmatization**: Reducing words to their **base or dictionary form**.
  
    Example: "running" and "ran" both using WordNetLemmatizer with the correct part of speech.

- **POS Tagging**: Part-of-speech tagging; identifying grammatical parts of speech.
  
    Example: "She runs fast" --> [("She","PRP"),("runs","VBZ"),("fast","RB")]

- **NER**: Named Entity Recognition; identify names, locations, dates, etc.

- **Stop Words**: Common words that are usually removed before processing (e.g: "the","and")
  

## Components of NLP

NLP is not a monolithic singular approach, but rather, it is composed of several components, each contributing to the overall understanding of language. The main components that NLP strives to understand are Syntax, Semantics, Pragmatics, and Discourse 

### Syntax:
- **Definition**: Syntax pertains to the arrangement of words and phrases to create well-structured sentences in a language.
- **Example**: Consider the sentence *The cat sat on the mat." Syntax involves analyzing the grammatical structure of this sentence, ensuring that it adheres to the grammatical rules of English, such as subject-verb agreement and proper word order

### Semantics:
- **Definition**: Semantics is concerned with understanding the meaning of words and how they create meaning when combined in sentences.
- **Example**: In the sentence "The panda eats shoots and leaves," semantics helps distinguish whether the panda eats plants (shoots and leaves) or is involved in a violent act (shoots) and then departs (leaves), based on the meaning of the words and the context.

### Pragmatics:
- **Definition**: Pragmatics deals with understanding language in various contexts, ensuring that the intended meaning is derived based on the situation, speaker's
intent, and shared knowledge
- **Example**: If someone says, "Can you pass the salt?" Pragmatics involves understanding that this is a request rather than a question about one's ability to pass the salt, interpreting the speaker's intent based on the dining context.

### Discourse
- **Definition**: Discourse focuses on the analysis and interpretation of language beyond the sentence level, considering how sentences relate to each other in texts and
conversations.
- **Example**: In a conversation where one person says, "I'm freezing," and another responds, "I'll close the window, discourse involves understanding the coherence between the two statements, recognizing that the second statement is a response to the implied request in the first.



## 2. Working with NLTK (Natural Language Toolkit)

NLTK is one of the most widely used Python libraries for teaching and working with human language data. It provides easy-to-use interfaces to over 50 corpora and

In [2]:
import nltk

## Tokenization

In [5]:


from nltk.tokenize import word_tokenize, sent_tokenize
text= "Natural Language Processing with NLTK is fun and educational. It is not boring."\

print("Word Tokenization:",word_tokenize(text))
print("Sentence Tokenization:",sent_tokenize(text))

Word Tokenization: ['Natural', 'Language', 'Processing', 'with', 'NLTK', 'is', 'fun', 'and', 'educational', '.', 'It', 'is', 'not', 'boring', '.']
Sentence Tokenization: ['Natural Language Processing with NLTK is fun and educational.', 'It is not boring.']


## Stop Words Removal:

In [8]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

words = word_tokenize(text)
filtered_words = [word for word in words if word.lower() not in stop_words]
print("Filtered Words:", filtered_words)

Filtered Words: ['Natural', 'Language', 'Processing', 'NLTK', 'fun', 'educational', '.', 'boring', '.']


## Stemming:

In [11]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

words = ["running","ran","runs"]
stemmed = [stemmer.stem(word) for word in words]
print("Stemmed Words:", stemmed)

Stemmed Words: ['run', 'ran', 'run']


## Lemmatiazation

In [14]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["running","ran","better","ate","eating"]
lemmatized = [lemmatizer.lemmatize(word, pos='v') for word in words]
print("Lemmatized words:", lemmatized)

Lemmatized words: ['run', 'run', 'better', 'eat', 'eat']


## Text Preprocessing in NLP

In Natural Language Processing, before we apply any machine learning models, we need to clean the text data properly. This process is called **text preprocessing**. It helps to remove unnecessary parts from text and bring it into a format that the model can understand.

### Common Text Preprocessing Steps:
- Lowercasing
- Removing punctuation
- Tokenization
- Removing stop words
- Stemming or Lemmatization
- Removing numbers
- Handling special characters

### Real World Use Cases:
- Preprocessing user reviews for sentiment analysis
- Cleaning documents for topic modeling
- Preparing data for search engine indexing

In [2]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

text="NLTK is a leading platform for building Python programs to work with human laguage data."

text = text.lower()
text = re.sub(r"[^a-zA-Z\s]","",text) #removing punctuation and numbers
tokens = word_tokenize(text)#tokenization

#stop words removal
filtered = [word for word in tokens if word not in stop_words]

#lemmatization
lemmatized = [lemmatizer.lemmatize(word) for word in filtered]

print("Preprocessed text: ", lemmatized)

Preprocessed text:  ['nltk', 'leading', 'platform', 'building', 'python', 'program', 'work', 'human', 'laguage', 'data']


## 4. Vectoriizors 

When working with text data in NLP, vectorizer are used for converting raw text into numerical feature representations that machine learning

### Common Vectorizer in NLP

#### 1. CountVectorizer
- What it does: Converts a collection of text documents to a matrix of token counts.
- Example: Text: "I love spam and ham also ham burger" Feature: {i:1, love:1, spam:1, and:1, ham:2, also:1, burger: 1}
- Good for: Simple bag-of-words representations.
- Limitation: Doesn't consider term importance across documents

#### 2. TfidfVectorizer 
TF-IDF = Term Frequency - Inverse Document Frequency

- What it does: Weighs words based on how important they are across the corpus.
  
    - Common words across many documents get lower weight.
    - Rare but frequent words in a document get higher weight.
    - Better that CountVectorizer for capturing importance.

- Use case: When you want to reduce the impact of frequent but less imformative words.

#### 3. Word2Vec (from Gensim or others)

- What it does: Learns dense vector representations (embeddings) of words.
- Context-aware: Words with similar meaning are close in vector space.
- Trained using:
    - CBOW(Continuous Bag of Words)
    - Skip-Gram

#### 4. GloVe (Global Vectors for Word Representation)

Pre-trained embeddings from large corpora (Wikipedia, Common Crawl).

Similar to Word2Vec but trained differently(factorizing word co-occurance matrix).


#### 5. FastText (by Facebook)

Like Word2Vec but includes subwords information (n-grams).

Better at handling out-of-Vocabulary (OOV) words.

#### 6. BERT Vectorizers (Contextual Embeddings)
From models like BERT, RoBERTa, DistilBERT, etc.

Contextual: Word vectors depend on surrounding words.

Typically used viw libraries like:

transformers (by HuggingFace)

sentence

## Explanation of Tools Used:

### TfidfVectorizer
TfidfVectorizer is a tool from scikit-learn used to convert a collection of text documents into a matrix of TF-IDF features. TF-IDF stands for Term Frequency-Inverse Document Frequency, and it measures the importance of a word within a document relative to all other documents in a corpus.

**Working Mechanism:**
- Term Frequency (TF): Measures how frequently a term (word) appears in a document. It is calculated as:

TF = Frequency of word in document / Total words in document

- Inverse Document Frequency (IDF):
Measures how important a term is in the entire corpus. The idea is that words that appear in many documents are less informative and hence given less weight. It is calculated as:

IDF=log(Total number of documents / Number of documents containing the word)

- The TF-IDF score for a term is then calculated as:

TF-IDF = TF × IDF

**Why it Matters:**

- Feature Representation: TF-IDF provides a numeric representation of words that reflects their importance, which is crucial for machine learning models
- Downplays Common Words: Words like "the," "is," and "and" are common and may occur in all documents. TF-IDF reduces their weight to prevent them from dominating the feature space.
- Improves Classification: By weighting more important words, TF-IDF ensures that the model focuses on distinctive terms for classification tasks.

### WordNetLemmatizer
The WordNetLemmatizer from NLTK is a tool used for lemmatization, which is the process of reducing a word to its base or root form (lemma). For example, "running" becomes "run," "better" becomes "good."

**Working Mechanism:**

Lemmatization involves looking up a word in a lexical database called WordNet, which contains relationships between words (like synonyms and antonyms). It checks the word's part of speech (POS) to determine its lemma. Unlike stemming, which simply truncates words, lemmatization ensures that the word is reduced to its valid dictionary form.

Example:

Input: "running"

Output: "run" (using the lemma from WordNet)

**Why it Matters:**
- Reduces Variability: Lemmatization helps reduce different forms of a word to one common form, ensuring that the model treats variations of the same word as a single entity.
- Improves Model Efficiency: By reducing words to their roots, lemmatization reduces the feature space and improves the generalization of machine learning models.
- More Accurate than Stemming: Unlike stemming, lemmatization preserves the meaning of the word, making it more useful in contexts where understanding is important.

### MultinomialNB
MultinomialNB is a Naive Bayes classifier used for classification tasks, especially with discrete features like word counts. It is part of the scikit-learn library. This classifier assumes that the features (terms in text) are conditionally independent and follow a multinomial distribution.

**Working Mechanism:**

The Naive Bayes algorithm calculates the probability of a document belonging to each class (category) using Bayes' **Theorem:**

![image.png](attachment:09148951-596b-4340-8b78-fb0dc985bfa8.png)

Where:

- 𝑃(𝐶∣𝑋) is the posterior probability of class 𝐶 given the features 𝑋
- 𝑃(𝑋∣𝐶) is the likelihood of observing the features 𝑋 given the class 𝐶
- 𝑃(𝐶) is the prior probability of class 𝐶
- 𝑃(𝑋) is the probability of observing the features 𝑋 across all classes.

Essentially, it calculates the probability of each word occurring in a given class, and the class with the highest probability is chosen.