#1) What are regular expressions?

Regular expressions are patterns used to search, match, and manipulate strings based on specific text rules.

###Pattern to extract emails containing both numbers and alphabets

A regex like \b[A-Za-z0-9._%+-]+[0-9][A-Za-z0-9._%+-]*@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b can capture emails that have at least one number and one letter before the @.

####**Advantages of regular expressions**

They are powerful for pattern-based text matching and cleaning without manually writing long string-handling code.

####**Limitations of regular expressions**

They cannot understand the meaning or context of the text, only patterns.

In [None]:
import re

text = "Hello, my email is john123@example.com and my friend's email is alice99@testmail.org. Contact us!"

pattern = r'\b[A-Za-z0-9._%+-]+[0-9][A-Za-z0-9._%+-]*@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(pattern, text)

print("Extracted emails:", emails)

Extracted emails: ['john123@example.com', 'alice99@testmail.org']


#2) What is the Bag of Words (BoW) technique?

Bag of Words is a method that converts text into a collection of word frequencies, ignoring grammar and word order.

###How it differs from regular expressions

BoW focuses on word counts for text analysis, while regex focuses on pattern matching.

###Limitations of BoW

It ignores word order, meaning, and context, which can cause information loss.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love Python programming",
    "Python programming is fun",
    "I love NLP with Python"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Feature names:", vectorizer.get_feature_names_out())
print("Bag of Words matrix:\n", X.toarray())

Feature names: ['fun' 'is' 'love' 'nlp' 'programming' 'python' 'with']
Bag of Words matrix:
 [[0 0 1 0 1 1 0]
 [1 1 0 0 1 1 0]
 [0 0 1 1 0 1 1]]


#3) What is TF-IDF (Term Frequency–Inverse Document Frequency)?

TF-IDF is a numerical measure that highlights words that are important in a document but rare across documents.

###Advantages of TF-IDF

It reduces the weight of common words and emphasizes unique terms, improving text representation.

###How it differs from regex and BoW

Unlike regex (pattern search) and BoW (raw counts), TF-IDF considers word importance relative to all documents.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print("Feature names:", vectorizer.get_feature_names_out())
print("TF-IDF matrix:\n", X.toarray())

Feature names: ['fun' 'is' 'love' 'nlp' 'programming' 'python' 'with']
TF-IDF matrix:
 [[0.         0.         0.61980538 0.         0.61980538 0.48133417
  0.        ]
 [0.5844829  0.5844829  0.         0.         0.44451431 0.34520502
  0.        ]
 [0.         0.         0.44451431 0.5844829  0.         0.34520502
  0.5844829 ]]


#4) Word Embeddings

What it means in plain English:
A way to turn words into numbers that carry meaning.
In embeddings, similar words (like “king” and “queen”) end up close to each other in a “map” of words.

###Breaking the name:

Word → a single term from text.

Embedding → mapping it into a multi-dimensional space (think coordinates).

###Why use it?

Captures meaning and relationships between words.

Lets computers understand similarity:

“man” + “woman” → both are people.

“king” - “man” + “woman” → ≈ “queen”.

In [None]:
# Example: Imagine these are word embeddings (numbers that represent meaning)

embeddings = {
    "king": [0.9, 0.8],
    "queen": [0.88, 0.82],
    "man": [0.4, 0.3],
    "woman": [0.42, 0.35]
}

print("Vector for 'king':", embeddings["king"])
print("Vector for 'queen':", embeddings["queen"])

Vector for 'king': [0.9, 0.8]
Vector for 'queen': [0.88, 0.82]


#5) What are stop words and how to remove them using NLTK?

Stop words are common words (like “the”, “is”, “in”) that are usually removed to focus on meaningful content.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

text = "I love learning NLP in Python because it is interesting"
stop_words = set(stopwords.words('english'))

words = word_tokenize(text)
filtered_words = [w for w in words if w.lower() not in stop_words]

print("Filtered words:", filtered_words)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Filtered words: ['love', 'learning', 'NLP', 'Python', 'interesting']


#6) What is sentence tokenization and word tokenization?

Sentence tokenization splits text into sentences, while word tokenization splits text into individual words.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download('punkt')

sample_text = "NLP is fun. I love learning new techniques!"

print("Sentence tokenization:", sent_tokenize(sample_text))
print("Word tokenization:", word_tokenize(sample_text))

Sentence tokenization: ['NLP is fun.', 'I love learning new techniques!']
Word tokenization: ['NLP', 'is', 'fun', '.', 'I', 'love', 'learning', 'new', 'techniques', '!']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
