**Question 1:** What is Computational Linguistics and how does it relate to NLP? 

Ans- Computational linguistics is a theoretical field that uses computational methods to study language, while Natural Language Processing (NLP) is its applied counterpart, focusing on developing practical applications that enable computers to process and understand human language. In short, computational linguistics explores the theory and models of language, whereas NLP uses those theories and models to build things like chatbots, speech recognition, and machine translation. 

**Question 2:** Briefly describe the historical evolution of Natural Language Processing.

Ans- NLP has evolved from early rule-based systems in the 1950s to statistical methods in the 1980s-2000s, and has since been revolutionized by neural networks and deep learning, especially with the rise of large language models in the 2010s and 2020s. Key advancements include the shift from hand-written rules to statistical models that learn from data, the development of recurrent neural networks (RNNs), and the emergence of transformers and other deep learning models that can handle vast amounts of text. 

**Question 3:** List and explain three major use cases of NLP in today’s tech industry.

Ans- Three major use cases of NLP in today's tech industry are customer service automation through chatbots, search and information retrieval like in search engines, and business intelligence and analytics such as sentiment analysis and market trend prediction. NLP enables businesses to automate interactions, understand user intent, and extract actionable insights from vast amounts of text data. 

**Question 4:** What is text normalization and why is it essential in text processing tasks?

Ans- Text normalization is the process of converting raw text into a standard, consistent format, which is essential in text processing for reducing complexity and improving model performance.

**Question 5:** Compare and contrast stemming and lemmatization with suitable 
examples.

Ans- Stemming and lemmatization are both techniques used in Natural Language Processing (NLP) to reduce words to their base or root form, but they differ in their approach and accuracy.

**Question 6:** Write a Python program that uses regular expressions (regex) to extract all 
email addresses from the following block of text: 
“Hello team, please contact us at support@xyz.com for technical issues, or reach out to 
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny 
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.” 
(Include your Python code and output in the code box below.)

In [1]:
import re

text = """
    Hello team, please contact us at support@xyz.com for technical issues, or reach out to 
    our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny 
    via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.
"""

pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

emails = re.findall(pattern, text)

print(emails)

['support@xyz.com', 'hr@xyz.com', 'john.doe@xyz.org', 'jenny_clarke126@mail.co.us', 'partners@xyz.biz']


**Question 7:** Given the sample paragraph below, perform string tokenization and 
frequency distribution using Python and NLTK: 
“Natural Language Processing (NLP) is a fascinating field that combines linguistics, 
computer science, and artificial intelligence. It enables machines to understand, 
interpret, and generate human language. Applications of NLP include chatbots, 
sentiment analysis, and machine translation. As technology advances, the role of NLP 
in modern solutions is becoming increasingly critical.” 
(Include your Python code and output in the code box below.)

In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

text = """Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical."""

# Tokenization
tokens = word_tokenize(text)
print("Tokens:")
print(tokens)

# Frequency Distribution
freq_dist = FreqDist(tokens)
print("\nFrequency Distribution:")
for word, freq in freq_dist.items():
    print(f"{word}: {freq}")


Tokens:
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'It', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'Applications', 'of', 'NLP', 'include', 'chatbots', ',', 'sentiment', 'analysis', ',', 'and', 'machine', 'translation', '.', 'As', 'technology', 'advances', ',', 'the', 'role', 'of', 'NLP', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical', '.']

Frequency Distribution:
Natural: 1
Language: 1
Processing: 1
(: 1
NLP: 3
): 1
is: 2
a: 1
fascinating: 1
field: 1
that: 1
combines: 1
linguistics: 1
,: 7
computer: 1
science: 1
and: 3
artificial: 1
intelligence: 1
.: 4
It: 1
enables: 1
machines: 1
to: 1
understand: 1
interpret: 1
generate: 1
human: 1
language: 1
Applications: 1
of: 2
include: 1
chatbots: 1
sentiment: 1
analysis: 1
machine: 1
translatio

**Question 8:** Create a custom annotator using spaCy or NLTK that identifies and labels 
proper nouns in a given text. 
(Include your Python code and output in the code box below.)

In [3]:
import nltk
from nltk import word_tokenize, pos_tag

def annotate_proper_nouns(text):
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)

    annotations = []
    for i, (word, tag) in enumerate(tagged):
        if tag in ("NNP", "NNPS"):
            annotations.append({
                "token": word,
                "tag": tag,
                "index": i,
                "label": "PROPER_NOUN"
            })
    return annotations

text = "Apple hired John to work in New York."
anns = annotate_proper_nouns(text)

for ann in anns:
    print(ann)


{'token': 'Apple', 'tag': 'NNP', 'index': 0, 'label': 'PROPER_NOUN'}
{'token': 'John', 'tag': 'NNP', 'index': 2, 'label': 'PROPER_NOUN'}
{'token': 'New', 'tag': 'NNP', 'index': 6, 'label': 'PROPER_NOUN'}
{'token': 'York', 'tag': 'NNP', 'index': 7, 'label': 'PROPER_NOUN'}


**Question 9:** Using Genism, demonstrate how to train a simple Word2Vec model on the 
following dataset consisting of example sentences: 
dataset = [ 
"Natural language processing enables computers to understand human language", 
"Word embeddings are a type of word representation that allows words with similar 
meaning to have similar representation", 
"Word2Vec is a popular word embedding technique used in many NLP applications", 
"Text preprocessing is a critical step before training word embeddings", 
"Tokenization and normalization help clean raw text for modeling" 
] 
Write code that tokenizes the dataset, preprocesses it, and trains a Word2Vec model using 
Gensim. 
(Include your Python code and output in the code box below.)

In [4]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

tokenized_sentences = [simple_preprocess(doc) for doc in dataset]

print("Tokenized & preprocessed sentences:")
for sent in tokenized_sentences:
    print(sent)

model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=50,
    window=3,
    min_count=1,
    workers=4,
    sg=1,
    epochs=100
)

print("\nModel vocabulary:")
print(list(model.wv.index_to_key))

print("\nVector for 'language':")
print(model.wv["language"])

print("\nMost similar words to 'word':")
for word, score in model.wv.most_similar("word", topn=5):
    print(f"{word}: {score:.4f}")

Tokenized & preprocessed sentences:
['natural', 'language', 'processing', 'enables', 'computers', 'to', 'understand', 'human', 'language']
['word', 'embeddings', 'are', 'type', 'of', 'word', 'representation', 'that', 'allows', 'words', 'with', 'similar', 'meaning', 'to', 'have', 'similar', 'representation']
['word', 'vec', 'is', 'popular', 'word', 'embedding', 'technique', 'used', 'in', 'many', 'nlp', 'applications']
['text', 'preprocessing', 'is', 'critical', 'step', 'before', 'training', 'word', 'embeddings']
['tokenization', 'and', 'normalization', 'help', 'clean', 'raw', 'text', 'for', 'modeling']

Model vocabulary:
['word', 'text', 'is', 'similar', 'representation', 'embeddings', 'to', 'language', 'modeling', 'for', 'raw', 'clean', 'help', 'normalization', 'and', 'tokenization', 'training', 'before', 'step', 'critical', 'preprocessing', 'applications', 'nlp', 'many', 'in', 'used', 'technique', 'embedding', 'popular', 'vec', 'have', 'meaning', 'with', 'words', 'allows', 'that', 'of

**Question 10:** Imagine you are a data scientist at a fintech startup. You’ve been tasked 
with analyzing customer feedback. Outline the steps you would take to clean, process, 
and extract useful insights using NLP techniques from thousands of customer reviews. 

### 1. Problem Understanding & Data Collection

- Identify the business goals:
  - Detect main pain points (e.g., payment failures, KYC issues, app crashes).
  - Measure overall customer satisfaction (sentiment).
  - Track changes in feedback over time or across product features.
- Collect data from:
  - App store reviews (Google Play, App Store).
  - In-app feedback forms.
  - Customer support tickets, emails, chat logs.
- Capture useful metadata:
  - Review text, rating, timestamp, product/feature, customer ID, platform (Android/iOS).

---

### 2. Data Cleaning

- Remove duplicate reviews.
- Drop empty or very short reviews (e.g., less than 3–4 words if not useful).
- Handle missing values in important fields (text/rating/timestamp).
- Fix encoding issues (ensure UTF-8).
- Standardize basic formats:
  - Timestamps to a single format.
  - Consistent language (filter or flag non-English reviews if focusing on English).

---

### 3. Text Preprocessing

Apply standard NLP preprocessing steps to the review text:

- **Lowercasing**: Convert all text to lowercase.
- **Remove noise**:
  - Punctuation, extra spaces, special characters.
  - URLs, email addresses, user IDs.
  - Optional: numbers (unless important, like error codes).
- **Tokenization**:
  - Split text into tokens (words or subwords).
- **Stopword removal**:
  - Remove very common words (e.g., “the”, “is”, “and”) that don’t add meaning.
- **Lemmatization / Stemming**:
  - Convert words to their base forms (e.g., “running” → “run”, “cards” → “card”).
- **Handle emojis and slang** (optional but useful):
  - Map emojis or common slang to sentiment tags or words if needed.
- **Spelling correction** (optional):
  - Correct frequent typos in domain-specific terms (e.g., “paytm”, “upi”, brand names).

---

### 4. Exploratory Analysis (EDA for Text)

Use basic statistics to understand the data:

- Review length distribution (number of words/chars per review).
- Overall rating distribution (1–5 stars).
- Word frequency analysis:
  - Most common words after preprocessing.
  - Most frequent bigrams/trigrams (e.g., “payment failed”, “app crash”, “kyc issue”).
- Compare positive vs negative reviews:
  - Common words and phrases in each group.

---

### 5. Sentiment Analysis

Measure how customers feel:

- Label reviews as **positive**, **negative**, or **neutral** using:
  - Rule-based models (e.g., VADER) or
  - ML models (Logistic Regression, SVM, or deep learning) or
  - Pre-trained transformer models (e.g., BERT-based sentiment classifier).
- Validate model performance:
  - Manually label a sample of reviews.
  - Evaluate using accuracy, F1-score, precision, recall.
- Aggregate sentiment:
  - By time (daily/weekly/monthly).
  - By product/feature (UPI, credit card, loans).
  - By platform (Android vs iOS).

---

### 6. Topic Modeling (Discover Main Themes)

Find the main topics/issues in the reviews:

- Convert reviews to features:
  - Bag-of-Words or TF-IDF vectors.
  - Or embeddings (Word2Vec, Doc2Vec, sentence transformers).
- Apply topic modeling:
  - LDA (Latent Dirichlet Allocation).
  - NMF (Non-negative Matrix Factorization).
  - Or advanced methods like BERTopic (using sentence embeddings).
- Interpret and label topics:
  - E.g., “Payment Failure”, “KYC Verification Problems”, “Cashback Delays”, “App Performance/Crashes”.
- Link topics with sentiment:
  - Identify which topics are associated with most negative or most positive sentiment.

---

### 7. Keyword & Phrase Extraction

Highlight key phrases customers use:

- Use techniques like:
  - TF-IDF to get important keywords per segment (e.g., per topic or per product).
  - RAKE or YAKE for keyphrase extraction.
  - KeyBERT or similar embedding-based methods for more semantic keyphrases.
- Extract top keywords for:
  - Negative reviews.
  - Each topic or product category.

---

### 8. Named Entity Recognition (NER)

Identify important entities in the text:

- Detect mentions of:
  - Brand names, banks, payment gateways.
  - Features (UPI, debit card, credit card, loans).
  - Locations or countries (if relevant).
- Use this to:
  - See which partners or features are mentioned often in negative reviews.
  - Track complaints tied to a particular bank or payment provider.

---

### 9. Trend & Correlation Analysis

Connect text insights with business metrics:

- Track **trend over time**:
  - How sentiment changes around app releases or new features.
  - Spike in certain topics (e.g., “server down”, “login issue”) after a deployment.
- Correlate:
  - Sentiment with app ratings.
  - Certain topics with churn, support tickets, or complaint volume.
  - Platform (Android/iOS) with particular error or UX issues.

---

### 10. Visualization & Reporting

Create clear visualizations and dashboards:

- Tools: Power BI, Tableau, or custom dashboards (Streamlit, Dash).
- Example visuals:
  - Sentiment over time.
  - Topic distribution (share of each topic).
  - Top complaint categories per product or feature.
  - Word clouds for positive vs negative reviews.
- Summarize findings:
  - Use simple tables, charts, and short textual summaries.

---

### 11. Actionable Insights for the Fintech Startup

Translate analysis into concrete actions, for example:

- “40% of 1-star reviews mention ‘payment failed’ or ‘transaction pending’.”
- “Significant spike in negative sentiment after last app update on [date].”
- “KYC verification issues are the most common topic for new users in the last 3 months.”
- “Users appreciate cashback offers but are frustrated by payout delays.”

Propose actions:

- Prioritize bug fixes for high-impact issues (e.g., failed payments).
- Improve onboarding and KYC flow.
- Refine communication around cashback timelines and eligibility.
- Use positive feedback to highlight strengths in marketing (e.g., UX, speed, rewards).

---

### 12. Automation & Productionization

- Build an **end-to-end pipeline**:
  - Data ingestion → cleaning → preprocessing → sentiment & topic modeling → dashboards.
- Schedule regular updates (daily/weekly).
- Monitor model performance and refresh models periodically.
- Integrate alerts:
  - E.g., notify product team when negative sentiment on a topic spikes.

---
