**Question 1: What is Computational Linguistics and how does it relate to NLP?**

Computational Linguistics is the scientific study of language from a computational perspective, focusing on how human language can be modeled, analyzed, and understood using algorithms and formal rules. It provides the theoretical and linguistic foundations (such as syntax, semantics, and grammar) for Natural Language Processing (NLP), while NLP applies these theories practically to build systems like chatbots, translators, and speech recognition tools.

**Question 2: Briefly describe the historical evolution of Natural Language Processing.**

Natural Language Processing evolved from rule-based approaches in the 1950s and 1960s, where language understanding relied on handcrafted grammar and linguistic rules. In the 1990s, the field shifted to statistical and machine learning methods using large text corpora. More recently, deep learning and transformer-based models have driven major advances, enabling highly accurate tasks such as translation, sentiment analysis, and conversational AI.

**Question 3: List and explain three major use cases of NLP in today’s tech industry.**

**Chatbots and Virtual Assistants**: NLP enables systems like customer support bots and voice assistants to understand user queries, generate relevant responses, and carry out conversations in natural language.

**Machine Translation**: NLP is used to automatically translate text or speech between languages (e.g., Google Translate), helping break language barriers in global communication.

**Sentiment Analysis**: Companies use NLP to analyze customer reviews, social media posts, and feedback to understand opinions, emotions, and overall public sentiment toward products or brands.

**Question 4: What is text normalization and why is it essential in text processing tasks?**

Text normalization is the process of cleaning and standardizing raw text by converting it into a consistent format, such as lowercasing text, removing punctuation, correcting spelling, and expanding abbreviations. It is essential because it reduces noise and variability in text data, helping NLP models process input more accurately and improving the performance of tasks like text classification, search, and information retrieval.


**Question 5: Compare and contrast stemming and lemmatization with suitable**
**examples**.

Stemming and lemmatization are text preprocessing techniques used to reduce words to their base form, but they differ in accuracy and approach.

Stemming removes word suffixes using simple rules, often producing non-dictionary words.

Example: running, runner, runs → run

Lemmatization uses vocabulary and grammatical analysis to return the meaningful base (dictionary) form of a word.

Example: running → run, better → good

In short, stemming is faster but less accurate, while lemmatization is slower but more linguistically correct.

In [1]:
# Question 6: Write a Python program that uses regular expressions (regex) to extract all
# email addresses from the following block of text:
# “Hello team, please contact us at support@xyz.com for technical issues, or reach out to
# our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
# via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.”

import re

text = """Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz."""

# Regex pattern for email addresses
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Find all email addresses
emails = re.findall(pattern, text)

print(emails)

['support@xyz.com', 'hr@xyz.com', 'john.doe@xyz.org', 'jenny_clarke126@mail.co.us', 'partners@xyz.biz']


In [4]:
# Question 7: Given the sample paragraph below, perform string tokenization and
# frequency distribution using Python and NLTK:
# “Natural Language Processing (NLP) is a fascinating field that combines linguistics,
# computer science, and artificial intelligence. It enables machines to understand,
# interpret, and generate human language. Applications of NLP include chatbots,
# sentiment analysis, and machine translation. As technology advances, the role of NLP
# in modern solutions is becoming increasingly critical.”

import nltk
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize, sent_tokenize
text = """Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical."""

tokens = word_tokenize(text.lower())

from nltk.probability import FreqDist
freq_dist = FreqDist(tokens)

print("Tokens:")
print(tokens)

print("Frequency:")
for word, freq in freq_dist.items():
    print(f"{word}: {freq}")

Tokens:
['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'it', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'applications', 'of', 'nlp', 'include', 'chatbots', ',', 'sentiment', 'analysis', ',', 'and', 'machine', 'translation', '.', 'as', 'technology', 'advances', ',', 'the', 'role', 'of', 'nlp', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical', '.']
Frequency:
natural: 1
language: 2
processing: 1
(: 1
nlp: 3
): 1
is: 2
a: 1
fascinating: 1
field: 1
that: 1
combines: 1
linguistics: 1
,: 7
computer: 1
science: 1
and: 3
artificial: 1
intelligence: 1
.: 4
it: 1
enables: 1
machines: 1
to: 1
understand: 1
interpret: 1
generate: 1
human: 1
applications: 1
of: 2
include: 1
chatbots: 1
sentiment: 1
analysis: 1
machine: 1
translation: 1
as: 1
technology: 1
a

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [8]:
# Question 8: Create a custom annotator using spaCy or NLTK that identifies and labels
# proper nouns in a given text.
import spacy

nlp = spacy.load("en_core_web_sm")

text = """Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical."""

doc = nlp(text)

print("proper noun")
for token in doc:
  if token.pos_ == "PROPN":
    print(f"Token: {token.text} - POS Tag: {token.pos_}")

proper noun
Token: Natural - POS Tag: PROPN
Token: Language - POS Tag: PROPN
Token: Processing - POS Tag: PROPN
Token: NLP - POS Tag: PROPN
Token: NLP - POS Tag: PROPN
Token: NLP - POS Tag: PROPN


In [11]:
# Question 9: Using Genism, demonstrate how to train a simple Word2Vec model on the
# following dataset consisting of example sentences:
# dataset = [
#  "Natural language processing enables computers to understand human language",
#  "Word embeddings are a type of word representation that allows words with similar
# meaning to have similar representation",
#  "Word2Vec is a popular word embedding technique used in many NLP applications",
#  "Text preprocessing is a critical step before training word embeddings",
#  "Tokenization and normalization help clean raw text for modeling"
# ]
# Write code that tokenizes the dataset, preprocesses it, and trains a Word2Vec model using
# Gensim.

!pip install gensim

from gensim.utils import simple_preprocess
from gensim.models import Word2Vec

dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]

# tokenize and preprocessing
# preprocessing- lowercase, remove punctuation

tokenized_data = [simple_preprocess(sentence) for sentence in dataset]

model = Word2Vec(
    sentences = tokenized_data,
    vector_size=100,
    window = 5,
    min_count=1,
    workers = 4
)

# get word vector
word_vector = model.wv["language"]

#get similar words
similar_words = model.wv.most_similar("word")

print(f"word vector: {word_vector}")
print(f"similar word: {similar_words}")

word vector: [ 8.1681199e-03 -4.4430327e-03  8.9854337e-03  8.2536647e-03
 -4.4352221e-03  3.0310510e-04  4.2744912e-03 -3.9263200e-03
 -5.5599655e-03 -6.5123225e-03 -6.7073823e-04 -2.9592158e-04
  4.4630850e-03 -2.4740540e-03 -1.7260908e-04  2.4618758e-03
  4.8675989e-03 -3.0808449e-05 -6.3394094e-03 -9.2608072e-03
  2.6657581e-05  6.6618943e-03  1.4660227e-03 -8.9665223e-03
 -7.9386048e-03  6.5519023e-03 -3.7856805e-03  6.2549924e-03
 -6.6810320e-03  8.4796622e-03 -6.5163244e-03  3.2880199e-03
 -1.0569858e-03 -6.7875278e-03 -3.2875966e-03 -1.1614120e-03
 -5.4709399e-03 -1.2113475e-03 -7.5633135e-03  2.6466595e-03
  9.0701487e-03 -2.3772502e-03 -9.7651005e-04  3.5135616e-03
  8.6650876e-03 -5.9218528e-03 -6.8875779e-03 -2.9329848e-03
  9.1476962e-03  8.6626766e-04 -8.6784009e-03 -1.4469790e-03
  9.4794659e-03 -7.5494875e-03 -5.3580985e-03  9.3165627e-03
 -8.9737261e-03  3.8259076e-03  6.6544057e-04  6.6607012e-03
  8.3127534e-03 -2.8507852e-03 -3.9923131e-03  8.8979173e-03
  2.0896459

**Question 10: Imagine you are a data scientist at a fintech startup. You’ve been tasked with analyzing customer feedback. Outline the steps you would take to clean, process,and extract useful insights using NLP techniques from thousands of customer reviews.**

First, I would clean and preprocess the customer reviews by removing noise such as HTML tags, URLs, punctuation, and stopwords, converting text to lowercase, and applying tokenization and lemmatization. This step ensures the text is consistent and ready for analysis, reducing ambiguity and improving model performance.

Next, I would analyze the processed text using NLP techniques like sentiment analysis to understand customer opinions and topic modeling to identify common themes or issues. I would then summarize and visualize these insights to help the fintech team make data-driven decisions and improve customer satisfaction.