Question 1: What is Computational Linguistics and how does it relate to NLP?

Ans

**Computational Linguistics** is an interdisciplinary field that studies how computers can understand, analyze, and generate human language using linguistic rules and computational methods. It combines knowledge from linguistics (such as grammar, syntax, and semantics) and computer science to model language in a structured way.

Computational Linguistics is closely related to **Natural Language Processing (NLP)**. While Computational Linguistics focuses more on the theoretical and rule-based understanding of language, NLP focuses on practical applications that use machine learning and algorithms to process language data. In simple terms, Computational Linguistics provides the linguistic foundation, and NLP applies those concepts to build real-world applications like chatbots, machine translation, speech recognition, and sentiment analysis.


Question 2: Briefly describe the historical evolution of Natural Language Processing.

Ans

The historical evolution of Natural Language Processing (NLP) can be divided into different stages. In the 1950s and 1960s, NLP started with **rule-based systems**, where researchers used hand-written grammar rules for tasks like machine translation. These systems were limited because they required extensive manual effort and could not handle language variations well.

In the 1980s and 1990s, NLP shifted toward **statistical methods**, where machine learning techniques and probability models were used to analyze large text datasets. Algorithms like Hidden Markov Models (HMM) and n-grams became popular during this time.

In the 2000s, **machine learning-based approaches** became more advanced with the availability of large datasets and better computing power. Techniques such as Support Vector Machines (SVM) and decision trees were widely used.

After 2010, NLP entered the **deep learning era**, where neural networks like RNNs, LSTMs, and later Transformers significantly improved performance in tasks like translation, text generation, and sentiment analysis. Today, large pre-trained language models like BERT and GPT have made NLP more powerful and accurate than ever before.


Question 3: List and explain three major use cases of NLP in today’s tech industry.


Ans

Three major use cases of NLP in today’s tech industry are **Chatbots and Virtual Assistants, Sentiment Analysis, and Machine Translation**.

**1. Chatbots and Virtual Assistants:**
NLP is used to build chatbots and voice assistants that can understand and respond to human language. These systems process user queries and provide relevant answers. For example, customer support chatbots on websites and virtual assistants like Alexa or Google Assistant use NLP to interact with users naturally.

**2. Sentiment Analysis:**
Sentiment analysis uses NLP to determine whether a piece of text expresses a positive, negative, or neutral opinion. Companies use it to analyze customer reviews, social media posts, and feedback to understand customer satisfaction and improve products or services.

**3. Machine Translation:**
NLP enables automatic translation of text from one language to another. Applications like Google Translate use NLP techniques to translate sentences accurately while maintaining meaning and context.

These use cases show how NLP is widely applied in communication, business, and global technology platforms.


Question 4: What is text normalization and why is it essential in text processing tasks?

Ans

**Text normalization** is the process of converting raw text into a clean and consistent format so that it can be easily processed by machine learning models. It involves steps such as converting text to lowercase, removing punctuation and special characters, correcting spelling, removing stop words, and sometimes applying stemming or lemmatization.

Text normalization is essential in text processing tasks because raw text often contains variations that can confuse models. For example, words like “Running,” “running,” and “RUNNING” should be treated as the same word. By normalizing text, we reduce noise, improve data quality, and make the model more accurate and efficient. It helps in better feature extraction and improves the overall performance of NLP applications such as sentiment analysis, chatbots, and text classification.


Question 5: Compare and contrast stemming and lemmatization with suitable
examples.

Ans

**Stemming** and **Lemmatization** are text preprocessing techniques used in NLP to reduce words to their base or root form, but they work in different ways.

**Stemming** is a simple and fast process that removes prefixes or suffixes from words without considering the meaning of the word. It may sometimes produce incorrect or incomplete words. For example, the words *“playing,” “played,” and “plays”* may all be reduced to *“play.”* However, a word like *“studies”* might be reduced to *“studi,”* which is not a proper word.

**Lemmatization**, on the other hand, is a more advanced and accurate method. It considers the meaning and context of the word and converts it into its correct base form (called a lemma). For example, *“running”* becomes *“run,”* and *“better”* becomes *“good.”* Lemmatization uses vocabulary and morphological analysis, so it usually produces meaningful words.

In summary, stemming is faster but less accurate, while lemmatization is slower but more accurate and linguistically correct.


Question 6: Write a Python program that uses regular expressions (regex) to extract all
email addresses from the following block of text:
“Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.”

In [1]:
import re

# Given text
text = """Hello team, please contact us at support@xyz.com for technical issues,
or reach out to our HR at hr@xyz.com. You can also connect with John at
john.doe@xyz.org and jenny via jenny_clarke126@mail.co.us.
For partnership inquiries, email partners@xyz.biz."""

# Regular expression pattern for extracting email addresses
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Find all email addresses
emails = re.findall(pattern, text)

# Print extracted emails
print("Extracted Email Addresses:")
for email in emails:
    print(email)

Extracted Email Addresses:
support@xyz.com
hr@xyz.com
john.doe@xyz.org
jenny_clarke126@mail.co.us
partners@xyz.biz


Question 7: Given the sample paragraph below, perform string tokenization and
frequency distribution using Python and NLTK:
“Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.

In [5]:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Download punkt tokenizer (only needed once)
nltk.download('punkt_tab')

# Given paragraph
text = """Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical."""

# Convert text to lowercase
text = text.lower()

# Tokenization
tokens = word_tokenize(text)

# Remove punctuation tokens
words = [word for word in tokens if word.isalpha()]

# Frequency Distribution
freq_dist = FreqDist(words)

# Print tokens
print("Tokens:")
print(words)

# Print frequency distribution
print("\nFrequency Distribution:")
for word, frequency in freq_dist.items():
    print(f"{word}: {frequency}")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...


Tokens:
['natural', 'language', 'processing', 'nlp', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', 'computer', 'science', 'and', 'artificial', 'intelligence', 'it', 'enables', 'machines', 'to', 'understand', 'interpret', 'and', 'generate', 'human', 'language', 'applications', 'of', 'nlp', 'include', 'chatbots', 'sentiment', 'analysis', 'and', 'machine', 'translation', 'as', 'technology', 'advances', 'the', 'role', 'of', 'nlp', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical']

Frequency Distribution:
natural: 1
language: 2
processing: 1
nlp: 3
is: 2
a: 1
fascinating: 1
field: 1
that: 1
combines: 1
linguistics: 1
computer: 1
science: 1
and: 3
artificial: 1
intelligence: 1
it: 1
enables: 1
machines: 1
to: 1
understand: 1
interpret: 1
generate: 1
human: 1
applications: 1
of: 2
include: 1
chatbots: 1
sentiment: 1
analysis: 1
machine: 1
translation: 1
as: 1
technology: 1
advances: 1
the: 1
role: 1
in: 1
modern: 1
solutions: 1
becoming: 1
incr

[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Question 8: Create a custom annotator using spaCy or NLTK that identifies and labels
proper nouns in a given te

In [3]:
# Install spaCy first (if not installed)
# pip install spacy
# python -m spacy download en_core_web_sm

import spacy

# Load English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "John Doe works at Microsoft in New York. He recently visited India and met Sundar Pichai."

# Process text
doc = nlp(text)

print("Proper Nouns Identified:\n")

# Identify and label proper nouns
for token in doc:
    if token.pos_ == "PROPN":
        print(f"{token.text} --> Proper Noun")

Proper Nouns Identified:

John --> Proper Noun
Doe --> Proper Noun
Microsoft --> Proper Noun
New --> Proper Noun
York --> Proper Noun
India --> Proper Noun
Sundar --> Proper Noun
Pichai --> Proper Noun


Question 9: Using Genism, demonstrate how to train a simple Word2Vec model on the
following dataset consisting of example sentences:
dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar
meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]
Write code that tokenizes the dataset, preprocesses it, and trains a Word2Vec model using
Gensim

In [6]:
# Install gensim if not installed
!pip install gensim

from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
import string

# Download punkt tokenizer (only once)
nltk.download('punkt')

# Given dataset
dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

# Preprocessing: lowercase, tokenize, remove punctuation
processed_data = []

for sentence in dataset:
    sentence = sentence.lower()
    tokens = word_tokenize(sentence)
    tokens = [word for word in tokens if word.isalpha()]  # remove punctuation
    processed_data.append(tokens)

# Train Word2Vec model
model = Word2Vec(
    sentences=processed_data,
    vector_size=100,   # size of word vectors
    window=5,          # context window size
    min_count=1,       # include all words
    workers=4
)

# Example: Get vector for a word
print("Vector for 'language':\n", model.wv['language'])

# Example: Find similar words
print("\nWords similar to 'word':")
print(model.wv.most_similar('word'))

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m66.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
Vector for 'language':
 [-9.5782308e-03  8.9439554e-03  4.1630305e-03  9.2365965e-03
  6.6424119e-03  2.9268002e-03  9.8025510e-03 -4.4268500e-03
 -6.8041803e-03  4.2267689e-03  3.7317213e-03 -5.6662858e-03
  9.7040217e-03 -3.5548718e-03  9.5477831e-03  8.3319913e-04
 -6.3350094e-03 -1.9779371e-03 -7.3799482e-03 -2.9804893e-03
  1.0420545e-03  9.4868196e-03  9.3578557e-03 -6.5933121e-03
  3.4730809e-03  2.2753577e-03 -2.4914318e-03 -9.2325537e-03
  1.0250713e-03 -8.1645567e-03  6.3203699e-03 -5.8030877e-03
  5.5381078e-03  9.8315831e-03 -1.5997297e-04  4.5317081e-03
 -1.810438

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Question 10: Imagine you are a data scientist at a fintech startup. You’ve been tasked
with analyzing customer feedback. Outline the steps you would take to clean, process,
and extract useful insights using NLP techniques from thousands of customer reviews

Ans

As a data scientist at a fintech startup, I would follow a structured NLP pipeline to analyze thousands of customer reviews and extract useful insights.

First, I would **collect and organize the data**, ensuring all reviews are stored in a structured format like CSV or a database. Then, I would perform **data cleaning**, such as removing duplicates, handling missing values, converting text to lowercase, removing special characters, numbers, and unnecessary spaces.

Next, I would apply **text preprocessing techniques** like tokenization, stop word removal, stemming or lemmatization, and text normalization to prepare the data for analysis. If needed, I would also perform spelling correction and remove irrelevant content like URLs or emojis.

After preprocessing, I would perform **exploratory text analysis**, such as generating word frequency distributions, word clouds, and identifying common keywords or phrases. Then, I would apply **sentiment analysis** to classify reviews as positive, negative, or neutral to understand customer satisfaction levels.

To gain deeper insights, I would use **topic modeling** (such as LDA) to identify common themes like payment issues, app performance, loan approvals, or customer support. I might also use **text classification models** to categorize feedback into predefined categories (e.g., complaints, feature requests, praise).

Finally, I would visualize the results using dashboards and charts, highlight major pain points, track sentiment trends over time, and provide actionable insights to product and business teams. This approach would help the company improve services, reduce customer complaints, and enhance overall user experience.
