
## NLP Introduction & Text Processing Assignment  
----------------------------------------


## Question 1: What is Computational Linguistics and how does it relate to NLP?

### Answer:

Computational Linguistics is an interdisciplinary field that combines linguistics and computer science to analyze and model natural language using computational methods.

Natural Language Processing (NLP) is a subfield of Artificial Intelligence that applies computational linguistics techniques to build practical applications such as chatbots, translation systems, and sentiment analysis tools.

Computational Linguistics provides the theoretical foundation, while NLP focuses on real-world applications.

## Question 2: Historical Evolution of NLP

### Answer:

1. **Rule-Based Systems (1950s–1980s)**  
   Used manually written grammar rules.

2. **Statistical NLP (1990s–2000s)**  
   Used probability and machine learning models like HMM and Naive Bayes.

3. **Machine Learning Era (2000–2015)**  
   Algorithms like SVM, Decision Trees were used.

4. **Deep Learning Era (2015–Present)**  
   Neural networks, Word2Vec, BERT, GPT revolutionized NLP.

## Question 3: : List and explain three major use cases of NLP in today’s tech industry.

### Answer:

1. **Chatbots & Virtual Assistants** – Automated customer support.
2. **Sentiment Analysis** – Analyze customer opinions.
3. **Machine Translation** – Translate text between languages.

## Question 4: What is text normalization and why is it essential in text processing tasks?

### Answer:

Text normalization is the process of converting text into a consistent format.

Steps include:
- Lowercasing
- Removing punctuation
- Removing stopwords
- Stemming or lemmatization

It improves model accuracy and reduces noise.

# Question 5: Compare and contrast stemming and lemmatization with suitable examples.

### Answer:

| Feature | Stemming | Lemmatization |
|----------|------------|----------------|
| Speed | Fast | Slower |
| Accuracy | Less accurate | More accurate |
| Example (running) | runn | run |
| Example (better) | better | good |

# Question 6: Write a Python program that uses regular expressions (regex) to extract all email addresses from the following block of text:

“Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.”

In [1]:
import re

text = """Hello team, please contact us at support@xyz.com for technical issues,
or reach out to our HR at hr@xyz.com. You can also connect with John at
john.doe@xyz.org and jenny via jenny_clarke126@mail.co.us.
For partnership inquiries, email partners@xyz.biz."""

pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

emails = re.findall(pattern, text)

print("Extracted Emails:")
for email in emails:
    print(email)

Extracted Emails:
support@xyz.com
hr@xyz.com
john.doe@xyz.org
jenny_clarke126@mail.co.us
partners@xyz.biz


# Question 7: Given the sample paragraph below, perform string tokenization and frequency distribution using Python and NLTK:
“Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.”

In [3]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Download required resources
nltk.download('punkt')
nltk.download('punkt_tab')   # <-- THIS LINE FIXES YOUR ERROR

text = """Natural Language Processing (NLP) is a fascinating field that combines
linguistics, computer science, and artificial intelligence. It enables machines
to understand, interpret, and generate human language. Applications of NLP
include chatbots, sentiment analysis, and machine translation. As technology
advances, the role of NLP in modern solutions is becoming increasingly critical."""

# Tokenization
tokens = word_tokenize(text.lower())

# Frequency Distribution
fdist = FreqDist(tokens)

print("Tokens:")
print(tokens)

print("\nTop 10 Most Common Words:")
print(fdist.most_common(10))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Tokens:
['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'it', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'applications', 'of', 'nlp', 'include', 'chatbots', ',', 'sentiment', 'analysis', ',', 'and', 'machine', 'translation', '.', 'as', 'technology', 'advances', ',', 'the', 'role', 'of', 'nlp', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical', '.']

Top 10 Most Common Words:
[(',', 7), ('.', 4), ('nlp', 3), ('and', 3), ('language', 2), ('is', 2), ('of', 2), ('natural', 1), ('processing', 1), ('(', 1)]


# Question 8: Create a custom annotator using spaCy or NLTK that identifies and labels proper nouns in a given text.

In [4]:
!pip install spacy
!python -m spacy download en_core_web_sm

import spacy

nlp = spacy.load("en_core_web_sm")

text = "John works at Microsoft in New York."

doc = nlp(text)

print("Proper Nouns Identified:")
for token in doc:
    if token.pos_ == "PROPN":
        print(token.text)

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m77.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Proper Nouns Identified:
John
Microsoft
New
York


# Question 9: Using Genism, demonstrate how to train a simple Word2Vec model on the following dataset consisting of example sentences:
dataset = [
"Natural language processing enables computers to understand human language",
"Word embeddings are a type of word representation that allows words with similar
meaning to have similar representation",
"Word2Vec is a popular word embedding technique used in many NLP applications",
"Text preprocessing is a critical step before training word embeddings",
"Tokenization and normalization help clean raw text for modeling"
]

Write code that tokenizes the dataset, preprocesses it, and trains a Word2Vec model using
Gensim.

In [5]:
!pip install gensim

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt')

dataset = [
"Natural language processing enables computers to understand human language",
"Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
"Word2Vec is a popular word embedding technique used in many NLP applications",
"Text preprocessing is a critical step before training word embeddings",
"Tokenization and normalization help clean raw text for modeling"
]

processed_data = [word_tokenize(sentence.lower()) for sentence in dataset]

model = Word2Vec(sentences=processed_data, vector_size=50, window=3, min_count=1, workers=4)

print("Vector for word 'word':")
print(model.wv['word'])

print("\nMost similar words to 'word':")
print(model.wv.most_similar('word'))

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m60.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
Vector for word 'word':
[-1.0403453e-03  5.0025561e-04  1.0216394e-02  1.8061617e-02
 -1.8591337e-02 -1.4261860e-02  1.2926748e-02  1.7953532e-02
 -1.0032079e-02 -7.5593363e-03  1.4738411e-02 -3.0673896e-03
 -9.0098204e-03  1.3120006e-02 -9.7130043e-03 -3.6296644e-03
  5.7513574e-03  1.9671449e-03 -1.6544867e-02 -1.8923346e-02
  1.4675597e-02  1.0175380e-02  1.3475348e-02  1.5095623e-03
  1.2712638e-02 -6.8120449e-03 -1.9073798e-03  1.1505213e-02
 -1.4995098e-02 -7.8761997e-03 -1.5040041e-02 -1.8752393e-03
  1.9053303e-02 -1.4657243e-02 -4.6755383e-03 -3.8616473e-03
  1.613911

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Question 10: Imagine you are a data scientist at a fintech startup. You’ve been tasked with analyzing customer feedback. Outline the steps you would take to clean, process, and extract useful insights using NLP techniques from thousands of customer reviews.

In [6]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

reviews = [
"The app is very easy to use and secure.",
"Customer service is terrible and slow.",
"I love the new update, very smooth experience."
]

sia = SentimentIntensityAnalyzer()

for review in reviews:
    score = sia.polarity_scores(review)
    print("Review:", review)
    print("Sentiment Score:", score)
    print("------")

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Review: The app is very easy to use and secure.
Sentiment Score: {'neg': 0.0, 'neu': 0.556, 'pos': 0.444, 'compound': 0.6801}
------
Review: Customer service is terrible and slow.
Sentiment Score: {'neg': 0.383, 'neu': 0.617, 'pos': 0.0, 'compound': -0.4767}
------
Review: I love the new update, very smooth experience.
Sentiment Score: {'neg': 0.0, 'neu': 0.588, 'pos': 0.412, 'compound': 0.6369}
------
