<a href="https://colab.research.google.com/github/sohrab4u/uphc/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. What is Computational Linguistics and how does it relate to NLP?
- Computational Linguistics is a field that combines linguistics and computer science to understand, model, and process human language using computational methods. It seeks to enable computers to interpret, generate, and interact with natural language effectively. Natural Language Processing (NLP) is a subfield of computational linguistics focused on creating practical applications that allow computers to understand and respond to human language, such as chatbots and translation software. In short, computational linguistics provides the theoretical foundation, while NLP applies it to real-world language tasks

# 2. Briefly describe the historical evolution of Natural Language Processing.
- Natural Language Processing (NLP) has evolved significantly since the 1950s. It began with rule-based systems in the 1950s and 1960s, exemplified by early efforts like the Georgetown-IBM experiment for machine translation and the ELIZA chatbot. In the 1980s and 1990s, statistical methods and machine learning became dominant, improving language understanding with models like Hidden Markov Models. The 2000s and 2010s saw the rise of deep learning and neural networks, with breakthroughs such as word embeddings (Word2Vec) and transformer models like BERT and GPT. Today, NLP continues advancing rapidly with large-scale language models enabling sophisticated text generation and comprehension

#. 3 List and explain three major use cases of NLP in today’s tech industry.
  - Three major use cases of Natural Language Processing (NLP) in today’s tech industry are:

Customer Service & Virtual Agents: NLP powers AI chatbots and virtual assistants that enable near-human dialogue, responding to customer queries via voice, text, or email, improving response accuracy and personalized engagement.

Real-Time Translation & Subtitling: NLP enables real-time multilingual translation and auto-captioning, facilitating communication in international meetings, live streams, and cross-border e-commerce.

Sentiment Analysis & Market Intelligence: NLP analyzes large volumes of text data from social media, reviews, financial reports, and customer feedback to ascertain sentiment, trends, and actionable insights for businesses.

# 4. What is text normalization and why is it essential in text processing tasks?
  - Text normalization is the process of converting text into a consistent, standard format by removing variations such as capitalization differences, punctuation, contractions, and inflections. This makes the text easier to process in NLP tasks by reducing complexity and variability.

It is essential because raw text contains many unpredictable variations that can confuse algorithms. Normalization reduces noise in the data, lowers the dimensionality of input, and improves the efficiency and accuracy of models in tasks like text classification, search, and machine learning. Key techniques include lowercasing, stemming, lemmatization, and removing stop words.

# 5. Compare and contrast stemming and lemmatization with suitable examples.
  - Stemming and lemmatization are both text normalization techniques used in Natural Language Processing (NLP) to reduce words to their base forms, but they differ in approach and accuracy.

Stemming: It is a rule-based process that chops off word endings (suffixes) to get the root form (stem), which may not be a valid word. It is faster and simpler but can be less accurate, sometimes producing stems that are not meaningful (e.g., "running" → "runn"). It does not consider the context or part of speech.

Lemmatization: It is a more complex, context-aware process that reduces words to their dictionary base form (lemma) by considering the part of speech and word meaning. It usually requires a dictionary and linguistic knowledge, producing valid words (e.g., "running" → "run"). It is slower but yields more accurate and meaningful results.

Example:

Stemming: "better" → "bett"

Lemmatization: "better" → "good" (correct base form based on meaning)

In [None]:
#6. Write a Python program that uses regular expressions (regex) to extract all email addresses from the following block of text:

import re

text = """Hello team, please contact us at support@xyz.com for technical issues,
or reach out to our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org
and jenny via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz."""

# Regex pattern to match email addresses
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Extract email addresses
emails = re.findall(email_pattern, text)

print("Extracted email addresses:")
for email in emails:
    print(email)

In [None]:
# 7. Given the sample paragraph below, perform string tokenization and frequency distribution using Python and NLTK:

import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Sample paragraph
text = """Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence.
It enables machines to understand, interpret, and generate human language. Applications of NLP include chatbots, sentiment analysis,
and machine translation. As technology advances, the role of NLP in modern solutions is becoming increasingly critical."""

# Tokenization
tokens = word_tokenize(text)

# Frequency Distribution
freq_dist = FreqDist(tokens)

# Display tokens and their frequency
print("Tokens and frequency distribution:")
for word, frequency in freq_dist.items():
    print(f"{word}: {frequency}")

In [None]:
# 8. Create a custom annotator using spaCy or NLTK that identifies and labels proper nouns in a given text.

import spacy
from spacy.tokens import Span

# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Define custom component to label proper nouns
def proper_noun_annotator(doc):
    # Identify tokens with POS tag 'PROPN' (proper noun)
    proper_nouns = [token for token in doc if token.pos_ == 'PROPN']

    # Create spans for proper nouns and label them as 'PROPN_ENTITY'
    spans = [Span(doc, token.i, token.i+1, label="PROPN_ENTITY") for token in proper_nouns]

    # Add the spans to doc.ents
    doc.ents = list(doc.ents) + spans

    return doc

# Add the custom component to the pipeline
nlp.add_pipe(proper_noun_annotator, after='ner')

# Sample text
text = "John and Mary went to London to attend a conference. Meanwhile, Mr. Smith stayed in New York."

# Process the text
doc = nlp(text)

# Print the text spans labeled by the custom annotator
print("Proper Nouns identified by custom annotator:")
for ent in doc.ents:
    if ent.label_ == "PROPN_ENTITY":
        print(ent.text)

In [None]:
# 9. Using Genism, demonstrate how to train a simple Word2Vec model on the following dataset consisting of example sentences

from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# Dataset of example sentences
dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

# Tokenize and preprocess dataset
tokenized_data = [simple_preprocess(sentence) for sentence in dataset]

# Train a Word2Vec model
model = Word2Vec(sentences=tokenized_data, vector_size=50, window=3, min_count=1, workers=2, sg=1)

# Print some example word vectors and similarity
print("Vector for 'word':")
print(model.wv['word'])

print("\nMost similar words to 'language':")
print(model.wv.most_similar('language', topn=3))

In [None]:
#10. Imagine you are a data scientist at a fintech startup. You’ve been tasked with analyzing customer feedback. Outline the steps you would take to clean, process, and extract useful insights using NLP techniques from thousands of customer reviews

# Outline with example code snippets illustrating key steps in analyzing customer feedback using NLP

import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from textblob import TextBlob

# Step 1: Load customer feedback data (simulate with sample data)
customer_reviews = [
    "I love the service! Quick and easy transactions.",
    "The app crashes frequently. Very frustrating experience.",
    "Customer support was helpful and resolved my issue quickly.",
    "Too many hidden fees, not transparent at all.",
    "Excellent user interface and seamless navigation!"
]

# Step 2: Text Cleaning - lowercase, remove special characters, numbers, and extra spaces
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

cleaned_reviews = [clean_text(review) for review in customer_reviews]

# Step 3: Tokenization and Stop Words Removal
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def tokenize_and_remove_stopwords(text):
    tokens = word_tokenize(text)
    filtered_tokens = [t for t in tokens if t not in stop_words]
    return filtered_tokens

tokenized_reviews = [tokenize_and_remove_stopwords(review) for review in cleaned_reviews]

# Step 4: Frequency Analysis - Get most common words
all_tokens = [token for sublist in tokenized_reviews for token in sublist]
word_freq = Counter(all_tokens)
print("Most common words:", word_freq.most_common(5))

# Step 5: Sentiment Analysis using TextBlob
sentiments = [TextBlob(review).sentiment.polarity for review in cleaned_reviews]
print("Sentiment scores:", sentiments)

# Step 6: Visualize with Word Cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(all_tokens))
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()