# NLP Assignment

Question 1: What is Computational Linguistics and how does it relate to NLP?
  - Computational linguistics is an interdisciplinary field that focuses on the computational aspects of language, while natural language processing (NLP) is a subfield that applies these principles to build practical applications. Computational linguistics is more theoretical, exploring the underlying linguistic structures and rules that computers must understand, whereas NLP is applied, aiming to create systems like chatbots and machine translators that can process and generate human language. The relationship is symbiotic; computational linguistics provides the theoretical foundation that enables NLP applications, and NLP's practical work provides data and insights that inform computational linguistics research.

Question 2: Briefly describe the historical evolution of Natural Language Processing.
  - The historical evolution of Natural Language Processing (NLP) progressed through three major paradigms: rule-based systems, statistical methods, and the modern deep learning revolution. This journey reflects a shift from human-coded instructions to machine learning models that learn from vast amounts of data.
  The Dawn of NLP: Rule-Based Systems (1950s-1970s):
  Key Milestones: The 1954 Georgetown-IBM experiment successfully translated 60 Russian sentences into English using a system that matched words and applied basic rules. In the mid-1960s, the chatbot ELIZA simulated conversation by recognizing keywords and using simple pattern-matching responses, though it did not genuinely "understand" human language.
  Limitations: These systems struggled with the inherent ambiguity, nuances, and exceptions of human language, leading to an "AI winter" in NLP research after the 1966 ALPAC report deemed machine translation efforts ineffective.
  The Statistical Revolution (1980s-2000s):
  Key Milestones: Techniques like Hidden Markov Models (HMMs) became crucial for speech recognition and part-of-speech tagging. In the 2000s, the development of neural language models and word embeddings (e.g., Word2Vec) allowed machines to represent words as numerical vectors that captured semantic relationships, fundamentally changing how language was processed.
  Applications: This era saw the rise of more practical applications, including improved search engines, spam filters, and the launch of Google Translate in 2006, which used statistical machine translation.  

Question 3: List and explain three major use cases of NLP in today’s tech industry.
  - Text Classification:
  One major use case of Natural Language Processing (NLP) is text classification, which involves programmatically categorizing textual content into predefined groups. This is used extensively in industry for things like spam detection in email and sentiment analysis on social media reviews, allowing systems to automatically filter content or gauge public opinion at scale.
  Named Entity Recognition (NER):
  Another key application is Named Entity Recognition (NER), a process that identifies and categorizes key information (entities) within unstructured text. This powers applications such as customer support systems that automatically extract crucial details like product names, tracking numbers, or personal details, streamlining data entry and improving search functionality.
  Machine Translation:
  Finally, machine translation utilizes NLP to automatically translate text or speech from one natural language to another while aiming to preserve both meaning and context. Services like Google Translate rely on sophisticated NLP models to break down source language syntax and semantics, enabling seamless global communication for individuals and businesses alike.

Question 4: What is text normalization and why is it essential in text processing tasks?
  - Text normalization is a pre-processing step that converts raw text into a consistent, standard format to improve processing efficiency and accuracy. It is essential because it reduces complexity by handling variations like different cases (e.g., "Hello" and "hello"), punctuation, and word forms (e.g., "running" and "ran"), allowing models to treat them as the same. This standardization reduces vocabulary size, decreases dimensionality, and improves the model's ability to generalize and perform better across various Natural Language Processing (NLP) tasks.
  Text normalization involves a series of techniques to standardize text into a consistent format, such as:
  Case normalization: Converting all letters to a single case, usually lowercase, so that "Apple" and "apple" are treated as the same word.
  Punctuation removal: Removing punctuation marks like commas, periods, and question marks, which may not be relevant for many analyses.
  Punctuation removal: Removing punctuation marks like commas, periods, and question marks, which may not be relevant for many analyses.
  Stemming and lemmatization: Reducing words to their root or base form. Stemming chops off endings (e.g., "running" becomes "run"), while lemmatization uses a dictionary to find the base form (lemma).
  Stop word removal: Removing common words like "a," "the," and "is," which often do not carry significant meaning.
  It is essential because:
  Improves accuracy: By standardizing words like "run," "runs," and "running" into a single form, models can better understand the underlying meaning, leading to more accurate results.
  Reduces data complexity: Normalization significantly reduces the number of unique words (vocabulary), which is crucial for models that have limitations on vocabulary size.
  Decreases dimensionality: With fewer unique terms, the complexity of the data is reduced, making it more efficient for models to process and analyze.
  Enhances generalization: By treating variations of the same word identically, models can generalize better and are less likely to be confused by different but semantically identical inputs.
  Increases efficiency: Reducing the size and complexity of the data allows machine learning models to train faster and more efficiently.

Question 5: Compare and contrast stemming and lemmatization with suitable
examples.
  - Stemming is a faster, rule-based process that chops off word endings to get a root form, which may not be a valid dictionary word. Lemmatization uses a dictionary and morphological analysis to return the valid, base or dictionary form (lemma) of a word, considering its context.
  Comparison of Stemming and Lemmatization
  Both techniques aim to reduce inflectional forms of words to a common base form to aid in text processing tasks like information retrieval and text classification. However, their methodologies and outputs differ significantly.
  Aspect 	Lemmatization	Stemming
  Methodology	Uses a dictionary and morphological analysis.	Uses heuristic, rule-based algorithms (like Porter or Snowball stemmers) to strip affixes.
  Output	The output, called a lemma, is always a meaningful word found in a dictionary.	The output, called a stem, may not be a linguistically valid or actual word.
  Context	It is context-aware, often requiring the word's part of speech (POS) to determine the correct base form.	It is context-agnostic, operating on a single word by applying rules without considering the surrounding text or meaning.
  Speed	Slower and more computationally intensive due to dictionary lookups and complex analysis.	Faster and less computationally intensive due to simple rule application.
  Accuracy	Generally more accurate and suitable for tasks requiring deep language understanding.	Less accurate and prone to errors like over-stemming (unrelated words reduced to the same stem) or under-stemming (related words not reduced).
          


In [1]:
# Question 6: Write a Python program that uses regular expressions (regex) to extract all email addresses from the following block of text:

import re

# Compiling the regex pattern for email validation
regex = re.compile(
    r"(?i)"  # Case-insensitive matching
    r"(?:[A-Z0-9!#$%&'*+/=?^_`{|}~-]+"  # Unquoted local part
    r"(?:\.[A-Z0-9!#$%&'*+/=?^_`{|}~-]+)*"  # Dot-separated atoms in local part
    r"|\"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]"  # Quoted strings
    r"|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*\")"  # Escaped characters in local part
    r"@"  # Separator
    r"[A-Z0-9](?:[A-Z0-9-]*[A-Z0-9])?"  # Domain name
    r"\.(?:[A-Z0-9](?:[A-Z0-9-]*[A-Z0-9])?)+"  # Top-level domain and subdomains
)

def isValid(email):
    """Check if the given email address is valid."""
    return "Valid email" if re.fullmatch(regex, email) else "Invalid email"

# Example Usage
print(isValid("name.surname@gmail.com"))
print(isValid("anonymous123@yahoo.co.uk"))
print(isValid("anonymous123@...uk"))
print(isValid("...@domain.us"))

Valid email
Invalid email
Invalid email
Invalid email


In [1]:
# Question 7: Given the sample paragraph below, perform string tokenization and frequency distribution using Python and NLTK:

import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Ensure you have the 'punkt' tokenizer models downloaded
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')

# Sample paragraph
paragraph = "Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence. It enables machines to understand, interpret, and generate human language. Applications of NLP include chatbots, sentiment analysis, and machine translation. As technology advances, the role of NLP in modern solutions is becoming increasingly critical."

# 1. Tokenization
tokens = word_tokenize(paragraph)
print("Tokens:")
print(tokens)

# 2. Frequency Distribution
fdist = FreqDist(tokens)
print("\nFrequency Distribution (Top 10):")
print(fdist.most_common(10))

AttributeError: module 'nltk.downloader' has no attribute 'DownloadError'

In [2]:
# Question 8: Create a custom annotator using spaCy or NLTK that identifies and labels proper nouns in a given text.

import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

def extract_proper_nouns_spacy(text):
    """
    Identifies and labels proper nouns in a given text using spaCy.
    """
    doc = nlp(text)
    proper_nouns = []
    for token in doc:
        # Proper nouns are typically tagged as 'NNP' (singular) or 'NNPS' (plural)
        # or identified as part of a Named Entity (e.g., PERSON, GPE, ORG)
        if token.pos_ == "PROPN" or token.ent_type_ != "":
            proper_nouns.append((token.text, "PROPER_NOUN"))
    return proper_nouns

# Example usage
text = "Apple Inc. is a technology company based in Cupertino, California. Tim Cook is the CEO."
identified_nouns = extract_proper_nouns_spacy(text)
print(identified_nouns)




[('Apple', 'PROPER_NOUN'), ('Inc.', 'PROPER_NOUN'), ('Cupertino', 'PROPER_NOUN'), ('California', 'PROPER_NOUN'), ('Tim', 'PROPER_NOUN'), ('Cook', 'PROPER_NOUN'), ('CEO', 'PROPER_NOUN')]


In [3]:
# Question 9: Using Genism, demonstrate how to train a simple Word2Vec model on the following dataset consisting of example sentences.

import gensim
from gensim.models import Word2Vec
import re
import nltk
# Ensure you have the necessary NLTK data (optional, for advanced cleaning)
# nltk.download('punkt')

# 1. Define the dataset
dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

# 2. Preprocessing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove non-alphanumeric characters and extra spaces
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize (split into words) and remove short words
    tokens = [word for word in text.split() if len(word) > 2]
    return tokens

# 3. Tokenize and preprocess the entire dataset
processed_sentences = [preprocess_text(sentence) for sentence in dataset]

# Display the processed sentences
print("Processed sentences (tokens):")
for sentence in processed_sentences:
    print(sentence)
print("-" * 30)

# 4. Train the Word2Vec model
# Parameters explained:
#   sentences: the pre-processed list of sentences (list of lists of words)
#   vector_size: Dimensionality of the word vectors (e.g., 20)
#   window: Maximum distance between the current and predicted word within a sentence (e.g., 5)
#   min_count: Ignores all words with a frequency lower than this (e.g., 1)
#   sg: Training algorithm: 0 for CBOW (default), 1 for Skip-gram (e.g., 1 here)
model = Word2Vec(sentences=processed_sentences, vector_size=20, window=5, min_count=1, sg=1)

# 5. Build the vocabulary
model.build_vocab(processed_sentences)

# 6. Train the model (optional if already passed in constructor, but good practice for clarity)
# total_examples=model.corpus_count, epochs=model.epochs
model.train(processed_sentences, total_examples=len(processed_sentences), epochs=10)

# 7. Demonstrate the model (optional)
print("Model vocabulary size:", len(model.wv.index_to_key))
print("Vector for the word 'language':\n", model.wv['language'])
print("-" * 30)

# Example of finding similar words
try:
    similar_words = model.wv.most_similar('word2vec', topn=2)
    print("Words most similar to 'word2vec':", similar_words)
except KeyError as e:
    print(f"Word not in vocabulary or not enough context to find similarities: {e}")


ModuleNotFoundError: No module named 'gensim'

Question 10: Imagine you are a data scientist at a fintech startup. You’ve been tasked
with analyzing customer feedback. Outline the steps you would take to clean, process,
and extract useful insights using NLP techniques from thousands of customer reviews.
  - Data Cleaning and Preprocessing
To begin, you must clean and preprocess the raw text data to make it suitable for NLP analysis. [1] This involves several crucial steps to standardize the text.
Remove Noise: Eliminate irrelevant information such as HTML tags, URLs, numbers, and special characters, as these do not contribute to the semantic meaning of the reviews.
Case Conversion: Convert all text to lowercase to ensure consistency and prevent the same word in different cases from being treated as separate tokens.
Tokenization: Break down the continuous text into individual words or phrases (tokens) which are the fundamental units for NLP.
Stop Word Removal: Filter out common, non-informative words like "the," "a," "is," which are frequent but carry little analytical value. [1]
Lemmatization/Stemming: Reduce words to their base or root form (e.g., "running," "runs," and "ran" all become "run"). Lemmatization is often preferred as it uses a dictionary to return a valid word (lemma), ensuring better semantic integrity than simple stemming. [1]
Feature Engineering and Insight Extraction
Once the data is clean, you can use various NLP techniques to extract meaningful insights. These methods transform the text into numerical formats that machine learning models can understand.
Sentiment Analysis: Use models to classify reviews as positive, negative, or neutral. [1] This provides an overall understanding of customer satisfaction and helps identify general trends in feedback.
Topic Modeling: Employ techniques like Latent Dirichlet Allocation (LDA) to automatically discover abstract "topics" present in the review collection. [1] This helps identify common themes, such as "app usability," "fees," or "customer support," allowing you to pinpoint specific areas for improvement.
Text Vectorization: Convert the processed text into numerical vectors using methods like TF-IDF (Term Frequency-Inverse Document Frequency) or modern embeddings (word embeddings). [1] This numerical representation is essential for quantitative analysis and subsequent modeling.
