##NLP Introduction & Text Processing

### Q1. What is Computational Linguistics and how does it relate to NLP?

  **Computational Linguistics (CL)** is an interdisciplinary field that combines **linguistics, computer science, and artificial intelligence** to study and model **human language using computational methods**. Its main goal is to understand how language works and to create **formal, rule-based or statistical models** that represent linguistic structure such as syntax, semantics, morphology, and pragmatics.

  **Natural Language Processing (NLP)** is an **application-oriented subfield** that focuses on enabling computers to **process, understand, and generate human language** in practical systems.

  ### **Relationship between Computational Linguistics and NLP**

  | Aspect   | Computational Linguistics           | Natural Language Processing      |
  | -------- | ----------------------------------- | -------------------------------- |
  | Focus    | Theoretical + modeling of language  | Practical language applications  |
  | Approach | Linguistic rules, grammar, theory   | Algorithms, ML, deep learning    |
  | Goal     | Understand language computationally | Build language-enabled systems   |
  | Output   | Language models, grammars, theories | Chatbots, translators, analyzers |


  ### **Examples**

  * CL helps define **grammar rules** and **semantic structures**
    * NLP uses those rules to build:
    * Machine Translation (Google Translate)
    * Chatbots & Voice Assistants
    * Sentiment Analysis
    * Text Summarization
    * Speech Recognition


###Q2. Briefly describe the historical evolution of Natural Language Processing (NLP)

  The evolution of **Natural Language Processing (NLP)** can be broadly divided into **four major phases**, each driven by advances in computing power, data availability, and algorithms.


  ### **1. Rule-Based Era (1950s–1970s)**

  * Early NLP systems relied on **handcrafted linguistic rules and grammar**.
  * Language processing was done using **if–else rules, dictionaries, and syntax trees**.
  * Example tasks: basic machine translation and grammar checking.
  * **Limitation:** Rigid rules, hard to scale, and poor handling of ambiguity.

  ### **2. Statistical NLP Era (1980s–2000s)**

  * Shift from rules to **probability and statistics**.
  * Models learned patterns from **large text corpora**.
  * Techniques included **n-grams, Hidden Markov Models (HMMs), and probabilistic parsers**.
  * **Advantage:** Better handling of uncertainty and real-world language usage.


  ### **3. Machine Learning Era (2000s–2010s)**

  * Introduction of **supervised and unsupervised learning**.
  * Algorithms like **Naive Bayes, SVMs, Decision Trees** became common.
  * NLP tasks such as **spam detection, sentiment analysis, and text classification** improved significantly.
  * **Limitation:** Heavy feature engineering was required.


  ### **4. Deep Learning & Transformer Era (2015–Present)**

  * Use of **neural networks**, especially **RNNs, LSTMs, and Transformers**.
  * Models automatically learn features from data.
  * Enabled high performance in **translation, summarization, question answering, and chatbots**.
  * **Advantage:** State-of-the-art accuracy with minimal manual feature design.


###Q3. Three Major Use Cases of NLP in Today’s Tech Industry


  **1. Chatbots and Virtual Assistants**

  NLP enables systems to understand user queries, identify intent, and generate appropriate responses. These systems are widely used in customer support, banking, and e-commerce to provide instant and automated assistance.

  **Example:** Customer support chatbots, voice assistants.


  **2. Sentiment Analysis**

  Sentiment analysis uses NLP techniques to determine the emotional tone of text such as positive, negative, or neutral. It is commonly used to analyze customer reviews, social media posts, and feedback to understand user opinions.

  **Example:** Brand monitoring, product review analysis.


  **3. Machine Translation**

  NLP allows automatic translation of text from one language to another while preserving meaning and context. This helps businesses and users communicate across different languages.

  **Example:** Language translation tools, multilingual websites.



### Q4. What is Text Normalization and why is it essential in text processing tasks?

  **Text normalization** is the process of **cleaning and standardizing raw text** into a consistent format so that it can be effectively processed by NLP algorithms. It transforms text into a uniform representation by reducing variations that do not add meaningful information.


  ### **Common Text Normalization Techniques**

  * Converting text to **lowercase**
  * Removing **punctuation, special characters, and extra spaces**
  * **Tokenization** (splitting text into words)
  * Removing **stop words** (e.g., *is, the, and*)
  * **Stemming** or **lemmatization** (reducing words to their root form)
  * Expanding **contractions** (e.g., *don’t → do not*)


  ### **Why Text Normalization is Essential**

  * **Improves model accuracy** by reducing noise and inconsistencies
  * **Ensures uniform representation** of words (e.g., *Run, running, runs → run*)
  * **Reduces vocabulary size**, making models more efficient
  * **Enhances feature extraction** for machine learning algorithms
  * **Improves consistency** across large datasets


  ### **Example**

  Raw text:

  > *“Running, runs, and RUN!”*

  After normalization:

  > *“run”*



###Q5. Compare and contrast Stemming and Lemmatization with suitable examples

  **Stemming** and **Lemmatization** are text normalization techniques used in NLP to reduce words to their base form, but they differ in approach and accuracy.


  ### **Stemming**

  * Reduces words by **removing suffixes** using simple rules.
  * Does **not consider grammar or word meaning**.
  * Output may **not be a valid dictionary word**.
  * Faster but less accurate.

  **Example:**

  * *running → run*
  * *studies → studi*
  * *better → bett*


  ### **Lemmatization**

  * Reduces words to their **dictionary base form (lemma)**.
  * Considers **part of speech and context**.
  * Output is always a **valid word**.
  * Slower but more accurate.

  **Example:**

  * *running → run*
  * *studies → study*
  * *better → good*


  ### **Comparison Table**

  | Aspect        | Stemming                  | Lemmatization                    |
  | ------------- | ------------------------- | -------------------------------- |
  | Approach      | Rule-based suffix removal | Dictionary + linguistic analysis |
  | Accuracy      | Lower                     | Higher                           |
  | Speed         | Faster                    | Slower                           |
  | Output        | May be invalid word       | Always valid word                |
  | POS awareness | No                        | Yes                              |


In [2]:
'''
Q6. Write a Python program that uses regular expressions (regex) to extract all email addresses from the following block of text:

“Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.”
'''

import re

text = """
Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.
"""

pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

emails = re.findall(pattern, text)
print(emails)


['support@xyz.com', 'hr@xyz.com', 'john.doe@xyz.org', 'jenny_clarke126@mail.co.us', 'partners@xyz.biz']


In [4]:
'''
Q7. Given the sample paragraph below, perform string tokenization and frequency distribution using Python and NLTK:

“Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.”
'''


import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist


nltk.download('punkt')
nltk.download('punkt_tab')

text = """
Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.
"""

text = text.lower()
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in string.punctuation]
freq_dist = FreqDist(tokens)


print("Tokens:")
print(tokens)

print("\nFrequency Distribution:")
print(freq_dist)

print("\nTop 10 Most Common Words:")
print(freq_dist.most_common(10))



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Tokens:
['natural', 'language', 'processing', 'nlp', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', 'computer', 'science', 'and', 'artificial', 'intelligence', 'it', 'enables', 'machines', 'to', 'understand', 'interpret', 'and', 'generate', 'human', 'language', 'applications', 'of', 'nlp', 'include', 'chatbots', 'sentiment', 'analysis', 'and', 'machine', 'translation', 'as', 'technology', 'advances', 'the', 'role', 'of', 'nlp', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical']

Frequency Distribution:
<FreqDist with 43 samples and 50 outcomes>

Top 10 Most Common Words:
[('nlp', 3), ('and', 3), ('language', 2), ('is', 2), ('of', 2), ('natural', 1), ('processing', 1), ('a', 1), ('fascinating', 1), ('field', 1)]


In [8]:
'''
Q8.  Create a custom annotator using spaCy or NLTK that identifies and labels proper nouns in a given text.
'''

import spacy
from spacy.language import Language
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

@Language.component("proper_noun_annotator")
def proper_noun_annotator(doc):
    spans = []
    for token in doc:
        if token.pos_ == "PROPN":
            span = Span(doc, token.i, token.i + 1, label="PROPER_NOUN")
            spans.append(span)
    doc.ents = spans
    return doc

nlp.add_pipe("proper_noun_annotator", last=True)

text = "John works at Google in New York and studies NLP at Stanford University."

doc = nlp(text)

for ent in doc.ents:
    print(ent.text, "->", ent.label_)


John -> PROPER_NOUN
Google -> PROPER_NOUN
New -> PROPER_NOUN
York -> PROPER_NOUN
NLP -> PROPER_NOUN
Stanford -> PROPER_NOUN
University -> PROPER_NOUN


In [9]:
'''
Q9.  Using Genism, demonstrate how to train a simple Word2Vec model on the following dataset consisting of example sentences:
dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar
meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]
Write code that tokenizes the dataset, preprocesses it, and trains a Word2Vec model using
Gensim.
'''

!pip install gensim


Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m77.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [10]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
import string


In [11]:
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

processed_data = []

for sentence in dataset:
    sentence = sentence.lower()
    sentence = sentence.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(sentence)
    processed_data.append(tokens)
print(processed_data)


[['natural', 'language', 'processing', 'enables', 'computers', 'to', 'understand', 'human', 'language'], ['word', 'embeddings', 'are', 'a', 'type', 'of', 'word', 'representation', 'that', 'allows', 'words', 'with', 'similar', 'meaning', 'to', 'have', 'similar', 'representation'], ['word2vec', 'is', 'a', 'popular', 'word', 'embedding', 'technique', 'used', 'in', 'many', 'nlp', 'applications'], ['text', 'preprocessing', 'is', 'a', 'critical', 'step', 'before', 'training', 'word', 'embeddings'], ['tokenization', 'and', 'normalization', 'help', 'clean', 'raw', 'text', 'for', 'modeling']]


In [13]:
model = Word2Vec(
    sentences=processed_data,
    vector_size=100,
    window=5,
    min_count=1,
    workers=4
)


In [14]:
vector = model.wv['language']
print("Vector size:", len(vector))

similar_words = model.wv.most_similar('word', topn=3)
print(similar_words)


Vector size: 100
[('tokenization', 0.21880948543548584), ('modeling', 0.21611614525318146), ('embedding', 0.19551844894886017)]


###Q10. Imagine you are a data scientist at a fintech startup. You’ve been tasked with analyzing customer feedback. Outline the steps you would take to clean, process,and extract useful insights using NLP techniques from thousands of customer reviews.



  As a data scientist analyzing **thousands of customer reviews**, the goal is to **clean text, process it, and extract actionable insights** such as common issues, customer sentiment, and frequently discussed topics.


  ## **Step-by-Step NLP Workflow**

  ### **1. Data Collection**

  * Collect customer reviews from app stores, emails, surveys, or support tickets.

  ### **2. Text Cleaning**

  * Convert text to lowercase
  * Remove punctuation, numbers, and special characters
  * Remove stopwords (e.g., *is, the, and*)

  ### **3. Tokenization**

  * Split text into individual words (tokens)

  ### **4. Normalization**

  * Apply lemmatization to reduce words to base form

  ### **5. Feature Extraction**

  * Use word frequency or TF-IDF to identify important terms

  ### **6. Sentiment Analysis**

  * Classify reviews as **positive, negative, or neutral**

  ### **7. Insight Generation**

  * Identify common complaints, popular features, and overall customer satisfaction


In [15]:
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.probability import FreqDist
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')

reviews = [
    "The app is very easy to use and transactions are super fast!",
    "Customer support is slow and the app crashes frequently.",
    "Great experience with instant loan approval.",
    "Poor service, my payment failed multiple times.",
    "User interface is clean but login issues are frustrating."
]

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
sia = SentimentIntensityAnalyzer()

processed_tokens = []
sentiment_results = []

for review in reviews:
    review = review.lower()
    review = review.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(review)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    processed_tokens.extend(tokens)
    sentiment = sia.polarity_scores(review)
    sentiment_results.append(sentiment)
freq_dist = FreqDist(processed_tokens)

print("Most Common Words:")
print(freq_dist.most_common(10))

print("\nSentiment Scores:")
for score in sentiment_results:
    print(score)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Most Common Words:
[('app', 2), ('easy', 1), ('use', 1), ('transaction', 1), ('super', 1), ('fast', 1), ('customer', 1), ('support', 1), ('slow', 1), ('crash', 1)]

Sentiment Scores:
{'neg': 0.0, 'neu': 0.585, 'pos': 0.415, 'compound': 0.796}
{'neg': 0.0, 'neu': 0.748, 'pos': 0.252, 'compound': 0.4019}
{'neg': 0.0, 'neu': 0.357, 'pos': 0.643, 'compound': 0.802}
{'neg': 0.561, 'neu': 0.439, 'pos': 0.0, 'compound': -0.7506}
{'neg': 0.303, 'neu': 0.551, 'pos': 0.146, 'compound': -0.4588}
