#**Assignment : NLP Introduction & Text Processing | Assignment**

**Question 1: What is Computational Linguistics and how does it relate to NLP?**

Ans: Computational Linguistics (CL) is an interdisciplinary field that deals with the statistical or rule-based modeling of natural language from a computational perspective. It is concerned with the theoretical aspects of using computers to understand and process human language.

Natural Language Processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Essentially, CL provides the linguistic theories and models, while NLP focuses on building practical systems and applications that can understand and process human language using those theories and models. You can think of CL as the science and NLP as the engineering that stems from that science.

**Question 2: Briefly describe the historical evolution of Natural Language Processing.**

Ans : The history of NLP can be broadly divided into several periods:

*   **Early Years (1950s-1960s):** Focused on rule-based approaches, with attempts to translate languages using grammatical rules and dictionaries. This era saw the development of early systems like Georgetown-IBM experiment.
*   **Statistical Revolution (1970s-1980s):** Shifted towards statistical methods, using probabilities and machine learning techniques to process language. This was driven by the availability of larger datasets and increased computational power.
*   **Machine Learning Era (1990s-2000s):** Saw significant advancements in machine learning techniques applied to NLP tasks, such as support vector machines, decision trees, and hidden Markov models. This led to improved performance in areas like part-of-speech tagging and named entity recognition.
*   **Deep Learning Era (2010s-Present):** Characterized by the rise of deep learning models, particularly neural networks. Techniques like recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformers have revolutionized NLP, achieving state-of-the-art results in various tasks, including machine translation, text generation, and sentiment analysis. The availability of massive datasets and powerful hardware like GPUs has been crucial for this progress.

**Question 3: List and explain three major use cases of NLP in today’s tech industry.**

Three major use cases of NLP in today's tech industry are:

1.  **Chatbots and Virtual Assistants:** NLP enables the development of chatbots and virtual assistants (like Siri, Alexa, and Google Assistant) that can understand and respond to human language. They are used in customer service, information retrieval, and task automation.
2.  **Sentiment Analysis:** NLP is used to analyze text data (social media posts, reviews, feedback) to determine the sentiment expressed (positive, negative, neutral). This is valuable for businesses to understand customer opinions, brand perception, and market trends.
3.  **Machine Translation:** NLP powers machine translation services (like Google Translate) that automatically translate text or speech from one language to another. This facilitates communication and access to information across language barriers.

**Question 4: What is text normalization and why is it essential in text processing tasks?**

Ans: Text normalization is the process of transforming text into a canonical (standard) form. This involves tasks such as converting text to lowercase, removing punctuation, correcting spelling errors, and handling numbers and dates in a consistent way.

It is essential in text processing tasks because it reduces variations in the text data, making it easier for algorithms to process and analyze. For example, without normalization, words like "run," "running," and "ran" might be treated as distinct tokens, even though they refer to the same concept. Normalization helps to group these variations together, improving the accuracy and efficiency of NLP models.

**Question 5: Compare and contrast stemming and lemmatization with suitable examples.**

Ans : Both stemming and lemmatization are techniques used in text processing to reduce words to their base or root form. However, they differ in their approach and the quality of the resulting root form:

**Stemming:**

*   **Approach:** Stemming is a simpler, rule-based process that chops off the ends of words to arrive at a root form. It's often a heuristic process that doesn't consider the word's context or meaning.
*   **Result:** The resulting stem may not be a valid word.
*   **Speed:** Generally faster than lemmatization.
*   **Use cases:** Often used in information retrieval systems where speed is crucial and perfect accuracy of the root form is not strictly necessary.

**Example of Stemming:**

*   "running" -> "run"
*   "flies" -> "fli"
*   "studies" -> "studi"

**Lemmatization:**

*   **Approach:** Lemmatization is a more sophisticated process that uses vocabulary and morphological analysis to return the base or dictionary form of a word, known as the lemma. It considers the word's context and meaning.
*   **Result:** The resulting lemma is always a valid word.
*   **Speed:** Generally slower than stemming as it requires more linguistic knowledge.
*   **Use cases:** Preferred in applications where the meaning and grammatical correctness of the root form are important, such as in natural language understanding and machine translation.

**Example of Lemmatization:**

*   "running" -> "run"
*   "flies" -> "fly"
*   "studies" -> "study"
*   "better" -> "good" (lemmatization can handle irregular forms)

**Key Differences Summarized:**

| Feature        | Stemming                       | Lemmatization                     |
| :------------- | :----------------------------- | :-------------------------------- |
| **Approach**   | Rule-based, heuristic          | Dictionary and morphological analysis |
| **Result**     | May not be a valid word        | Always a valid word               |
| **Speed**      | Faster                         | Slower                            |
| **Context**    | Does not consider context      | Considers context                 |
| **Complexity** | Simpler                        | More complex                      |

In essence, lemmatization is more accurate and linguistically informed than stemming, but it is also computationally more expensive. The choice between stemming and lemmatization depends on the specific requirements of the NLP task.

 Question 6: Write a Python program that uses regular expressions (regex) to extract all email addresses from the following block of text:

 “Hello team, please contact us at support@xyz.com for technical issues, or reach out to our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.”

In [1]:
import re

text = "Hello team, please contact us at support@xyz.com for technical issues, or reach out to our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz."

# Regex pattern for email addresses
# This pattern is a simplified one for demonstration.
# A more robust pattern would be needed for real-world scenarios.
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Find all email addresses in the text
email_addresses = re.findall(email_pattern, text)

# Print the extracted email addresses
print("Extracted email addresses:")
for email in email_addresses:
    print(email)

Extracted email addresses:
support@xyz.com
hr@xyz.com
john.doe@xyz.org
jenny_clarke126@mail.co.us
partners@xyz.biz


**Question 7: Given the sample paragraph below, perform string tokenization and frequency distribution using Python and NLTK:**

“Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence. It enables machines to understand, interpret, and generate human language. Applications of NLP include chatbots, sentiment analysis, and machine translation. As technology advances, the role of NLP in modern solutions is becoming increasingly critical.”

In [7]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [8]:
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

text = "Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence. It enables machines to understand, interpret, and generate human language. Applications of NLP include chatbots, sentiment analysis, and machine translation. As technology advances, the role of NLP in modern solutions is becoming increasingly critical."

# Tokenize the text
tokens = word_tokenize(text)

# Calculate frequency distribution
fdist = FreqDist(tokens)

# Print the tokens and frequency distribution
print("Tokens:")
print(tokens)
print("\nFrequency Distribution:")
print(fdist.most_common(10)) # Display the 10 most common tokens

Tokens:
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'It', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'Applications', 'of', 'NLP', 'include', 'chatbots', ',', 'sentiment', 'analysis', ',', 'and', 'machine', 'translation', '.', 'As', 'technology', 'advances', ',', 'the', 'role', 'of', 'NLP', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical', '.']

Frequency Distribution:
[(',', 7), ('.', 4), ('NLP', 3), ('and', 3), ('is', 2), ('of', 2), ('Natural', 1), ('Language', 1), ('Processing', 1), ('(', 1)]


**Question 8: Create a custom annotator using spaCy or NLTK that identifies and labels proper nouns in a given text.**

In [9]:
# Install spaCy
!pip install spacy

# Download a spaCy language model (e.g., English)
# You might need to restart the runtime after this step
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m93.7 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [10]:
import spacy

# Load the English language model
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading en_core_web_sm model...")
    !python -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")

text = "Barack Obama was the 44th President of the United States of America. He was born in Honolulu, Hawaii."

# Process the text with spaCy
doc = nlp(text)

# Identify and label proper nouns
print("Proper Nouns:")
for token in doc:
    if token.pos_ == "PROPN":
        print(f"{token.text} ({token.pos_})")

Proper Nouns:
Barack (PROPN)
Obama (PROPN)
President (PROPN)
United (PROPN)
States (PROPN)
America (PROPN)
Honolulu (PROPN)
Hawaii (PROPN)


**Question 9: Using Genism, demonstrate how to train a simple Word2Vec model on the following dataset consisting of example sentences:**

In [11]:
# Install Gensim
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m33.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [12]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_https_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('punkt')

dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

# Preprocess the dataset (tokenize and lowercase)
processed_sentences = [word_tokenize(sentence.lower()) for sentence in dataset]

# Train the Word2Vec model
model = Word2Vec(processed_sentences, vector_size=100, window=5, min_count=1, workers=4)

# Print the vocabulary size
print(f"Vocabulary size: {len(model.wv)}")

# Example: Get the vector for a word
word_vector = model.wv['language']
print(f"\nVector for 'language': {word_vector[:10]}...") # Print first 10 elements

# Example: Find similar words
similar_words = model.wv.most_similar('language')
print(f"\nWords similar to 'language': {similar_words}")

Vocabulary size: 46

Vector for 'language': [-0.00958061  0.00894419  0.00416531  0.00923353  0.00664613  0.00292132
  0.00980621 -0.004423   -0.0067969   0.00421717]...

Words similar to 'language': [('technique', 0.28540539741516113), ('text', 0.19906380772590637), ('processing', 0.19070622324943542), ('are', 0.10012832283973694), ('step', 0.09662030637264252), ('is', 0.07467351853847504), ('that', 0.07278265804052353), ('representation', 0.060818735510110855), ('meaning', 0.04675615578889847), ('modeling', 0.044769588857889175)]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Question 10: Imagine you are a data scientist at a fintech startup. You’ve been tasked with analyzing customer feedback. Outline the steps you would take to clean, process, and extract useful insights using NLP techniques from thousands of customer reviews.**

Here is an outline of the steps to clean, process, and extract useful insights from thousands of customer reviews using NLP techniques:

1.  **Data Loading and Inspection:**
    *   Load the customer feedback data from its source (e.g., CSV file, database, API).
    *   Inspect the data to understand its structure, identify relevant columns (e.g., review text, rating, timestamp), and check for missing values or inconsistencies.

2.  **Text Cleaning and Preprocessing:**
    *   **Handle missing values:** Decide how to deal with missing reviews (e.g., remove rows, impute with a placeholder).
    *   **Convert to lowercase:** Standardize the text by converting everything to lowercase.
    *   **Remove punctuation and special characters:** Eliminate characters that do not contribute to the meaning of the text.
    *   **Remove numbers:** Decide whether to remove or replace numbers depending on the analysis goals.
    *   **Remove stopwords:** Eliminate common words (e.g., "the," "a," "is") that have little analytical value.
    *   **Perform stemming or lemmatization:** Reduce words to their root form to group similar words together (choose based on the task's needs).
    *   **Handle emojis and emoticons:** Decide how to process or remove these based on whether they convey sentiment.
    *   **Correct spelling errors:** Use spell-checking techniques if necessary.

3.  **Text Representation (Feature Extraction):**
    *   Convert the cleaned text into numerical representations that can be used by machine learning models. Common techniques include:
        *   **Bag-of-Words (BoW):** Represents text as a bag of its words, ignoring grammar and word order but keeping multiplicity.
        *   **TF-IDF (Term Frequency-Inverse Document Frequency):** Weights words based on their frequency in a document and their inverse frequency across the entire dataset, highlighting important words.
        *   **Word Embeddings (e.g., Word2Vec, GloVe, FastText):** Represents words as dense vectors in a continuous vector space, capturing semantic relationships between words.
        *   **Sentence Embeddings (e.g., Sentence-BERT):** Represents entire sentences as vectors, capturing the meaning of the sentence.

4.  **Exploratory Data Analysis (EDA) with NLP:**
    *   **Word clouds:** Visualize the most frequent words in the reviews.
    *   **N-gram analysis:** Analyze the frequency of sequences of words (bigrams, trigrams) to identify common phrases.
    *   **Sentiment analysis:** Apply sentiment analysis techniques (lexicon-based or machine learning-based) to determine the overall sentiment of reviews.
    *   **Topic modeling (e.g., LDA, NMF):** Discover underlying topics or themes present in the reviews.

5.  **Insight Extraction and Modeling:**
    *   **Sentiment analysis:** Categorize reviews as positive, negative, or neutral to understand customer satisfaction.
    *   **Aspect-based sentiment analysis:** Identify the specific aspects of the product or service that customers are talking about and their sentiment towards those aspects.
    *   **Keyphrase extraction:** Identify important phrases or keywords that summarize the main points of the reviews.
    *   **Text classification:** Train models to classify reviews into predefined categories (e.g., bug report, feature request, usability issue).
    *   **Relationship extraction:** Identify relationships between entities in the text (e.g., "app" is "slow").

6.  **Visualization and Reporting:**
    *   Visualize the insights gained from the analysis (e.g., sentiment distribution, topic trends, keyphrase frequency).
    *   Create reports or dashboards to communicate the findings to stakeholders, highlighting actionable insights for product improvement, customer service, or marketing.

7.  **Model Evaluation (if machine learning models are used):**
    *   Evaluate the performance of any trained models using appropriate metrics (e.g., accuracy, precision, recall, F1-score).

8.  **Monitoring and Iteration:**
    *   Continuously monitor new incoming reviews and periodically re-run the analysis to track changes in customer feedback and identify emerging issues or trends. Refine the NLP pipeline and models as needed.