# Useful NLP Libraries & Networks | Assignment

**Question 1: Compare and contrast NLTK and spaCy in terms of features, ease of use, and performance**.

NLTK is mainly an educational and research-focused library that offers a wide range of NLP tools like tokenization, stemming, parsing, and corpora access, but it often requires more code and manual setup, making it less beginner-friendly for production use. spaCy, on the other hand, is designed for industrial and real-world applications, providing fast, pre-trained models for tasks such as POS tagging, NER, and dependency parsing with a very clean and easy-to-use API. In terms of performance, spaCy is significantly faster and more memory-efficient, while NLTK is more flexible for learning concepts and experimentation but slower and less optimized for large-scale applications.

**Question 2: What is TextBlob and how does it simplify common** **NLP tasks like** **sentiment analysis and translation?**

TextBlob is a high-level Python NLP library built on top of NLTK and Pattern that provides a very simple and intuitive API for common text processing tasks. It simplifies NLP by offering ready-to-use methods for tasks like sentiment analysis, part-of-speech tagging, noun phrase extraction, spelling correction, and translation, often in just one or two lines of code. For example, sentiment analysis can be done using a built-in polarity and subjectivity score, and translation is handled through simple method calls without requiring deep knowledge of underlying NLP models, making TextBlob especially useful for beginners and quick prototyping.

**Question 3: Explain the role of Standford NLP in academic and industry NLP Projects**.

Stanford NLP plays an important role in both academic research and industry NLP projects by providing state-of-the-art, research-backed tools for core NLP tasks such as tokenization, POS tagging, named entity recognition, parsing, coreference resolution, and sentiment analysis. In academia, it is widely used for research, experimentation, and benchmarking due to its strong theoretical foundations and high-quality models. In industry, Stanford NLP (especially Stanford CoreNLP) is valued for its accuracy, language support, and robustness, making it suitable for building reliable NLP pipelines in applications like information extraction, question answering, and text analytics, particularly in systems where correctness and linguistic depth are more important than speed.


**Question 4: Describe the architecture and functioning of a Recurrent Natural Network (RNN)**

A Recurrent Neural Network (RNN) is a neural network architecture designed to handle sequential data such as text, speech, or time-series data. Its key feature is the presence of recurrent connections, which allow information from previous time steps to be passed forward as a hidden state, enabling the network to capture temporal dependencies. At each time step, the RNN takes the current input and the previous hidden state to produce a new hidden state and output using shared weights across the sequence. This architecture allows RNNs to model sequence order and context, but standard RNNs suffer from problems like vanishing and exploding gradients, which limit their ability to learn long-term dependencies—an issue addressed by improved variants such as LSTM and GRU.

**Question 5: What is the key difference between LSTM and GRU networks in NLP**
**applications?**

The key difference between LSTM and GRU networks lies in their internal gating mechanisms. LSTM uses three gates (input, forget, and output) along with a separate cell state to control the flow of information, making it powerful for capturing long-term dependencies but computationally heavier. GRU, on the other hand, combines these functions into two gates (reset and update) and does not use a separate cell state, resulting in a simpler architecture with fewer parameters. In NLP applications, GRUs are generally faster to train and more memory-efficient, while LSTMs may perform slightly better on tasks requiring very long-range context.

In [None]:
# Question 6: Write a Python program using TextBlob to perform sentiment analysis on
# the following paragraph of text:
# “I had a great experience using the new mobile banking app. The interface is intuitive,
# and customer support was quick to resolve my issue. However, the app did crash once
# during a transaction, which was frustrating"
# Your program should print out the polarity and subjectivity scores.

!pip install textblob
!python -m textblob.download_corpora

from textblob import TextBlob
text = """I had a great experience using the new mobile banking app. The interface is intuitive,
and customer support was quick to resolve my issue. However, the app did crash once
during a transaction, which was frustrating"""

blob = TextBlob(text)
print(blob.sentiment)

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.
Sentiment(polarity=0.21742424242424244, subjectivity=0.6511363636363636)


In [2]:
# Question 7: Given the sample paragraph below, perform string tokenization and
# frequency distribution using Python and NLTK:
# “Natural Language Processing (NLP) is a fascinating field that combines linguistics,
# computer science, and artificial intelligence. It enables machines to understand,
# interpret, and generate human language. Applications of NLP include chatbots,
# sentiment analysis, and machine translation. As technology advances, the role of NLP
# in modern solutions is becoming increasingly critical.”


from  nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import nltk
nltk.download('punkt_tab')

text = """Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical."""

tokens = word_tokenize(text)
print("Tokens:\n", tokens)
freq_dist = FreqDist(tokens)
print("\nFrequency Distribution:")
for word, freq in freq_dist.items():
    print(word, ":", freq)

Tokens:
 ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'It', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'Applications', 'of', 'NLP', 'include', 'chatbots', ',', 'sentiment', 'analysis', ',', 'and', 'machine', 'translation', '.', 'As', 'technology', 'advances', ',', 'the', 'role', 'of', 'NLP', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical', '.']

Frequency Distribution:
Natural : 1
Language : 1
Processing : 1
( : 1
NLP : 3
) : 1
is : 2
a : 1
fascinating : 1
field : 1
that : 1
combines : 1
linguistics : 1
, : 7
computer : 1
science : 1
and : 3
artificial : 1
intelligence : 1
. : 4
It : 1
enables : 1
machines : 1
to : 1
understand : 1
interpret : 1
generate : 1
human : 1
language : 1
Applications : 1
of : 2
include : 1
chatbots : 1
sentiment : 

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [3]:
# Question 8: Implement a basic LSTM model in Keras for a text classification task using
# the following dummy dataset. Your model should classify sentences as either positive
# (1) or negative (0).
# # Dataset
# texts = [
# “I love this project”, #Positive
# “This is an amazing experience”, #Positive
# “I hate waiting in line”, #Negative
# “This is the worst service”, #Negative
# “Absolutely fantastic!” #Positive
# ]
# labels = [1, 1, 0, 0, 1]
# Preprocess the text, tokenize it, pad sequences, and build an LSTM model to train on
# this data. You may use Keras with TensorFlow backend.

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

texts = [
"I love this project", #Positive
"This is an amazing experience", #Positive
"I hate waiting in line", #Negative
"This is the worst service", #Negative
"Absolutely fantastic!" #Positive
]

labels = np.array([1, 1, 0, 0, 1])

tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

sequences = tokenizer.texts_to_sequences(texts)

max_len = 6
X = pad_sequences(sequences, maxlen=max_len, padding='post')

vocab_size = len(tokenizer.word_index) + 1

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=16, input_length=max_len),
    LSTM(16),
    Dense(1, activation='sigmoid')
])

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

model.summary()

model.fit(X, labels, epochs=20, verbose=1)

test = ["I love this service"]
test_seq = tokenizer.texts_to_sequences(test)
test_pad = pad_sequences(test_seq, maxlen=max_len, padding='post')

prediction = model.predict(test_pad)

print("\nPrediction:", prediction)
print("Class:", 1 if prediction > 0.5 else 0)



Epoch 1/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step - accuracy: 0.6000 - loss: 0.6892
Epoch 2/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step - accuracy: 0.6000 - loss: 0.6873
Epoch 3/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 140ms/step - accuracy: 0.6000 - loss: 0.6854
Epoch 4/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step - accuracy: 0.6000 - loss: 0.6835
Epoch 5/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step - accuracy: 0.6000 - loss: 0.6815
Epoch 6/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step - accuracy: 0.6000 - loss: 0.6795
Epoch 7/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step - accuracy: 0.6000 - loss: 0.6774
Epoch 8/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step - accuracy: 0.6000 - loss: 0.6753
Epoch 9/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

In [4]:
# Question 9: Using spaCy, build a simple NLP pipeline that includes tokenization,
# lemmatization, and entity recognition. Use the following paragraph as your dataset:
# “Homi Jehangir Bhaba was an Indian nuclear physicist who played a key role in the
# development of India’s atomic energy program. He was the founding director of the Tata
# # Institute of Fundamental Research (TIFR) and was instrumental in establishing the
# Atomic Energy Commission of India.”
# Write a Python program that processes this text using spaCy, then prints tokens, their
# lemmas, and any named entities found.


import spacy

# Load English language model
nlp = spacy.load("en_core_web_sm")

text = """Homi Jehangir Bhaba was an Indian nuclear physicist who played a key role in the
development of India’s atomic energy program. He was the founding director of the Tata
Institute of Fundamental Research (TIFR) and was instrumental in establishing the
Atomic Energy Commission of India."""

# Process text
doc = nlp(text)

# 1. Tokenization + Lemmatization
print("TOKENS AND LEMMAS\n")
for token in doc:
    print(f"Token: {token.text:15} Lemma: {token.lemma_}")

# 2. Named Entity Recognition
print("\nNAMED ENTITIES\n")
for ent in doc.ents:
    print(f"Entity: {ent.text:40} Label: {ent.label_}")

TOKENS AND LEMMAS

Token: Homi            Lemma: Homi
Token: Jehangir        Lemma: Jehangir
Token: Bhaba           Lemma: Bhaba
Token: was             Lemma: be
Token: an              Lemma: an
Token: Indian          Lemma: indian
Token: nuclear         Lemma: nuclear
Token: physicist       Lemma: physicist
Token: who             Lemma: who
Token: played          Lemma: play
Token: a               Lemma: a
Token: key             Lemma: key
Token: role            Lemma: role
Token: in              Lemma: in
Token: the             Lemma: the
Token: 
               Lemma: 

Token: development     Lemma: development
Token: of              Lemma: of
Token: India           Lemma: India
Token: ’s              Lemma: ’s
Token: atomic          Lemma: atomic
Token: energy          Lemma: energy
Token: program         Lemma: program
Token: .               Lemma: .
Token: He              Lemma: he
Token: was             Lemma: be
Token: the             Lemma: the
Token: founding        Lemma: fou

**Question 10: You are working on a chatbot for a mental health platform. Explain how you would leverage LSTM or GRU networks along with libraries like spaCy or Stanford NLP to understand and respond to user input effectively. Detail your architecture, data preprocessing pipeline, and any ethical considerations.**


To build a mental-health chatbot, LSTM or GRU networks can be used to understand the context and emotion in user messages. These models are suitable because they capture long-term dependencies in text, such as expressions of sadness or anxiety across multiple sentences. spaCy or Stanford NLP can be used in the initial stage for tokenization, lemmatization, part-of-speech tagging, and named entity recognition to convert raw user input into structured linguistic features.

The architecture would start with a preprocessing pipeline where the user text is cleaned, tokenized, and converted into word embeddings. The processed sequence is then passed to an LSTM/GRU layer that performs intent and emotion classification, such as detecting depression, stress, or crisis situations. Based on the predicted intent, a dialogue manager selects an appropriate empathetic response or escalates the conversation to a human counselor when necessary.

Ethical considerations are critical in such a system. The chatbot must protect user privacy, avoid storing sensitive personal data, and clearly state that it is not a replacement for professional therapy. It should include safety mechanisms to detect self-harm or suicidal intent and immediately provide helpline information and encourage human support. Regular monitoring and bias-free, compassionate responses are essential to ensure user well-being.