Question 1: Compare and contrast NLTK and spaCy in terms of features, ease of use,
and performance.

Answer:
NLTK and spaCy are popular Python libraries for natural language processing (NLP), but they target different needs: NLTK emphasizes educational flexibility, while spaCy prioritizes production efficiency.

## Key Features
NLTK offers a broad suite including tokenization, stemming, lemmatization, POS tagging, NER, sentiment analysis, emotion detection, language detection, and access to large corpora.

spaCy provides non-destructive tokenization, POS tagging, dependency parsing, NER, text classification, sentence segmentation, and pre-trained models for 19+ languages with multi-task learning support.

Both support core tasks like tokenization and NER, but NLTK excels in corpora and customization, while spaCy focuses on trainable pipelines and visualizers.

## Ease of Use
NLTK has a steeper learning curve due to its modular, code-heavy approach, suiting learners who want to explore algorithms deeply.

spaCy features a simple, consistent API for quick implementation, with less code needed for common tasks, though custom training can be complex.

NLTK requires more setup for production but offers extensive documentation; spaCy is more intuitive for beginners in applied NLP.

## Performance
spaCy is significantly faster, processing 10,000+ tokens per second via Cython optimization, ideal for large-scale or real-time use—up to 50x quicker than NLTK on tokenization.

NLTK is slower (around 2,000 tokens/second) due to pure Python, better for prototyping or small datasets.

spaCy often delivers higher accuracy (90-95% on benchmarks) with modern models; NLTK matches for specific tasks but needs tuning.

## Comparison Table

| Aspect          | NLTK                                      | spaCy                                      |
|-----------------|-------------------------------------------|--------------------------------------------|
| **Strengths**   | Educational, flexible, vast corpora | Production-ready, fast, accurate models  |
| **Weaknesses**  | Slower, steeper curve               | Less flexible for research, memory-heavy  |
| **Best For**    | Learning/research                 | Apps/large data                      |

Question 2: What is TextBlob and how does it simplify common NLP tasks like
sentiment analysis and translation?

Answer: TextBlob is a simple Python library built on NLTK and Pattern, designed for straightforward text processing in NLP. It provides an intuitive, "Pythonic" interface that abstracts complex operations into one-liners.

## Core Features
TextBlob supports tokenization (words/sentences), POS tagging, noun phrase extraction, sentiment analysis, classification (Naive Bayes/Decision Tree), n-grams, lemmatization, spelling correction, and WordNet integration.

It also includes translation and language detection powered by Google Translate, plus word inflection for plurals/singulars.

## Simplifying Sentiment Analysis
Sentiment analysis returns polarity (-1.0 negative to 1.0 positive) and subjectivity (0.0 objective to 1.0 subjective) with a single call: `TextBlob(text).sentiment`. This uses a rule-based approach from Pattern, avoiding manual model training for quick prototyping.

Example: `TextBlob("Python is great!").sentiment` yields (0.8, 0.9), indicating positive, subjective text.

## Simplifying Translation
Translation is a one-liner: `TextBlob(text).translate(to='es')` detects source language and converts using Google Translate API—no setup required.

Example: `TextBlob("Hello world").translate(to='fr')` outputs "Bonjour le monde."

## Key Advantages
Its simplicity suits beginners and rapid tasks, requiring minimal code versus raw NLTK, though it's slower without neural models.

Question 3: Explain the role of Standford NLP in academic and industry NLP Projects.

Answer:  Stanford NLP, primarily through tools like Stanford CoreNLP and Stanza, provides robust, research-grade software for linguistic annotations in text processing. It plays a foundational role in both advancing theoretical NLP and enabling practical applications.

## Academic Role
In academia, Stanford NLP drives basic research in computational linguistics, machine learning, and cognitive science, supporting over 60 languages via Stanza.

It powers educational materials, student projects, and peer-reviewed papers, with CoreNLP used for tasks like dependency parsing and coreference resolution in experiments.

The Stanford NLP Group fosters interdisciplinary work, influencing curricula and tools for training future researchers.

## Industry Role
Industry adopts Stanford tools for production systems in sentiment analysis, NER, relation extraction, and text classification, integrating them into apps for customer feedback and search.

Widely used by companies for scalable NLP pipelines, it bridges research to deployment despite resource constraints in academia.

## Key Impacts
Stanford NLP sets benchmarks for accuracy in parsing and annotation, enabling hybrid academia-industry collaborations that boost innovation.

| Context     | Primary Contributions                  | Examples                          |
|-------------|----------------------------------------|-----------------------------------|
| **Academic**| Research tools, education, experiments | Stanza for 60+ languages, papers  |
| **Industry**| Applied annotations, scalable apps  | Sentiment, NER in products  |

Question 4: Describe the architecture and functioning of a Recurrent Natural Network
(RNN).

Answer: Recurrent Neural Networks (RNNs) are neural networks designed for sequential data, where connections form cycles to maintain a "memory" of previous inputs. They process inputs step-by-step, updating an internal state to capture dependencies over time.

## Core Architecture
RNNs consist of three main layers: an input layer for sequential data (like words in a sentence), a hidden (recurrent) layer that holds the state, and an output layer for predictions.

The key is the recurrent connection: at each time step \( t \), the hidden state \( h_t \) is computed as \( h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h) \), where \( x_t \) is the current input, \( h_{t-1} \) is the prior hidden state, \( W \) are weight matrices, and \( b_h \) is bias.

This creates a chain-like unfolding over time, sharing weights across steps for efficiency.

## Functioning Process
RNNs process sequences iteratively: start with initial hidden state \( h_0 \) (often zeros), feed each input \( x_t \) to update \( h_t \), then compute output \( y_t = W_{hy} h_t + b_y \).

During training, backpropagation through time (BPTT) unrolls the network, computing gradients across steps to optimize parameters.

They excel at tasks like language modeling but suffer from vanishing gradients on long sequences.

## Comparison of Components

| Component      | Role                                      | Example Equation/Note                  |
|----------------|-------------------------------------------|----------------------------------------|
| **Input**      | Current sequence element           | \( x_t \) (e.g., word embedding)      |
| **Hidden State** | Memory from prior steps      | \( h_t = f(h_{t-1}, x_t) \)           |
| **Output**     | Prediction per step               | \( y_t \) for classification/seq gen  |
| **Weights**    | Learned parameters (shared)    | \( W_{xh}, W_{hh}, W_{hy} \)          |

Question 5: What is the key difference between LSTM and GRU networks in NLP
applications?

Answer: LSTM and GRU are both gated RNN variants that mitigate vanishing gradients, but their key difference lies in gating mechanisms and architectural simplicity, impacting efficiency in NLP tasks like sequence modeling.

## Gating Mechanisms
LSTM employs three gates—input, forget, and output—plus a distinct cell state to regulate information flow, enabling precise control over long-term dependencies.

GRU streamlines this with two gates—reset and update—merging cell and hidden states into one, reducing parameters by about 25% for faster computation.

## NLP Applications
In NLP, LSTMs shine in tasks needing deep context retention, such as document-level sentiment analysis or machine translation with long sequences.

GRUs excel in resource-constrained or real-time scenarios like NER or chatbots, offering comparable accuracy with quicker training and inference.

## Comparison Table

| Aspect          | LSTM                               | GRU                                |
|-----------------|------------------------------------|------------------------------------|
| **Gates**       | 3 (input, forget, output)  | 2 (reset, update)          |
| **States**      | Hidden + cell          | Hidden only               |
| **Parameters**  | More (slower)              | Fewer (faster)             |
| **NLP Best For**| Long sequences         | Efficiency/real-time       |

Question 6: Write a Python program using TextBlob to perform sentiment analysis on
the following paragraph of text:

“I had a great experience using the new mobile banking app. The interface is intuitive,
and customer support was quick to resolve my issue. However, the app did crash once
during a transaction, which was frustrating"

Your program should print out the polarity and subjectivity scores.

(Include your Python code and output in the code box below.)

Answer:

In [1]:
from textblob import TextBlob

# Given text
text = (
    "I had a great experience using the new mobile banking app. "
    "The interface is intuitive, and customer support was quick to resolve my issue. "
    "However, the app did crash once during a transaction, which was frustrating."
)

# Create TextBlob object
blob = TextBlob(text)

# Get sentiment scores
polarity = blob.sentiment.polarity
subjectivity = blob.sentiment.subjectivity

# Print results
print("Polarity:", polarity)
print("Subjectivity:", subjectivity)


Polarity: 0.21742424242424244
Subjectivity: 0.6511363636363636


Question 7: Given the sample paragraph below, perform string tokenization and
frequency distribution using Python and NLTK:

“Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.”

(Include your Python code and output in the code box below.)

Answer:

In [3]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Download tokenizer (only needed first time)
nltk.download('punkt_tab') # Changed from 'punkt' to 'punkt_tab' as indicated by the error

# Given paragraph
text = (
    "Natural Language Processing (NLP) is a fascinating field that combines linguistics, "
    "computer science, and artificial intelligence. It enables machines to understand, "
    "interpret, and generate human language. Applications of NLP include chatbots, "
    "sentiment analysis, and machine translation. As technology advances, the role of NLP "
    "in modern solutions is becoming increasingly critical."
)

# Tokenization
tokens = word_tokenize(text)

# Frequency distribution
freq_dist = FreqDist(tokens)

# Print tokens
print("Tokens:")
print(tokens)

# Print frequency distribution
print("\nFrequency Distribution:")
for word, freq in freq_dist.items():
    print(f"{word}: {freq}")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Tokens:
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'It', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'Applications', 'of', 'NLP', 'include', 'chatbots', ',', 'sentiment', 'analysis', ',', 'and', 'machine', 'translation', '.', 'As', 'technology', 'advances', ',', 'the', 'role', 'of', 'NLP', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical', '.']

Frequency Distribution:
Natural: 1
Language: 1
Processing: 1
(: 1
NLP: 3
): 1
is: 2
a: 1
fascinating: 1
field: 1
that: 1
combines: 1
linguistics: 1
,: 7
computer: 1
science: 1
and: 3
artificial: 1
intelligence: 1
.: 4
It: 1
enables: 1
machines: 1
to: 1
understand: 1
interpret: 1
generate: 1
human: 1
language: 1
Applications: 1
of: 2
include: 1
chatbots: 1
sentiment: 1
analysis: 1
machine: 1
translatio

Question 8: Implement a basic LSTM model in Keras for a text classification task using
the following dummy dataset. Your model should classify sentences as either positive
(1) or negative (0).

# Dataset
texts = [
“I love this project”, #Positive
“This is an amazing experience”, #Positive
“I hate waiting in line”, #Negative
“This is the worst service”, #Negative
“Absolutely fantastic!” #Positive
]

labels = [1, 1, 0, 0, 1]

Preprocess the text, tokenize it, pad sequences, and build an LSTM model to train on
this data. You may use Keras with TensorFlow backend.  

(Include your Python code and output in the code box below.)

Answer:  

In [4]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Dataset
texts = [
    "I love this project",                 # Positive
    "This is an amazing experience",       # Positive
    "I hate waiting in line",               # Negative
    "This is the worst service",            # Negative
    "Absolutely fantastic!"                # Positive
]

labels = [1, 1, 0, 0, 1]

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

# Padding sequences
max_len = max(len(seq) for seq in sequences)
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post')

# Convert labels to numpy array
labels = np.array(labels)

# Vocabulary size
vocab_size = len(tokenizer.word_index) + 1

# Build LSTM model
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=10, input_length=max_len),
    LSTM(16),
    Dense(1, activation='sigmoid')
])

# Compile model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Model summary
model.summary()

# Train model
history = model.fit(
    padded_sequences,
    labels,
    epochs=20,
    verbose=1
)

# Test prediction
test_text = ["I really love this experience"]
test_seq = tokenizer.texts_to_sequences(test_text)
test_pad = pad_sequences(test_seq, maxlen=max_len, padding='post')
prediction = model.predict(test_pad)

print("\nPrediction for test sentence:", prediction)




Epoch 1/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step - accuracy: 0.6000 - loss: 0.6920
Epoch 2/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step - accuracy: 0.6000 - loss: 0.6906
Epoch 3/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step - accuracy: 0.6000 - loss: 0.6892
Epoch 4/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step - accuracy: 0.6000 - loss: 0.6877
Epoch 5/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step - accuracy: 0.6000 - loss: 0.6863
Epoch 6/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step - accuracy: 0.6000 - loss: 0.6848
Epoch 7/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step - accuracy: 0.6000 - loss: 0.6833
Epoch 8/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step - accuracy: 0.6000 - loss: 0.6817
Epoch 9/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m

Question 9: Using spaCy, build a simple NLP pipeline that includes tokenization,
lemmatization, and entity recognition. Use the following paragraph as your dataset:

“Homi Jehangir Bhaba was an Indian nuclear physicist who played a key role in the
development of India’s atomic energy program. He was the founding director of the Tata
Institute of Fundamental Research (TIFR) and was instrumental in establishing the
Atomic Energy Commission of India.”

Write a Python program that processes this text using spaCy, then prints tokens, their
lemmas, and any named entities found.

(Include your Python code and output in the code box below.)

Answer:

In [5]:
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Given paragraph
text = (
    "Homi Jehangir Bhaba was an Indian nuclear physicist who played a key role in the "
    "development of India’s atomic energy program. He was the founding director of the "
    "Tata Institute of Fundamental Research (TIFR) and was instrumental in establishing "
    "the Atomic Energy Commission of India."
)

# Process the text
doc = nlp(text)

# Print tokens and lemmas
print("Tokens and Lemmas:")
for token in doc:
    print(f"{token.text:<15} -> {token.lemma_}")

# Print named entities
print("\nNamed Entities:")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")


Tokens and Lemmas:
Homi            -> Homi
Jehangir        -> Jehangir
Bhaba           -> Bhaba
was             -> be
an              -> an
Indian          -> indian
nuclear         -> nuclear
physicist       -> physicist
who             -> who
played          -> play
a               -> a
key             -> key
role            -> role
in              -> in
the             -> the
development     -> development
of              -> of
India           -> India
’s              -> ’s
atomic          -> atomic
energy          -> energy
program         -> program
.               -> .
He              -> he
was             -> be
the             -> the
founding        -> found
director        -> director
of              -> of
the             -> the
Tata            -> Tata
Institute       -> Institute
of              -> of
Fundamental     -> Fundamental
Research        -> Research
(               -> (
TIFR            -> TIFR
)               -> )
and             -> and
was             -> be
instrume

Question 10: You are working on a chatbot for a mental health platform. Explain how
you would leverage LSTM or GRU networks along with libraries like spaCy or Stanford
NLP to understand and respond to user input effectively. Detail your architecture, data
preprocessing pipeline, and any ethical considerations.

(Include your Python code and output in the code box below.)

Answer:  

In [6]:
import spacy
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample mental-health related dataset
texts = [
    "I feel very anxious today",
    "I am happy and relaxed",
    "I feel sad and overwhelmed",
    "Everything is going well"
]

# Labels: 1 = Distressed, 0 = Not distressed
labels = [1, 0, 1, 0]

# spaCy preprocessing (lemmatization)
processed_texts = []
for text in texts:
    doc = nlp(text)
    lemmas = " ".join([token.lemma_ for token in doc])
    processed_texts.append(lemmas)

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(processed_texts)
sequences = tokenizer.texts_to_sequences(processed_texts)

# Padding
max_len = max(len(seq) for seq in sequences)
X = pad_sequences(sequences, maxlen=max_len, padding='post')
y = np.array(labels)

# Build LSTM model
model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=16),
    LSTM(32),
    Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Train model
model.fit(X, y, epochs=10, verbose=1)

# Test input
test_text = "I am feeling very stressed and tired"
doc = nlp(test_text)
test_lemmas = " ".join([token.lemma_ for token in doc])
test_seq = tokenizer.texts_to_sequences([test_lemmas])
test_pad = pad_sequences(test_seq, maxlen=max_len, padding='post')

prediction = model.predict(test_pad)
print("Distress probability:", prediction)


Epoch 1/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - accuracy: 0.5000 - loss: 0.6940
Epoch 2/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 65ms/step - accuracy: 0.5000 - loss: 0.6930
Epoch 3/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 69ms/step - accuracy: 0.5000 - loss: 0.6920
Epoch 4/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 69ms/step - accuracy: 0.7500 - loss: 0.6909
Epoch 5/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 138ms/step - accuracy: 1.0000 - loss: 0.6899
Epoch 6/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 139ms/step - accuracy: 1.0000 - loss: 0.6888
Epoch 7/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 147ms/step - accuracy: 1.0000 - loss: 0.6876
Epoch 8/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 66ms/step - accuracy: 1.0000 - loss: 0.6864
Epoch 9/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 