# NLP Liabraries Assignment

Question 1: Compare and contrast NLTK and spaCy in terms of features, ease of use,
and performance.
  - NLTK (Natural Language Toolkit) and spaCy are both prominent Python libraries for Natural Language Processing (NLP), but they differ significantly in their approach and target audience.
  Features:
  NLTK: Offers a wide array of algorithms and resources, including tokenizers, stemmers, taggers, parsers, and various corpora. It provides extensive flexibility for research and experimentation, allowing users to choose and compare different algorithms for the same task. However, it lacks built-in support for word embeddings and advanced neural network models.
  spaCy: Focuses on providing efficient and production-ready NLP pipelines with pre-trained statistical models for various languages. It includes features like tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and word vectors. spaCy is designed for speed and accuracy in real-world applications.
  Ease of use:
  NLTK: Can have a steeper learning curve due to its extensive collection of algorithms and the need for manual configuration for many tasks. It's more geared towards researchers who want fine-grained control over the NLP process.
  spaCy: Is generally considered more user-friendly, especially for developers and those looking for quick integration into applications. It offers a streamlined API and pre-trained models that simplify common NLP tasks, making it easier to get started and achieve results.
  Perfomance:
  NLTK: Performance can vary depending on the chosen algorithms and the complexity of the task. While it offers a wide range of options, optimizing for speed and efficiency often requires careful selection and configuration.
  spaCy: Is optimized for performance and speed, making it suitable for large-scale text processing and real-time applications. Its pre-trained models and efficient algorithms generally deliver faster processing times and higher accuracy for many standard NLP tasks compared to NLTK's default settings.

Question 2: What is TextBlob and how does it simplify common NLP tasks like
sentiment analysis and translation?
  - TextBlob is a Python library that simplifies common Natural Language Processing (NLP) tasks by providing a simple, intuitive, high-level API built on top of the more complex NLTK and Pattern libraries. It allows beginners and professionals to perform operations like sentiment analysis and translation with minimal code.
  TextBlob streamlines sentiment analysis by abstracting the complex underlying rule-based models into a single, accessible property of a TextBlob object.
  Simple API: You simply create a TextBlob object from your text and access its .sentiment property.
  Direct Scores: This property returns a tuple containing two floats: polarity and subjectivity.
  Interpretable Results: The polarity score ranges from -1.0 (very negative) to 1.0 (very positive), while subjectivity ranges from 0.0 (objective) to 1.0 (subjective), making the results immediately understandable without needing to manage the underlying dictionaries or models.

Question 3: Explain the role of Standford NLP in academic and industry NLP Projects.
  - Stanford NLP serves as a foundational suite of NLP tools and libraries, significantly influencing both academic research and industry applications. [1] It provides a comprehensive collection of pre-trained models and robust software for common tasks like tokenization, part-of-speech tagging, and named entity recognition.
  In academia, Stanford NLP is heavily utilized as a baseline for new research and a core teaching tool.
  Benchmarking: Researchers often use its models as a standard to compare against novel algorithms and models developed in their studies. [2]
  Education: It is frequently integrated into university curricula to teach students the fundamentals of how NLP tasks are implemented.
  Accessible Research: The open-source nature and well-documented APIs allow researchers globally to build upon existing work efficiently.

Question 4: Describe the architecture and functioning of a Recurrent Natural Network
(RNN).
  - A Recurrent Neural Network (RNN) is a class of artificial neural networks designed to recognize patterns in sequences of data, such as text, handwriting, or spoken language. Unlike traditional feedforward networks where information flows in only one direction, RNNs incorporate loops that allow information to persist, effectively giving them a form of memory about previous inputs. This architecture makes them highly suitable for tasks involving sequential data.
  Architecture of an RNN:
  The Recurrent Structure
  Input Layer: Receives the data for the current time step (e.g., a single word in a sentence).
  Hidden State (Memory): This is the crucial component. At each time step, the network takes the current input and the previous hidden state (the output from the loop in the prior step) to calculate a new hidden state. This hidden state captures information learned from all preceding elements in the sequence.
  Output Layer: Uses the current hidden state to produce an output (e.g., the next predicted word, a sentiment score).

Question 5: What is the key difference between LSTM and GRU networks in NLP
applications?
  - The key difference between LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) networks is the number and type of gates they use to control information flow. LSTMs employ three gates: the input gate, forget gate, and output gate, alongside a dedicated cell state. GRUs, designed for simplicity, use only two gates: the reset gate and the update gate, and lack a separate cell state, merging the cell state and hidden state into one.
  The architectural disparity leads to LSTMs having more parameters and potentially higher accuracy in specific complex sequence modeling tasks, while GRUs are generally faster to train and computationally less intensive due to their simpler structure.
  Key components:
  LSTM: Utilizes a cell state to carry information across time steps, governed by the three distinct gates that regulate what information to remember, forget, and output. This provides a more granular control over information flow.
  GRU: Merges the cell state and hidden state, simplifying the architecture. The two gates control how much of the past information to forget (reset gate) and how much of the new information to incorporate into the current state (update gate).
  
   

In [2]:
# Write a Python program using TextBlob to perform sentiment analysis on the following paragraph:

text = "I enjoy coding, but debugging can be frustrating."
blob = TextBlob(text)
sentiment = blob.sentiment
print(sentiment)

Sentiment(polarity=0.0, subjectivity=0.7)


In [3]:
# Given the sample paragraph below, perform string tokenization and frequency distribution using Python and NLTK:

import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Download necessary NLTK data (if not already downloaded)
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')

# Sample paragraph
paragraph = "Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence. It enables machines to understand, interpret, and generate human language. Applications of NLP include chatbots, sentiment analysis, and machine translation. As technology advances, the role of NLP in modern solutions is becoming increasingly critical."

# 1. Tokenization
# Convert the paragraph to lowercase to treat "NLP" and "nlp" as the same token
lower_case_paragraph = paragraph.lower()
tokens = word_tokenize(lower_case_paragraph)

print("Tokens:")
print(tokens)

# 2. Frequency Distribution
fdist = FreqDist(tokens)

print("\nFrequency Distribution:")
for word, frequency in fdist.most_common(10):  # Display top 10 most common words
    print(f"{word}: {frequency}")



AttributeError: module 'nltk.downloader' has no attribute 'DownloadError'

In [4]:
#  Question 8. Implement a basic LSTM model in Keras for a text classification task using the following dummy dataset.Your model should classify sentences.

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.utils import to_categorical

# 1. Dummy Dataset (already provided in the prompt)
texts = [
    "I love this project",
    "This is an amazing experience",
    "I hate waiting in line",
    "This is the worst service",
    "Absolutely fantastic!"
]
labels = [1, 1, 0, 0, 1]

# Convert labels to a numpy array
labels = np.array(labels)

# 2. Text Preprocessing and Tokenization
vocab_size = 100
oov_tok = "<OOV>" # Out of vocabulary token

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(texts)
word_index = tokenizer.word_index

print(f"Word Index: {word_index}\n")

# Convert texts to sequences of integers
sequences = tokenizer.texts_to_sequences(texts)

print(f"Sequences: {sequences}\n")

# 3. Padding Sequences
max_length = max([len(x) for x in sequences]) # Determine max length in the dataset
padding_type='post'
trunc_type='post'

padded_sequences = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

print(f"Padded Sequences (Shape: {padded_sequences.shape}):\n{padded_sequences}\n")

# 4. Building the LSTM Model
embedding_dim = 16

model = Sequential([
    # Input layer expects integer indices, outputs dense vectors of size embedding_dim
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
    # LSTM layer processes the sequence data
    LSTM(units=32),
    # Dense output layer with a single neuron and sigmoid activation for binary classification
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

# 5. Training the Model
num_epochs = 50

# Note: With such a tiny dataset, the results are unlikely to generalize well or show smooth learning curves,
# but it demonstrates the implementation process.
history = model.fit(
    padded_sequences,
    labels,
    epochs=num_epochs,
    verbose=0 # Set verbose=1 to see training progress per epoch
)

print(f"\nTraining finished after {num_epochs} epochs.")

# Optional: Display final training accuracy
loss, accuracy = model.evaluate(padded_sequences, labels, verbose=0)
print(f"Final Training Accuracy: {accuracy*100:.2f}%")

# 6. Example Prediction Function
def predict_sentiment(text_list):
    # Preprocess new text exactly like the training data
    new_sequences = tokenizer.texts_to_sequences(text_list)
    new_padded = pad_sequences(new_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

    # Predict
    predictions = model.predict(new_padded)

    for text, pred in zip(text_list, predictions):
        sentiment = "Positive (1)" if pred[0] >= 0.5 else "Negative (0)"
        print(f"Text: '{text}' -> Prediction Score: {pred[0]:.4f} -> Sentiment: {sentiment}")

print("\n--- Testing Model Predictions ---")
test_texts = [
    "I love this!", # Should lean positive
    "Worst experience ever", # Should lean negative
    "It was okay" # Neutral/Unseen words
]

predict_sentiment(test_texts)

Word Index: {'<OOV>': 1, 'this': 2, 'i': 3, 'is': 4, 'love': 5, 'project': 6, 'an': 7, 'amazing': 8, 'experience': 9, 'hate': 10, 'waiting': 11, 'in': 12, 'line': 13, 'the': 14, 'worst': 15, 'service': 16, 'absolutely': 17, 'fantastic': 18}

Sequences: [[3, 5, 2, 6], [2, 4, 7, 8, 9], [3, 10, 11, 12, 13], [2, 4, 14, 15, 16], [17, 18]]

Padded Sequences (Shape: (5, 5)):
[[ 3  5  2  6  0]
 [ 2  4  7  8  9]
 [ 3 10 11 12 13]
 [ 2  4 14 15 16]
 [17 18  0  0  0]]






Training finished after 50 epochs.
Final Training Accuracy: 80.00%

--- Testing Model Predictions ---
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 190ms/step
Text: 'I love this!' -> Prediction Score: 0.8529 -> Sentiment: Positive (1)
Text: 'Worst experience ever' -> Prediction Score: 0.7783 -> Sentiment: Positive (1)
Text: 'It was okay' -> Prediction Score: 0.8193 -> Sentiment: Positive (1)


In [5]:
# Question 9. Using spaCy, build a simple NLP pipeline that includes tokenization,lemmatization, and entity recognition.

import spacy

# Load the English NLP model
# You might need to run `python -m spacy download en_core_web_sm` in your terminal first
nlp = spacy.load("en_core_web_sm")

# The provided text dataset
text = "Homi Jehangir Bhaba was an Indian nuclear physicist who played a key role in the development of India's atomic energy program. He was the founding director of the Tata Institute of Fundamental Research (TIFR) and was instrumental in establishing the Atomic Energy Commission of India."

# Process the text with the spaCy pipeline
doc = nlp(text)

# --- 1. Tokenization and Lemmatization ---
print("--- Token, Lemma, POS Tag ---")
for token in doc:
    # Print the token text, its lemma, and its part-of-speech tag for clarity
    print(f"{token.text:<20} {token.lemma_:<20} {token.pos_:<10}")

print("\n" + "="*40 + "\n")

# --- 2. Named Entity Recognition (NER) ---
print("--- Named Entities ---")
for ent in doc.ents:
    # Print the entity text and its label (type)
    print(f"{ent.text:<40} {ent.label_:<10}")

--- Token, Lemma, POS Tag ---
Homi                 Homi                 PROPN     
Jehangir             Jehangir             PROPN     
Bhaba                Bhaba                PROPN     
was                  be                   AUX       
an                   an                   DET       
Indian               indian               ADJ       
nuclear              nuclear              ADJ       
physicist            physicist            NOUN      
who                  who                  PRON      
played               play                 VERB      
a                    a                    DET       
key                  key                  ADJ       
role                 role                 NOUN      
in                   in                   ADP       
the                  the                  DET       
development          development          NOUN      
of                   of                   ADP       
India                India                PROPN     
's              

Question 10: You are working on a chatbot for a mental health platform. Explain how
you would leverage LSTM or GRU networks along with libraries like spaCy or Stanford
NLP to understand and respond to user input effectively. Detail your architecture, data
preprocessing pipeline, and any ethical considerations.
(Include your Python code and output in the code box below.
  - For starters, mental health disorders and problems affect an estimated 792 million people worldwide.
  That’s basically 1 in 10 people globally.
  In Canada, where I’m from, the problem is even worse, as 1 in 5 Canadians experience a mental illness or addiction problem every single year, with 1 in 2 experiencing one by the time that they reach the age of 40.
  70% of the mental health problems also begun during childhood or adolescence, and youth experiencing the highest rates than any other age group. The reason that this is such a big problem is that mental illness can reduce life expectancy by 10–20 years.
  

In [6]:
# Answer 10.

import numpy as np
import nltk
# nltk.download('punkt')
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
def tokenize(sentence):
    """
    split sentence into array of words/tokens
    a token can be a word or punctuation character, or number
    """
    return nltk.word_tokenize(sentence)
def stem(word):
    """
    stemming = find the root form of the word
    examples:
    words = ["organize", "organizes", "organizing"]
    words = [stem(w) for w in words]
    -> ["organ", "organ", "organ"]
    """
    return stemmer.stem(word.lower())
def bag_of_words(tokenized_sentence, words):
    """
    return bag of words array:
    1 for each known word that exists in the sentence, 0 otherwise
    example:
    sentence = ["hello", "how", "are", "you"]
    words = ["hi", "hello", "I", "you", "bye", "thank", "cool"]
    bog   = [  0 ,    1 ,    0 ,   1 ,    0 ,    0 ,      0]
    """
    # stem each word
    sentence_words = [stem(word) for word in tokenized_sentence]
    # initialize bag with 0 for each word
    bag = np.zeros(len(words), dtype=np.float32)
    for idx, w in enumerate(words):
        if w in sentence_words:
            bag[idx] = 1
    return bag