<a href="https://colab.research.google.com/github/varun-beep/NLP_LAB/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Question 1: Document Similarity Using Similarity Measures
Objective: Measure the similarity between two given textual documents using two different similarity measures.


Task:
• Select two different similarity measures (e.g., Cosine Similarity, Jaccard Similarity, Euclidean
Distance, or Word Mover's Distance).
• Implement an algorithm to compute similarity between two given documents (corpus or
paragraphs).
• Compare and analyze the results obtained from both similarity measures.
• Discuss how different similarity measures impact document comparison in NLP applications
such as information retrieval and text clustering.
Expected Outcome:
• A comparison of similarity scores for different methods.
• Insights on which method works best for different types of text (e.g., short vs. long
documents).

In [2]:
import numpy as np
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

# Sample documents
doc1 = "Football is a popular sport played worldwide."
doc2 = "Soccer is a widely loved game enjoyed by millions."

# Preprocessing function
def preprocess(text):
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans("", "", string.punctuation))  # Remove punctuation
    tokens = word_tokenize(text)  # Tokenize words
    tokens = [word for word in tokens if word not in stopwords.words('english')]  # Remove stopwords
    return " ".join(tokens)

# Apply preprocessing
doc1_clean = preprocess(doc1)
doc2_clean = preprocess(doc2)

# **1. Cosine Similarity (Using TF-IDF)**
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([doc1_clean, doc2_clean])
cos_sim = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]

# **2. Jaccard Similarity (Set-based)**
set1, set2 = set(doc1_clean.split()), set(doc2_clean.split())
jaccard_sim = len(set1.intersection(set2)) / len(set1.union(set2))

print(f"Cosine Similarity: {cos_sim:.4f}")
print(f"Jaccard Similarity: {jaccard_sim:.4f}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Cosine Similarity: 0.0000
Jaccard Similarity: 0.0000


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Question 2: Sentiment Analysis Using Pretrained NLP Libraries (TextBlob, VADER, or Flair) Implement sentiment analysis using a pretrained NLP library and analyze sentiment features such as polarity and subjectivity.


Task:
• Use TextBlob for analysing the text (Any other library also you can use)
• Preprocess the text dataset and apply the chosen library for sentiment extraction.
• Analyze two key features:
o Polarity: Determines whether the sentiment is positive, negative, or neutral (range: -
1 to +1).
o Subjectivity: Measures the degree of opinion or factual information (range: 0 to 1).

In [3]:
from textblob import TextBlob

# Sample texts
text1 = "I absolutely love football! It's the best sport in the world."
text2 = "The match was okay, but I expected a better performance."

# Function to analyze sentiment
def analyze_sentiment(text):
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity
    subjectivity = blob.sentiment.subjectivity
    return polarity, subjectivity

# Analyze both texts
for text in [text1, text2]:
    polarity, subjectivity = analyze_sentiment(text)
    print(f"Text: {text}")
    print(f"  ➝ Polarity: {polarity:.2f} (Positive if > 0, Negative if < 0, Neutral if ~0)")
    print(f"  ➝ Subjectivity: {subjectivity:.2f} (Factual if ~0, Opinionated if ~1)\n")

Text: I absolutely love football! It's the best sport in the world.
  ➝ Polarity: 0.81 (Positive if > 0, Negative if < 0, Neutral if ~0)
  ➝ Subjectivity: 0.45 (Factual if ~0, Opinionated if ~1)

Text: The match was okay, but I expected a better performance.
  ➝ Polarity: 0.30 (Positive if > 0, Negative if < 0, Neutral if ~0)
  ➝ Subjectivity: 0.47 (Factual if ~0, Opinionated if ~1)



Question 3: Sentiment Analysis Using Bayesian Classification
Objective: Implement a Naïve Bayes classifier for sentiment analysis of textual data.


Task:
• Preprocess a given dataset (e.g., remove stopwords, stemming, and tokenization).
• Use Multinomial Naïve Bayes or Bernoulli Naïve Bayes for classification.
• Train the classifier on a labeled dataset (e.g., IMDb, Twitter sentiment dataset).
• Evaluate the model using metrics like accuracy, precision, recall, and F1-score.
• Discuss the impact of feature selection (e.g., TF-IDF vs. Bag of Words) on classification
performance.
Expected Outcome:
• Understanding of probabilistic approaches in sentiment classification.
• Evaluation of bias and variance in Naïve Bayes for NLP tasks.

In [6]:
!pip install datasets

from datasets import load_dataset

# Load IMDb dataset
dataset = load_dataset("imdb")

# Extract train and test data
train_texts = dataset['train']['text']
train_labels = dataset['train']['label']
test_texts = dataset['test']['text']
test_labels = dataset['test']['label']

# Check dataset size
print(f"Training samples: {len(train_texts)}")
print(f"Testing samples: {len(test_texts)}")

# Print a sample
print("Sample review:", train_texts[0])
print("Label:", train_labels[0])  # 0 = Negative, 1 = Positive

Training samples: 25000
Testing samples: 25000
Sample review: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. 

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report

# Use the IMDb dataset already loaded from Hugging Face
train_texts, test_texts = dataset['train']['text'], dataset['test']['text']
train_labels, test_labels = dataset['train']['label'], dataset['test']['label']

# Create a text classification pipeline with TF-IDF + Naïve Bayes
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

# Train the model
model.fit(train_texts, train_labels)

# Predict on test data
y_pred = model.predict(test_texts)

# Evaluate performance
accuracy = accuracy_score(test_labels, y_pred)
report = classification_report(test_labels, y_pred)

print(f'Accuracy: {accuracy:.4f}')
print('Classification Report:\n', report)

Accuracy: 0.8296
Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.89      0.84     12500
           1       0.87      0.77      0.82     12500

    accuracy                           0.83     25000
   macro avg       0.83      0.83      0.83     25000
weighted avg       0.83      0.83      0.83     25000



Question 4: Sentiment Analysis Using Recurrent Neural Networks (RNN)

O
bjective: Build a Recurrent Neural Network (RNN)-based sentiment classifier.
Task:
• Preprocess text data (tokenization, word embeddings, padding).
• Implement a Simple RNN using TensorFlow/Keras or PyTorch.
• Train the model on a dataset (e.g., Amazon Reviews, Twitter Sentiment Dataset).
• Analyze the performance of RNN in capturing sequential dependencies in text.
• Compare RNN performance with Naïve Bayes in sentiment analysis.
Expected Outcome:
• Understanding how sequential patterns in text influence sentiment classification.
• Limitations of RNNs (e.g., vanishing gradient problem).

In [2]:
pip install torch datasets transformers

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting 

In [3]:
from datasets import load_dataset

dataset = load_dataset("amazon_polarity")
train_data = dataset["train"]
test_data = dataset["test"]

# Extract text and labels
train_texts, train_labels = train_data["content"][:50000], train_data["label"][:50000]
test_texts, test_labels = test_data["content"][:10000], test_data["label"][:10000]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/6.81k [00:00<?, ?B/s]

train-00000-of-00004.parquet:   0%|          | 0.00/260M [00:00<?, ?B/s]

train-00001-of-00004.parquet:   0%|          | 0.00/258M [00:00<?, ?B/s]

train-00002-of-00004.parquet:   0%|          | 0.00/255M [00:00<?, ?B/s]

train-00003-of-00004.parquet:   0%|          | 0.00/254M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3600000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/400000 [00:00<?, ? examples/s]

In [4]:
from transformers import AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenize & Pad
train_enc = tokenizer(train_texts, padding=True, truncation=True, return_tensors="pt")
test_enc = tokenizer(test_texts, padding=True, truncation=True, return_tensors="pt")

train_labels = torch.tensor(train_labels)
test_labels = torch.tensor(test_labels)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
from torch.utils.data import TensorDataset, DataLoader

batch_size = 64
train_dataset = TensorDataset(train_enc["input_ids"], train_labels)
test_dataset = TensorDataset(test_enc["input_ids"], test_labels)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

In [1]:
import torch.nn as nn

class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, output_size):
        super(SentimentRNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(embed_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.embedding(x)
        _, h = self.rnn(x)
        return self.fc(h.squeeze(0))

# Initialize model
vocab_size = tokenizer.vocab_size
model = SentimentRNN(vocab_size, embed_size=128, hidden_size=128, output_size=2)

NameError: name 'tokenizer' is not defined

In [7]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

epochs = 3
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_loader):.4f}")


Epoch 1, Loss: 0.6958
Epoch 2, Loss: 0.6940
Epoch 3, Loss: 0.6946


In [8]:
from sklearn.metrics import accuracy_score

model.eval()
predictions, true_labels = [], []

with torch.no_grad():
    for inputs, labels in test_loader:
        outputs = model(inputs)
        preds = torch.argmax(outputs, dim=1)
        predictions.extend(preds.tolist())
        true_labels.extend(labels.tolist())

accuracy = accuracy_score(true_labels, predictions)
print(f"Test Accuracy: {accuracy:.4f}")

Test Accuracy: 0.5125
