<a href="https://colab.research.google.com/github/rahul-727/NLP-Lab-work/blob/main/Rahul_544_Lab_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1) Find the similarity between two documents



*   Cosine Similarity measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. In the context of text analysis, these vectors often represent the frequency of occurrence of terms within the documents (Term Frequency or TF-IDF vectors). A cosine similarity of 1 means the documents are identical, while a cosine similarity of 0 indicates no similarity.



In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

with open('/content/text1.txt', 'r', encoding='utf-8') as file:
    doc1 = file.read()

with open('/content/text2.txt', 'r', encoding='utf-8') as file:
    doc2 = file.read()

documents = [doc1, doc2]

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

cos_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
print("Cosine Similarity between the two documents:", cos_sim[0, 1])


Cosine Similarity between the two documents: 0.4356132262843411


TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical representation of a document (a piece of text) that captures the importance of each word within that document relative to a collection of documents.

Term Frequency (TF): This measures how often a word appears in a document. If a word appears more frequently in a document, its TF value will be higher.

Inverse Document Frequency (IDF): This measures how unique or rare a word is across all the documents in the collection. If a word appears in many documents, its IDF value will be lower.

* Jaccard Similarity measures the similarity between two sets. It's calculated as the size of the intersection divided by the size of the union of two sets. For text analysis, documents are converted into sets of tokens. In simple terms, it tells us how similar or different two sets are by considering the intersection (common elements) and union (total elements) of the sets.

In [12]:
def jaccard_similarity(doc1, doc2):
    set1 = set(doc1.split())
    set2 = set(doc2.split())
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union

jac_sim = jaccard_similarity(doc1, doc2)
print("Jaccard Similarity between the two documents:", jac_sim)


Jaccard Similarity between the two documents: 0.09649122807017543


# 2. Implement the Sentiment Analysis using Bayesian Classification.

In [13]:
import nltk
nltk.download('stopwords')
import nltk
nltk.download('punkt')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#Baeysian Classification

Bayesian Classification is a method used in machine learning for predicting the category or class of a given data point based on the probability that it belongs to each category. It's based on Bayes' theorem, which is a fundamental concept in probability theory.

* in this we are analyzing the sentiment of that statement whther it is +ve or -ve

In [14]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict

df = pd.read_csv('/content/Tweets.csv')

dataset = df[['text', 'airline_sentiment']]

stop_words = set(stopwords.words('english'))
word_freq = defaultdict(lambda: [0, 0])
for _, row in dataset.iterrows():
    text = row['text']
    label = row['airline_sentiment']
    words = [word.lower() for word in word_tokenize(text) if word.isalnum() and word.lower() not in stop_words]
    for word in words:
        word_freq[word][label == 'positive'] += 1

total_positive = sum(word_freq[word][1] for word in word_freq)
total_negative = sum(word_freq[word][0] for word in word_freq)
prior_positive = total_positive / (total_positive + total_negative)
prior_negative = total_negative / (total_positive + total_negative)

def classify(text):
    words = [word.lower() for word in word_tokenize(text) if word.isalnum() and word.lower() not in stop_words]
    log_prob_positive = sum([word_freq[word][1] / total_positive for word in words])
    log_prob_negative = sum([word_freq[word][0] / total_negative for word in words])
    prob_positive = prior_positive * log_prob_positive
    prob_negative = prior_negative * log_prob_negative
    return 'positive' if prob_positive > prob_negative else 'negative'

test_data = ["@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying VA"]
for text in test_data:
    sentiment = classify(text)
    print(f"Sentiment of '{text}': {sentiment}")


Sentiment of '@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying VA': negative


#3. Implement the Sentiment Analysis using RNN.

* RNNs are a class of artificial neural networks designed to recognize patterns in sequences of data, such as text, genomes, handwriting, or numerical time series data. Unlike traditional neural networks, RNNs have loops within them, allowing information to persist. This looped network architecture enables RNNs to take not just the current input but also what they have perceived previously in time into account, making them powerful for sequential data analysis like language processing.

In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import Adam

In [16]:
df = pd.read_csv('/content/Tweets.csv')

In [17]:
texts = df['text'].tolist()
labels = df['airline_sentiment'].tolist()

texts = df['text'].tolist()
labels = df['airline_sentiment'].tolist()

label_encoder = LabelEncoder()
labels_encoded = label_encoder.fit_transform(labels)

labels_one_hot = to_categorical(labels_encoded)

tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

max_len = max(len(seq) for seq in sequences)
sequences_padded = pad_sequences(sequences, maxlen=max_len, padding='post')

X_train, X_test, y_train, y_test = train_test_split(sequences_padded, labels_one_hot, test_size=0.2, random_state=42)

model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=32, input_length=max_len))
model.add(LSTM(32))
model.add(Dense(3, activation='softmax'))  # Use 3 neurons in the output layer, with a softmax activation function

model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=1)  # Adjust batch size for efficiency

loss, accuracy = model.evaluate(X_test, y_test)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Loss: 1.017911434173584
Test Accuracy: 0.7616119980812073


#4. Implement the Sentiment Analysis using LSTM.

* Long Short-Term Memory (LSTM) networks, a specific type of Recurrent Neural Network (RNN), is a highly effective approach for analyzing the sentiment of text data due to LSTMs' ability to capture long-term dependencies. This makes LSTMs particularly adept at understanding the nuanced context of language, which is crucial for accurately determining sentiment. Here’s a non-code overview of how sentiment analysis with LSTM works.
* LSTMs are designed to address the vanishing gradient problem of traditional RNNs, allowing them to learn and remember over long sequences of data without losing context or meaning

In [18]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import LSTM, Embedding, Dense, SpatialDropout1D
from keras.callbacks import EarlyStopping


In [19]:
data = pd.read_csv("/content/Tweets.csv")

In [20]:
X = data['text']
y = pd.get_dummies(data['airline_sentiment']).values

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X)
X = tokenizer.texts_to_sequences(X)
X = pad_sequences(X, maxlen=200)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = Sequential()
model.add(Embedding(5000, 128, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=3, verbose=1, restore_best_weights=True)

history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_test, y_test), callbacks=[early_stop])

loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print('Test Accuracy:', accuracy)




Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 5: early stopping
Test Accuracy: 0.8005464673042297
