<a href="https://colab.research.google.com/github/srivamsikakarla/venkatasuryasatya_INFO5731_Fall2023/blob/main/kakarlavenkatasuryasatya_Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Three**

In this assignment, you are required to conduct information extraction, semantic analysis based on **the dataset you collected from assignment two**. You may use scipy and numpy package in this assignment.

# **Question 1: Understand N-gram**

(45 points). Write a python program to conduct N-gram analysis based on the dataset in your assignment two:

(1) Count the frequency of all the N-grams (N=3).

(2) Calculate the probabilities for all the bigrams in the dataset by using the fomular count(w2 w1) / count(w2). For example, count(really like) / count(really) = 1 / 3 = 0.33.

(3) Extract all the **noun phrases** and calculate the relative probabilities of each review in terms of other reviews (abstracts, or tweets) by using the fomular frequency (noun phrase) / max frequency (noun phrase) on the whole dataset. Print out the result in a table with column name the all the noun phrases and row name as all the 100 reviews (abstracts, or tweets).


In [None]:
!pip install nltk




In [None]:
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.probability import FreqDist
from nltk.tag import pos_tag
from collections import Counter
import pandas as pd

# Downloading the necessary NLTK resource
nltk.download('averaged_perceptron_tagger')


dataset = [
    "This is a sample sentence for N-gram analysis.",
    "N-gram analysis helps in understanding language patterns.",
    # Add more sentences from your dataset here
]

# Function to extract N-grams and calculate frequency
def calculate_ngram_frequency(text, n):
    tokens = word_tokenize(text.lower())
    n_grams = list(ngrams(tokens, n))
    return FreqDist(n_grams)

# Function to calculate bigram probabilities
def calculate_bigram_probabilities(text):
    tokens = word_tokenize(text.lower())
    bigrams = list(ngrams(tokens, 2))

    # Counting of bigrams
    bigram_freq = FreqDist(bigrams)

    # Calculate probabilities, handling division by zero
    probabilities = {bigram: count / bigram_freq[bigram[1]] if bigram_freq[bigram[1]] != 0 else 0 for bigram, count in bigram_freq.items()}

    return probabilities

# Function to extract noun phrases and calculate relative probabilities
def calculate_noun_phrase_probabilities(text, all_texts):
    # Use POS tagging to extract noun phrases
    tagged_tokens = pos_tag(word_tokenize(text))
    grammar = "NP: {<DT>?<JJ>*<NN>}"
    cp = nltk.RegexpParser(grammar)
    tree = cp.parse(tagged_tokens)

    noun_phrases = [' '.join(leaf[0] for leaf in subtree.leaves()) for subtree in tree.subtrees() if subtree.label() == 'NP']

    # If no noun phrases are detected, return an empty dictionary
    if not noun_phrases:
        return {}

    # Calculating frequency of each noun phrase in the whole dataset
    noun_phrase_counter = Counter(noun_phrases)

    # Calculating relative probabilities
    relative_probabilities = {noun_phrase: noun_phrase_counter[noun_phrase] / len(noun_phrases) for noun_phrase in noun_phrases}

    return relative_probabilities

# Function to calculate frequency of a noun phrase in the whole dataset
def calculate_noun_phrase_frequency(noun_phrase):
    return sum(text.count(noun_phrase) for text in dataset)

# Creating a DataFrame to display results
result_df = pd.DataFrame()

# Performing N-gram analysis (N=3)
n = 3
for i, text in enumerate(dataset):
    ngram_freq = calculate_ngram_frequency(text, n)
    result_df[f'Review {i+1}'] = pd.Series(ngram_freq)

# Performing bigram analysis and add to the DataFrame
bigram_probabilities = calculate_bigram_probabilities(' '.join(dataset))
result_df['Bigram Probabilities'] = pd.Series(bigram_probabilities)

# Performing noun phrase analysis and add to the DataFrame
noun_phrase_probabilities = calculate_noun_phrase_probabilities(' '.join(dataset), dataset)
result_df['Noun Phrase Probabilities'] = pd.Series(noun_phrase_probabilities)

# Displaying the result DataFrame
result_df.fillna(0, inplace=True)
result_df


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Felloh\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Unnamed: 0,Unnamed: 1,Unnamed: 2,Review 1,Review 2,Bigram Probabilities,Noun Phrase Probabilities
this,is,a,1,0.0,0,0.0
is,a,sample,1,0.0,0,0.0
a,sample,sentence,1,0.0,0,0.0
sample,sentence,for,1,0.0,0,0.0
sentence,for,n-gram,1,0.0,0,0.0
for,n-gram,analysis,1,0.0,0,0.0
n-gram,analysis,.,1,0.0,0,0.0


# **Question 2: Undersand TF-IDF and Document representation**

(20 points). Starting from the documents (all the reviews, or abstracts, or tweets) collected for assignment two, write a python program:

(1) To build the **documents-terms weights (tf*idf) matrix bold text**.

(2) To rank the documents with respect to query (design a query by yourself, for example, "An Outstanding movie with a haunting performance and best character development") by using **cosine similarity**.

In [None]:
# Write your code here
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


dataset = [
    "This movie is outstanding with a haunting performance and best character development.",
    "The plot is engaging and the characters are well-developed.",
    "An intense and gripping movie with brilliant character performances.",

]

# Designing a query
query = "An Outstanding movie with a haunting performance and best character development"

# Combining the query with the dataset
documents = dataset + [query]

# Creating a TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Converting the TF-IDF matrix to a DataFrame for better visualization
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out(), index=range(1, len(documents) + 1))

# Extracting the TF-IDF vectors for the query and documents
query_tfidf_vector = tfidf_df.iloc[-1].values.reshape(1, -1)
document_tfidf_vectors = tfidf_df.iloc[:-1]

# Calculating cosine similarity between the query and each document
cosine_similarities = cosine_similarity(query_tfidf_vector, document_tfidf_vectors).flatten()

# Creating a DataFrame to display the ranking
ranking_df = pd.DataFrame({'Document': range(1, len(documents)), 'Cosine Similarity': cosine_similarities})

# Ranking the documents based on cosine similarity
ranking_df = ranking_df.sort_values(by='Cosine Similarity', ascending=False)

# Displaying the TF-IDF matrix and document ranking
print("TF-IDF Matrix:")
print(tfidf_df)
print("\nDocument Ranking:")
print(ranking_df)







TF-IDF Matrix:
       best  brilliant  character  characters  developed  development  \
1  0.398067   0.000000   0.322269         0.0        0.0     0.398067   
2  0.000000   0.000000   0.000000         0.5        0.5     0.000000   
3  0.000000   0.455732   0.290888         0.0        0.0     0.000000   
4  0.398067   0.000000   0.322269         0.0        0.0     0.398067   

   engaging  gripping  haunting   intense     movie  outstanding  performance  \
1       0.0  0.000000  0.398067  0.000000  0.322269     0.398067     0.398067   
2       0.5  0.000000  0.000000  0.000000  0.000000     0.000000     0.000000   
3       0.0  0.455732  0.000000  0.455732  0.290888     0.000000     0.000000   
4       0.0  0.000000  0.398067  0.000000  0.322269     0.398067     0.398067   

   performances  plot  
1      0.000000   0.0  
2      0.000000   0.5  
3      0.455732   0.0  
4      0.000000   0.0  

Document Ranking:
   Document  Cosine Similarity
0         1           1.000000
2         3 

# **Question 3: Create your own word embedding model**

(20 points). Use the data you collected for assignment two to build a word embedding model:

(1) Train a 300-dimension word embedding (it can be word2vec, glove, ulmfit, bert, or others).

(2) Visualize the word embedding model you created.

Reference: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Reference: https://jaketae.github.io/study/word2vec/

In [None]:
!pip install gensim




In [None]:
# Write your code here

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string


dataset = [
    "This movie is outstanding with a haunting performance and best character development.",
    "The plot is engaging and the characters are well-developed.",
    "An intense and gripping movie with brilliant character performances.",
    # Add more documents from your dataset here
]

# Tokenizing and preprocessing the text
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words and word not in string.punctuation]
    return tokens

tokenized_data = [preprocess_text(text) for text in dataset]

# Training the Word2Vec model
embedding_size = 300
window_size = 5
min_count = 1  # Ignoring all words with a total frequency lower than this

word2vec_model = Word2Vec(sentences=tokenized_data, vector_size=embedding_size, window=window_size, min_count=min_count, workers=4)

# Save the model
word2vec_model.save("word2vec_model")




# Visualizing the Word Embeddings
words_to_visualize = ["movie", "outstanding", "performance", "character", "development"]

# Getting the vectors for the specified words
word_vectors = [word2vec_model.wv[word] for word in words_to_visualize]

# Visualizing the vectors
print("Word Embeddings:")
for word, vector in zip(words_to_visualize, word_vectors):
    print(f"{word}: {vector}")

# Saving the word vectors to a file for further visualization
with open("word_vectors.txt", "w") as f:
    for word, vector in zip(words_to_visualize, word_vectors):
        f.write(f"{word}: {vector}\n")






Word Embeddings:
movie: [-2.7475592e-03  3.0997850e-03 -6.5886976e-05 -6.5575878e-04
  1.5345435e-03 -1.3651053e-03  9.1437140e-04  2.3133222e-03
  2.0218086e-03 -2.5035981e-03  3.1274501e-03  1.5572695e-03
  1.3220402e-03 -2.0811686e-03  2.8199931e-03 -7.1672164e-04
  2.9417293e-03 -1.7873343e-03 -2.7098064e-03  2.2748529e-03
  5.5706420e-04 -7.3283631e-04  3.1712004e-03  3.1646185e-03
 -3.2580157e-03  8.3507615e-04  2.0522308e-03  1.2908188e-03
  6.7426247e-04  1.4350057e-04  2.2454381e-04 -1.2735454e-03
 -2.3800833e-03 -6.9629075e-04  1.3079660e-03  2.9395611e-03
  3.0863835e-03 -1.9919788e-03 -3.1342236e-03  3.2547922e-03
  1.1432616e-03  1.7220390e-03  2.0941151e-03 -9.3475421e-04
  2.4409012e-03  9.4342389e-04  9.5700147e-04 -7.9345662e-04
 -1.0427498e-03 -7.9004723e-04  1.4254788e-03  2.5352638e-05
 -3.1947596e-03 -3.2218480e-03 -2.0493979e-03 -4.2856533e-05
  6.6580536e-04  3.1439892e-03  1.8614503e-03 -1.4302321e-03
  9.2772243e-05  1.6547863e-03  2.5661031e-03 -3.8140774e-04


# **Question 4: Create your own training and evaluation data for sentiment analysis**

(15 points). **You dodn't need to write program for this question!** Read each review (abstract or tweet) you collected in detail, and annotate each review with a sentiment (positive, negative, or neutral). Save the annotated dataset into a csv file with three columns (first column: document_id, clean_text, sentiment), upload the csv file to GitHub and submit the file link blew. This datset will be used for assignment four: sentiment analysis and text classification.


In [None]:
# The GitHub link of your final csv file



# Link:
https://github.com/srivamsikakarla/venkatasuryasatya_INFO5731_Fall2023/blob/main/sentiment_dataset.csv


