<a href="https://colab.research.google.com/github/sushanthbandameedi/Sushanth_INFO5731_Spring2023/blob/main/INFO5731_Assignment_Three_fall2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Three**

In this assignment, you are required to conduct information extraction, semantic analysis based on **the dataset you collected from assignment two**. You may use scipy and numpy package in this assignment.

# **Question 1: Understand N-gram**

(45 points). Write a python program to conduct N-gram analysis based on the dataset in your assignment two:

(1) Count the frequency of all the N-grams (N=3).

(2) Calculate the probabilities for all the bigrams in the dataset by using the fomular count(w2 w1) / count(w2). For example, count(really like) / count(really) = 1 / 3 = 0.33.

(3) Extract all the **noun phrases** and calculate the relative probabilities of each review in terms of other reviews (abstracts, or tweets) by using the fomular frequency (noun phrase) / max frequency (noun phrase) on the whole dataset. Print out the result in a table with column name the all the noun phrases and row name as all the 100 reviews (abstracts, or tweets). 


In [None]:
# Write your code here
import nltk
from nltk.util import ngrams
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
import pandas as pd

# Load the twitter_tweets dataset
with open('INFO_5731/assgn_4/reviews.txt', 'r', encoding='utf-8') as file:
    data = file.read()

# Tokenize the tweets into words
words = word_tokenize(data)

# Generate N-grams (N=3)
n = 3
ngrams_list = list(ngrams(words, n))

# Count the frequency of all N-grams
freq_dist = FreqDist(ngrams_list)

# Calculate the probabilities for all bigrams
probabilities = {}
for ngram, count in freq_dist.items():
    if len(ngram) == 2:
        word1, word2 = ngram
        prob = count / freq_dist[(word1,)]
        probabilities[(word1, word2)] = prob

# Extract all noun phrases
noun_phrases = []
pos_tags = pos_tag(words)
for i in range(len(pos_tags) - 1):
    if pos_tags[i][1].startswith('NN') and pos_tags[i + 1][1].startswith('NN'):
        noun_phrases.append(pos_tags[i][0] + ' ' + pos_tags[i + 1][0])

# Calculate the relative probabilities of noun phrases
noun_phrase_freq_dist = FreqDist(noun_phrases)
max_freq = noun_phrase_freq_dist.most_common(1)[0][1]
rel_probabilities = {}
for noun_phrase, count in noun_phrase_freq_dist.items():
    rel_prob = count / max_freq
    rel_probabilities[noun_phrase] = rel_prob

# Convert the relative probabilities to a table
noun_phrases_list = list(noun_phrase_freq_dist.keys())
data = []
for i in range(len(noun_phrases_list)):
    row = []
    for j in range(len(noun_phrases_list)):
        if i == j:
            row.append(rel_probabilities[noun_phrases_list[i]])
        else:
            row.append('')
    data.append(row)

# Print the table
df = pd.DataFrame(data, columns=noun_phrases_list, index=noun_phrases_list)
print(df)






# **Question 2: Undersand TF-IDF and Document representation**

(20 points). Starting from the documents (all the reviews, or abstracts, or tweets) collected for assignment two, write a python program: 

(1) To build the **documents-terms weights (tf*idf) matrix bold text**.

(2) To rank the documents with respect to query (design a query by yourself, for example, "An Outstanding movie with a haunting performance and best character development") by using **cosine similarity**.

In [None]:
# Write your code here

import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load the twitter_tweets dataset
with open('INFO_5731/assgn_4/reviews.txt', 'r', encoding='utf-8') as file:
    data = file.readlines()

# Remove stopwords and punctuation from the tweets
stop_words = set(stopwords.words("english"))
punctuation = ['.', ',', '!', '?', ':', ';', '(', ')', '[', ']', '{', '}', '@', '#', '$', '%', '^', '&', '*', '-', '_', '+', '=', '/', '\\', '|', '<', '>', '`', '~']
preprocessed_data = []
for tweet in data:
    tweet = nltk.word_tokenize(tweet)
    tweet = [word.lower() for word in tweet if word not in punctuation and word.lower() not in stop_words]
    tweet = ' '.join(tweet)
    preprocessed_data.append(tweet)

# Build the tf-idf matrix
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_data)
terms = vectorizer.get_feature_names()
documents = tfidf_matrix.toarray()

# Rank the documents with respect to a query using cosine similarity
query = "An Outstanding movie with a haunting performance and best character development"
query = query.lower()
query_vector = vectorizer.transform([query]).toarray()
similarities = cosine_similarity(query_vector, documents)
ranked_documents_indices = similarities.argsort()[0][::-1]

# Print the ranked documents
print("Ranked Documents:")
for i, index in enumerate(ranked_documents_indices):
    print(f"{i+1}. Document {index + 1}: Similarity = {similarities[0][index]:.3f}")
    print(data[index].strip())
    print()






# **Question 3: Create your own word embedding model**

(20 points). Use the data you collected for assignment two to build a word embedding model: 

(1) Train a 300-dimension word embedding (it can be word2vec, glove, ulmfit, bert, or others).

(2) Visualize the word embedding model you created.

Reference: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Reference: https://jaketae.github.io/study/word2vec/

In [None]:
# Write your code here

import gensim
from gensim.models import Word2Vec
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# Load the twitter_tweets dataset
with open('INFO_5731/assgn_4/reviews.txt, 'r', encoding='utf-8') as file:
    data = file.readlines()

# Preprocess the data by tokenizing the tweets
preprocessed_data = [gensim.utils.simple_preprocess(tweet) for tweet in data]

# Train the word2vec model
model = Word2Vec(size=300, window=5, min_count=1, workers=4)
model.build_vocab(preprocessed_data)
model.train(preprocessed_data, total_examples=model.corpus_count, epochs=30)

# Visualize the word embeddings using t-SNE
vocabulary = model.wv.vocab
embeddings = [model.wv[word] for word in vocabulary]
tsne = TSNE(n_components=2)
embeddings_tsne = tsne.fit_transform(embeddings)

# Plot the word embeddings
plt.figure(figsize=(10, 10))
plt.scatter(embeddings_tsne[:, 0], embeddings_tsne[:, 1])
for i, word in enumerate(vocabulary):
    plt.annotate(word, (embeddings_tsne[i, 0], embeddings_tsne[i, 1]))
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.title('Word Embeddings Visualization')
plt.show()






# **Question 4: Create your own training and evaluation data for sentiment analysis**

(15 points). **You dodn't need to write program for this question!** Read each review (abstract or tweet) you collected in detail, and annotate each review with a sentiment (positive, negative, or neutral). Save the annotated dataset into a csv file with three columns (first column: document_id, clean_text, sentiment), upload the csv file to GitHub and submit the file link blew. This datset will be used for assignment four: sentiment analysis and text classification. 


In [None]:
# The GitHub link of your final csv file



# Link: 



