### Shreyashi Mukhopadhyay

## Problem 1:

(3 points) TF-IDF (compute comment similarity between different ratings)

The Amazon rating system is 1 to 5 stars, with 5 stars being the best. The comments with each rating can be viewed as a separate document. In total, we will have 5 separate documents for 1-star, 2-star, 3-star, 4-star, or 5-star comments. Using this document representation to categorize all the comments with different ratings into 5 documents.

Each document (1-star, 2 star, 3 star, 4 star, or 5 star comments) can be represented as an N-dimension vector. You need to perform Stopwords removal and Stemming before generating each document. Each dimension in this vector space is defined by the unigrams of all comments from all documents; while the weight for each unigram in this rating can defined by TF-IDF. Specifically, we need to use ”Sub-linear TF scaling” to compute the normalized TF of each unigram in a document

(e.g., sklearn.feature extraction.text.TfidfVectorizer(sublinear tf=True)).
Construct the vector space representations for these 1 to 5-star reviews and find out the most similar reviews to 1-star, 3-star, and 5-star reviews, where the similarity metric is defined as cosine similarity.



In [None]:
# Import google drive

from google.colab import drive
drive.mount('/content/gdrive')

### Read the data

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import  linear_kernel

In [None]:
import nltk
nltk.download('all')

In [None]:
df = pd.read_csv('/content/gdrive/MyDrive/Text Analytics/HW1/Amazon_Comments.csv',  delimiter="^", header=None, names=["No", "Title", "Date", "Bool", "Review", "Rating"])
df.head()

## Data has 2038 rows and 6 columns with no null values

In [None]:
df.info()

## Visualization of Ratings Distribution

In [None]:
import matplotlib.pyplot as plt

# Plot the count of each rating

df['Rating'].value_counts().to_frame("Count").plot(kind='bar', figsize=(10, 4), title='Rating Counts', colormap='Accent')
plt.title('Rating Counts', fontsize = 15)
plt.xlabel('Date', fontsize= 12)
plt.ylabel('Count', fontsize = 12)
plt.show()

# Data Cleaning

In [None]:
df1 = df.copy()
df1 = df.drop(columns=["No", "Title", "Date", "Bool"])
df1.head()

## Clean the data

In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import string
import re




# Function to clean and preprocess text
def clean_text(text):

    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Convert text to lowercase
    text = text.lower()

    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])


    # Tokenize text
    tokens = nltk.word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))

    tokens = [word for word in tokens if word not in stop_words and not word.isdigit()]

    #Stemming (you can replace with lemmatization if preferred)
    #stemmer = PorterStemmer()

    #tokens = [stemmer.stem(word) for word in tokens]

    # Create a lemmatizer object.
    lemmatizer = WordNetLemmatizer()

   #Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Reconstruct cleaned text
    cleaned_text = ' '.join(tokens)

    return cleaned_text


In [None]:
# Apply the clean_text function to each review in the DataFrame
df2 = df1.copy()

df2['Clean_Review'] = df1['Review'].apply(clean_text)

# Print the cleaned reviews
df2.head()

## Visualize Tokens

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords

# Download the nltk stopwords if you haven't done so
nltk.download('stopwords')

# Create a set of stopwords
stop_words = set(stopwords.words('english'))

# visualize the frequent words
text = " ".join([sentence for sentence in df_incorrect['Clean_text']])

# Generate a word cloud with stop words
wordcloud = WordCloud(width=800, height=400, stopwords=stop_words, background_color='white').generate(text)

# plot the graph
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
import collections
from collections import Counter
from itertools import chain

word_tokenize = nltk.word_tokenize

# Tokenize the text column
df2['Tokenized_Text'] = df2['Clean_Review'].apply(word_tokenize)

# Print the tokenized text
corpus = df2['Tokenized_Text']
corpus = corpus.tolist()
# Flatten list of lists to a single list
tokens = list(chain(*corpus))
unique_freq = collections.Counter(tokens)
# Count each unique element
unique_freq_df = pd.DataFrame.from_dict(unique_freq, orient='index').reset_index() # Convert to dataframe
# Rename columns
unique_freq_df = unique_freq_df.rename(columns={'index': 'Token', 0: 'Count'})
# Sort by count
unique_freq_df.sort_values('Count', ascending=False, inplace=True)
unique_freq_df = unique_freq_df.head(20)

unique_freq_df1 = unique_freq_df.reset_index(drop=True)
unique_freq_df2 = unique_freq_df1.set_index("Token")
unique_freq_df2

In [None]:
#plt.colormaps()
unique_freq_df2.plot(kind="bar", figsize= (15,5), grid=False, colormap = "Spectral_r")
plt.show()

## Calculating the TF of each unigram in the document

In [None]:
df3 =df2.copy()
df3.head()

In [None]:
# Step 1: Create a TF-IDF Vectorizer with sublinear TF scaling
tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True)

# Step 2: Fit the Vectorizer to the Text Data
tfidf_matrix = tfidf_vectorizer.fit_transform(df3['Clean_Review'])

# Step 3: Convert the TF-IDF Matrix to a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Now, tfidf_df contains the TF-IDF vectors with sub-linear TF scaling
tfidf = tfidf_df.T

tfidf

## Tf-idf Normalization

In [None]:
# Normalize the above matrix

tfidf_norm = tfidf.subtract(tfidf.mean(axis=1), axis = 0)
tfidf_norm.head()

## Cosine Similarity

In [None]:
# Calculate Cosine Similarity between all the reviews
cosine_similarity_df = pd.DataFrame(cosine_similarity(tfidf_norm, tfidf_norm), index=tfidf.index, columns=tfidf.index)

cosine_similarity_df.head()

## Predict the reviews most similar to 1 star, 3 star and 5 star reviews

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


# Separate reviews by star rating
one_star_reviews = df3[df3['Rating'] == 1]
three_star_reviews = df3[df3['Rating'] == 3]
five_star_reviews = df3[df3['Rating'] == 5]

# Create a TF-IDF vectorizer
tfidf_vectorizer =  TfidfVectorizer(sublinear_tf=True)

# Fit and transform the entire text data
tfidf_matrix = tfidf_vectorizer.fit_transform(df3['Clean_Review'])

# Initialize empty lists to store the most similar reviews and their similarity scores
most_similar_one_star = []
most_similar_three_star = []
most_similar_five_star = []

# Calculate cosine similarity for each review with 1-star, 3-star, and 5-star reviews
for one_star_review in one_star_reviews['Review']:
    similarity_scores = cosine_similarity(tfidf_vectorizer.transform([one_star_review]), tfidf_matrix)
    most_similar_idx = similarity_scores.argsort()[0][-2]  # Get the most similar review (excluding itself)
    most_similar_one_star.append((df3['Review'].iloc[most_similar_idx], similarity_scores[0][most_similar_idx]))

## 1 Star Reviews

In [None]:
# Print the most similar reviews and their similarity scores

print("Most similar to 1-star review:")
for review, similarity_score in most_similar_one_star:
    print(f"Review: {review}\nSimilarity Score: {similarity_score}\n")

## 3 Star Reviews

In [None]:
# Three star reviews

for three_star_review in three_star_reviews['Review']:
    similarity_scores = cosine_similarity(tfidf_vectorizer.transform([three_star_review]), tfidf_matrix)
    most_similar_idx = similarity_scores.argsort()[0][-2]
    most_similar_three_star.append((df3['Review'].iloc[most_similar_idx], similarity_scores[0][most_similar_idx]))


# Print the most similar reviews and their similarity scores

print("\nMost similar to 3-star review:")
for review, similarity_score in most_similar_three_star:
    print(f"Review: {review}\nSimilarity Score: {similarity_score}\n")


## 5 Star Reviews

In [None]:

for five_star_review in five_star_reviews['Review']:
    similarity_scores = cosine_similarity(tfidf_vectorizer.transform([five_star_review]), tfidf_matrix)
    most_similar_idx = similarity_scores.argsort()[0][-2]
    most_similar_five_star.append((df3['Review'].iloc[most_similar_idx], similarity_scores[0][most_similar_idx]))

print("\nMost similar to 5-star review:")
for review, similarity_score in most_similar_five_star:
    print(f"Review: {review}\nSimilarity Score: {similarity_score}\n")

## Correct dataset

In [None]:
# Create the correct dataset
df_correct = df[df['flag'] == 1]

# remove punctuations
df_correct['question'] = df2['text'].str.replace('[^A-Za-z ]+', '')
df_correct['question'] = df_correct['question'].str.lower()

# create a bag of words
df_correct['question_bow'] = df_correct['question'].str.split(' ')
df_correct['question_bow']

# Apply stopwords
df_correct['question_bow'] = df_correct['question_bow'].apply(lambda x: [item for item in x if item not in nltk.corpus.stopwords.words('english')])

# create a list of all the words in the bag of words
all_words = df_correct['question_bow'].explode().unique()
all_words

# # count the number of times each word appears
word_counts = df_correct['question_bow'].explode().value_counts().to_frame("Counts").reset_index()
word_counts


# # sort the words by their frequency
sorted_word_counts = word_counts.sort_values(by="Counts",ascending=False)
sorted_word_counts

In [None]:
# print the top 30 words that appear most frequently in questions that students tend to do poorly on
sorted_30 = sorted_word_counts.drop(index=0).head(30)
sorted_30

In [None]:
sorted_30.plot(kind='bar', x='index', y='Counts', figsize=(20,5)  )
plt.title('Word Frequency of Top 30 words in Correct Data', size = 20)
plt.xlabel('Words', size = 15)
plt.ylabel('Frequency', size = 15)
plt.xticks(size = 12)
plt.yticks(size = 12)
plt.show()