<a href="https://colab.research.google.com/github/sainikhila11/SaiNikhila_INFO5731_Spring2024/blob/main/Yavanamanda_Sai_Assignment_04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [2]:
# Install bertopic
!pip install bertopic

# Import necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.decomposition import TruncatedSVD
from bertopic import BERTopic
from google.colab import files

# Read the CSV file
file_path = '/content/imdb_reviews_sentimentanalysis.csv'
df = pd.read_csv(file_path)

# Preprocess the text data
# You may need to perform text cleaning, tokenization, and lemmatization here

# Define a function to fit LDA model and print top words for each topic
def print_lda_topics(lda_model, vectorizer, top_n_words=10):
    features = vectorizer.get_feature_names_out()
    for topic_idx, topic in enumerate(lda_model.components_):
        print(f"Topic {topic_idx + 1}:")
        topic_words = [features[i] for i in topic.argsort()[:-top_n_words - 1:-1]]
        print(", ".join(topic_words))
        print()

# Define a function to print top words for each topic in LSA
def print_lsa_topics(lsa_model, vectorizer, top_n_words=10):
    features = vectorizer.get_feature_names_out()
    for topic_idx, topic in enumerate(lsa_model.components_):
        print(f"Topic {topic_idx + 1}:")
        topic_words = [features[i] for i in topic.argsort()[:-top_n_words - 1:-1]]
        print(", ".join(topic_words))
        print()

# Vectorize the text data using CountVectorizer
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(df['clean_text'])

# Fit LDA model
lda_model = LatentDirichletAllocation(n_components=10, random_state=42)
lda_model.fit(X)

# Print top words for each topic in LDA
print("Top 10 topics using LDA:")
print_lda_topics(lda_model, vectorizer)

# Fit LSA model
lsa_model = TruncatedSVD(n_components=10, random_state=42)
lsa_model.fit(X)

# Print top words for each topic in LSA
print("Top 10 topics using LSA:")
print_lsa_topics(lsa_model, vectorizer)

# Fit BERTopic model
bertopic_model = BERTopic(language="english")
topics, _ = bertopic_model.fit_transform(df['clean_text'])

# Get top topics
top_topics = bertopic_model.get_topic_info()

# Print top 10 clusters for BERTopic
print("Top 10 clusters for BERTopic:")
top_topics = bertopic_model.get_topics()

for topic_id, words in top_topics.items():
    top_words = ", ".join([word[0] for word in words[:10]])  # Extract top 10 words for each topic
    topic_count = bertopic_model.get_topic_freq(topic_id)  # Get the count of documents in the topic
    print(f"Cluster {topic_id} - Words: {top_words}")
    print(f"Count: {topic_count}")
    print()

# Analyze sentiment
sentiment_counts = df['sentiment_analysis'].str.strip().value_counts()
print("Sentiment Analysis:")
print(sentiment_counts)


Collecting bertopic
  Downloading bertopic-0.16.1-py2.py3-none-any.whl (158 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/158.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/158.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.5/158.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.6-py3-none-any.whl (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Top 10 clusters for BERTopic:
Cluster 0 - Words: the, and, to, of, it, that, in, was, movie, but
Count: 43

Cluster 1 - Words: the, and, is, of, in, barbie, to, it, but, was
Count: 29

Cluster 2 - Words: and, the, movie, of, that, to, this, in, was, is
Count: 16

Cluster 3 - Words: the, to, barbie, was, and, movie, it, of, this, they
Count: 12

Sentiment Analysis:
sentiment_analysis
negative    49
positive    35
nuetral     15
neutral      1
Name: count, dtype: int64


LDA Topics:

Topic 1: Possibly related to reviews discussing movies featuring Ryan Gosling, with positive sentiments like "great" and "expecting."

Topic 2: Likely discussing the plot ("storyline") and characters of movies, with mentions of "ken," "watch," and "looks."

Topic 3: Potentially focused on Barbie movies, with words like "barbie," "ken," and "men."

Topic 4: Covers various aspects of movies, including humor, realism, and messages, with terms like "funny," "real," and "message."

Topic 5: Likely discussing films catering to women and children, with mentions of "woman," "kids," and "audience."

Topic 6: Focuses on critiques and opinions about movies, with terms like "reason," "jokes," and "costuming."

Topic 7: Possibly discussing Barbie movies and characters, including Barbie, Ken, and Margot Robbie.

Topic 8: Discusses films in the context of observations and societal themes like patriarchy, with mentions of "observations" and "patriarchy".

Topic 9: Possibly discussing movies and their realism or quality, with mentions of "world," "real," and "acting."

Topic 10: Discusses movies in a general sense, possibly focusing on the concept or idea of movies, with terms like "movie," "kind," and "film."

LSA Topics:

Topic 1: Likely discussing Barbie movies and characters, with mentions of Barbie, Ken, and Ryan Gosling.

Topic 2: Discusses films and their visuals, potentially related to Barbie movies, with terms like "visuals" and "mattel."

Topic 3: Focuses on moments and observations in movies, possibly related to specific scenes or themes, with mentions of "moments" and "observations."

Topic 4: Discusses societal topics and messaging in movies, with terms like "messaging," "change," and "men."

Topic 5: Possibly discussing Ryan Gosling and related topics, with mentions of "ryan," "gosling," and "margot."

Topic 6: Discusses themes related to Barbie movies and their success or messaging, with mentions of "barbieland," "kens," and "messaging."

Topic 7: Focuses on aspects like storytelling and audience reception, with terms like "like," "way," and "audience."

Topic 8: Discusses the quality or appeal of movies, with terms like "good," "reason," and "dude."

Topic 9: Possibly discussing aspects of movies and their impact, with terms like "like," "just," and "end."

Topic 10: Discusses aspects like costuming and humor in movies, with terms like "costuming," "jokes," and "wasnt."

BERTopic Clusters:

Cluster 0: Generic cluster with common words, possibly representing generic language across documents.

Cluster 1: Another generic cluster with common words, possibly related to discussions about Barbie but lacks specificity.

Cluster 2: Contains common words and phrases related to movies but lacks a clear theme.

Cluster 3: Mix of words including "barbie" and "movie" but lacks coherence and specificity.


# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [1]:
# Write your code here
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Read the CSV file
file_path = '/content/imdb_reviews_sentimentanalysis.csv'
df = pd.read_csv(file_path)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['clean_text'], df['sentiment_analysis'], test_size=0.2, random_state=42)

# Feature extraction: Using TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Train and evaluate SVM classifier
svm_classifier = SVC(kernel='linear', random_state=42)
svm_classifier.fit(X_train_tfidf, y_train)
svm_predictions = svm_classifier.predict(X_test_tfidf)

# Evaluate SVM performance
svm_accuracy = accuracy_score(y_test, svm_predictions)
svm_precision = precision_score(y_test, svm_predictions, average='weighted')
svm_recall = recall_score(y_test, svm_predictions, average='weighted')
svm_f1 = f1_score(y_test, svm_predictions, average='weighted')

print("SVM Classifier Performance:")
print(f"Accuracy: {svm_accuracy}")
print(f"Precision: {svm_precision}")
print(f"Recall: {svm_recall}")
print(f"F1 Score: {svm_f1}")

# Train and evaluate Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train_tfidf, y_train)
rf_predictions = rf_classifier.predict(X_test_tfidf)

# Evaluate Random Forest performance
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_precision = precision_score(y_test, rf_predictions, average='weighted')
rf_recall = recall_score(y_test, rf_predictions, average='weighted')
rf_f1 = f1_score(y_test, rf_predictions, average='weighted')

print("\nRandom Forest Classifier Performance:")
print(f"Accuracy: {rf_accuracy}")
print(f"Precision: {rf_precision}")
print(f"Recall: {rf_recall}")
print(f"F1 Score: {rf_f1}")





SVM Classifier Performance:
Accuracy: 0.7
Precision: 0.7050000000000001
Recall: 0.7
F1 Score: 0.6924242424242424

Random Forest Classifier Performance:
Accuracy: 0.65
Precision: 0.646969696969697
Recall: 0.65
F1 Score: 0.6463369963369964


Feature Selection Explanation:
TF-IDF (Term Frequency-Inverse Document Frequency) vectorization is chosen for feature extraction. TF-IDF captures the importance of words in documents based on their frequency and rarity across the corpus. It handles stop words effectively and provides a meaningful representation of text data.

Selection of Supervised Learning Algorithms:
Support Vector Machine (SVM) and Random Forest are chosen for sentiment analysis. SVM is effective in high-dimensional spaces and finds optimal hyperplanes for classification. Random Forest is an ensemble method that combines multiple decision trees, providing robust performance and feature importance analysis.

# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the training dataset
train_df = pd.read_csv('/content/train.csv')

# Load the testing dataset
test_df = pd.read_csv('/content/test.csv')

# Exploratory Data Analysis (EDA) and Data Cleaning
# Drop columns with a high percentage of missing values or deemed irrelevant for analysis
train_df.drop(columns=['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'], inplace=True)

# Fill missing values in numerical columns with mean
numerical_cols = train_df.select_dtypes(include=np.number).columns
train_df[numerical_cols] = train_df[numerical_cols].fillna(train_df[numerical_cols].mean())

# Impute missing values in categorical columns with mode
categorical_cols = train_df.select_dtypes(include='object').columns
train_df[categorical_cols] = train_df[categorical_cols].fillna(train_df[categorical_cols].mode().iloc[0])

# Feature Selection
# Based on EDA results, select a number of features for the regression model
selected_features = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']

# Split data into features (X) and target variable (y) for training dataset
X_train = train_df[selected_features]
y_train = train_df['SalePrice']

# Split the training dataset into training and testing sets
X_train_split, X_test_split, y_train_split, y_test_split = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Initialize and train a linear regression model
model = LinearRegression()
model.fit(X_train_split, y_train_split)

# Predict on the test set
y_pred = model.predict(X_test_split)

# Calculate evaluation metrics
mse = mean_squared_error(y_test_split, y_pred)
r2 = r2_score(y_test_split, y_pred)

# Print evaluation results
print("\nEvaluation Metrics (Testing Dataset):")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2): {r2}")



Evaluation Metrics (Testing Dataset):
Mean Squared Error (MSE): 1576962754.8842602
R-squared (R2): 0.7944073417103643


The selected features for the regression model are:

OverallQual: Overall material and finish quality of the house.

GrLivArea: Above ground living area square feet.

GarageCars: Size of garage in car capacity.

TotalBsmtSF: Total square feet of basement area.

FullBath: Number of full bathrooms above grade.

YearBuilt: Original construction year.

These features are chosen based on their potential correlation with the target variable (house price) and their relevance in influencing the price of residential homes. For instance, the overall quality of a house, living area size, garage capacity, basement area, number of full bathrooms, and the year of construction are common factors known to affect house prices. Therefore, they are deemed important for inclusion in the regression model to capture the variations in house prices based on these characteristics.

# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [4]:
import pandas as pd
import torch
from transformers import pipeline, BertTokenizer, BertForSequenceClassification

# Load the dataset
data_path = "/content/imdb_reviews_sentimentanalysis.csv"
data = pd.read_csv(data_path)

# Select a relevant Pre-trained Language Model (PLM)
# For this example, let's choose BERT
plm = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(plm)
model = BertForSequenceClassification.from_pretrained(plm)

# Set the model to evaluation mode
model.eval()

# Define a function to perform sentiment analysis using BERT
def predict_sentiment(text):
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    probabilities = logits.softmax(dim=1)
    predicted_class = torch.argmax(probabilities, dim=1).item()
    return "positive" if predicted_class == 1 else "negative"

# Perform sentiment analysis using the selected PLM
predictions = [predict_sentiment(text) for text in data["clean_text"]]

# Evaluate performance
ground_truths = data["sentiment_analysis"].tolist()
accuracy = sum(1 for truth, pred in zip(ground_truths, predictions) if truth == pred) / len(ground_truths)

# Calculate precision, recall, and F1 score
true_positives = sum(1 for truth, pred in zip(ground_truths, predictions) if truth == pred == "positive")
false_positives = sum(1 for truth, pred in zip(ground_truths, predictions) if truth != pred and pred == "positive")
false_negatives = sum(1 for truth, pred in zip(ground_truths, predictions) if truth != pred and pred == "negative")

precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) != 0 else 0
recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) != 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) != 0 else 0

# Print evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1_score)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Accuracy: 0.49
Precision: 0
Recall: 0.0
F1 Score: 0


'\nAdvantages of BERT:\n- BERT has been pre-trained on a large corpus of text data, enabling it to capture intricate language patterns.\n- It performs well on various natural language processing tasks, including sentiment analysis.\n- Its zero-shot learning capability allows for direct application without fine-tuning on task-specific data.\n\nDisadvantages of BERT:\n- BERT may require substantial computational resources for inference due to its large number of parameters.\n- It may not generalize well to domains or tasks significantly different from its pre-training data.\n- Interpretability of BERT predictions can be challenging due to its complex architecture and tokenization process.\n\nChallenges encountered during implementation:\n- Ensuring compatibility between the dataset format and the input requirements of the BERT model.\n- Handling any preprocessing or tokenization steps necessary for the input text data.\n- Interpreting and analyzing the model predictions to understand its

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model pre-trained on a large corpus of text data from sources like English Wikipedia and BooksCorpus. The "bert-base-uncased" variant has around 110 million parameters. BERT can be fine-tuned for specific tasks, but in the zero-shot setting, it's used without task-specific fine-tuning. In the provided task of sentiment analysis, BERT's zero-shot capability is leveraged for predicting sentiment polarities on the collected data without additional fine-tuning. This approach allows BERT to directly apply its pre-trained knowledge to the task at hand. However, it's essential to note that while BERT offers impressive performance across various NLP tasks, including sentiment analysis, its computational requirements and potential for overfitting to the pre-training data can be challenges. Additionally, interpreting BERT's predictions and ensuring compatibility with the dataset format are crucial considerations during implementation.




Advantages of BERT:
- BERT has been pre-trained on a large corpus of text data, enabling it to capture intricate language patterns.
- It performs well on various natural language processing tasks, including sentiment analysis.
- Its zero-shot learning capability allows for direct application without fine-tuning on task-specific data.

Disadvantages of BERT:
- BERT may require substantial computational resources for inference due to its large number of parameters.
- It may not generalize well to domains or tasks significantly different from its pre-training data.
- Interpretability of BERT predictions can be challenging due to its complex architecture and tokenization process.

Challenges encountered during implementation:
- Ensuring compatibility between the dataset format and the input requirements of the BERT model.
- Handling any preprocessing or tokenization steps necessary for the input text data.
- Interpreting and analyzing the model predictions to understand its behavior and performance.