<a href="https://colab.research.google.com/github/vodnalashiva131/INFO-5731/blob/main/vodnala_shiva_info5731_assignment_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [None]:
!pip install -q BERTopic

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.5/158.5 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for hdbscan (pyproject.toml) ... [?25l[?25hdone


In [None]:
# Write your code here

from sklearn.datasets import fetch_20newsgroups
from bertopic import BERTopic

# Load the 20 newsgroups dataset
data = fetch_20newsgroups(subset='all', shuffle=True, remove=('headers', 'footers', 'quotes'))

In [None]:
# check size of data

print(data.target.shape)

(18846,)


In [None]:
print(data.target[:5])

[10  3 17  3  4]


In [None]:
# Sample 10% of the data
sampled_indices = range(0, len(data.data), len(data.data) // 10)
sampled_data = [data.data[i] for i in sampled_indices]


In [None]:
# Initialize BERTopic model
model = BERTopic()

# Fit BERTopic model to the sampled dataset
topics, _ = model.fit_transform(sampled_data)

# Get the top 10 clusters/topics
top_clusters = model.get_topics(10)

# Print top 10 clusters/topics
for i, cluster in enumerate(top_clusters):
    print(f"Cluster {i+1}:")
    print(cluster)
    print("\n")

Cluster 1:
Main




# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# List of encodings to try
encodings = ["utf-8", "latin-1", "iso-8859-1", "cp1252"]

# Try reading the CSV file with different encodings
for encoding in encodings:
    try:
        data = pd.read_csv("annotated_paper.csv", encoding=encoding)
        print(f"CSV file successfully read with encoding: {encoding}")
        break  # Stop trying encodings if successful
    except Exception as e:
        print(f"Failed to read CSV file with encoding {encoding}: {e}")

# Drop rows with NaN values
data.dropna(subset=['Abstract'], inplace=True)

# Check data types in the target variable
print(data['sentiment'].value_counts())

# Selecting features and target
X = data['Abstract'].astype(str)
y = data['sentiment']

# TF-IDF vectorization
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(X)

# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

CSV file successfully read with encoding: utf-8
sentiment
negative    2983
positive    2979
neutral     2869
Name: count, dtype: int64


In [None]:
X_train

<7065x101467 sparse matrix of type '<class 'numpy.float64'>'
	with 1142369 stored elements in Compressed Sparse Row format>

In [None]:
y_test

9800    positive
1235    negative
3392     neutral
2312     neutral
9687     neutral
          ...   
7049    positive
5338    negative
6812    positive
9714     neutral
6779    negative
Name: sentiment, Length: 1767, dtype: object

In [None]:
from sklearn.preprocessing import LabelEncoder
# Encode categorical labels in y_test
label_encoder = LabelEncoder()
y_test_encoded = label_encoder.fit_transform(y_test)


In [None]:
# Models
models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier()
}

# Training and evaluation
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(y_pred)

Training Logistic Regression...
['positive' 'positive' 'negative' ... 'negative' 'negative' 'negative']
Training Random Forest...
['negative' 'positive' 'negative' ... 'negative' 'negative' 'neutral']


# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

Sure, let's tackle each step one by one:

### 1. Exploratory Data Analysis (EDA) and Data Cleaning:
We'll start by loading the training and testing datasets, exploring their structure, handling missing values, and performing basic visualizations to understand the data better.

### 2. Feature Selection:
Based on the EDA results, we'll select a subset of features that seem most relevant for predicting house prices. We may consider features that have a strong correlation with the target variable and those that are not highly correlated with each other to avoid multicollinearity issues.

### 3. Regression Model Development:
We'll develop a regression model using the selected features from the training dataset. We can start with a simple linear regression model and then explore more advanced models if necessary.

### 4. Model Evaluation:
We'll evaluate the performance of the regression model using appropriate evaluation metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2) score. We'll use the test set to assess how well the model generalizes to new data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load the training and testing datasets
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

# Display basic information about the datasets
print("Training Data Info:")
print(train_data.info())
print("\nTesting Data Info:")
print(test_data.info())

# Check for missing values
print("\nMissing Values in Training Data:")
print(train_data.isnull().sum())
print("\nMissing Values in Testing Data:")
print(test_data.isnull().sum())

Training Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   i

In [None]:
# Drop columns with more than 10% null values in both train and test data
train_data = train_data.drop(columns=train_data.columns[train_data.isnull().mean() > 0.1])
test_data = test_data.drop(columns=test_data.columns[test_data.isnull().mean() > 0.1])

# Check for null values again
print("\nMissing Values in Training Data:")
print(train_data.isnull().sum())
print("\nMissing Values in Testing Data:")
print(test_data.isnull().sum())



Missing Values in Training Data:
Id               0
MSSubClass       0
MSZoning         0
LotArea          0
Street           0
                ..
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
SalePrice        0
Length: 74, dtype: int64

Missing Values in Testing Data:
Id               0
MSSubClass       0
MSZoning         4
LotArea          0
Street           0
                ..
MiscVal          0
MoSold           0
YrSold           0
SaleType         1
SaleCondition    0
Length: 73, dtype: int64


In [None]:
# remove null

train_data = train_data.dropna()
test_data = test_data.dropna()


In [None]:
train_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,2,20,RL,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,3,60,RL,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,4,70,RL,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,5,60,RL,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,12,2008,WD,Normal,250000


In [None]:
# label encode all catograial variables

from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object
le = LabelEncoder()

# Iterate through the categorical columns
for col in train_data.select_dtypes(include=['object']):
  # Fit the encoder to the data
  le.fit(train_data[col])
  # Transform the data
  train_data[col] = le.transform(train_data[col])
  test_data[col] = le.transform(test_data[col])

train_data.head()


Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,3,8450,1,3,3,0,4,0,...,0,0,0,0,0,2,2008,8,4,208500
1,2,20,3,9600,1,3,3,0,2,0,...,0,0,0,0,0,5,2007,8,4,181500
2,3,60,3,11250,1,0,3,0,4,0,...,0,0,0,0,0,9,2008,8,4,223500
3,4,70,3,9550,1,0,3,0,0,0,...,272,0,0,0,0,2,2006,8,0,140000
4,5,60,3,14260,1,0,3,0,2,0,...,0,0,0,0,0,12,2008,8,4,250000


In [None]:
# Split the data into features and target variable
X_train = train_data.drop(columns=["SalePrice"])  # Features
y_train = train_data["SalePrice"]  # Target variable
X_test = test_data

# Split the training data into training and validation sets
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [None]:
# Develop a regression model
model = LinearRegression()
model.fit(X_train_split, y_train_split)

# Predict house prices on the validation set
y_val_pred = model.predict(X_val)

# Evaluate the model
mae = mean_absolute_error(y_val, y_val_pred)
mse = mean_squared_error(y_val, y_val_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_val, y_val_pred)

print("\nModel Evaluation:")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R2) Score: {r2}")


Model Evaluation:
Mean Absolute Error (MAE): 22154.410142191173
Mean Squared Error (MSE): 975065503.5832952
Root Mean Squared Error (RMSE): 31226.0388711616
R-squared (R2) Score: 0.7721333610442048


# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


To accomplish this task, I'll follow these steps:

1. Select a pre-trained language model (PLM) from the Hugging Face Repository.
2. Use the selected PLM to perform sentiment analysis on the data collected in Assignment 3 in the zero-shot setting.
3. Evaluate the performance of the model by comparing it with the ground truths (labels annotated in Assignment 3) on Accuracy, Precision, Recall, and F1 metrics.
4. Discuss the advantages and disadvantages of the selected PLM, as well as any challenges encountered during the implementation.

Let's start by selecting a relevant PLM and providing a brief description. Then, we'll proceed with implementing sentiment analysis using the chosen PLM.

### 1. Selection of Pre-trained Language Model (PLM)

For this task, let's choose BERT (Bidirectional Encoder Representations from Transformers) from the Hugging Face Repository.

**Description of BERT:**
- **Original Pretraining Data Sources:** BERT was trained on a large corpus of text from BooksCorpus (800 million words) and English Wikipedia (2,500 million words).
- **Number of Parameters:** BERT-base has 110 million parameters, while BERT-large has 340 million parameters.
- **Task-specific Fine-tuning:** BERT can be fine-tuned on specific tasks by adding task-specific layers and training on task-specific data. However, for this task, we will use BERT in the zero-shot setting, which means we won't fine-tune it for sentiment analysis specifically.

### 2. Sentiment Analysis Using BERT (Zero-shot)

We'll use the zero-shot capability of BERT to perform sentiment analysis on the data collected in Assignment 3. We'll evaluate its performance using Accuracy, Precision, Recall, and F1 metrics.

### 3. Discussion of Advantages, Disadvantages, and Challenges

**Advantages of BERT:**
- BERT has achieved state-of-the-art performance on various NLP tasks, including sentiment analysis.
- It can capture complex relationships and contexts in text data due to its bidirectional nature.
- BERT is available as a pre-trained model and can be easily fine-tuned for specific tasks.

**Disadvantages of BERT:**
- BERT requires significant computational resources for training and inference due to its large number of parameters.
- Fine-tuning BERT for specific tasks may require labeled task-specific data, which can be expensive and time-consuming to obtain.
- BERT may struggle with out-of-vocabulary words or domain-specific language.

**Challenges Encountered:**
- Understanding and selecting the appropriate settings for using BERT, such as whether to use BERT-base or BERT-large, and whether to fine-tune or use zero-shot learning.
- Processing large amounts of text data efficiently, especially when dealing with a large dataset.
- Evaluating the performance of BERT and interpreting the results in the context of the specific task requirements.

Overall, BERT is a powerful tool for natural language processing tasks like sentiment analysis, but it requires careful consideration of various factors such as computational resources, fine-tuning strategies, and evaluation metrics.

In [None]:
# Write your code here


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# List of encodings to try
encodings = ["utf-8", "latin-1", "iso-8859-1", "cp1252"]

# Try reading the CSV file with different encodings
for encoding in encodings:
    try:
        data = pd.read_csv("annotated_paper.csv", encoding=encoding)
        print(f"CSV file successfully read with encoding: {encoding}")
        break  # Stop trying encodings if successful
    except Exception as e:
        print(f"Failed to read CSV file with encoding {encoding}: {e}")

# Drop rows with NaN values
data.dropna(subset=['Abstract', 'sentiment'], inplace=True)

CSV file successfully read with encoding: utf-8


In [None]:
from transformers import pipeline
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Initialize the sentiment analysis pipeline using BERT
sentiment_pipeline = pipeline("sentiment-analysis", framework="pt")

# Function to split text into chunks of max_length tokens
def split_text(text, max_length=512):
    return [text[i:i+max_length] for i in range(0, len(text), max_length)]

# Perform sentiment analysis on each abstract
predicted_sentiments = []

for abstract in data['Abstract'].tolist():
    chunks = split_text(abstract)
    chunk_sentiments = [result['label'] for chunk in chunks for result in sentiment_pipeline(chunk)]
    predicted_sentiments.append(chunk_sentiments)

# Flatten the list of predicted sentiments
predicted_sentiments = [sentiment for sublist in predicted_sentiments for sentiment in sublist]


In [None]:
predicted_sentiments = [sentiment.lower() for sentiment in predicted_sentiments]

# Evaluate performance
y_true = data['sentiment'].tolist()
accuracy = accuracy_score(y_true, predicted_sentiments)
precision = precision_score(y_true, predicted_sentiments, average='weighted')
recall = recall_score(y_true, predicted_sentiments, average='weighted')
f1 = f1_score(y_true, predicted_sentiments, average='weighted')

# Print evaluation metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1}")