<a href="https://colab.research.google.com/github/sushanthbandameedi/Sushanth_INFO5731_Spring2023/blob/main/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

(1) Features (text representation) used for topic modeling.

(2) Top 10 clusters for topic modeling.

(3) Summarize and describe the topic for each cluster. 


In [None]:
# Write your code here

# Import libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Load the dataset
df = pd.read_csv('INFO_5731/assgn_3/final.csv')

# Create the document-term matrix
vectorizer = CountVectorizer(stop_words='english')
doc_term_matrix = vectorizer.fit_transform(df['review_text'])

# Fit the LDA model
num_topics = 10
lda_model = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda_model.fit(doc_term_matrix)

# Print the top 10 clusters for topic modeling
feature_names = vectorizer.get_feature_names()
top_clusters = {}
for i, topic in enumerate(lda_model.components_):
    top_clusters[f"Cluster {i+1}"] = [feature_names[idx] for idx in topic.argsort()[-10:][::-1]]
    print(f"Cluster {i+1}: {top_clusters[f'Cluster {i+1}']}")

# Summarize and describe the topic for each cluster
for cluster, words in top_clusters.items():
    print(f"\n{cluster}:")
    for word in words:
        print(f"- {word}")
    print("\nTopic summary:")
    for index, row in enumerate(lda_model.components_):
        top_words = [feature_names[i] for i in row.argsort()[:-11:-1]]
        if all(word in top_words for word in words):
            print(df['review_text'][lda_model.transform(doc_term_matrix)[index].argmax()])



# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

(1) Features used for sentiment classification and explain why you select these features.

(2) Select two of the supervised learning algorithm from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build a sentiment classifier respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

(3) Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9. 

In [None]:
# Write your code here

# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score

# Load the dataset
df = pd.read_csv('INFO_5731/assgn_3/final.csv')

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['review_text'], df['sentiment'], test_size=0.2, random_state=42)

# Feature extraction
vectorizer = TfidfVectorizer(stop_words='english')
X_train_features = vectorizer.fit_transform(X_train)
X_test_features = vectorizer.transform(X_test)

# Model 1: Multinomial Naive Bayes
mnb_model = MultinomialNB()
mnb_model.fit(X_train_features, y_train)

# Cross-validation for MNB model
mnb_scores = cross_val_score(mnb_model, X_train_features, y_train, cv=5)
print("Multinomial Naive Bayes cross-validation scores:", mnb_scores)

# Model 2: Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train_features, y_train)

# Cross-validation for Logistic Regression model
lr_scores = cross_val_score(lr_model, X_train_features, y_train, cv=5)
print("Logistic Regression cross-validation scores:", lr_scores)

# Evaluate model performance on test set
mnb_pred = mnb_model.predict(X_test_features)
lr_pred = lr_model.predict(X_test_features)

print("Multinomial Naive Bayes performance:")
print("Accuracy:", accuracy_score(y_test, mnb_pred))
print("Precision:", precision_score(y_test, mnb_pred, average='weighted'))
print("Recall:", recall_score(y_test, mnb_pred, average='weighted'))
print("F1 score:", f1_score(y_test, mnb_pred, average='weighted'))

print("\nLogistic Regression performance:")
print("Accuracy:", accuracy_score(y_test, lr_pred))
print("Precision:", precision_score(y_test, lr_pred, average='weighted'))
print("Recall:", recall_score(y_test, lr_pred, average='weighted'))
print("F1 score:", f1_score(y_test, lr_pred, average='weighted'))




# **Question 3: House price prediction**

(40 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878. 


In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the datasets
train_df = pd.read_csv('INFO_5731/assgn_4/ques_3/train.csv')
test_df = pd.read_csv('INFO_5731/assgn_4/ques_3/test.csv')

# Load the data description file
with open('INFO_5731/assgn_4/ques_3/data_description.txt', 'r') as f:
    lines = f.readlines()
    desc_dict = {}
    for line in lines:
        line = line.strip()
        if ':' in line:
            key, val = line.split(':')
            desc_dict[key.strip()] = val.strip()

# Preprocessing: Impute missing values and one-hot encoding
combined_df = pd.concat([train_df.drop('SalePrice', axis=1), test_df], axis=0, ignore_index=True)
for col in combined_df.columns:
    if combined_df[col].dtype == 'object':
        combined_df[col].fillna(desc_dict[col], inplace=True)
    else:
        combined_df[col].fillna(combined_df[col].mean(), inplace=True)
combined_df = pd.get_dummies(combined_df)

# Split the datasets
X_train = combined_df[:train_df.shape[0]]
X_test = combined_df[train_df.shape[0]:]
y_train = train_df['SalePrice']

# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Split the dataset into training and testing sets
X_train_split, X_test_split, y_train_split, y_test_split = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train_split, y_train_split)

# Make predictions on the testing set
y_pred = model.predict(X_test_split)

# Evaluate the model
mse = mean_squared_error(y_test_split, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_split, y_pred)

print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"R-squared: {r2:.2f}")
