<a href="https://colab.research.google.com/github/yaminiravala/5731/blob/main/Ravala_Yamini_Assignment_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

(1) Features (text representation) used for topic modeling.

(2) Top 10 clusters for topic modeling.

(3) Summarize and describe the topic for each cluster.


In [None]:
# Write your code here
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
import pandas as pd

In [None]:
file_path = 'srmtd_reviews_cleaned.xl (1).xls'
data = pd.read_excel(file_path)
documents = data['Cleaned_Review']

In [None]:
# Text Representation using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
# LDA Model
n_topics = 10
lda = LatentDirichletAllocation(n_components=n_topics, random_state=0)
lda.fit(tfidf)
# Display the top words in each topic
def display_topics(model, feature_names, no_top_words):
    topic_summaries = []
    for topic_idx, topic in enumerate(model.components_):
        topic_top_words = " ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]])
        topic_summaries.append((topic_idx, topic_top_words))
    return topic_summaries

no_top_words = 10
topic_summaries = display_topics(lda, tfidf_vectorizer.get_feature_names_out(), no_top_words)
# Display
topic_summaries

[(0,
  'harsha babu nice performance story love help charuseela millionaire fall'),
 (1, 'film acting hero flop routine time movie expecting got review'),
 (2,
  'movie time character great babu like action rajendraprasad decent worked'),
 (3,
  'rating telugu multimillionaire different charusheela movie story adopts directed review'),
 (4,
  'good performance film india job dancer especially wonder commercial mahesh'),
 (5, 'content new formula dont routine different director movie story good'),
 (6, 'village vardhan come people babu rest harsha know father son'),
 (7, 'wasted said new story mahesh movie sruthi routine siva usual'),
 (8, 'film dont know good movie audience bit flop telugu great'),
 (9, 'people babu form bit needy process instead life family ravikanth')]

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

(1) Features used for sentiment classification and explain why you select these features.

(2) Select two of the supervised learning algorithm from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build a sentiment classifier respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

(3) Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [None]:
# Write your code here
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.svm import SVC

In [None]:
# Data Cleaning
cleaned_data = data.dropna(subset=['Cleaned_Review', 'Sentiment'])
X = cleaned_data['Cleaned_Review']
y = cleaned_data['Sentiment'].str.lower()

In [None]:
# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [None]:
# Model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, y_train)

LogisticRegression(max_iter=1000)

In [None]:
# Evaluation
y_pred = model.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
accuracy, report

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


(0.25,
 '              precision    recall  f1-score   support\n\n    negative       0.00      0.00      0.00         2\n     neutral       0.00      0.00      0.00         4\n    positive       0.25      1.00      0.40         2\n\n    accuracy                           0.25         8\n   macro avg       0.08      0.33      0.13         8\nweighted avg       0.06      0.25      0.10         8\n')

In [None]:
# model for comparison
logistic_model = LogisticRegression(max_iter=1000)
svm_model = SVC()
cv_folds = 5

In [None]:
scoring_metrics_standard = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
logistic_cv_results_standard = cross_validate(logistic_model, X_train_tfidf, y_train, cv=cv_folds,
                                              scoring=scoring_metrics_standard)
svm_cv_results_standard = cross_validate(svm_model, X_train_tfidf, y_train, cv=cv_folds,
                                         scoring=scoring_metrics_standard)

def average_cv_results_standard(cv_results):
    return {metric: np.mean(cv_results[f'test_{metric}']) for metric in scoring_metrics_standard}

logistic_performance_standard = average_cv_results_standard(logistic_cv_results_standard)
svm_performance_standard = average_cv_results_standard(svm_cv_results_standard)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
logistic_performance_standard, svm_performance_standard

({'accuracy': 0.6333333333333332,
  'precision_macro': 0.29999999999999993,
  'recall_macro': 0.4666666666666666,
  'f1_macro': 0.3644444444444444},
 {'accuracy': 0.6333333333333332,
  'precision_macro': 0.29999999999999993,
  'recall_macro': 0.4666666666666666,
  'f1_macro': 0.3644444444444444})

# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.


In [None]:
# Write your code here
import pandas as pd

train_file_path = 'assignment4-question3-data (1)/train.csv'
test_file_path = 'assignment4-question3-data (1)/test.csv'
train_data = pd.read_csv(train_file_path)
test_data = pd.read_csv(test_file_path)
train_data.head(), train_data.shape, test_data.shape

(   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
 0   1          60       RL         65.0     8450   Pave   NaN      Reg   
 1   2          20       RL         80.0     9600   Pave   NaN      Reg   
 2   3          60       RL         68.0    11250   Pave   NaN      IR1   
 3   4          70       RL         60.0     9550   Pave   NaN      IR1   
 4   5          60       RL         84.0    14260   Pave   NaN      IR1   
 
   LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
 0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
 1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
 2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
 3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
 4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   
 
   YrSold  SaleType  SaleCondition  SalePrice  
 0   2

In [None]:
# filling missing values
def fill_missing_values(df, fill_strategy):
    for column, strategy in fill_strategy.items():
        if strategy == 'mean':
            df[column].fillna(df[column].mean(), inplace=True)
        elif strategy == 'median':
            df[column].fillna(df[column].median(), inplace=True)
        elif strategy == 'mode':
            df[column].fillna(df[column].mode()[0], inplace=True)
        elif strategy == 'none':
            df[column].fillna('None', inplace=True)
        elif strategy == 'drop':
            df.drop(column, axis=1, inplace=True)

# define strategies
fill_strategies_train = {
    'LotFrontage': 'mean',
    'Alley': 'none',
    'MasVnrType': 'mode',
    'MasVnrArea': 'mean',
    'BsmtQual': 'mode',
    'FireplaceQu': 'none',
    'GarageType': 'mode',
    'GarageYrBlt': 'mean',
    'PoolQC': 'none',
    'Fence': 'none',
    'MiscFeature': 'none',
}

fill_strategies_test = fill_strategies_train.copy()
fill_strategies_test.update({

})

In [None]:
fill_missing_values(train_data, fill_strategies_train)
fill_missing_values(test_data, fill_strategies_test)
# verifying
missing_values_train_after = train_data.isnull().sum().sum()
missing_values_test_after = test_data.isnull().sum().sum()

In [None]:
missing_values_train_after, missing_values_test_after

(394, 429)

# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **pre-trained Large Language Model (LLM) from the Hugging Face Repository** for your specific task using the data collected in Assignment 3. After creating an account on Hugging Face (https://huggingface.co/), choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any Meta based text analysis model. Provide a brief description of the selected LLM, including its original sources, significant parameters, and any task-specific fine-tuning if applied.

Perform a detailed analysis of the LLM's performance on your task, including key metrics, strengths, and limitations. Additionally, discuss any challenges encountered during the implementation and potential strategies for improvement. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [None]:
# Write your code here
