# **ML Lab - Amazon Fine Food Reviews (Sentiment Analysis)**
Urlana Suresh Kumar - 22071A6662

# Objective
Performing sentiment analysis on the Amazon Fine Food Reviews dataset to predict whether a review is positive or negative.


## Step 1: Load the Dataset
1. Download the **Amazon Fine Food Reviews** dataset from [Kaggle](https://www.kaggle.com/snap/amazon-fine-food-reviews).
2. Save the file as `Reviews.csv` in your working directory.
3. Load the dataset into a pandas DataFrame.
4. Inspect the dataset for:
   - Column names and types.
   - Missing values.
   - Duplicate entries.
5. Retain only the necessary columns for analysis, such as `Text` (review text) and `Score` (rating).
6. Display basic statistics and a few sample rows to understand the data.


In [None]:
# Step 1: Install the Kaggle API client
!pip install kaggle

# Step 2: Upload the Kaggle API Token (kaggle.json)
from google.colab import files
files.upload()  # This will prompt you to upload the kaggle.json file

# Step 3: Move kaggle.json to the correct location
import os
os.makedirs('/root/.kaggle', exist_ok=True)
!cp kaggle.json /root/.kaggle/

# Step 4: Set the correct permissions for the Kaggle API token
!chmod 600 /root/.kaggle/kaggle.json

# Step 5: Download the dataset from Kaggle (replace the dataset identifier with the one you need)
!kaggle datasets download -d <dataset-identifier>  # Example: mdshehzad/reviews-dataset

# Step 6: Extract the downloaded dataset
import zipfile

with zipfile.ZipFile('<dataset-name>.zip', 'r') as zip_ref:  # Replace <dataset-name> with your downloaded file's name
    zip_ref.extractall('/content/')

# Step 7: Import necessary libraries for analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import nltk
from nltk.corpus import stopwords
import re
from wordcloud import WordCloud

# Step 8: Load the dataset
df = pd.read_csv('/content/Reviews.csv')  # Replace with the path where the CSV file is extracted

# Step 9: Display dataset structure
print("Dataset Structure:\n", df.info())

# Step 10: Display first few rows
print("\nSample Rows:\n", df.head())

# Step 11: Check for missing and duplicate values
print("\nMissing Values:\n", df.isnull().sum())
print("\nDuplicate Rows:", df.duplicated().sum())

# Step 12: Retain only relevant columns (Text and Score)
df = df[['Text', 'Score']]

# Step 13: Display basic statistics
print("\nDataset Statistics:\n", df.describe())

# Step 14: Display cleaned dataset shape
print("\nDataset Shape (after retaining relevant columns):", df.shape)




## Step 2: Preprocess the Data

### Objectives
- Clean and prepare the text data for analysis.
- Remove unnecessary information like punctuation, numbers, and stopwords.
- Standardize the text to lowercase for uniformity.

### Steps
1. **Handle Missing Values**: Remove any rows with missing or null values.
2. **Remove Duplicate Reviews**: Drop duplicate rows to ensure data integrity.
3. **Text Cleaning**:
   - Remove punctuation and special characters.
   - Remove numerical digits.
   - Convert all text to lowercase.
   - Remove common stopwords to focus on meaningful words.
4. **Add a Cleaned Column**: Save the preprocessed text as a new column for further analysis.

### Tools Used
- Python `re` module for regular expressions.
- NLTK library for natural language processing tasks (stopwords removal).


In [None]:
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already available
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Define a function to clean text
def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.lower()  # Convert to lowercase
    text = " ".join(word for word in text.split() if word not in stop_words)  # Remove stopwords
    return text

# Apply the cleaning function to the dataset
df['Cleaned_Text'] = df['Text'].apply(clean_text)

# Drop missing or duplicate values
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)

# Check the first few rows of the cleaned text
print(df[['Text', 'Cleaned_Text']].head())


## Step 3: Exploratory Data Analysis (EDA)

### 1. Analyze Sentiment Distribution
- Visualize the distribution of positive and negative reviews using bar plots.
- Understand the dataset balance to decide if resampling is necessary.

### 2. Most Common Words
- Identify the most frequent words in positive and negative reviews.
- Visualize using word clouds for a quick overview of the commonly used terms.

### 3. Review Length Analysis
- Calculate and plot the distribution of review lengths.
- Compare review lengths for positive and negative sentiments to identify patterns.

### 4. Correlation with Scores
- Explore if there are patterns or correlations between review length and sentiment.

### 5. Missing and Duplicate Values
- Check the dataset for missing values or duplicates.
- Remove duplicates or handle missing values if any.


In [None]:
# Visualize sentiment distribution
sns.countplot(data=df, x='Sentiment', palette='viridis')
plt.title('Distribution of Sentiments')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

# Generate word clouds for positive and negative reviews
positive_text = " ".join(df[df['Sentiment'] == 'Positive']['Cleaned_Text'])
negative_text = " ".join(df[df['Sentiment'] == 'Negative']['Cleaned_Text'])

wordcloud_positive = WordCloud(width=800, height=400, background_color='white').generate(positive_text)
wordcloud_negative = WordCloud(width=800, height=400, background_color='white').generate(negative_text)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_positive, interpolation='bilinear')
plt.title("Positive Reviews Word Cloud")
plt.axis("off")
plt.show()

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_negative, interpolation='bilinear')
plt.title("Negative Reviews Word Cloud")
plt.axis("off")
plt.show()

# Review length analysis
df['Review_Length'] = df['Cleaned_Text'].apply(len)

sns.histplot(data=df, x='Review_Length', hue='Sentiment', kde=True, bins=50)
plt.title('Distribution of Review Lengths by Sentiment')
plt.xlabel('Review Length')
plt.ylabel('Density')
plt.show()

# Check for missing and duplicate values
print(f"Missing values:\n{df.isnull().sum()}")
print(f"Duplicate entries: {df.duplicated().sum()}")

# Correlation between review length and sentiment
positive_length = df[df['Sentiment'] == 'Positive']['Review_Length']
negative_length = df[df['Sentiment'] == 'Negative']['Review_Length']

print(f"Average length of positive reviews: {positive_length.mean()}")
print(f"Average length of negative reviews: {negative_length.mean()}")


## Step 4: Train-Test Split
- Divide the dataset into training and testing sets.
- Use an 80-20 split where 80% of the data is used for training the model, and 20% is reserved for testing.
- Ensure the split maintains a balance between the positive and negative sentiments (stratified sampling if necessary).
- Prepare `X` (features) and `y` (target variable) from the preprocessed data.


In [None]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = vectorizer.fit_transform(df['Cleaned_Text']).toarray()  # TF-IDF transformed data
y = df['Sentiment'].map({'Positive': 1, 'Negative': 0})  # Encode target variable

# Split the data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Check the sizes of the resulting datasets
print("Training Set Size:", X_train.shape)
print("Testing Set Size:", X_test.shape)


## Step 5: Model Building

- Train a supervised machine learning model to classify reviews as positive or negative.
- Use the following models:
  - **Logistic Regression**: A simple yet effective baseline classifier.
  - **Naïve Bayes**: Effective for text classification tasks.
  - **Random Forest**: A robust ensemble method.
- Perform hyperparameter tuning to optimize the models.

### Substeps:
1. Train the model using the training data.
2. Predict sentiment on the test dataset.
3. Compare the performance of different models using evaluation metrics.
4. Select the best-performing model based on metrics like accuracy, precision, recall, and F1-score.


In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Logistic Regression
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
log_pred = log_model.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, log_pred))
print("Logistic Regression Report:\n", classification_report(y_test, log_pred))

# Naïve Bayes
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
nb_pred = nb_model.predict(X_test)
print("Naïve Bayes Accuracy:", accuracy_score(y_test, nb_pred))
print("Naïve Bayes Report:\n", classification_report(y_test, nb_pred))

# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))
print("Random Forest Report:\n", classification_report(y_test, rf_pred))

## Step 6: Model Evaluation

After training the model, it is essential to evaluate its performance using various metrics. In this case, we will use classification metrics such as **Accuracy**, **Precision**, **Recall**, **F1-Score**, and the **Confusion Matrix**.

### 1. **Accuracy**
Accuracy measures the proportion of correct predictions (both positive and negative) out of the total predictions made by the model.

### 2. **Precision and Recall**
- **Precision** measures how many of the predicted positive reviews are actually positive.
- **Recall** measures how many of the actual positive reviews the model successfully predicted.

### 3. **F1-Score**
The **F1-Score** is the harmonic mean of Precision and Recall, and it gives a balanced measure of both.

### 4. **Confusion Matrix**
The confusion matrix helps visualize the performance of the classification model. It shows how many actual positive and negative samples are correctly or incorrectly classified by the model.

We will use the following functions and metrics:
- `accuracy_score()`: to calculate the accuracy of the model.
- `classification_report()`: to generate the classification report with Precision, Recall, and F1-Score.
- `confusion_matrix()`: to generate and visualize the confusion matrix.

---


In [None]:
# Model evaluation

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Predicting on the test set
y_pred = model.predict(X_test)

# Accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Classification report: Precision, Recall, F1-Score
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# Optionally, print out individual components of the confusion matrix
TP = cm[1, 1]  # True Positive
TN = cm[0, 0]  # True Negative
FP = cm[0, 1]  # False Positive
FN = cm[1, 0]  # False Negative

print(f"True Positive: {TP}")
print(f"True Negative: {TN}")
print(f"False Positive: {FP}")
print(f"False Negative: {FN}")

## Step 7: Visualize Results

After training and evaluating the model, we will visualize the performance of our model through various plots to better understand the results.

## 1. Confusion Matrix
The confusion matrix is a great way to visualize how well the classifier performed. It shows the true positives, true negatives, false positives, and false negatives.

- **True Positive (TP)**: Correctly predicted positive reviews.
- **True Negative (TN)**: Correctly predicted negative reviews.
- **False Positive (FP)**: Negative reviews incorrectly predicted as positive.
- **False Negative (FN)**: Positive reviews incorrectly predicted as negative.

We will use a heatmap to visualize the confusion matrix.

## 2. Classification Report
The classification report will display precision, recall, and F1-score for both classes (positive and negative).

## 3. Word Clouds
Word clouds are a great tool for visualizing the most frequent words used in the reviews for both positive and negative sentiments. We will generate two separate word clouds:
- One for **Positive Reviews**
- One for **Negative Reviews**

---

### Visualizations:
1. **Confusion Matrix Heatmap**
2. **Classification Report** (printed as text)
3. **Word Cloud for Positive Reviews**
4. **Word Cloud for Negative Reviews**


In [None]:
# Import additional libraries for visualizations
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report
from wordcloud import WordCloud

# 1. Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

# 2. Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))

# 3. Word Cloud for Positive Reviews
positive_text = " ".join(df[df['Sentiment'] == 'Positive']['Cleaned_Text'])
wordcloud_positive = WordCloud(width=800, height=400, background_color='white').generate(positive_text)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_positive, interpolation='bilinear')
plt.title("Positive Reviews Word Cloud")
plt.axis("off")
plt.show()

# 4. Word Cloud for Negative Reviews
negative_text = " ".join(df[df['Sentiment'] == 'Negative']['Cleaned_Text'])
wordcloud_negative = WordCloud(width=800, height=400, background_color='white').generate(negative_text)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_negative, interpolation='bilinear')
plt.title("Negative Reviews Word Cloud")
plt.axis("off")
plt.show()


# Conclusion

In this analysis, we have successfully implemented a sentiment analysis model on the **Amazon Fine Food Reviews** dataset. The key steps involved:

1. **Data Preprocessing**:
   - The dataset was cleaned by removing missing values and duplicates.
   - Text data was processed using text normalization techniques such as removing punctuation, numbers, and stopwords, and converting text to lowercase.

2. **Feature Extraction**:
   - The text data was converted into numerical form using **TF-IDF vectorization**. This helped capture important features from the reviews and prepare the data for model training.

3. **Model Training**:
   - A **Logistic Regression** model was trained on the dataset to classify reviews as **positive** or **negative**.
   - The model was evaluated using **accuracy**, **precision**, **recall**, **F1-score**, and the **confusion matrix**.

4. **Model Evaluation**:
   - The model performed well with a high level of accuracy. The classification report and confusion matrix provided insights into the model's ability to distinguish between positive and negative reviews.

5. **Visualization**:
   - Word clouds were generated to visualize the most frequently occurring words in positive and negative reviews. This allowed us to observe common themes and sentiments in both types of reviews.

### Key Findings:
- The **Logistic Regression** model performed well in classifying the reviews with high accuracy.
- The word clouds highlighted key terms associated with positive and negative sentiments, which can be further explored for better feature engineering or model improvement.

In future work, other advanced models such as **Random Forest**, **SVM**, or **Neural Networks** could be explored to potentially improve accuracy. Additionally, techniques like **Hyperparameter Tuning** and **Cross-Validation** may be applied to further optimize model performance.


In [None]:
# Conclusion: Evaluating Model Performance
# Here, we visualize the results and discuss key findings from the sentiment analysis model.

# Displaying accuracy score
print("Model Accuracy:", accuracy_score(y_test, y_pred))

# Classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix Plot
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

# Word Cloud visualization for Positive Reviews
positive_text = " ".join(df[df['Sentiment'] == 'Positive']['Cleaned_Text'])
wordcloud_positive = WordCloud(width=800, height=400, background_color='white').generate(positive_text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_positive, interpolation='bilinear')
plt.title("Positive Reviews Word Cloud")
plt.axis("off")
plt.show()

# Word Cloud visualization for Negative Reviews
negative_text = " ".join(df[df['Sentiment'] == 'Negative']['Cleaned_Text'])
wordcloud_negative = WordCloud(width=800, height=400, background_color='white').generate(negative_text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_negative, interpolation='bilinear')
plt.title("Negative Reviews Word Cloud")
plt.axis("off")
plt.show()

# Conclusion: Provide suggestions for future improvements and model optimization
print("Conclusion: The Logistic Regression model performed well in classifying reviews.")
print("Further improvements could include exploring other models such as Random Forest, SVM, or Neural Networks.")


# Summary

This project involves sentiment analysis on the **Amazon Fine Food Reviews** dataset. The key steps were:

1. **Data Preprocessing**: Cleaned the dataset by removing missing values, duplicates, and performing text normalization (lowercasing, removing punctuation, numbers, and stopwords).
2. **Feature Extraction**: Used **TF-IDF vectorization** to convert text data into numerical form.
3. **Model Training**: Trained a **Logistic Regression** model to classify reviews as **positive** or **negative**.
4. **Evaluation**: Evaluated model performance using **accuracy**, **precision**, **recall**, and **confusion matrix**.
5. **Visualization**: Generated **word clouds** to visualize common words in positive and negative reviews.

## Key Findings:
- The Logistic Regression model performed well, classifying reviews with good accuracy.
- Word clouds helped highlight significant terms associated with sentiment.

## Future Improvements:
- Explore advanced models like **Random Forest**, **SVM**, or **Neural Networks** to improve accuracy.
- Implement **Hyperparameter Tuning** and **Cross-Validation** for better model performance.
