# 📊 Sentiment Analysis of Amazon Fine Food Reviews

This notebook explores and analyzes a large dataset of Amazon food product reviews to understand customer sentiment using NLP and EDA techniques.


## 🎯 Project Objectives

- Understand the distribution of review scores and sentiments  
- Clean and preprocess the dataset for ML modeling  
- Label reviews as **positive** or **negative**  
- Visualize key insights to be used in a Power BI Dashboard  


## 📁 Dataset Information

- **Source**: Amazon Fine Food Reviews (Kaggle)
- **File**: `Reviews.csv`
- **Original Columns Used**: `Time`, `Score`, `Text`, `Summary`, `ProductId`, `HelpfulnessNumerator`, `HelpfulnessDenominator`


In [None]:
import pandas as pd

# Step 1: Load the dataset from the same folder
df = pd.read_csv('Reviews.csv')

# Step 2: Basic info to confirm it's loaded
print("✅ Dataset loaded successfully!")
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())

## 🧹 Data Cleaning & Sentiment Labeling

We'll keep only relevant columns, convert timestamps, and create a new sentiment label based on review scores:

- **Score < 3** → Negative  
- **Score > 3** → Positive  
- **Score = 3** → Neutral (will be removed)


In [None]:
from tabulate import tabulate
import matplotlib.pyplot as plt
import seaborn as sns

original_shape = df.shape
df = df[['Time', 'Score', 'Text', 'Summary', 'ProductId', 'HelpfulnessNumerator', 'HelpfulnessDenominator']]
df.dropna(inplace=True)

# Convert time
df['Time'] = pd.to_datetime(df['Time'], unit='s')

# Sentiment labeling
def label_sentiment(score):
    if score < 3:
        return 'negative'
    elif score > 3:
        return 'positive'
    else:
        return 'neutral'

df['sentiment'] = df['Score'].apply(label_sentiment)
df = df[df['sentiment'] != 'neutral']
cleaned_shape = df.shape

# Save cleaned data
df.to_csv('cleaned_reviews.csv', index=False)

## 📊 Dataset Cleaning Summary

In [None]:
summary_table = [
    ['Original Rows', original_shape[0]],
    ['Original Columns', original_shape[1]],
    ['Rows after cleaning', cleaned_shape[0]],
    ['Columns after cleaning', cleaned_shape[1]],
    ['Neutral rows dropped', original_shape[0] - cleaned_shape[0]]
]

print(tabulate(summary_table, headers=['Metric', 'Value'], tablefmt='fancy_grid'))

## 📈 Sentiment Distribution

In [None]:
sentiment_count = df['sentiment'].value_counts().reset_index()
sentiment_count.columns = ['Sentiment', 'Count']

print(tabulate(sentiment_count.values, headers=sentiment_count.columns.tolist(), tablefmt='fancy_grid'))

## 📊 Visualizations

### ✅ Sentiment Count


In [None]:
sns.countplot(data=df, x='sentiment', palette='Set2')
plt.title('Sentiment Distribution')
plt.savefig('sentiment_distribution.png')
plt.show()

### ⭐ Review Score Distribution

In [None]:
sns.histplot(df['Score'], bins=5, kde=True, color='orange')
plt.title('Review Score Distribution')
plt.savefig('score_distribution.png')
plt.show()

### 📅 Monthly Review Count

In [None]:
df['Month'] = df['Time'].dt.month
df['Month'].value_counts().sort_index().plot(kind='bar', color='skyblue')
plt.title("Review Count by Month")
plt.xlabel("Month")
plt.ylabel("Reviews")
plt.savefig('reviews_by_month.png')
plt.show()

## 🔚 Conclusion & Next Steps

- Data is cleaned, labeled, and saved as `cleaned_reviews.csv`
- Visualizations show clear sentiment trends
- This data is now ready for:
  - Machine Learning modeling (TF-IDF, Logistic Regression)
  - Power BI dashboard integration

---
👨‍💻 Project by **Shaurya Verma**  
B.Tech CSE | LPU | Data Science Enthusiast
