# Loading the "Fine Food Reviews" Dataset

In this notebook I am starting exploring the data and preparing it for any other preprocessing.

## Dataset Overview

The shape of the dataset (rows, columns) gives an initial sense of its size.
I will preview the first few rows to understand the structure of the data, including the review text and rating fields.

In [7]:
import pandas as pd

df = pd.read_csv("../data/FineFoodReviews.csv")

df.shape

(568454, 10)

In [8]:
# Print column names
print(df.columns)

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')


In [10]:
# Print first few rows of the dataframe
print(df.head())

   Id   ProductId          UserId                      ProfileName  \
0   1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1   2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2   3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3   4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4   5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   

   HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0                     1                       1      5  1303862400   
1                     0                       0      1  1346976000   
2                     1                       1      4  1219017600   
3                     3                       3      2  1307923200   
4                     0                       0      5  1350777600   

                 Summary                                               Text  
0  Good Quality Dog Food  I have bought several of the Vitality canned d...  
1 

## Score Distribution

I will use the "Score" to sort the reviews into categories.
This distribution will show how many reviews fall in to each of the categories.
To prepare the dataset for classification, I convert the 1–5 rating into:

- 1–2 → negative  
- 3 → neutral  
- 4–5 → positive  

This creates a balanced multi-class sentiment problem.

In [11]:
def score_to_sentiment(score):
    if score <= 2:
        return 'negative'
    elif score == 3:
        return 'neutral'
    else:
        return 'positive'
    
df["Sentiment"] = df["Score"].apply(score_to_sentiment)
df["Sentiment"].value_counts()

Sentiment
positive    443777
negative     82037
neutral      42640
Name: count, dtype: int64

## Balancing the Dataset

As the dataset is heavily biased towards positive, I will balance it to an equal amount of each.
Our lowest count is "Neutral" at 42640, so it would be best to take 40,000 from each category which gives us the most reviews to train on while still being balanced.


In [12]:
# Creating a balanced subset with 40k per class
balanced_df = (
    df.groupby("Sentiment")
    .sample(n=40000, random_state=42)
    .reset_index(drop=True)
)

balanced_df["Sentiment"].value_counts()

Sentiment
negative    40000
neutral     40000
positive    40000
Name: count, dtype: int64