**Exploratory Data Analysis (EDA) on Beer Ratings Dataset**

This script performs initial data exploration and visualization on a beer reviews dataset 
downloaded via KaggleHub. It inspects the structure, content, and quality of the data, 
including missing values, duplicates, and data types. Additionally, it highlights key insights 
such as the most reviewed beers, top beer styles, and breweries, and visualizes distributions 
and correlations among review scores.

## Data loading

In [1]:
import os
import sys
import matplotlib.pyplot as plt
import seaborn as sns

# Add paths for custom transformers
while any(marker in os.getcwd() for marker in ('exercises', 'notebooks', 'students', 'research', 'projects')):
    os.chdir("..")
sys.path.append('.')

In [None]:
from dotenv import load_dotenv
import pandas as pd

In [None]:
env_path = 'projects/proj_3_team_5/.env'
load_dotenv(env_path)

In [4]:
df_raw_path = os.getenv('RAW_DATA_DIR')

In [None]:
df= pd.read_csv(df_raw_path)

In [None]:
df.sample(5)

## Basic Data Inspection

In [None]:
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())

In [None]:
print("\nMissing values:\n", df.isnull().sum())

In [None]:
print("\nData types:\n", df.dtypes)

In [None]:
print("\nBasic stats:\n", df.describe(include='all'))

In [None]:
print("\nDuplicates:", df.duplicated().sum())

## Most Reviewed Beers and Styles

In [None]:
top_beers = df['Name'].value_counts().head(10)
print("\nTop beers by number of reviews:\n", top_beers)

top_style= df['Style'].value_counts().head(10)
print("\nTop style by number of reviews:\n", top_style)

## Distribution of Review Scores

In [None]:
rating_cols = ['review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'review_overall']
df[rating_cols].hist(bins=20, figsize=(12, 8))
plt.suptitle("Distribution of Review Scores", fontsize=16)
plt.tight_layout()
plt.show()

## Correlation Heatmap

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(df[rating_cols].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation between Review Scores")
plt.show()

## Top Beer Styles by Rating

In [None]:
style_rating = df.groupby('Style')['review_overall'].mean().sort_values(ascending=False).head(10)
style_rating.plot(kind='bar', figsize=(10, 5), title="Top 10 Beer Styles by Average Overall Rating")
plt.ylabel("Average Overall Rating")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Top Breweries by Number of Reviews


In [None]:
top_breweries = df['Brewery'].value_counts().head(10)
sns.barplot(x=top_breweries.values, y=top_breweries.index, color='orchid')
plt.title("Top 10 Breweries by Number of Reviews")
plt.xlabel("Number of Reviews")
plt.tight_layout()
plt.show()

## Conclusions
**1. Dataset Overview**
- The dataset contains 3,197 beer reviews and 25 features.
- There are no missing values and no duplicate entries, making the data clean and ready for further analysis.
- It includes both categorical (e.g., Name, Style, Brewery) and numerical (e.g., ABV, review_overall) features.
- Some beers and styles appear frequently, which may influence clustering outcomes if not properly normalized.

**2. Distribution of Review Scores**
- All review aspects (aroma, appearance, palate, taste, overall) are roughly normally distributed with a slight right skew.
- Most scores cluster between 3.0 and 4.5, indicating generally positive reviews.
- Review Overalls are slightly more concentrated between 3.5 and 4.2, showing that people have more similar opinions about the overall quality.

**3. Correlation Between Review Scores**
- Very high positive correlations between all attributes:
   - review_palate ↔ review_taste: 0.95
   - review_taste ↔ review_overall: 0.94
   - review_palate ↔ review_overall: 0.92
- This suggests multicollinearity, which should be addressed before clustering (for example through PCA or by removing redundant features)
- Review_appearance shows a slightly weaker correlation with overall rating, possibly offering more unique information for clustering.

**4. Top 10 Beer Styles by Average Overall Rating**

The highest-rated styles include:
- IPA - New England
- Wild Ale
- Stout - American Imperial

These styles could potentially form distinct clusters, as they may differ significantly in attributes like ABV, taste profile, and consumer preferences.

**5. Key Insights**
- Features such as review_taste, review_palate, and review_overall are highly correlated—dimensionality reduction may help improve clustering quality.
- Clustering should take into account that some beers or styles dominate the dataset, which could lead to imbalanced clusters.
- Standardization or normalization of features (especially review scores and ABV) will be crucial to avoid bias in distance-based algorithms like K-Means.
- Including or excluding categorical features like Style or Brewery should be carefully considered based on the clustering goal (e.g., grouping by taste vs. origin).