# Vivino Wine Analysis

## Project Overview
This notebook contains the analysis for the Vivino wine dataset. The project is structured as follows:
1. **Data Exploratory Analysis & Unsupervised Exploration**
2. **Data preprocessing, preparation & train-val-test splits**
3. **Baseline results with basic Linear & Ensemble Models**

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Configuration
%matplotlib inline
sns.set_theme(style="whitegrid")
pd.set_option('display.max_columns', None)

In [None]:
# Load Data
try:
    df = pd.read_csv('data/25-11-2025.csv')
    print("Data loaded successfully!")
    display(df.head())
    print(f"Dataset shape: {df.shape}")
except FileNotFoundError:
    print("Error: File not found. Please check the path 'data/25-11-2025.csv'")

## 1. Data Exploratory Analysis & Unsupervised Exploration
In this section, we will explore the dataset to understand the distribution of variables, correlations, and potential outliers. We will also perform unsupervised exploration if applicable (e.g., clustering).

In [None]:
# EDA: Visualizations

# 1. Distribution of the target variable 'rating'
plt.figure(figsize=(10, 6))
sns.histplot(df['rating'], bins=20, kde=True)
plt.title('Distribution of Wine Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

# 2. Price vs Rating
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='price', y='rating', alpha=0.5)
plt.title('Price vs Rating')
plt.xlabel('Price')
plt.ylabel('Rating')
plt.xscale('log') # Log scale for price as it can be skewed
plt.show()

# 3. Correlation Matrix (Numeric features)
numeric_cols = df.select_dtypes(include=[np.number]).columns
plt.figure(figsize=(12, 10))
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

df.info()
df.describe()

## 2. Data preprocessing, preparation & train-val-test splits
Here we will prepare the data for modeling. This includes handling missing values, encoding categorical variables, scaling features, and splitting the data into training, validation, and test sets.

In [None]:
# Preprocessing

# Separate target and features
# We drop 'id' and 'name' as they are identifiers and not useful for prediction
X = df.drop(['rating', 'id', 'name'], axis=1)
y = df['rating']

# Handle 'vintage' column which might contain 'N.V.' or other non-numeric values
# We coerce errors to NaN, so 'N.V.' becomes NaN
X['vintage'] = pd.to_numeric(X['vintage'], errors='coerce')

# Define feature groups based on the schema
numeric_features = ['vintage', 'price', 'acidity', 'intensity', 'sweetness', 'tannin']
categorical_features = ['country', 'winery', 'grapes', 'flavor_rank1', 'flavor_rank2', 'flavor_rank3']

# Create transformers
# Numeric: Impute missing values with median, then scale
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical: Impute missing with 'missing', then OneHotEncode
# We use max_categories to handle high cardinality columns like winery and grapes
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False, max_categories=20))
])

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and transform the data
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

print("Training data shape:", X_train.shape)
print("Test data shape:", X_test.shape)

## 3. Baseline results with basic Linear & Ensemble Models
We will establish baseline performance using simple models like Linear Regression and Ensemble methods (e.g., Random Forest).

In [None]:
# Baseline Models

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# Random Forest
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluation
# Calculate RMSE manually as 'squared' parameter might not be supported in this version
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))

print("Linear Regression RMSE:", rmse_lr)
print("Random Forest RMSE:", rmse_rf)

print("Linear Regression R2:", r2_score(y_test, y_pred_lr))
print("Random Forest R2:", r2_score(y_test, y_pred_rf))

In [None]:
# Visualizing Model Performance

# 1. Actual vs Predicted (Random Forest)
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_rf, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Rating')
plt.ylabel('Predicted Rating')
plt.title('Random Forest: Actual vs Predicted Ratings')
plt.show()

# 2. Feature Importance (Random Forest)
# Get feature names from the preprocessor
feature_names = []
if hasattr(preprocessor, 'get_feature_names_out'):
    feature_names = preprocessor.get_feature_names_out()
else:
    # Fallback for older sklearn versions or if get_feature_names_out is not available
    # This is a simplified fallback and might not be perfect
    feature_names = numeric_features + list(preprocessor.named_transformers_['cat']['encoder'].get_feature_names_out(categorical_features))

# Get feature importances
importances = rf.feature_importances_

# Create a DataFrame for visualization
feature_imp_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_imp_df = feature_imp_df.sort_values(by='Importance', ascending=False).head(20) # Top 20 features

plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_imp_df, palette='viridis')
plt.title('Top 20 Feature Importances (Random Forest)')
plt.show()