# Wine Cultivar Origin Prediction System

**Student Name:** Victor Emeka  
**Matric Number:** 23cg034065  
**Algorithm:** Random Forest Classifier  

## Project Overview
This notebook develops a machine learning model to predict wine cultivar (origin/class) based on chemical properties using the Wine Dataset from sklearn.

### Dataset Information
- **Source:** UCI Machine Learning Repository / sklearn.datasets
- **Classes:** 3 wine cultivars
- **Features:** 13 chemical properties
- **Selected Features (6):** alcohol, malic_acid, ash, total_phenols, flavanoids, color_intensity

## 1. Import Required Libraries

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
    confusion_matrix
)
import joblib
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("âœ… All libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## 2. Load and Explore the Wine Dataset

In [None]:
# Load the wine dataset
wine_data = load_wine()

# Create DataFrame
df = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)
df['cultivar'] = wine_data.target

print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nDataset Info:")
print(df.info())
print("\nClass Distribution:")
print(df['cultivar'].value_counts().sort_index())

In [None]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())
print("\nTotal Missing Values:", df.isnull().sum().sum())

# Statistical summary
print("\nStatistical Summary:")
print(df.describe())

## 3. Feature Selection

### Selected 6 Features:
1. **alcohol** - Alcohol content
2. **malic_acid** - Malic acid content
3. **ash** - Ash content
4. **total_phenols** - Total phenols
5. **flavanoids** - Flavanoids content
6. **color_intensity** - Color intensity

In [None]:
# Select 6 features as per project requirements
selected_features = [
    'alcohol',
    'malic_acid',
    'ash',
    'total_phenols',
    'flavanoids',
    'color_intensity'
]

# Create feature matrix (X) and target vector (y)
X = df[selected_features]
y = df['cultivar']

print(f"Selected Features: {selected_features}")
print(f"\nFeature Matrix Shape: {X.shape}")
print(f"Target Vector Shape: {y.shape}")
print("\nFeature Data Sample:")
print(X.head())

## 4. Data Visualization

In [None]:
# Visualize class distribution
plt.figure(figsize=(8, 5))
df['cultivar'].value_counts().sort_index().plot(kind='bar', color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
plt.title('Wine Cultivar Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Cultivar Class', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("Class Distribution:")
for cultivar in sorted(y.unique()):
    count = (y == cultivar).sum()
    percentage = (count / len(y)) * 100
    print(f"Cultivar {cultivar}: {count} samples ({percentage:.2f}%)")

In [None]:
# Correlation heatmap for selected features
plt.figure(figsize=(10, 8))
correlation_matrix = X.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

## 5. Data Preprocessing

In [None]:
# Split data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

print("Data Split Summary:")
print(f"Training Set: {X_train.shape[0]} samples")
print(f"Testing Set: {X_test.shape[0]} samples")
print(f"\nTraining Set Class Distribution:")
print(y_train.value_counts().sort_index())
print(f"\nTesting Set Class Distribution:")
print(y_test.value_counts().sort_index())

In [None]:
# Feature Scaling (Standardization) - MANDATORY due to varying feature ranges
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature Scaling Applied: StandardScaler")
print("\nOriginal Feature Ranges (Training Set):")
print(X_train.describe().loc[['min', 'max']])
print("\nScaled Feature Ranges (Training Set):")
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=selected_features)
print(X_train_scaled_df.describe().loc[['min', 'max']])
print("\nâœ… Features have been standardized (mean=0, std=1)")

## 6. Model Training - Random Forest Classifier

In [None]:
# Initialize Random Forest Classifier
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=RANDOM_STATE,
    n_jobs=-1
)

print("Model: Random Forest Classifier")
print("\nModel Hyperparameters:")
print(f"  - Number of Trees: {model.n_estimators}")
print(f"  - Max Depth: {model.max_depth}")
print(f"  - Min Samples Split: {model.min_samples_split}")
print(f"  - Min Samples Leaf: {model.min_samples_leaf}")
print(f"  - Random State: {model.random_state}")

# Train the model
print("\nðŸ”„ Training model...")
model.fit(X_train_scaled, y_train)
print("âœ… Model training completed!")

## 7. Model Evaluation

In [None]:
# Make predictions
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)

# Calculate accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("="*60)
print("MODEL PERFORMANCE SUMMARY")
print("="*60)
print(f"Training Accuracy: {train_accuracy:.4f} ({train_accuracy*100:.2f}%)")
print(f"Testing Accuracy:  {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")
print("="*60)

In [None]:
# Calculate multiclass classification metrics (weighted and macro averages)
precision_weighted = precision_score(y_test, y_test_pred, average='weighted')
recall_weighted = recall_score(y_test, y_test_pred, average='weighted')
f1_weighted = f1_score(y_test, y_test_pred, average='weighted')

precision_macro = precision_score(y_test, y_test_pred, average='macro')
recall_macro = recall_score(y_test, y_test_pred, average='macro')
f1_macro = f1_score(y_test, y_test_pred, average='macro')

print("\nMULTICLASS CLASSIFICATION METRICS")
print("="*60)
print("\nWeighted Average (accounts for class imbalance):")
print(f"  Precision: {precision_weighted:.4f}")
print(f"  Recall:    {recall_weighted:.4f}")
print(f"  F1-Score:  {f1_weighted:.4f}")

print("\nMacro Average (treats all classes equally):")
print(f"  Precision: {precision_macro:.4f}")
print(f"  Recall:    {recall_macro:.4f}")
print(f"  F1-Score:  {f1_macro:.4f}")
print("="*60)

In [None]:
# Detailed Classification Report
print("\nDETAILED CLASSIFICATION REPORT")
print("="*60)
print(classification_report(y_test, y_test_pred, 
                          target_names=['Cultivar 0', 'Cultivar 1', 'Cultivar 2']))
print("="*60)

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Cultivar 0', 'Cultivar 1', 'Cultivar 2'],
            yticklabels=['Cultivar 0', 'Cultivar 1', 'Cultivar 2'],
            cbar_kws={'label': 'Count'})
plt.title('Confusion Matrix - Random Forest Classifier', fontsize=14, fontweight='bold', pad=20)
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.tight_layout()
plt.show()

print("\nConfusion Matrix:")
print(cm)

In [None]:
# Feature Importance
feature_importance = pd.DataFrame({
    'Feature': selected_features,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'], 
         color='steelblue', edgecolor='navy')
plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importance - Random Forest Classifier', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nFeature Importance Ranking:")
print(feature_importance.to_string(index=False))

## 8. Save the Trained Model and Scaler

In [None]:
# Save the trained model using Joblib
model_filename = 'wine_cultivar_model.pkl'
scaler_filename = 'scaler.pkl'

joblib.dump(model, model_filename)
joblib.dump(scaler, scaler_filename)

print("âœ… Model saved successfully!")
print(f"   - Model file: {model_filename}")
print(f"   - Scaler file: {scaler_filename}")
print(f"\nModel Persistence Method: Joblib")

# Verify saved files
import os
print(f"\nFile Verification:")
print(f"  - {model_filename}: {os.path.exists(model_filename)} ({os.path.getsize(model_filename) / 1024:.2f} KB)")
print(f"  - {scaler_filename}: {os.path.exists(scaler_filename)} ({os.path.getsize(scaler_filename) / 1024:.2f} KB)")

## 9. Model Testing - Load and Predict

In [None]:
# Load the saved model and scaler
loaded_model = joblib.load(model_filename)
loaded_scaler = joblib.load(scaler_filename)

print("âœ… Model and scaler loaded successfully!")

# Test with a sample prediction
sample_data = X_test.iloc[0:3]  # Take first 3 test samples
print("\nSample Input Data:")
print(sample_data)

# Scale the sample data
sample_scaled = loaded_scaler.transform(sample_data)

# Make predictions
predictions = loaded_model.predict(sample_scaled)
print("\nPredicted Cultivars:")
for i, pred in enumerate(predictions):
    print(f"  Sample {i+1}: Cultivar {pred}")

# Compare with actual labels
actual_labels = y_test.iloc[0:3].values
print("\nActual Cultivars:")
for i, actual in enumerate(actual_labels):
    print(f"  Sample {i+1}: Cultivar {actual}")

print("\nâœ… Model testing completed successfully!")

## 10. Summary and Conclusions

### Model Development Summary:
- **Algorithm:** Random Forest Classifier
- **Features Used:** 6 chemical properties (alcohol, malic_acid, ash, total_phenols, flavanoids, color_intensity)
- **Dataset Split:** 80% training, 20% testing
- **Preprocessing:** StandardScaler for feature scaling
- **Model Persistence:** Joblib

### Key Findings:
- The Random Forest Classifier achieved excellent performance on the wine cultivar prediction task
- Feature scaling was applied to normalize the varying ranges of chemical properties
- The model demonstrated good generalization with consistent performance on test data
- Feature importance analysis revealed which chemical properties are most significant for classification

### Next Steps:
- Deploy the model using a web-based GUI (Streamlit/Flask)
- Host the application on a cloud platform (Render/Streamlit Cloud)
- Allow users to input wine chemical properties and receive cultivar predictions