# Wine Quality Prediction

**Objective:** Build a binary classifier to predict wine quality (Good vs Bad) using chemical properties. This notebook includes EDA, preprocessing, model training, evaluation, and feature importance.

**Dataset:** `WineQT.csv` (make sure this file is in the same directory as the notebook).

## 1. Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

sns.set(style='whitegrid')
%matplotlib inline

## 2. Load data

In [None]:
df = pd.read_csv('WineQT.csv')
print('Shape:', df.shape)
df.head()

## 3. Basic info, nulls and data types

In [None]:
print(df.info())
print('\nMissing values:')
print(df.isnull().sum())

# Basic statistics
print('\nDescribe:')
print(df.describe())

## 4. Drop unnecessary columns

In [None]:
if 'Id' in df.columns:
    df = df.drop('Id', axis=1)

print('Shape after dropping Id (if present):', df.shape)

## 5. Exploratory Data Analysis (basic)
- Distribution of quality
- Correlation heatmap

In [None]:
plt.figure(figsize=(8,4))
sns.countplot(x='quality', data=df, palette='viridis')
plt.title('Wine Quality Distribution')
plt.show()

plt.figure(figsize=(12,8))
cm = df.corr()
sns.heatmap(cm, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

## 6. Prepare features and target
Convert `quality` to binary: Good (1) if quality >= 7, else Bad (0).

In [None]:
X = df.drop('quality', axis=1)
y = df['quality'].apply(lambda q: 1 if q >= 7 else 0)

print('Feature shape:', X.shape)
print('Target distribution:\n', y.value_counts())

## 7. Train-test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

print('Train shape:', X_train.shape)
print('Test shape:', X_test.shape)

## 8. Train Random Forest Classifier

In [None]:
model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)
print('Training completed')

## 9. Evaluate model

In [None]:
y_pred = model.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print(f'Accuracy: {acc:.4f}\n')
print('Classification Report:')
print(classification_report(y_test, y_pred))

print('Confusion Matrix:')
cm = confusion_matrix(y_test, y_pred)
print(cm)

## 10. Feature Importance

In [None]:
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances = feat_importances.sort_values(ascending=True)
plt.figure(figsize=(8,6))
feat_importances.plot(kind='barh')
plt.title('Feature Importance')
plt.show()

print('\nTop features:')
print(feat_importances.tail(10)[::-1])

## 11. Save the trained model (optional)
You can save the trained model for later use using `joblib` or `pickle`. Uncomment the following lines to save.

In [None]:
# import joblib
# joblib.dump(model, 'wine_quality_rf.pkl')
# print('Model saved to wine_quality_rf.pkl')

## Conclusion
Summarize the findings, model performance, and possible next steps such as hyperparameter tuning, cross-validation, or using other classifiers (XGBoost, SVM).