# Wine Quality Classification

This notebook presents an analysis of wine quality using a machine learning approach. The goal is to predict the quality of red wine based on physicochemical tests such as acidity, residual sugar, pH, and more. In this notebook, we will go through the following steps:

- **Data Loading:** Importing and examining the dataset.
- **Exploratory Data Analysis (EDA):** Understanding the data distribution and relationships.
- **Data Preprocessing:** Cleaning, transforming, and preparing the data for modeling.
- **Modeling:** Training different machine learning models.
- **Evaluation:** Comparing model performance and drawing conclusions.
- **Conclusion:** Summarizing insights and discussing potential future improvements.


In [6]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

%matplotlib inline

## Data Loading

In this section, we load the `winequality-red.csv` dataset using the Pandas library. The dataset contains various physicochemical properties of red wine and a quality score assigned by experts. We will also preview the first few rows to understand the structure and contents of the data.


In [7]:
wine = pd.read_csv('winequality-red.csv')
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Exploratory Data Analysis (EDA)

Here, we perform an initial exploration of the data. This includes:
- Displaying basic statistics and data types.
- Visualizing the distribution of key features (e.g., acidity, sugar levels).
- Investigating relationships between features and the wine quality.
  
These steps help us identify potential data quality issues, understand feature distributions, and guide our preprocessing and model selection.


In [8]:
wine.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [9]:
wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [10]:
wine['quality'].value_counts()

quality
5    681
6    638
7    199
4     53
8     18
3     10
Name: count, dtype: int64

In [11]:
fig = plt.figure(figsize=(15,10))

plt.subplot(3,4,1)
sns.barplot(x='quality',y='fixed acidity',data=wine)

plt.subplot(3,4,2)
sns.barplot(x='quality',y='volatile acidity',data=wine)

plt.subplot(3,4,3)
sns.barplot(x='quality',y='citric acid',data=wine)

plt.subplot(3,4,4)
sns.barplot(x='quality',y='residual sugar',data=wine)

plt.subplot(3,4,5)
sns.barplot(x='quality',y='chlorides',data=wine)

plt.subplot(3,4,6)
sns.barplot(x='quality',y='free sulfur dioxide',data=wine)

plt.subplot(3,4,7)
sns.barplot(x='quality',y='total sulfur dioxide',data=wine)

plt.subplot(3,4,8)
sns.barplot(x='quality',y='density',data=wine)

plt.subplot(3,4,9)
sns.barplot(x='quality',y='pH',data=wine)

plt.subplot(3,4,10)
sns.barplot(x='quality',y='sulphates',data=wine)

plt.subplot(3,4,11)
sns.barplot(x='quality',y='alcohol',data=wine)

plt.tight_layout()
plt.savefig('output.jpg',dpi=1000)

## Data Preprocessing

This section covers the steps to prepare the data for modeling. The preprocessing steps may include:
- Handling missing values (if any).
- Scaling and normalization of features.
- Feature selection or extraction based on correlation and domain knowledge.
  
These transformations ensure that the data is in the right format and scale for effective model training.


In [12]:
#from 2 to 6. it is considered bad and above that it is good as 8 is the max value of quality
ranges = (2,6.5,8) 
groups = ['bad','good']
wine['quality'] = pd.cut(wine['quality'],bins=ranges,labels=groups)

In [13]:
le = LabelEncoder()
wine['quality'] = le.fit_transform(wine['quality'])
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0


In [14]:
wine['quality'].value_counts()

quality
0    1382
1     217
Name: count, dtype: int64

In [15]:
good_quality = wine[wine['quality']==1]
bad_quality = wine[wine['quality']==0]

bad_quality = bad_quality.sample(frac=1)
bad_quality = bad_quality[:len(good_quality)]

new_df = pd.concat([good_quality,bad_quality])
new_df = new_df.sample(frac=1)
new_df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
343,10.9,0.390,0.47,1.8,0.118,6.0,14.0,0.99820,3.30,0.75,9.8,0
1235,6.0,0.330,0.32,12.9,0.054,6.0,113.0,0.99572,3.30,0.56,11.5,0
747,8.6,0.330,0.40,2.6,0.083,16.0,68.0,0.99782,3.30,0.48,9.4,0
634,7.9,0.350,0.21,1.9,0.073,46.0,102.0,0.99640,3.27,0.58,9.5,0
600,8.2,0.915,0.27,2.1,0.088,7.0,23.0,0.99620,3.26,0.47,10.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
715,7.2,0.490,0.18,2.7,0.069,13.0,34.0,0.99670,3.29,0.48,9.2,0
1312,8.0,1.180,0.21,1.9,0.083,14.0,41.0,0.99532,3.34,0.47,10.5,0
459,11.6,0.580,0.66,2.2,0.074,10.0,47.0,1.00080,3.25,0.57,9.0,0
952,8.2,0.310,0.40,2.2,0.058,6.0,10.0,0.99536,3.31,0.68,11.2,1


In [16]:
new_df['quality'].value_counts()

quality
0    217
1    217
Name: count, dtype: int64

In [17]:
new_df.corr()['quality'].sort_values(ascending=False)

quality                 1.000000
alcohol                 0.589490
citric acid             0.248055
sulphates               0.216591
fixed acidity           0.124716
residual sugar          0.100897
pH                     -0.030783
free sulfur dioxide    -0.170571
chlorides              -0.182496
total sulfur dioxide   -0.234800
density                -0.240006
volatile acidity       -0.364845
Name: quality, dtype: float64

## Model Training

In this section, we train several machine learning models to predict wine quality. We split the data into training and test sets and then apply algorithms such as:
- **Logistic Regression**
- **Decision Trees**
- **Random Forests**
- *(or any other models you are using)*

For each model, we explain the rationale behind its selection and the tuning parameters used. Our goal is to compare their performance and determine the best approach for this classification problem.


In [18]:
from sklearn.model_selection import train_test_split

X = new_df.drop('quality',axis=1) 
y = new_df['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [19]:
param = {'n_estimators':[100,200,300,400,500,600,700,800,900,1000]}

grid_rf = GridSearchCV(RandomForestClassifier(),param,scoring='accuracy',cv=10,)
grid_rf.fit(X_train, y_train)

print('Best parameters --> ', grid_rf.best_params_)

pred = grid_rf.predict(X_test)

Best parameters -->  {'n_estimators': 100}


## Model Evaluation

After training the models, we evaluate their performance using metrics such as:
- Accuracy
- Precision
- Recall
- F1 Score

We also use visualizations (e.g., confusion matrices, ROC curves) to better understand the strengths and weaknesses of each model. This evaluation helps in selecting the most effective model for predicting wine quality.


In [20]:
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))
print('\n')
print(accuracy_score(y_test,pred))

[[51 12]
 [ 8 60]]


              precision    recall  f1-score   support

           0       0.86      0.81      0.84        63
           1       0.83      0.88      0.86        68

    accuracy                           0.85       131
   macro avg       0.85      0.85      0.85       131
weighted avg       0.85      0.85      0.85       131



0.8473282442748091


## Conclusion & Future Work

In this notebook, we successfully developed a model to predict the quality of red wine based on its physicochemical properties. Key takeaways include:
- Insights from exploratory data analysis.
- The effectiveness of various preprocessing techniques.
- Comparative performance of different machine learning models.

For future work, potential improvements could include:
- Tuning model parameters further with cross-validation.
- Incorporating additional data or features.
- Experimenting with ensemble methods or deep learning approaches.

This documentation not only summarizes our approach but also serves as a reference for further refinements and experimentation.
