This notebook shows how we can perform clustering on the Wine Quality dataset.

In [21]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np
from sklearn import cluster

git_url = 'https://raw.githubusercontent.com/vishal-git/dapt-631/main/data'

In [5]:
df = pd.read_csv(f'{git_url}/winequality.csv', index_col=0)
df.shape

(6497, 13)

In [6]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,wine type
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,White
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,White
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,White
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,White
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,White


We will use all features except for `wine type` to perform clustering.

In [7]:
clus_cols = df.columns[:-1]
clus_cols

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [18]:
len(clus_cols)

12

Let's standardize the dataset first.

In [13]:
X = StandardScaler().fit_transform(df[clus_cols])

Reduce dimensionality using PCA.

In [14]:
pca = PCA(random_state=314)
pca.fit(X)

In [17]:
np.cumsum(pca.explained_variance_ratio_)

array([0.25346226, 0.47428343, 0.61107566, 0.70012777, 0.77016947,
       0.82520274, 0.87218827, 0.91518684, 0.95338453, 0.97830228,
       0.9972679 , 1.        ])

It looks like the top 9 principal components would capture more than 95% of the total variation that exists in this dataset.

In [19]:
components_to_keep = 9

pca = PCA(n_components=components_to_keep, random_state=314)

In [22]:
X_pc = pca.fit_transform(X)

Let's find two clusters. (Since we know that there are two clusters in this dataset: red and white wine.)

In [23]:
ward = cluster.AgglomerativeClustering(n_clusters=2,
                                       linkage='ward').fit(X_pc)

In [24]:
y_pred = ward.labels_.astype(int)

Check if the identified clusters align with the true clusters (red and white). 

In [26]:
pd.crosstab(y_pred, df['wine type'])

wine type,Red,White
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,39,4741
1,1560,157


In [27]:
crosstab = pd.crosstab(y_pred, df['wine type'])
crosstab.div(crosstab.sum(axis=1), axis=0)

wine type,Red,White
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.008159,0.991841
1,0.908561,0.091439


91% of all red wines were assigned (correctly) to the same clusters, but 9% got mis-assigned to the other clusters. On the other hand, most white wines were correctly assigned to the their own cluster.