Note: This Red-Wine Analysis using K-Means is just one part of a group project which I had done together with my team-mates. For full analysis using different machine learning models - please refer to my another notebook "Red-Wine Analysis (Full)".

# Importing Libraries

In [None]:
#import libraries 

#structures
import numpy as np
import pandas as pd

#visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
from mpl_toolkits.mplot3d import Axes3D

#get model duration
import time
from datetime import date

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Description of data

In [None]:
#load dataset
data = '../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv'
dataset = pd.read_csv(data)
dataset.shape

The red wine data consists of 1599 rows and 12 columns.

In [None]:
dataset.dtypes

In [None]:
dataset.describe()

# Data Cleaning

In [None]:
#check for missing data
dataset.isnull().any().any()

In [None]:
#check for unreasonable data
dataset.applymap(np.isreal)

# Data visualisation

In [None]:
sns_plot = sns.pairplot(dataset)

In [None]:
sns_plot = sns.distplot(dataset['quality'])

# Pre-processing

In [None]:
#set x and y
from sklearn.preprocessing import StandardScaler

X = dataset.iloc[:,0:11]
y = dataset['quality']

#stadardize data
X_scaled = StandardScaler().fit_transform(X)

In [None]:
dataset.head()

# Feature Engineering

1. Feature extraction: Principal component analysis
2. Feature selection: Pearson's correlation

# 1. Principal component analysis

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [None]:
pca = PCA(n_components=6)
pc_X = pca.fit_transform(X_scaled)
pc_columns = ['pc1','pc2','pc3','pc4','pc5','pc6']
print(pca.explained_variance_ratio_.sum())

In [None]:
print(pca.explained_variance_ratio_)

# 2. Pearson's Correlation

In [None]:
#get correlation map
corr_mat=dataset.corr()

In [None]:
#visualise data
plt.figure(figsize=(13,5))
sns_plot=sns.heatmap(data=corr_mat, annot=True, cmap='GnBu')
plt.show()

Using a correlation of 0.6 to -0.5 as benchmark, a correlation matrix has been created to sieve out features that are highly correlated to the quality of red wine. Our results show that all features are within the acceptable range of 0.6 to -0.5.

From the heatmap, it can be seen that most features are weakly correlated to the quality of wine the exception of alcohol (0.48) which is a moderate correlation.

**Direction of relationship** <br>
Acidity (-0.39), chlorides (-0.13), free sulfur dioxide (-0.051), total sulfur dioxide (-0.19), density (-0.17) and PH (-0.058) are negatively correlated to the quality of wine; as these variables decrease, the quality of wine will increase vice versa. <br> <br>

Conversely, fixed acidity (0.12), citric acid, residual sugar (0.014), sulphates (0.25) and alcohol (0.48) are positively correlated to the quality of wine; as these variables increase, the quality of wine improves.

In [None]:
#check for highly correlated values to be removed
target = 'quality'
candidates = corr_mat.index[
    (corr_mat[target] > 0.5) | (corr_mat[target] < -0.5)
].values
candidates = candidates[candidates != target]
print('Correlated to', target, ': ', candidates)

# K-Means (Without PCA)

In [None]:
#import libraries
from sklearn.metrics import f1_score
from sklearn.cluster import KMeans

In this model, the entire dataset has been used as a training data. <br>
Then an elbow method will be used to find out an optimal number of “K” clusters.

In [None]:
#try to find optimal k using the elbow method
wcss = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i,init='k-means++',max_iter=300, n_init=12, random_state=0)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)
f3, ax = plt.subplots(figsize=(8, 6))
plt.plot(range(1,11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

“K” value of 2 will be used as a dip can be seen around 2 which is our elbow in a graph above. <br> <br>

First, clustering will be performed with K-Means on dataset without applying principle component analysis (PCA).
Note that the total dimension of dataset is 11.

In [None]:
#Applying kmeans to the dataset, set k=2
kmeans = KMeans(n_clusters = 2)
start_time = time.time()
clusters = kmeans.fit_predict(X_scaled)
today = date.today()
print("--- %s seconds ---" % (time.time() - start_time))
labels = kmeans.labels_

Training Time – 0.072 seconds

## Data Visualization of Clustering

In [None]:
#2D plot
colors = 'rgbkcmy'
for i in np.unique(clusters):
    plt.scatter(X_scaled[clusters==i,0],
               X_scaled[clusters==i,1],
               color=colors[i], label='Cluster' + str(i+1))
plt.legend()

It can be seen that clusters are not well separated. Some members of Cluster 2 can be seen in Cluster 1 and vice versa.

## Data Visualization of Clustering in 3D Plot (Fixed Acidity, Residual Sugar, Alcohol)

In [None]:
# Visualise the clusterds considerig fixed acidity, residual sugar, and alcohol
fig = plt.figure(figsize=(20, 15))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=15, azim=40)

ax.scatter(X_scaled[:,0], X_scaled[:,3], X_scaled[:,10],c=y, edgecolor='k')
ax.set_xlabel('Acidity')
ax.set_ylabel('Sugar')
ax.set_zlabel('Alcohol')
ax.set_title('K=2: Acidity, Sugar, Alcohol', size=22)

Now, the silhouette score of the model will be measured. The silhouette score ranges from -1 to +1. <br>
The high silhouette score indicates that the objects are well matched to its own cluster and not to its neighbouring clusters. <br>
(The higher the silhouette score – the better the clustering)

In [None]:
#evaluate model
from sklearn import metrics
from sklearn.metrics import pairwise_distances
metrics.silhouette_score(X_scaled, labels, metric='euclidean')

The silhouette score obtained is considered low. It means clusters are neither dense nor well separated. <br>
Next, let’s measure the inertia value.

In [None]:
kmeans.inertia_

An extremely high inertia value of 14330.119 was obtained. It is an indicative of the “curse of dimensionality”. <br>
We are using 11 dimensions of data in this model. <br>
In this case, we will explore the model again using PCA (principle component analysis).

## K-means with PCA

Our purpose of applying principal component analysis is to reduce dimension. <br>
In this dataset, we reduced the 11-dimensional data to 6-dimensional data during PCA.

In [None]:
#Applying kmeans to the dataset, set k=2
kmeans = KMeans(n_clusters = 2)
start_time = time.time()
clusters = kmeans.fit_predict(pc_X)
today = date.today()
print("--- %s seconds ---" % (time.time() - start_time))
labels = kmeans.labels_

Training time – 0.062 seconds Training time is observed to have reduced slightly.

## Data Visualization of Clustering

In [None]:
#2D plot
colors = 'rgbkcmy'
for i in np.unique(clusters):
    plt.scatter(pc_X[clusters==i,0],
               pc_X[clusters==i,1],
               color=colors[i], label='Cluster' + str(i+1))
plt.legend()

After implementing PCA, it can be seen that clustering is improved. So it is expected to see a higher silhouette score.

In [None]:
#evaluate model
metrics.silhouette_score(pc_X, labels, metric='euclidean')

As expected, we can see an improvement in the silhouette score. But it is still considered low which means there are still some overlapping of clusters or incorrect grouping. <br>

Although the silhouette score increased with PCA, it still low; clusters are overlapping or incorrectly grouped.

In [None]:
kmeans.inertia_

The inertia value is also decreased but still extremely high.

K-means clustering has poor clustering result for high dimensional data. Even with the implementation of PCA, the silhouette score can only be improved to some extent but is considered low. Also the inertia value is observed to be extremely high. In an ideal situation, the inertia value should be as low as possible. Hence, we can conclude that this is not a good model fit to the data.