Note: This Red-Wine Analysis using both Supervised Learning & Unsupervised Learning models is a group project which I had done together with my team-mates. 

In [None]:
#import libraries 

#structures
import numpy as np
import pandas as pd

#visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
from mpl_toolkits.mplot3d import Axes3D

#get model duration
import time
from datetime import date

#analysis
from sklearn.metrics import confusion_matrix, accuracy_score

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Description of Data

In [None]:
#load dataset
data = '../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv'
dataset = pd.read_csv(data)
dataset.shape

The red wine data consists of 1599 rows and 12 columns.

In [None]:
dataset.dtypes

In [None]:
dataset.describe()

# Data Cleaning

In [None]:
#check for missing data
dataset.isnull().any().any()

In [None]:
#check for unreasonable data
dataset.applymap(np.isreal)

# Data visualisation

In [None]:
sns_plot = sns.pairplot(dataset)

In [None]:
sns_plot = sns.distplot(dataset['quality'])

# Pre-processing

In [None]:
#create new column; "quality_class"
dataset['quality_class'] = dataset['quality'].apply(lambda value: 1 if value < 5 else 2 if value < 7 else 3)

In [None]:
#set x and y
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = dataset.iloc[:,0:11]
y = dataset['quality_class']

#stadardize data
X_scaled = StandardScaler().fit_transform(X)

#get feature names
X_columns = dataset.columns[:11]

#split train and test data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=42)

In [None]:
dataset.head()

# Feature Engineering

1. Feature extraction: Principal component analysis
2. Feature selection: Pearson's correlation

# 1. Principal component analysis

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [None]:
pca = PCA(n_components=6)
pc_X = pca.fit_transform(X_scaled)
pc_columns = ['pc1','pc2','pc3','pc4','pc5','pc6']
print(pca.explained_variance_ratio_.sum())

In [None]:
print(pca.explained_variance_ratio_)

In [None]:
#split train and test data for pca
Xpc_train, Xpc_test, ypc_train, ypc_test = train_test_split(pc_X, y, random_state=42)

# 2. Pearson's Correlation

In [None]:
#get correlation map
corr_mat=dataset.corr()

In [None]:
#visualise data
plt.figure(figsize=(13,5))
sns_plot=sns.heatmap(data=corr_mat, annot=True, cmap='GnBu')
plt.show()

#save file
#sns_plot.get_figure().savefig('corr_mat.jpg')

Using a correlation of 0.6 to -0.5 as benchmark, a correlation matrix has been created to sieve out features that are highly correlated to the quality of red wine. Our results show that all features are within the acceptable range of 0.6 to -0.5.

From the heatmap, it can be seen that most features are weakly correlated to the quality of wine the exception of alcohol (0.48) which is a moderate correlation.

**Direction of relationship** <br>
Acidity (-0.39), chlorides (-0.13), free sulfur dioxide (-0.051), total sulfur dioxide (-0.19), density (-0.17) and PH (-0.058) are negatively correlated to the quality of wine; as these variables decrease, the quality of wine will increase vice versa. <br> <br>


Conversely, fixed acidity (0.12), citric acid, residual sugar (0.014), sulphates (0.25) and alcohol (0.48) are positively correlated to the quality of wine; as these variables increase, the quality of wine improves.

In [None]:
#check for highly correlated values to be removed
target = 'quality'
candidates = corr_mat.index[
    (corr_mat[target] > 0.5) | (corr_mat[target] < -0.5)
].values
candidates = candidates[candidates != target]
print('Correlated to', target, ': ', candidates)

# Supervised Machine Learning

1. Regression Model <br>
1.1 Linear Regression <br>
2. Classification Models <br>
2.1 Logistic Regression <br>
2.2 K-NN <br>
2.3 Decision Tree <br>
2.4 Neural Network

## 1. Regression Model

## 1.1 Linear Regression

In [None]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split

In [None]:
# import model
from sklearn.linear_model import LinearRegression

#instantiate
linReg = LinearRegression()

start_time = time.time()
# fit out linear model to the train set data
linReg_model = linReg.fit(X_train, y_train)
today = date.today()
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
#get coefficient values
coeff_df = pd.DataFrame(linReg.coef_, X_columns, columns=['Coefficient'])  
coeff_df

All features seem to have little effect on the wine quality. <br> <br>

The coefficient scores suggest that for a unit increase in any feature, there is less than 0.12 units increase/decrease in the wine [“quality_class”]. Similarly, for wine [“quality”], although coefficient scores are higher, they remain low with alcohol having the highest coefficient score of 0.3.

In [None]:
#validate model
y_pred = linReg.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.head(10)

In [None]:
df1.plot(kind='bar',figsize=(5,5))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

In [None]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

The mean square error for the model (0.30) is rather low and indicative of high prediction accuracy. However, this could possibly mean that the model is overfitting. <br> <br>

The root mean squared error (0.55) is slightly less than 10% of the mean wine quality (5.63), this asserts that the model can make reasonable predictions although not entirely accurate. <br> <br>

However, we have to keep in mind that the correlations in the model is rather low.

In [None]:
# print the intercept and coefficients
print('Intercept: ',linReg.intercept_)
print('r2 score: ',linReg.score(X_train, y_train))

The R2 (0.22) score is small (i.e. the residuals are big); only 22% of the variance in wine quality can be explained by the variables.

## 2. Classification Models

In [None]:
sns_plot = sns.distplot(dataset['quality'])

In [None]:
#the dataset contains 6 unique values.
len(dataset['quality'].unique())

## 2.1 Logistics Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logReg=LogisticRegression(solver='lbfgs', multi_class='multinomial', random_state=42)

start_time = time.time()
# Building a Logistic Regression Model
logReg.fit(X_train, y_train)

#print duration of model
today = date.today()
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
# Calculate Accuracy Score
y_pred = logReg.predict(X_test)
print('Accuracy score: ', accuracy_score(y_test, y_pred))

In [None]:
#Calculate Confusion Matrix
print('confusion matrix: ','\n',confusion_matrix(y_test,y_pred, labels=[1,2,3]))

## 2.1.1 Logistics Regression with PCA

In [None]:
#apply pca
start_time = time.time()

# Building a Logistic Regression Model
logReg.fit(Xpc_train, ypc_train)

#print duration of model
today = date.today()
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
# Calculate Accuracy Score
y_pred = logReg.predict(Xpc_test)
print('Accuracy score with PCA applied: ', accuracy_score(ypc_test, y_pred))

In [None]:
# Calculate Confusion Matrix
print('confusion matrix: ','\n',confusion_matrix(ypc_test,y_pred, labels=[1,2,3]))

An accuracy score of 84.5% looks good enough for Logistic Regression model as a classification technique. Out of 400 testing samples used, 338 are correctly predicted and 62 are classified wrongly. When PCA is applied to reduce the dimensions of the dataset, the accuracy score did not improve but decreased marginally to 82.75% and there were 69 classification errors using the test data. <br> <br>

The advantages of using Logistic Regression are high efficiency, does not require much computational resources, highly interpretable and it can produce predicted probabilities of possible outcomes. <br> <br>

Feature Engineering is important to Logistic Regression in order to apply it. Each sample must belong in one of the categories and the categories must be mutually exclusive. There must be no missing values in the dataset. For Logistic Regression to work better, the independent variables should not be correlated with each other (i.e. no multi-collinearity). <br> <br>

The disadvantages of Logistic Regression are that it requires large amounts of samples and it cannot be used to predict continuous values. It can only be used to predict a categorical outcome.

## 2.2 K-NN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
k_array = np.arange(1, 17, 2)
for k in k_array:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    y_pred=knn.predict(X_test)
    ac = accuracy_score(y_test, y_pred)
    print('n_neighbours: ',k)
    print('accuracy score: ',ac)
    print('confusion matrix: ','\n',confusion_matrix(y_test, y_pred))
    print('-------------------------------')

## 2.2.1 K-NN with PCA

In [None]:
#apply pca
k_array = np.arange(1, 17, 2)
for k in k_array:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(Xpc_train, ypc_train)
    y_pred=knn.predict(Xpc_test)
    ac = accuracy_score(ypc_test, y_pred)
    print('n_neighbours: ',k)
    print('accuracy score: ',ac)
    print('confusion matrix: ','\n',confusion_matrix(ypc_test, y_pred))
    print('-------------------------------')

## 2.3 Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()

In [None]:
#train model
start_time = time.time()
dt.fit(X_train,y_train)
today = date.today()
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
# Calculate Accuracy Score
dt_predict = dt.predict(X_test)
dt_acc_score = accuracy_score(y_test, dt_predict)
print(dt_acc_score)

In [None]:
# Calculate Confusion Matrix
dt_conf_matrix = confusion_matrix(y_test, dt_predict)
print('confusion matrix: ','\n',dt_conf_matrix)

In [None]:
#training with Gini
def decTreeScore2(crit = 'gini',  maxDepth = 2, minSamples = 1, minSplit = 2):
    dect = DecisionTreeClassifier(criterion = crit, max_depth = maxDepth, min_samples_leaf = minSamples, 
                                 min_samples_split = minSplit, random_state= 42)
    dect.fit(X_train, y_train)
    accuracy = accuracy_score(y_test, dect.predict(X_test))
    print(accuracy)
    return accuracy

In [None]:
start_time=time.time()
decTreeScore2()
today=date.today()
print("---%s seconds---"% (time.time()-start_time))

In [None]:
decTreeScore2(crit = 'entropy')
#if we use entropy to calculate infomation gain instead of gini score, the accuracy drops

In [None]:
# find the max allowed depth for the decision tree
for i in np.arange(1, 15, 1):
    decTreeScore2(maxDepth = i)

In [None]:
# find maximum_samples leaf of the tree
for i in np.arange(1, 10, 1):
    decTreeScore2(minSamples = i)

In [None]:
# find minimum_samples_split of the tree
for i in np.arange(2, 10,1):
    decTreeScore2(minSplit = i)

In [None]:
# decision tree model
# import graphviz and sklearn.tree
from sklearn import tree
import graphviz
from graphviz import Source

In [None]:
dot_data = tree.export_graphviz(dt, out_file=None, max_depth=2,class_names=True,feature_names= X_columns, filled=True, rounded=True)
graph = graphviz.Source(dot_data) 
graph

## 2.3.1 Decision Tree with PCA

In [None]:
#apply pca
dt = tree.DecisionTreeClassifier(max_depth=2)
dt.fit(Xpc_train, ypc_train)

In [None]:
#training with Gini
def decTreeScore2(crit = 'gini',  maxDepth = 2, minSamples = 1, minSplit = 2):
    dect = DecisionTreeClassifier(criterion = crit, max_depth = maxDepth, min_samples_leaf = minSamples, 
                                 min_samples_split = minSplit, random_state= 42)
    dect.fit(Xpc_train, ypc_train)
    accuracy = accuracy_score(ypc_test, dect.predict(Xpc_test))
    print(accuracy)
    return accuracy

In [None]:
start_time=time.time()
decTreeScore2()
today=date.today()
print("---%s seconds---"% (time.time()-start_time))

In [None]:
decTreeScore2(crit = 'entropy')
#if we use entropy to calculate infomation gain instead of gini score, the accuracy drops

In [None]:
# use different maximum depth of the tree
for i in np.arange(1, 15, 1):
    decTreeScore2(maxDepth = i)

In [None]:
# use different maximum_samples leaf of the tree
for i in np.arange(1, 10, 1):
    decTreeScore2(minSamples = i)

In [None]:
dot_data = tree.export_graphviz(dt, out_file=None, max_depth=2,class_names=True, filled=True, rounded=True)
graph = graphviz.Source(dot_data) 
graph

## 2.4 Neural Network

Neural Network also known as Deep Learning is a type of machine learning with a series of algorithms used to identify relationship in a given data set. For this assignment we will be using Karen library to construct a Neural Network used to estimate the quality of wine in our chosen data set.

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout

In [None]:
#In this step we will be importing and preparing dataset that is to be analyzed, in this case we will be using
#‘winequality-red.csv’ dataset. 
#dataset = pd.read_csv('winequality-red.csv',sep=';')
dataset['quality_class'] = dataset['quality'].apply(lambda value: 1 if value < 5 else 2 if value < 7 else 3)
dataset['quality_class'] = pd.Categorical(dataset['quality_class'], categories=[1,2,3])
dataset['quality_class'] = dataset['quality_class'].astype(int)
dataset.head()

In [None]:
quality_label_sums= dataset['quality_class'].value_counts()
quality_label_percentage = quality_label_sums/len('quality_class')
print(quality_label_sums)
print(quality_label_percentage)

In [None]:
#visualize quality_class
j = sns.countplot(x='quality_class', data=dataset)
plt.show(j)

In [None]:
dataset['quality_class'] = dataset['quality_class'].astype(int)
dataset = pd.get_dummies(dataset, columns=['quality_class'])
dataset.head()

Next we are going to determine the input and output variable of our dataset. We will also be doing scaling of feature using StandardScaler() function in sklearn library to ensure that our data is arranged in a standard normal distribution with mean of 0 and standard deviation of 1.

In [None]:
Xn = dataset.iloc[:,0:11].values
Yn = dataset.iloc[:,12:].values

Xn = StandardScaler().fit_transform(Xn)

Xn_train, Xn_test, Yn_train, Yn_test = train_test_split(Xn, Yn,random_state=42)

print(Xn_train.shape, Yn_train.shape, Xn_test.shape, Yn_test.shape)

After preparing our dataset, we will be moving on to create our Neural Network model using keras library that will be used to determine wine quality.

we are going to use Sequential class from keras.models to allow us to define all of the layer in constructor.
we are going to use Dense from keras.layers, to allow us to run our model operation.

In [None]:
model = Sequential()
model.add(Dense(30, input_dim=11, activation='sigmoid'))
model.add(Dense(50, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(100, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))

In [None]:
model.summary()

In [None]:
model.compile(optimizer = 'adam', loss = "categorical_crossentropy", metrics = ["accuracy"])

start_time = time.time()
#train model
history = model.fit(x = Xn_train, y = Yn_train,batch_size=128, epochs = 800,verbose=1,validation_data=(Xn_test, Yn_test))

#get model training duration
today= date.today()
print('---%s seconds---'%(time.time()-start_time))

In [None]:
# Calculation of Loss and Accuracy metrics
loss, accuracy = model.evaluate(Xn_test, Yn_test)
print('loss: ', loss, ', accuracy: ', accuracy)

In [None]:
predictions = model.predict(Xn_test)
print('\nPrediction:')
for i in np.arange(len(predictions)):
    print('Actual: ', Yn_test[i], ', Predicted: ', predictions[i])

predictions=np.argmax(predictions, axis=1)
Yn_test = np.argmax(Yn_test, axis=1)

In [None]:
# Training History - Model Accuracy
print(history.history.keys())
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
# Training History - Loss Accuracy
print(history.history.keys())
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
#Calculation of confusion matrix
#from sklearn.metrics import confusion_matrix
confusion_matrix(Yn_test, predictions)

## 2.4.1 Neural Network with PCA

In [None]:
Y = dataset.iloc[:,12:].values

X_train, X_test, Y_train, Y_test = train_test_split(pc_X, Y, random_state=42)

print(X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)

In [None]:
model = Sequential()
model.add(Dense(30, input_dim=6, activation='sigmoid'))
model.add(Dense(50, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(100, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))

In [None]:

model.compile(optimizer = 'adam', loss = "categorical_crossentropy", metrics = ["accuracy"])

start_time = time.time()
history = model.fit(x = X_train, y = Y_train,batch_size=128, epochs = 800,verbose=1,validation_data=(X_test, Y_test))

today= date.today()
print('---%s seconds---'%(time.time()-start_time))

In [None]:
# Calculation of Loss and Accuracy metrics
loss, accuracy = model.evaluate(X_test, Y_test)
print('loss: ', loss, ', accuracy: ', accuracy)

In [None]:
predictions = model.predict(X_test)
print('\nPrediction:')
for i in np.arange(len(predictions)):
    print('Actual: ', Y_test[i], ', Predicted: ', predictions[i])
    
predictions=np.argmax(predictions, axis=1)
Y_test = np.argmax(Y_test, axis=1)

In [None]:
# Training History - Model Accuracy
print(history.history.keys())
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
# Training History - Loss Accuracy
print(history.history.keys())
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
#Calculation of confusion matrix
confusion_matrix(Y_test, predictions)

# Unsupervised Machine Learning

We will apply 2 clustering ML models to the dataset to try uncover possible clusters.

1. K-Means (centriod based)
2. Hierarchical Agglomerative Clustering (similarity based)
3. Dbscan (density based)

## 1. K-Means

In [None]:
#import libraries
from sklearn.metrics import f1_score
from sklearn.cluster import KMeans

In this model, the entire dataset has been used as a training data. <br>
Then an elbow method will be used to find out an optimal number of “K” clusters.

In [None]:
#try to find optimal k using the elbow method
wcss = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i,init='k-means++',max_iter=300, n_init=12, random_state=0)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)
f3, ax = plt.subplots(figsize=(8, 6))
plt.plot(range(1,11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

“K” value of 2 will be used as a dip can be seen around 2 which is our elbow in a graph above. <br> <br> <br>


First, clustering will be performed with K-Means on dataset without applying principle component analysis (PCA). Note that the total dimension of dataset is 11.

In [None]:
#Applying kmeans to the dataset, set k=2
kmeans = KMeans(n_clusters = 2)
start_time = time.time()
clusters = kmeans.fit_predict(X_scaled)
today = date.today()
print("--- %s seconds ---" % (time.time() - start_time))
labels = kmeans.labels_

Training Time – 0.062 seconds

In [None]:
#2D plot
colors = 'rgbkcmy'
for i in np.unique(clusters):
    plt.scatter(X_scaled[clusters==i,0],
               X_scaled[clusters==i,1],
               color=colors[i], label='Cluster' + str(i+1))
plt.legend()

It can be seen that clusters are not well separated. Some members of Cluster 2 can be seen in Cluster 1 and vice versa.

In [None]:
# Visualise the clusterds considerig fixed acidity, residual sugar, and alcohol
fig = plt.figure(figsize=(20, 15))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=15, azim=40)

ax.scatter(X_scaled[:,0], X_scaled[:,3], X_scaled[:,10],c=y, edgecolor='k')
ax.set_xlabel('Acidity')
ax.set_ylabel('Sugar')
ax.set_zlabel('Alcohol')
ax.set_title('K=2: Acidity, Sugar, Alcohol', size=22)

Now, the silhouette score of the model will be measured. The silhouette score ranges from -1 to +1. <br>
The high silhouette score indicates that the objects are well matched to its own cluster and not to its neighbouring clusters. <br>
(The higher the silhouette score – the better the clustering)

In [None]:
#evaluate model
from sklearn.metrics import pairwise_distances
metrics.silhouette_score(X_scaled, labels, metric='euclidean')

The silhouette score obtained is considered low. It means clusters are neither dense nor well separated. <br>
Next, let’s measure the inertia value.

In [None]:
kmeans.inertia_

An extremely high inertia value of 14330.119 was obtained. It is an indicative of the “curse of dimensionality”. <br>
We are using 11 dimensions of data in this model. <br>
In this case, we will explore the model again using PCA (principle component analysis).

## 1.1 K-Means with PCA

Our purpose of applying principal component analysis is to reduce dimension. <br>
In this dataset, we reduced the 11-dimensional data to 6-dimensional data during PCA.

In [None]:
#Applying kmeans to the dataset, set k=2
kmeans = KMeans(n_clusters = 2)
start_time = time.time()
clusters = kmeans.fit_predict(pc_X)
today = date.today()
print("--- %s seconds ---" % (time.time() - start_time))
labels = kmeans.labels_

Training time – 0.05 seconds Training time is observed to have reduced slightly.

In [None]:
#2D plot
colors = 'rgbkcmy'
for i in np.unique(clusters):
    plt.scatter(pc_X[clusters==i,0],
               pc_X[clusters==i,1],
               color=colors[i], label='Cluster' + str(i+1))
plt.legend()

After implementing PCA, it can be seen that clustering is improved. So it is expected to see a higher silhouette score.

In [None]:
#evaluate model
metrics.silhouette_score(pc_X, labels, metric='euclidean')

As expected, we can see an improvement in the silhouette score. But it is still considered low which means there are still some overlapping of clusters or incorrect grouping. <br> <br>

Although the silhouette score increased with PCA, it still low; clusters are overlapping or incorrectly grouped.

In [None]:
kmeans.inertia_

The inertia value is also decreased but still extremely high. <br> <br>

K-means clustering has poor clustering result for high dimensional data. Even with the implementation of PCA, the silhouette score can only be improved to some extent but is considered low. Also the inertia value is observed to be extremely high. In an ideal situation, the inertia value should be as low as possible. Hence, we can conclude that this is not a good model fit to the data.

## 2. Agglormerative Clustering

Apply agglomerative clustering to pick the best number of clusters, we need to draw the dendrogram graph

In [None]:
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch
from scipy.cluster.hierarchy import dendrogram, linkage

In [None]:
#plot dendrogram to determine number of clusters
plt.figure(figsize=(25, 10))
plt.title('Dendrogram')
plt.xlabel('Wine Details')
plt.ylabel('Euclidean distances')

dendrogram (
    linkage(X_scaled, 'ward')  # generate the linkage matrix
    ,leaf_font_size=8 # font size for the x axis labels
)
plt.axhline(y=8)
plt.show()

From the dengrogram above we can see that the features after the 3rd branch are very similar to each other (i.e. shorter in height). The dataset should optimally have are 3 clusters; where the distance between the clusters are the highest.

In [None]:
clustering = AgglomerativeClustering(linkage="ward", n_clusters=3)
#train model
start_time = time.time()
clustering.fit(X_scaled)
today = date.today()
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
#visualize clustering
colors = 'rgbkcmy'

for i in np.unique(clustering.labels_):
    plt.scatter(X_scaled[clustering.labels_ == i, 0], X_scaled[clustering.labels_ == i, 1],
                color=colors[i], label='Cluster ' + str(i + 1))

plt.legend()
plt.title('Hierarchical Clustering')
plt.xlabel(X_columns[1])
plt.ylabel(X_columns[2])
plt.show()

From the graph above, we can tell that that clusters not clearly defined. Lets explore Agglormerative Clustering agin with PCA.

In [None]:
#evaluate model
labels = clustering.labels_
metrics.silhouette_score(X_scaled, labels, metric='euclidean')

## 2.1 Agglormerative Clustering with PCA

In [None]:
clustering = AgglomerativeClustering(linkage="ward", n_clusters=3)
start_time = time.time()
clustering.fit(pc_X)
today = date.today()
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
#visualize clustering
colors = 'rgbkcmy'

for i in np.unique(clustering.labels_):
    plt.scatter(pc_X[clustering.labels_ == i, 0], 
                pc_X[clustering.labels_ == i, 1],
                color=colors[i], label='Cluster ' + str(i + 1))

plt.legend()

plt.title('Hierarchical Clustering')
plt.xlabel(pc_columns[0])
plt.ylabel(pc_columns[1])
plt.show()

Although the clusters are not entirely segreggated, they appear to be clearer after applying PCA.

In [None]:
#evaluate model
labels = clustering.labels_
metrics.silhouette_score(pc_X, labels, metric='euclidean')

## 3. Dbscan

In [None]:
from sklearn.cluster import DBSCAN

Higher min_samples or lower eps indicate higher density necessary to form a cluster.

In [None]:
dbscan = DBSCAN(eps=2, min_samples=7)
start_time = time.time()
clusters= dbscan.fit_predict(X_scaled)
today = date.today()
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
np.unique(clusters)

In [None]:
colors = 'rgbkcmy'
ax = plt.axes(projection='3d')

for i in np.unique(clusters):
    label = 'Outlier' if i == -1 else 'Cluster ' + str(i + 1)
    ax.scatter3D(X_scaled[clusters==i,0], X_scaled[clusters==i,1],X_scaled[clusters==i,4],
                #color=colors[i], 
                 label=label)

plt.legend()
plt.show()

In [None]:
#evaluate model
labels = dbscan.labels_
metrics.silhouette_score(X_scaled, labels, metric='euclidean')

## 3.1 Dbscan with PCA

In [None]:
dbscan = DBSCAN(eps=2, min_samples=7)
start_time = time.time()
clusters= dbscan.fit_predict(pc_X)
today = date.today()
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
np.unique(clusters)

In [None]:
ax = plt.axes(projection='3d')

for i in np.unique(clusters):
    label = 'Outlier' if i == -1 else 'Cluster ' + str(i + 1)
    ax.scatter3D(pc_X[clusters==i,0], 
                 pc_X[clusters==i,1],
                 pc_X[clusters==i,2],
                 label=label)

plt.legend()
plt.show()

In [None]:
#evaluate model
labels = dbscan.labels_
metrics.silhouette_score(pc_X, labels, metric='euclidean')