******
- Wine DataSet Clustering
- author:"Xavier Martinez Bartra"
- date: "November 2020"
******

In [None]:
from plotly.offline import init_notebook_mode, iplot_mpl, download_plotlyjs, plot, iplot
import plotly_express as px
import plotly.figure_factory as ff
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')
init_notebook_mode(connected=True)
import pandas_profiling
import statsmodels.formula.api as sm
import missingno as msno
from sklearn.preprocessing import LabelEncoder
from statsmodels.compat import lzip
import statsmodels.api as sm
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score, silhouette_samples

**Python has been used for the study of the DataSet. The Pandas library has been used for the manipulation and processing of data. The Poltly Express library has been used for data visualization. The activity has studied the nature of the variables, as well as the correlations between them. The possibility of generating significant clusters of the wine instances with K-means has been explored by examining the Elbow method and the mean of the Silhouette score. A basic interpretation of the generated clusters and several visualizations with PCA has been carried out. Given the values of the Silhouette score with K-means (and the other 2 algorithms), we can conclude that the DataSet Wine Quality is not very suitable for the generation of cohesive and highly significant clusters.**

This dataset is also available in the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality.

For more information see [Cortez et al., 2009].

## Attributes (based on physicochemical tests):

 - 1 - fixed acidity
 - 2 - volatile acidity
 - 3 - citric acid
 - 4 - residual sugar
 - 5 - chlorides
 - 6 - free sulfur dioxide
 - 7 - total sulfur dioxide
 - 8 - density
 - 9 - pH
 - 10 - sulphates
 - 11 - alcohol

## Target (based on sensorial date):
- 12 - quality (score between 0 and 10)

## 1. DataSet Loading and Examination

In [None]:
#cargamos los datos
data=pd.read_csv("../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")

In [None]:
df=data.copy()

In [None]:
df.info()

We note that the DataSet has neither missing nor null values. All variables are numeric. The quality variable is an integer.

In [None]:
pandas_profiling.ProfileReport(df)

With Pandas Profiling we can examine the distributions of the variables.

We observe that the vast majority of wines have ratings of 5 and 6.

In [None]:
df.describe()

The maximum value of quality is 8. The minimum is 3. The average of the scores is 5.636023.

In [None]:
df.quality.value_counts(normalize=True)

The vast majority of wines have scores of 5 (42.59%) and 6, (39.90%)

In [None]:
fig = px.imshow(df.corr(),x=list(df.corr().columns),y=list(df.corr().columns),width=900, 
                height=700,title='Correlation Matrix', color_continuous_scale=px.colors.diverging.Tropic).update_layout( paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)')
fig.show()

The correlation matrix offers us a summary of the correlations between the variables.

Quality has a negative correlation with all attributes, except alcohol, sulphates and citric acid.


## 2. K-means Clustering

To perform the clustering activity we select the DataSet with the attributes without the quality objective variable. Our clustering objective will be to generate relatively cohesive groups and, at the same time, very different from each other. The aim is to obtain groupings with high intragroup similarity and low intergroup similarity.

In [None]:
#We generate the atribute matriz without the target
X=df.iloc[:,:-1]

In [None]:
X.head()

We standardize the variables.

In [None]:
X = StandardScaler().fit_transform(X)

## 3 Optimal number of clusters k selection

In this section, we will examine the two most common methods for choosing the optimal number of clusters in K-means.

      1) Elbow method
      2) Silhouette_score

   To perform the Elbow method (elbow, we execute n = k iterations of k-means, increasing the value of K in each iteration and register the inertia (or sum of errors to the square or distance of each point to its centroid). In the figure we observe that as k increases, the inertia decreases, this is because as more centroids are included, the distance from each point to its closest centroid decreases.

We have to find the optimal point where the curve of inertia begins to bend. We have to choose a value k making a reasonable balance between inertia and a number of suitable clusters.

### Elbow Method

In [None]:
#We iterate the inertia until k = 14
inertia = []
K = range(1,15)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(X)
    inertia.append(kmeanModel.inertia_)

In [None]:
#create new df 
inertia = pd.DataFrame({'inertia':inertia})
inertia.index=pd.RangeIndex(start=1, stop=15, step=1, dtype=None, copy=False, name=None)
inertia['K']=inertia.index

In [None]:
fig = px.line(inertia, x="K", y="inertia", title='Elbow plot K-clusters vs Inertia').update_layout( paper_bgcolor='rgb(243, 243, 243)',                                                                                             
    plot_bgcolor='rgb(243, 243, 243)')
fig.show()

Inertia falls rapidly when Kmeans reaches k = 5, but from there, it begins to decrease more slowly and its decrease becomes less significant. Having 4 or 5 clusters seems like a good compromise between having interpretable clusters and a relatively low level of inertia.

### Silhouette_score

The silhouette coefficient is a metric of the cohesion and separation of the groups. It measures how well an observation is in its cluster based on two factors:

     How close it is to the other observations in the cluster.
     How far it is from the rest of the observations of the other clusters.

The silhouette coefficient values vary between -1 and 1.

- A coefficient close to +1 means that the instance is within its own cluster and away from other groups.
- A coefficient close to 0 means that it is close to another cluster.
- A coefficient close to -1 means that the instance may have been assigned to the wrong cluster.

In the scikit-learn implementation, the silhouette coefficient is the mean of all the observations in a score.

The silhouette coefficient of each observation = $$ a (b - a) / max (a, b) $$
where a is the mean distance to the other observations of the same cluster (that is, the mean intra-group distribution) and b is the mean distance to the closest cluster.

In [None]:
# # A list will collect the silhouette coefficients for each k
silhouette_coefficients = []

for k in range(2, 15):
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(X)
    silhouette_coefficients.append(silhouette_score(X, labels=kmeanModel.labels_))

In [None]:
#create new df 
silhouette_coefficients = pd.DataFrame({'silhouette_coefficients':silhouette_coefficients})

In [None]:
silhouette_coefficients.index=pd.RangeIndex(start=2, stop=15, step=1, dtype=None, copy=False, name=None)

In [None]:
silhouette_coefficients['K']=silhouette_coefficients.index

In [None]:
fig = px.line(silhouette_coefficients, x="K", y="silhouette_coefficients",
              title='Silhouette Score per n=k clusters').update_layout( paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)')
fig.show()

- As we can examine, this visualization gives us much more information than inertia. It seems that the best option is k = 2, since it has the highest score. k = 4 would also be a good option, given its silhouette score. We will select K = 4, since it will give us a more interesting option to examine than not having only two clusters.


- We observed that the scores of all the models are relatively poor, which means that the data set seems to have little idiosyncrasy to generate highly cohesive clusters.

In [None]:
k = 4
kmeans = KMeans(n_clusters=k,random_state=30)
y_pred = kmeans.fit_predict(X)

In [None]:
df['silhouette_samples']=silhouette_samples(X, labels=kmeans.labels_, metric='euclidean')
df['clusters']=y_pred.astype(str)

## 4. Interpretation of clusters

We will execute a grouping that gives us the average values of each variable for each of the clusters in order to interpret their characteristics.

In [None]:
df['clusters']=y_pred.astype(str)
df1=df.groupby('clusters').aggregate(np.mean)

In [None]:
#We transpose the data to have a better visualization of it.
df1.T

We will standardize the grouping since this way we can better compare the variables that have different units and are measured at different scales.

In [None]:
#standardize variables
gc = StandardScaler().fit_transform(df1)

In [None]:
gc=pd.DataFrame(data=gc, index=df1.index,columns=df1.columns)

In [None]:
gc.T.style.background_gradient(cmap='Reds')

In [None]:
gc=gc.T
gc['atributes']=gc.index
gc.columns=['cluster_0','cluster_1','cluster_2','cluster_3','atributes']

We can visualize the average standardized values per attribute for each of the clusters.

In [None]:
fig = px.scatter(gc, x='atributes',y=['cluster_0', 'cluster_1', 'cluster_2', 'cluster_3'],
        title="Standarized mean values per atribute and cluster",
                 color_discrete_sequence=px.colors.qualitative.Pastel).update_traces(dict(marker_line_width=1,
        marker_size=16,marker_line_color="black",mode='markers+lines')).update_layout( 
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)').add_shape(type='line', x0=-0.5,y0=0, x1=12.5,y1=0,line=dict(color='Black',
    dash="dot"),xref='x',yref='y')

fig.show()

In [None]:
fig = px.bar(gc, x="atributes", y=['cluster_0', 'cluster_1', 'cluster_2', 'cluster_3'],
              title="Bar Plot - Standarized mean values per atribute and cluster",
              color_discrete_sequence=px.colors.qualitative.Pastel).update_layout( 
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)').add_shape(type='line', x0=-0.5, y0=0,x1=12.5,y1=0,
                                                 line=dict(color='Black', dash="dot"),xref='x',yref='y')
fig.show()

- **Cluster 0 (High pH-low density):** It has the lowest level of density, fixed_acidity and citric_acid and high Ph and volatile_acidity In quality it is the second best valued cluster, being very close to the average.


- **Cluster 1 (Low Dioxides-High Density and Alcohol):** Has the notoriously lower values of free_sulfur_dioxide and total_sulfur_dioxide. It has the highest level of alcohol, fixed acidity and density. In quality it is the best valued cluster. Standing at 1.65 standard deviation of the quality mean.


- **Cluster 2 (Chloride and high sulfates):** It has the highest average values of citric_acid, chlorides and sulphates and the lowest pH. In quality it is the second worst valued. It is the most cohesive cluster, given its silhouette score.


- **Cluster 3 (Dioxides and Re.Sugar Altos):** It has the notoriously higher values of residual_sugar and of free_sulfur_dioxide and total_sulfur_dioxide. In quality it is the worst valued.

## 5. Silhouette scores of the instances and clusters

Another metric worth analyzing is the distribution of silhouette samples by cluster. Thus, the higher the value of the silhouette scores, the more cohesion the cluster has compared to the other clusters (or less dispersion).

In general terms

- 0.76-1.0: Strong cohesion.

- 0.50-0.75: Moderate cohesion.

- 0.25-0.49: Weak cohesion.

- <0.25: Low Coeshion

In [None]:
df['index']=df.index.astype(str)

In [None]:
fig = px.scatter(df, x='silhouette_samples',y='index', title="Silhouette samples per cluster",
                 color='clusters',color_discrete_sequence=px.colors.qualitative.Pastel).update_traces(dict(marker_line_width=1,marker_line_color="black")).update_layout( paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)')

fig.add_shape(type='line',
                x0=0,
                y0=0,
                x1=0,
                y1=1600,
                line=dict(color='Red', dash="dot"),
                xref='x',
                yref='y')

fig.show()

We observe that the instances with the highest silouhette score belong to cluster 2 (Chloride and high sulfates) and the vast majority of negative values (outliers) belong to cluster 3. We can therefore determine that this last cluster has very poor cohesion.

In [None]:
fig = px.box(df, x='silhouette_samples', title="Silouhette samples Box Plots per cluster",
                 color='clusters',orientation='h',
             color_discrete_sequence=px.colors.qualitative.Pastel).update_traces(dict(marker_line_width=1,
          marker_line_color="black")).update_layout( paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)')

fig.show()

- We observe that, indeed, only cluster 2 (Chloride and high sulfates) has moderate cohesion.

- The cohesion of the rest of the clusters is poor or very poor (the median of cluster 3 is 0.097).

- We could completely rule out working this data set with clusters (since these are not very significant).

## 6. Visualización de los clusteres con PCA

In [None]:
# We scale our Matrix (DataSet) First we transform our DF in a scaled Matrix (scaled by the μ and σ of each columm) 
df_=scale(X)

In [None]:
# from sklearn we import PCA module and we fit our DataSet, we specify the number of PC and call the fit() method 
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(df_)

In [None]:
# Now we can transform our DataSet
df_2d = pca.transform(df_)

In [None]:
# We generate a new Dataframe with our two components as variables
df_2d = pd.DataFrame(df_2d)
df_2d.columns = ['PC1','PC2']

In [None]:
df_2d['clusters']=y_pred.astype('str')
df_2d['silhouette_samples']=df['silhouette_samples']

In [None]:
fig=px.scatter(df_2d, x='PC1',y='PC2',title="PC1 vs PC2 per cluster",
                    color='clusters',color_discrete_sequence=px.colors.qualitative.Pastel).update_traces(dict(marker_line_width=1, marker_line_color="black")).update_layout( paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)')

fig.show()

Visualization with Principal Components shows us a relatively good separation of the clusters.

- Cluster 2 (Chloride, citric acid and high sulfates) seems to have observations with values very close to cluster 1 (Low Dioxides-best valued).

- All the highest values of PC2 are found in cluster 3, the worst valued.

In [None]:
fig=px.scatter_3d(df_2d, x='PC1',y='PC2',z='silhouette_samples',title="PC1 vs PC2 vs Silohuette Score per cluster",
                    color='clusters',color_discrete_sequence=px.colors.qualitative.Pastel).update_traces(dict(marker_line_width=1, marker_line_color="black")).update_layout( paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)')
fig.show()

In 3 dimensions (with the values of silhouette scores as the height dimension, we observe that of the values with the best relative silhouette score, almost all belong to cluster 0 and 1

## 7. Agglomerative hierarchical clustering

We will look at a clustering technique, which is Aglomerative Hierarchical Clustering.

This algorithm uses a bottom-up approach, each observation starts in its own group, and the pairs of groups are shuffled as one moves up the hierarchy. In general, mixtures and divisions are determined with a voracious algorithm.

The Agglomerative Clustering class will require two inputs:

     - n_clusters: the number of clusters that will be formed, as well as the number of centroids that will be generated.
     
     - link: which link criteria to use. The linkage criterion determines what distance to use between observation sets. The algorithm will merge the cluster pairs that minimize this criterion. We will use average.

In [None]:
from sklearn.cluster import AgglomerativeClustering 

agglom = AgglomerativeClustering(n_clusters=5, linkage='average').fit(X)

In [None]:
# A list will collect the silhouette coefficients for each k holds the silhouette coefficients for each k
silhouette_coefficients_ac = []

# Notice you start at 2 clusters for silhouette coefficient
for k in range(2, 10):
    agglom = AgglomerativeClustering(n_clusters=k, linkage='average').fit(X)
    silhouette_coefficients_ac.append(silhouette_score(X, labels=agglom.labels_))

In [None]:
#create new df 
silhouette_coefficients_ac = pd.DataFrame({'silhouette_coefficients':silhouette_coefficients_ac})

In [None]:
silhouette_coefficients_ac.index=pd.RangeIndex(start=2, stop=10, step=1, dtype=None, copy=False, name=None)

In [None]:
silhouette_coefficients_ac['K']=silhouette_coefficients_ac.index

In [None]:
fig = px.line(silhouette_coefficients_ac, x="K", y="silhouette_coefficients",
              title='Silhouette Score per n=k clusters in  Agglomerative clustering ').update_layout( paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)')

fig.show()

We observe that we have higher values of the Silhouette score than with K-means.

This time we will choose the highest value of k, k = 2 with a silhouette score 0.58.

In [None]:
agglom = AgglomerativeClustering(n_clusters=2, linkage='average').fit(X)

df_2d['clusters_ac'] = agglom.labels_.astype('str')

In [None]:
fig = px.scatter(df_2d, x='PC1',y='PC2', title="PC1 vs PC2 scatter plot Agglomerative Clustering 2K",
                 color='clusters_ac',color_discrete_sequence=px.colors.qualitative.Pastel).update_traces(dict(marker_line_width=1,marker_line_color="black")).update_layout(paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)')
fig.show()

Although we observe some outlier values with PCA classified to cluster 0, we do not obtain a clear separation of the clusters with PCA.

## 8. MeanShift

MeanShift clustering aims to discover clusters at a uniform density of samples. It is a centroid-based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to remove nearly duplicates to form the final set of centroids.

The algorithm automatically sets the number of clusters, rather than relying on a parameter bandwidth, which dictates the size of the region to search. This parameter can be configured manually, but can be estimated using the provided estimated bandwidth function, which is called if the bandwidth is not set.

In [None]:
from sklearn.cluster import MeanShift, estimate_bandwidth

# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.1)
ms = MeanShift(bandwidth).fit(X)

df_2d['clusters_ms'] = ms.labels_.astype('str')

In [None]:
fig = px.scatter(df_2d, x='PC1',y='PC2', title="PC1 vs PC2 scatter plot",
                 color='clusters_ms',color_discrete_sequence=px.colors.qualitative.Pastel).update_traces(dict(marker_line_width=1,marker_line_color="black")).update_layout(paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)')
fig.show()

In [None]:
silhouette_score(X, labels=ms.labels_)

In [None]:
len(df_2d['clusters_ms'].unique())

The algorithm appears to perform very poorly on this data set. It has generated 26 clusters! and it has a silhouette score of only 0.12; so we can determine that the clusters generated are hardly significant and cohesive.

## 9. Conclusions
Examining the silouhette scores gives a fairly good idea of the relative quality of each cluster using a specific clustering algorithm. Thus, the comparison of the silhouette scores (which is the average value of all the grouped objects) obtained from various clustering algorithms or obtained from the same grouping method but with a changing number of clusters K is commonly used to help decide which grouping provides more relevant data.