## Similar taste? Different taste?

I love whisky. I want to enjoy a variety of whisky tastes. But it is dissapointing if I order several kinds of whisky and all of them taste almost the same. So, I decided to identify the difference between whisky distilleries before getting tipsy.

#### Strategy
- Classify distilleries based on flavor of their whisky using k-means clustering.
- Identify the characteristics of whisky (distilleries) using decision tree.
- Map the location of distilleries in each class.

#### Reference
Sebastian Raschka and Vahid Mirjalili, Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow, 2nd Edition. (Capter 3: A Tour of Machine Learning Classifiers Using scikit-learn, Capter 11: Working with Unlabeled Data - Clustering Analyis)

## 1. Data preparation

#### Load libraries.

In [None]:
!pip install pydotplus

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from IPython.display import Image, display_png
from pydotplus import graph_from_dot_data
from sklearn.tree import export_graphviz
from sklearn import manifold
import folium
from pyproj import Proj, transform

#### Load data and check the attributes.

In [None]:
df = pd.read_csv('../input/whisky.csv')
df.head()

We can see that we can use the attributes related to taste (body, sweetness, smoky, medicinal, tabacco, honey, picy, winey, nutty, malty, fruity, floral) to find out the similarity of whisky.

In [None]:
df.info()

#### Check the statistics

In [None]:
df.describe()

## 2. k-means clustering

Here we classify distilleries based on the flavors of their whisky using k-means clustering.

#### Determine the number of cluster

First, we identify appropriate number of clusters by using elbow method. Here we perform k-means clustering with changing the number of clusters (k). Then we find the optimal k where the inertia of k-means decreases rapidly.

In [None]:
dist = []

for i in range(2,20):
    km = KMeans(n_clusters = i, n_init=10, max_iter = 500, random_state =0)
    km.fit(df.iloc[:, 2:-3])
    dist.append(km.inertia_)
    
plt.plot(range(2,20),dist)
plt.show()

Here we do not have any outstanding elbows. Maybe we can use k=5.

#### Conduct k-means classification

Here we conduct k-means classification with k=5.

In [None]:
km = KMeans(n_clusters = 5, n_init=10, max_iter = 300, random_state =0)
df['class'] = km.fit_predict(df.iloc[:, 2:-3])
df['class'].values

#### Map the distilleries onto 2D plane

Here we map the distilleries with their class information onto 2D plane taking into account their "distance" in flavors. We use MDS (Multidimensional Scaling) library in scikit-learn.

In [None]:
mds = manifold.MDS(n_components=2, dissimilarity="euclidean", random_state=0)
pos = mds.fit_transform(df.iloc[:, 2:-4])

col =['orange','green', 'blue', 'purple', 'red']
chars = "^<>vo+d"
c_flag = 0
labels = df['Distillery']

plt.figure(figsize=(20, 20), dpi=50)
plt.rcParams["font.size"] = 15

for label, x, y, c in zip(labels, pos[:, 0], pos[:, 1],df['class']):

    if(c == c_flag):
        c_flag = c_flag+1
        plt.scatter(x,y, c=col[c], marker=chars[c], s=100, label = "Class "+ str(c+1))
    else:
        plt.scatter(x,y, c=col[c], marker=chars[c], s=100)
        
    plt.annotate(label,xy = (x, y))
plt.legend(loc='upper right')
plt.show()

Here we can see that the distilleries within the same class position closely.

Let's see how close the whisky tastes within the same class. Here we compare GlenSpey and Miltonduff in Class 1.

In [None]:
df.query('Distillery == "GlenSpey" or Distillery == "Miltonduff"')

The difference in absolute values between the attributes of these two distilleries is 7. It seems that Miltonduff is a bit more sweet, honey and fruity than GlenSpey.

Then let's compare the tastes in different class. Here we see GlenSpey in class 1 and Glendronach in class 4.

In [None]:
df.query('Distillery == "GlenSpey" or Distillery == "Glendronach"')

The difference in absolute values between the attributes of these two distilleries is 15. It is quite larger than the comparison between GlenSpey and Miltonduff. So I believe that Glendronarch tastes much body, honey, winey and fruity than GlenSpey.

## 3. Identify the characteristics of tastes using decision tree

Here we use decision tree to identy the characteristics of whisky tastes in each class using decision tree.

#### Calculate decision tree

In [None]:
tree = DecisionTreeClassifier(criterion='gini', max_depth=5, random_state =1, min_samples_leaf=5)

X_train = df.iloc[:, 2:-4]
y_train = df['class']

tree.fit(X_train, y_train)

#### Visualize the derived decision tree

In [None]:
dot_data = export_graphviz(tree, filled = True, rounded = True, class_names = ['Class 1','Class 2', 'Class 3', 'Class 4', 'Class 5'],
                          feature_names = df.columns[2:-4].values, out_file = None)

graph = graph_from_dot_data(dot_data)
graph.write_png('tree.png')
display_png(Image('tree.png'))

#### Interpretation of the tree

From the above decision tree, we can see the characteristics of whisky tastes in each class.

- Class 1: Most of them are floral and less winey.
- Class 2: Less floral, more body.
- Class 3: Floral, winey and body (strong tastes?).
- Class 4: There are only five of them. Three of them are less floral, more body and less medicinal. The remainings are close to Class 3. 
- Class 5: Most of them are less floral and less body (lighter tastes?).


## 4. Map the locations of distilleries

Here we mark the the locations of distilleries onto a map. Here we use position infomation (latitude and longitude) in the dataset. We also change the color of marker by their class.

Note that the values of latitude and longitude in the data are not in degree. They are in United Kingdom Coordinate System (EPSG 27700). Therefore, we need to transform them into World Coordinate System (WGS84 (EPSG4326)).

In [None]:
map_whisky = folium.Map(location=[57.499520,  -2.776390], zoom_start = 9)

inProj = Proj(init='epsg:27700')
outProj = Proj(init='epsg:4326')

for label, lon, lat, c in zip(labels, df['Latitude'], df['Longitude'], df['class']):
    
    lat2,lon2 = transform(inProj,outProj,lon,lat)
    folium.Marker([lon2, lat2], popup= label, icon=folium.Icon(color=col[c])).add_to(map_whisky)

map_whisky

Now we can see the location of distilleries for each class. By clicking the markers, you can see the name of distilleries.

## 5. Conlcusion

We identified the similarity/difference of whisky tastes derived by k-means clustering and decision tree analysis. I think this result is quite helpful for you to choose the whisky you will like. But my analysis might have some flaws. The only way to make sure my analysis is correct must be visiting Scotland and tasting all of them by myself ;-).