# Wine Data - Clustering

In this notebook we will try to perform clustering on the wine dataset.

1. Import Data
2. Analyze the data
3. Data Visualization
4. K-Means Clustering
5. Finding the value of K - Elbow Method, Silhouette Method
6. Best Features

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### 1. Import the data

In [None]:
wine = pd.read_csv("/kaggle/input/wine-customer-segmentation/Wine.csv")
wine.shape

* There are 14 columns and 178 rows

### 2. Analyze the data

In [None]:
# view the top 5 records
wine.head()

* We can see all the columns along with some sample values and most of the same are numeric.

In [None]:
# checking for the data type
wine.info()

* There are no missing records and all features are numeric.

In [None]:
# summary of the distribution for the numeric columns
wine.describe()

* All the features are continuous except Customer_Segment which looks like the clusters which we want to create.

In [None]:
# there is already a column with cluster, we will create a copy of this dataset without the Customer_Segment column
wine1 = wine.drop("Customer_Segment", axis=1).copy()

### 3. Data Visualization

In [None]:
# importing data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# pair each feature against the other feature and visualize the relationships
sns.pairplot(wine1)

In [None]:
# create a heat map to display correlation
plt.figure(figsize=(12,12))
sns.heatmap(wine1.corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.show()

* Flavanoids is highly correlated with Total_Phenols

In [None]:
# dropping Flavanoids column
wine1.drop("Flavanoids", axis=1, inplace=True)

In [None]:
# create a copy of wine1
wine2 = wine1.copy()

In [None]:
# scale the dataset for better performance of the KMeans clustering

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

wine2 = pd.DataFrame(scaler.fit_transform(wine2), columns=wine2.columns)

wine2.head()

### 4. K-Means Clustering

In [None]:
# import KMeans library
from sklearn.cluster import KMeans

#### Elbow Method

In [None]:
k = np.arange(1,11)

inertia = []

for i in k:
    kmeans = KMeans(n_clusters=i, random_state=511)
    kmeans.fit(wine2)
    inertia.append(kmeans.inertia_)

plt.plot(k,inertia,"o-")
plt.xticks(k)
plt.xlabel("K Value")
plt.ylabel("Inertia")
plt.title("Finding the value of K - Elbow Method")
plt.show()

* Looking at the above diagram, the inertia seems to be decreasing sharply till K value of 3. After that it decreases very slowly.

#### Silhouette Method

In [None]:
from sklearn.metrics import silhouette_score

In [None]:
n = np.arange(2,11)

score = []

for i in n:
    kmeans = KMeans(n_clusters=i, random_state=511)
    kmeans.fit(wine2)
    score.append(silhouette_score(wine2,kmeans.labels_))


In [None]:
plt.plot(n,score,"*-")
plt.xlabel("K Value")
plt.ylabel("Silhouette Score")
plt.title("Finding the value of K - Silhouette Method")
plt.xticks(n)
plt.show()

* To confirm the Elbow Method, we see that the Silhouette Score is maximum at K value of 3. Hence, we will create a cluster of 3 groups.

In [None]:
model = KMeans(n_clusters = 3, random_state= 511)
model.fit(wine2)

In [None]:
labels = model.labels_
centroids = model.cluster_centers_

In [None]:
wine1["Cluster"] = labels

wine1.head()

## 5. Best Features

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state = 511)
wine3 = wine1.copy()
wine3.drop('Cluster', axis = 1, inplace = True)
rfc.fit(wine3,wine1['Cluster'])
features = wine3.columns.tolist()
feature_value = rfc.feature_importances_
d = {'Features' : features, 'Values' : feature_value}
fi = pd.DataFrame(d).sort_values('Values', ascending = False).reset_index()
fi
plt.rcParams['figure.figsize'] = (20.0, 5.0)
ax = sns.barplot(x=fi['Features'], y = fi['Values'], data = fi, palette="Blues_d")