## About
This notebook contains a very fast fundamental k-means clustering example in Python.

This work is part of a series called [Machine learning in minutes - very fast fundamental examples in Python](https://www.kaggle.com/jamiemorales/machine-learning-in-minutes-very-fast-examples)

The approach is designed to help grasp the applied machine learning lifecycle in minutes. It is not an alternative to actually taking the time to learn. What it aims to do is help someone get started fast and gain intuitive understanding of the typical steps early on.

## Step 0: Understand the problem
What we're trying to do here is to find strong and interesting patterns or similarities from a red wine dataset.

## Step 1: Set-up and understand data
This step helps uncover issues that we will want to address in the next step and take into account when building and evaluating our model. We also want to find interesting relationships or patterns that we can possibly leverage in solving the problem we specified.

In [None]:
# Set-up libraries
import os
import pandas as pd
import seaborn as sns
import numpy as np
sns.set()
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from mpl_toolkits.mplot3d import Axes3D

In [None]:
# Check input data source
for dirname, _, filenames in os.walk('../input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Read-in data
df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

In [None]:
# Look at some details
df.info()

In [None]:
# Look at some records
df.head()

In [None]:
# Check for missing values
df.isna().sum()

In [None]:
# Check for duplicate values
df.duplicated().sum()

In [None]:
# Look at breakdown of Gender
df['quality'].value_counts()
sns.countplot(df['quality'])

In [None]:
# Explore data visually with boxplots
f, ax = plt.subplots(2, 2, figsize=(16, 12))
sns.boxplot('quality', 'alcohol', data=df, ax=ax[0, 0])
sns.boxplot('quality', 'sulphates', data=df, ax=ax[0, 1])
sns.boxplot('quality', 'volatile acidity', data=df, ax=ax[1, 0])
sns.boxplot('quality', 'citric acid', data=df, ax=ax[1,1])

In [None]:
# Explore correlation of other features to quality
df.corr()['quality'].sort_values(ascending=False)

In [None]:
# Summarise
df.describe()

## Step 2: Preprocess data and understand some more
This step typically takes the most time in the cycle but for our purposes, most of the datasets chosen in this series are clean.

Real-world datasets are noisy and incomplete. The choices we make in this step to address data issues can impact downstream steps and the result itself. For example, it can be tricky to address missing data when we don't know why it's missing. Is it missing completely at random or not? It can also be tricky to address outliers if we do not understand the domain and problem context enough.

In [None]:
# Get the features for input
X = df.drop('quality', axis=1)

In [None]:
# Scale the values
X_scaled = StandardScaler().fit_transform(X)

## Step 3: Model and evaluate

We need to create a number of models with different k values, measure the performance of each model, and use the k with the best performance in our final model. 

Where the ground truth is available, we compare the clusters generated to that of the ground truth.

In [None]:
# Build and fit models
wcss_scores = []
iterations = list(range(1,10))

for k in iterations:
    model = KMeans(n_clusters=k)
    model.fit(X_scaled)
    wcss_scores.append(model.inertia_)

In [None]:
# Plot performances
plt.figure(figsize=(12,6))
sns.lineplot(iterations, wcss_scores)

There are dips at 2, 5, and 7. Let's plot some features and see what's going on.

In [None]:
# Visualise the clusterds considerig fixed acidity, residual sugar, and alcohol
fig = plt.figure(figsize=(20, 15))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=15, azim=40)
model = KMeans(n_clusters=2)
model.fit(X)
labels = model.labels_
ax.scatter(X['fixed acidity'], X['residual sugar'], X['alcohol'],c=labels.astype(np.float), edgecolor='k')
ax.set_xlabel('Acidity')
ax.set_ylabel('Sugar')
ax.set_zlabel('Alcohol')
ax.set_title('K=2: Acidity, Sugar, Alcohol', size=22)

We can somewhat see a dark clump at the bottom there, all with relatively lower fixed acidity, residual sugar, and alcohol. This can be interesting, or not at all. But let's see the others first. Will increasing the number of clusters shed more light?

In [None]:
fig = plt.figure(figsize=(20, 15))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=15, azim=40)
model = KMeans(n_clusters=5)
model.fit(X)
labels = model.labels_
ax.scatter(X['fixed acidity'], X['residual sugar'], X['alcohol'],c=labels.astype(np.float), edgecolor='k')
ax.set_xlabel('Acidity')
ax.set_ylabel('Sugar')
ax.set_zlabel('Alcohol')
ax.set_title('K=5: Acidity, Sugar, Alcohol', size=22)

In [None]:
# Visualise the clusterds considerig fixed acidity, residual sugar, and alcohol
fig = plt.figure(figsize=(20, 15))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=15, azim=40)
model = KMeans(n_clusters=7)
model.fit(X)
labels = model.labels_
ax.scatter(X['fixed acidity'], X['residual sugar'], X['alcohol'],c=labels.astype(np.float), edgecolor='k')
ax.set_xlabel('Acidity')
ax.set_ylabel('Sugar')
ax.set_zlabel('Alcohol')
ax.set_title('K=7: Acidity, Sugar, Alcohol', size=22)

Let's go back to K=2 as it's more curious than the last two k's. How does it compare to the ground truth?

In [None]:
# Compare clusters generated to the ground truth
fig = plt.figure(figsize=(20, 15))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=15, azim=40)
ax.scatter(df['fixed acidity'], df['residual sugar'], df['alcohol'],c=df['quality'], edgecolor='k')
ax.set_xlabel('Acidity')
ax.set_ylabel('Sugar')
ax.set_zlabel('Alcohol')
ax.set_title('Ground truth', size=22)

In [None]:
# Visualise the clusterds considerig fixed acidity, residual sugar, and alcohol
fig = plt.figure(figsize=(20, 18))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=15, azim=40)
model = KMeans(n_clusters=2)
model.fit(X)
labels = model.labels_
ax.scatter(X['fixed acidity'], X['residual sugar'], X['alcohol'],c=labels.astype(np.float), edgecolor='k')
ax.set_xlabel('Acidity')
ax.set_ylabel('Sugar')
ax.set_zlabel('Alcohol')
ax.set_title('K=2: Acidity, Sugar, Alcohol', size=22)

In [None]:
# Compare generated cluster with ground truth
fig = plt.figure(figsize=(20, 18))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=15, azim=40)
ax.scatter(df['fixed acidity'][df['quality']<6], 
           df['residual sugar'][df['quality']<6], 
           df['alcohol'][df['quality']<6],
           edgecolor='k')
ax.set_xlabel('Acidity')
ax.set_ylabel('Sugar')
ax.set_zlabel('Alcohol')
ax.set_title('Ground truth: Low to average quality wine', size=22)

In [None]:
# Visualise the clusterds considerig fixed acidity, residual sugar, and alcohol
fig = plt.figure(figsize=(20, 18))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=15, azim=40)
ax.scatter(df['fixed acidity'][df['quality']>6], 
           df['residual sugar'][df['quality']>6], 
           df['alcohol'][df['quality']>6],
           edgecolor='k')
ax.set_xlabel('Acidity')
ax.set_ylabel('Sugar')
ax.set_zlabel('Alcohol')
ax.set_title('Ground truth: High quality wine', size=22)

It turns out there are two curious clusters of wine quality based on fixed acidity, residual sugar, and alcohol level.

Low to average quality wine that may be lower in fixed acidity, residual sugar, and alcohol level. High quality wine that may have higher fixed acidity, residual sugar, and alcohol level.


## Learn more
If you found this example interesting, you may also want to check out:

* [Machine learning in minutes - very fast fundamental examples in Python](https://www.kaggle.com/jamiemorales/machine-learning-in-minutes-very-fast-examples)
* [List of machine learning methods & datasets](https://www.kaggle.com/jamiemorales/list-of-machine-learning-methods-datasets)

Thanks for reading. Don't forget to upvote.