## About
This notebook contains a very fast fundamental k-means clustering example in Python.

This work is part of a series called [Machine learning in minutes - very fast fundamental examples in Python](https://www.kaggle.com/jamiemorales/machine-learning-in-minutes-very-fast-examples)

The approach is designed to help grasp the applied machine learning lifecycle in minutes. It is not an alternative to actually taking the time to learn. What it aims to do is help someone get started fast and gain intuitive understanding of the typical steps early on.

## Step 0: Understand the problem
What we're trying to do here is to find strong and interesting patterns or similarities from the mall's data about its customers.

## Step 1: Set-up and understand data
This step helps uncover issues that we will want to address in the next step and take into account when building and evaluating our model. We also want to find interesting relationships or patterns that we can possibly leverage in solving the problem we specified.

In [None]:
# Set-up libraries
import os
import pandas as pd
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder

In [None]:
# Check input data source
for dirname, _, filenames in os.walk('../input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Read-in data
df = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')

In [None]:
# Look at some details
df.info()

In [None]:
# Look at some records
df.head()

In [None]:
# Check for missing values
df.isna().sum()

In [None]:
# Check for duplicate values
df.duplicated().sum()

In [None]:
# Look at breakdown of Gender
df['Gender'].value_counts()
sns.countplot(df['Gender'])

In [None]:
# Look at distribution of Age
sns.distplot(df['Age'])

In [None]:
# Look at distribution of Annual Income
sns.distplot(df['Annual Income (k$)'])

In [None]:
# Look at distribution of Spending
sns.distplot(df['Spending Score (1-100)'])

In [None]:
# Explore data visually with scatter plots
sns.pairplot(df)

In [None]:
# Explore data visually with kernel density estimations
g = sns.PairGrid(df)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, n_levels=6);

In [None]:
# Summarise
df.describe()

## Step 2: Preprocess data and understand some more
This step typically takes the most time in the cycle but for our purposes, most of the datasets chosen in this series are clean.

Real-world datasets are noisy and incomplete. The choices we make in this step to address data issues can impact downstream steps and the result itself. For example, it can be tricky to address missing data when we don't know why it's missing. Is it missing completely at random or not? It can also be tricky to address outliers if we do not understand the domain and problem context enough.

In [None]:
# Rename columns for easier handling
df = df.rename(columns={'Annual Income (k$)': 'Income',
                  'Spending Score (1-100)': 'Spending'
                  })
df.head()

In [None]:
# Transform categorical features to numeric
le = LabelEncoder()
le.fit(df['Gender'].drop_duplicates())
df['Gender'] = le.transform(df['Gender'])

In [None]:
# Look at breakdown of feature Gender
df['Gender'].value_counts()

In [None]:
# Get the features for input
X = df.drop('CustomerID', axis=1)
X.head()

## Step 3: Model and evaluate

We need to create a number of models with different k values, measure the performance of each model, and use the k with the best performance in our final model. 

Where the ground truth is available, we compare the clusters generated to that of the ground truth.

In [None]:
# Build and fit models
wcss_scores = []
iterations = list(range(1,10))

for k in iterations:
    model = KMeans(n_clusters=k)
    model.fit(X)
    model.fit(X)
    wcss_scores.append(model.inertia_)

In [None]:
# Plot performances
plt.figure(figsize=(12,6))
sns.lineplot(iterations, wcss_scores)

There are dips at 2, 3, 4, and 5. Let's plot some features and see what's going on.

In [None]:
# Visualise the clusters, considering Income and Spending
plt.figure(figsize=(27,27))

plt.subplot(3,2,1)
plt.title('K = 2: Income',fontsize=22)
plt.xlabel('Income')
plt.xlabel('Spending')
model = KMeans(n_clusters=2)
X['labels'] = model.fit_predict(X)
plt.scatter(X.Spending[X.labels == 0], X.Income[X.labels == 0])
plt.scatter(X.Spending[X.labels == 1], X.Income[X.labels == 1])

plt.subplot(3,2,2)
plt.title('K = 3: Income',fontsize=22)
plt.xlabel('Income')
plt.xlabel('Spending')
model = KMeans(n_clusters=3)
X['labels'] = model.fit_predict(X)
plt.scatter(X.Spending[X.labels == 0], X.Income[X.labels == 0])
plt.scatter(X.Spending[X.labels == 1], X.Income[X.labels == 1])
plt.scatter(X.Spending[X.labels == 2], X.Income[X.labels == 2])

plt.subplot(3, 2, 3)
plt.title('K = 4: Income', fontsize=22)
plt.xlabel('Income')
plt.ylabel('Spending')
model = KMeans(n_clusters=4)
X['labels'] = model.fit_predict(X)
plt.scatter(X.Spending[X.labels == 0], X.Income[X.labels == 0])
plt.scatter(X.Spending[X.labels == 1], X.Income[X.labels == 1])
plt.scatter(X.Spending[X.labels == 2], X.Income[X.labels == 2])
plt.scatter(X.Spending[X.labels == 3], X.Income[X.labels == 3])

plt.subplot(3, 2, 4)
plt.title('K = 5: Income', fontsize=22)
plt.xlabel('Income')
plt.ylabel('Spending')
model = KMeans(n_clusters=5)
X['labels'] = model.fit_predict(X)
plt.scatter(X.Spending[X.labels == 0], X.Income[X.labels == 0])
plt.scatter(X.Spending[X.labels == 1], X.Income[X.labels == 1])
plt.scatter(X.Spending[X.labels == 2], X.Income[X.labels == 2])
plt.scatter(X.Spending[X.labels == 3], X.Income[X.labels == 3])
plt.scatter(X.Spending[X.labels == 4], X.Income[X.labels == 4])

K=5 is curious as we can easily see five groupings in these plots. Let's take a closer look at k=5.

In [None]:
# Visualise most interesting clusters
plt.figure(figsize=(24,12))

plt.title('K = 5: Income', fontsize=22)
plt.xlabel('Income')
plt.ylabel('Spending')
model = KMeans(n_clusters=5)
X['labels'] = model.fit_predict(X)
plt.scatter(X.Spending[X.labels == 0], X.Income[X.labels == 0])
plt.scatter(X.Spending[X.labels == 1], X.Income[X.labels == 1])
plt.scatter(X.Spending[X.labels == 2], X.Income[X.labels == 2])
plt.scatter(X.Spending[X.labels == 3], X.Income[X.labels == 3])
plt.scatter(X.Spending[X.labels == 4], X.Income[X.labels == 4])

It turns out there are five interesting clusters based on a customer's annual income and spending score.

**Low-income, low-spenders**. The first cluster is comprised of customers who have low income and low spending.

**Low-income, high-spenders**. The second cluster is comprised of customers who have low income and high spending.

**Middle-income, middle-spenders**. The third cluster is comprised of customers who have middle income and middle spending.

**High-income, low-spenders**. The fourth cluster is comprised of customers who have high income and low spending.

**High-income, high-spenders**. The fifth cluster is comprised of customers who have high income and high spending.


In [None]:
# Visualise the clusters, considering Age and Spending
plt.figure(figsize=(27,27))

plt.subplot(3,2,1)
plt.title('K = 2: Income',fontsize=22)
plt.xlabel('Age')
plt.xlabel('Spending')
model = KMeans(n_clusters=2)
X['labels'] = model.fit_predict(X)
plt.scatter(X.Spending[X.labels == 0], X.Age[X.labels == 0])
plt.scatter(X.Spending[X.labels == 1], X.Age[X.labels == 1])

plt.subplot(3,2,2)
plt.title('K = 3: Income',fontsize=22)
plt.xlabel('Age')
plt.xlabel('Spending')
model = KMeans(n_clusters=3)
X['labels'] = model.fit_predict(X)
plt.scatter(X.Spending[X.labels == 0], X.Age[X.labels == 0])
plt.scatter(X.Spending[X.labels == 1], X.Age[X.labels == 1])
plt.scatter(X.Spending[X.labels == 2], X.Age[X.labels == 2])

plt.subplot(3, 2, 3)
plt.title('K = 4: Income', fontsize=22)
plt.xlabel('Age')
plt.ylabel('Spending')
model = KMeans(n_clusters=4)
X['labels'] = model.fit_predict(X)
plt.scatter(X.Spending[X.labels == 0], X.Age[X.labels == 0])
plt.scatter(X.Spending[X.labels == 1], X.Age[X.labels == 1])
plt.scatter(X.Spending[X.labels == 2], X.Age[X.labels == 2])
plt.scatter(X.Spending[X.labels == 3], X.Age[X.labels == 3])

plt.subplot(3, 2, 4)
plt.title('K = 5: Income', fontsize=22)
plt.xlabel('Age')
plt.ylabel('Spending')
model = KMeans(n_clusters=5)
X['labels'] = model.fit_predict(X)
plt.scatter(X.Spending[X.labels == 0], X.Age[X.labels == 0])
plt.scatter(X.Spending[X.labels == 1], X.Age[X.labels == 1])
plt.scatter(X.Spending[X.labels == 2], X.Age[X.labels == 2])
plt.scatter(X.Spending[X.labels == 3], X.Age[X.labels == 3])
plt.scatter(X.Spending[X.labels == 4], X.Age[X.labels == 4])

K=2 is curious as we can somewhat see 2 groupings in these plots. Let's take a closer look at k=2.

In [None]:
# Visualise interesting clusters
plt.figure(figsize=(27,27))

plt.subplot(3,2,1)
plt.title('K = 2: Income',fontsize=22)
plt.xlabel('Age')
plt.xlabel('Spending')
model = KMeans(n_clusters=2)
X['labels'] = model.fit_predict(X)
plt.scatter(X.Spending[X.labels == 0], X.Age[X.labels == 0])
plt.scatter(X.Spending[X.labels == 1], X.Age[X.labels == 1])

It turns out there are two interesting clusters based on a customer's age and spending score.

**Young to middle age high spenders**. One cluster is comprised of young to middle aged customers who have high spending.

**Young to middle age high spenders**. The other cluster is comprised of all other customers who have low to middle spending.

The young to middle age high spenders may be a lofty demographic to target.

## Learn more
If you found this example interesting, you may also want to check out:

* [Machine learning in minutes - very fast fundamental examples in Python](https://www.kaggle.com/jamiemorales/machine-learning-in-minutes-very-fast-examples)
* [List of machine learning methods & datasets](https://www.kaggle.com/jamiemorales/list-of-machine-learning-methods-datasets)

Thanks for reading. Don't forget to upvote.