## Clustering

Clustering is a type of __unsupervised machine learning__, where different data points are grouped together into two or more clusters. Data points in the same cluster are more similar to each other than those in other clusters. This __similarity__ can be measured in some specified way and the strength of similarity between data points is used to assign data points to its cluster. 

There are __hard clustering__ and __soft clustering__ methods. Hard clustering is when each data point belongs to a cluster completely. Soft clustering is when each data point can belong to more than one cluster with some probability. The number of clusters can be defined by the user. However, in some cases even the users do not know how many clusters should the data be grouped into. Therefore, figuring out the best number of cluster is also a part of the clustering task.  

### K-Means Clustering
* Hard clustering method.
* A centroid-based clustering method. 
* Given a cluster, a __centroid__ is its central data point. 
* Centroid can be real of imaginary. 
* In K-Means an iterative algorithm is employed to derive similarity based on the distance of that data point from the centroid of the cluster. 

Let's begin by downloading a small sample (version 2) of marketing campaign dataset of a Portugese banking institution available on [OpenML](https://www.openml.org/). The data is related to direct marketing campaigns via phone calls to subscribe clients to a bank term deposit. Detailed description of the dataset is available [here](https://www.openml.org/d/1461). 

We will use the `fetch_openml` function from `datasets` module of sklearn. The function provides easy access to the OpenML API to download available datasets.

In [None]:
from sklearn.datasets import fetch_openml

# fetch by using data name and version
bank_marketing = fetch_openml(name='bank-marketing', version=2) # try version 1

data = bank_marketing.data 

Let's take a quick look at the data.

In [None]:
data.head()

Each row belongs to a bank client. Each column provides additional details related to that client including data from the last contact of the current campaign. We do not have column names to identify them easily. So let's add them to the data set.

In [None]:
data.columns = ['age', 'job', 'marital_status', 'education', 'credit_default', 'balance', 'housing', 'loan', 
             'lastcontact_type', 'lastcontact_dayofmonth', 'lastcontact_month', 'lastcontact_duration', 
             'n_contacts', 'days_since_lastcontact', 'previous_n_contacts', 'previous_outcome']
data.head()

Our task is to figure out how to devise a marketing campaign to optimize client subscription to bank term deposit. 

The data is not ready to feed into a ML algorithm. Therefore, we need to perform various __data cleaning__ or __data pre-processing__ steps first. Let's start by creating the feature matrix using select columns.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
X = data[['job', 'education',  'credit_default', 'balance', 'housing', 'loan']]
X.head()

In [None]:
X['education'].value_counts()

In [None]:
X.info()

There are no missing values in this datasets. However, there are many `categorical` data types. These categorical data need to be converted into numeric so that ML algorithm implementation tools can accept them. 

Note that some of these categorical columns have "unknown" values and could be treated the same way as missing values. 

For now, we will use all categories available including "unknown". However, I encourage you to look at these categories closely and find new ways of dealing with them such that model performance can perhaps be improved.

To convert categorical data into numeric we can use the [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) object from the `preprocessing` module of sklearn. The function follows a `fit_transform` framework used in many other sklearn objects. Given a dataset, the encoder finds the unique categories for each feature and transforms them into a new column, where a value of 1 is given if the row belongs to that category or 0 otherwise. This process is also known as __vector representation__.  

__Note__ that `pandas` also offers a method called `get_dummies`, which converts categorical variables into dummy variables much the same way.

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# make a list of categorical columns and isolate these features
cat_feat = ['job', 'education', 'credit_default', 'housing', 'loan']
X_cat = X[cat_feat]
X_cat.head()

In [None]:
# create and instance of OneHotEncoder
enc = OneHotEncoder()

# apply fit_transform on dataframe with categorical features only 
X_cat_ohe = enc.fit_transform(X_cat)

# convert result into numpy array
X_cat_ohe = X_cat_ohe.toarray()

# convert result into pandas dataframe
X_cat_ohe = pd.DataFrame(X_cat_ohe)
X_cat_ohe.head()

In [None]:
# rename columns
X_cat_ohe.columns = enc.get_feature_names(cat_feat)
X_cat_ohe.head()

The categorical data is now in an acceptable format. Let's drop the orginal columns and add these columns instead in the feature matrix. 

In [None]:
X.drop(cat_feat, axis=1, inplace=True)
X = pd.concat([X, X_cat_ohe], axis=1)
X.head()

Now we can use the `.fit` framework of sklearn to implement the [K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) clustering algorithm in this data. This framework will train the model using the provided data and then obtain subsequent predictions. 

In [None]:
from sklearn.cluster import KMeans

In [None]:
k_means = KMeans(n_clusters=2, init='random', random_state=0)
k_means.fit(X)

The parameter `n_cluster` takes the value of the number of clusters we wish to have. Here we have asked the data to be grouped into two clusters. The parameter `init` refers to the method to be used for initialization. We also specify the `random_state` parameter to replicate the result during future runs. We also have a choice of selecting either Lloyd's or Elkan's algorithm. 

The `KMeans` object has attributes such as `cluster_centers_`, `labels_`, `inertia_` and `n_iter`. 

In [None]:
print(k_means.cluster_centers_)

In [None]:
print(k_means.labels_)

In [None]:
np.unique(k_means.labels_)

In [None]:
print(k_means.inertia_)

In [None]:
print(k_means.n_iter_)

### Steps in K-Means Algorithm

1. Choose the number of clusters *k*
2. Randomly initialize *k* centroids
3. Assign each point to its closest centroid
4. Compute mean of each cluster and call it the new centroid
5. Repeat steps 3 and 4 until the centroid positions do not change

In [None]:
k_means_3k = KMeans(n_clusters=3, init='random', random_state=0)
k_means_3k.fit(X)
print(k_means_3k.cluster_centers_)
print(k_means_3k.inertia_)
print(k_means_3k.n_iter_)

In [None]:
print(k_means_3k.labels_)

In [None]:
np.unique(k_means_3k.labels_)

### Evaluation

#### Silhouette coefficient 
* A measure of cluster cohesion and separation. 
* Quantifies how well a data point fits into its assigned cluster based on two factors:
    * How close the data point is to other points in the __same__ cluster
    * How far away the data point is from points in __other__ clusters
* Values range between -1 and 1; larger numbers indicate that samples are closer to their assigned clusters than they are to other clusters.

[`Silhouette_score` function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html) is available in sklearn's `metric` module.

#### Elbow Method
* A technique to evaluate the best number of cluster *k*.
* Run K-Means on same data with multiple values of *k* and choose *k* that minimized the squared sum of errors (`.interia_`).

In [None]:
from sklearn.metrics import silhouette_score

In [None]:
silhouette = []
inertia = []
krange = range(2, 11)
for i in krange:
    print(i)
    kmeans = KMeans(n_clusters=i, init='random', random_state=0)
    kmeans.fit(X)
    sscore = round(silhouette_score(X, kmeans.labels_),2)
    silhouette.append(sscore)
    inertia.append(kmeans.inertia_)

In [None]:
# plot
fig, ax = plt.subplots(figsize=(10, 8))
plt.plot(krange, inertia, marker='o')
for i, txt in enumerate(silhouette):
    plt.annotate('S='+str(txt), (krange[i], inertia[i]))
plt.xlabel('Number of clusters')
plt.ylabel('Sum of Squared Errors')
plt.title('Elbow method for optimal k')
plt.show()