# Unsupervised Learning

Unsupervised learning is the training of an artificial intelligence (AI) algorithm using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance.

In unsupervised learning, an AI system is presented with unlabeled, uncategorised data and the system’s algorithms act on the data without prior training. The output is dependent upon the coded algorithms. Subjecting a system to unsupervised learning is one way of testing AI.

Unsupervised learning algorithms can perform more complex processing tasks than supervised learning systems. However, unsupervised learning can be more unpredictable than the alternate model. While an unsupervised learning AI system might, for example, figure out on its own how to sort cats from dogs, it might also add unforeseen and undesired categories to deal with unusual breeds, creating clutter instead of order.

## Clustering

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.

For Ex: The data points in the graph below clustered together can be classified into one single group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below picture.

Some commonly used clustering algorithms:
1. K means
2. DBSCAN
3. Mean-Shift 
4. Hierarchical

![Image](https://cdncontribute.geeksforgeeks.org/wp-content/uploads/merge3cluster.jpg)


## Clustering using Kmeans

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.

![Image](https://www.jeremyjordan.me/content/images/2016/12/kmeans.gif)

## Working of K means

The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of features for each data point. The algorithms starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set. The algorithm then iterates between two steps:

1. Data assigment step:

Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance. More formally, if ci is the collection of centroids in set C, then each data point x is assigned to a cluster based on

$$\underset{c_i \in C}{\arg\min} \; dist(c_i,x)^2$$

where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for each ith cluster centroid be Si.

2. Centroid update step:

In this step, the centroids are recomputed. This is done by taking the mean of all data points assigned to that centroid's cluster.

$$c_i=\frac{1}{|S_i|}\sum_{x_i \in S_i x_i}$$

The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).

This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not necessarily the best possible outcome), meaning that assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.

Choosing K
The algorithm described above finds the clusters and data set labels for a particular pre-chosen K. To find the number of clusters in the data, the user needs to run the K-means clustering algorithm for a range of K values and compare the results. In general, there is no method for determining exact value of K, but an accurate estimate can be obtained using the following techniques.

One of the metrics that is commonly used to compare results across different values of K is the mean distance between data points and their cluster centroid. Since increasing the number of clusters will always reduce the distance to data points, increasing K will always decrease this metric, to the extreme of reaching zero when K is the same as the number of data points. Thus, this metric cannot be used as the sole target. Instead, mean distance to the centroid as a function of K is plotted and the "elbow point," where the rate of decrease sharply shifts, can be used to roughly determine K.

A number of other techniques exist for validating K, including cross-validation, information criteria, the information theoretic jump method, the silhouette method, and the G-means algorithm. In addition, monitoring the distribution of data points across groups provides insight into how the algorithm is splitting the data for each K.

![Image](https://media.giphy.com/media/42dsvcMDP3diU/giphy.gif)

# Problem Statement

The data provider @arjunbhasin2013 says: 
> This case requires to develop a customer segmentation to define marketing strategy. The sample Dataset summarizes the usage behavior of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables.


In [None]:
#Importing all the necessary packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import PowerTransformer

In [None]:
#First step in analysing any dataset is to check if it has any missing values. So let's do that first

data =pd.read_csv("../input/CC GENERAL.csv")
missing = data.isna().sum()
print(missing)

In [None]:
# Aha! Minimum payments and credit limit has missing values. We can fill in those missing values with either mean or median of its respective column.

data['MINIMUM_PAYMENTS'] = data['MINIMUM_PAYMENTS'].fillna(data['MINIMUM_PAYMENTS'].median())
data['CREDIT_LIMIT'] = data['CREDIT_LIMIT'].fillna(data['CREDIT_LIMIT'].median())
data = data.drop(['CUST_ID'],axis=1)

In [None]:
# Let's take a look at how our data looks
data.head()

In [None]:
# It's always a good practice to remove unnecessary information from your dataset. This helps the algorithm converge better
# To find out which features are important, let's take a look their variance. Variance prvoides a quick overview of how spread the feature is. 
# Lower the varaince, less important the feature is.

for j in list(data.columns.values):
    print("Feature: {0}, Variance: {1}".format(j,data[j].var()))

In [None]:
# Let's get rid of the fatures with less variance

data = data.drop(["BALANCE_FREQUENCY","PURCHASES_FREQUENCY","ONEOFF_PURCHASES_FREQUENCY","PURCHASES_INSTALLMENTS_FREQUENCY","CASH_ADVANCE_FREQUENCY","CASH_ADVANCE_TRX","PURCHASES_TRX","PRC_FULL_PAYMENT","TENURE"],axis=1)
for j in list(data.columns.values):
    print("Feature: {0}, Variance: {1}".format(j,data[j].var()))

In [None]:
# Next step is to chekc if we are dealing with a lot of outliers. Having a lot of them can cause K means to perform poorly.
plt.figure(figsize=(20,10))
for j in list(data.columns.values):
    plt.scatter(y=data[j],x=[i for i in range(len(data[j]))],s=[20])
plt.legend()

In [None]:
# Luckily, this dataset has very few outliers which we can ignore for now. However, one problem that is evident in the graph above is the scale of values
# Any machine learning model would perform better over a scaled dataset compared to a non -scaled one. So let's get that done.

X = PowerTransformer(method='yeo-johnson').fit_transform(data)

In [None]:
# Now that our data is ready, we can run our K-means algorithm over it. The trick here is to guess the number of clusters you want k-means to make.
# This is called the elbow technique

wcss = []
for ii in range( 1, 30 ):
    kmeans = KMeans(n_clusters=ii, init="k-means++", n_init=10, max_iter=300) 
    kmeans.fit_predict(X)
    wcss.append( kmeans.inertia_ )
    
plt.plot( wcss, 'ro-', label="WCSS")
plt.title("Computing WCSS for KMeans++")
plt.xlabel("Number of clusters")
plt.ylabel("WCSS")
plt.show()

In [None]:
# Somewhere near 7, the slope starts to flatten signifancty. So 7 is a good no of clusters to start with.


kmeans = KMeans(n_clusters=7, init="k-means++", n_init=10, max_iter=300) 
y_pred = kmeans.fit_predict(X)

In [None]:
# Now that we have our clusters, the only thing left to do is find out what are the common traits amongst the members in each cluster
# A pairplot is the simplest way to visualize this relationship.

data["cluster"] = y_pred
cols = list(data.columns)
ss = sns.pairplot( data[ cols ], hue="cluster")
plt.legend()

The goal was to segment the customers in order to define a marketing strategy. Unfortunately the colors of the plots change when this kernel is rerun - but here are some thoughts:

* **Big Spenders with large Payments** - they make expensive purchases and have a credit limit that is between average and high.  This is only a small group of customers.
* **Cash Advances with large Payments** - this group takes the most cash advances. They make large payments, but this appears to be a small group of customers.
* **Medium Spenders with third highest Payments **- the second highest Purchases group (after the Big Spenders).
* **Highest Credit Limit but Frugal** - this group doesn't make a lot of purchases. It looks like the 3rd largest group of customers.
* **Cash Advances with Small Payments **- this group likes taking cash advances, but make only small payments. 
* **Small Spenders and Low Credit Limit** - they have the smallest Balances after the Smallest Spenders, their Credit Limit is in the bottom 3 groups, the second largest group of customers.
* **Smallest Spenders and Lowest Credit Limit** - this is the group with the lowest credit limit but they don't appear to buy much. Unfortunately this appears to be the largest group of customers.
* **Highest Min Payments** - this group has the highest minimum payments (which presumably refers to "Min Payment Due" on the monthly statement. This might be a reflection of the fact that they have the second lowest Credit Limit of the groups, so it looks like the bank has identified them as higher risk.)

So a marketing strategy that targeted the first five groups might be effective. 