__________________________
## 2. Unsupervised Learning
__________________________

 
### ____________________________ 2.1 Clustering _____________________________

##### mainly used for customer segmentation and data mining

#### Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. 

#### Types of Clustering

##### 1. Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. It means, hard clustering does not allow a data point to move between two or more clusters.

##### 2. Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. The likelihood can be 0 and 1.

#### Clustering Algorithms

##### 1. K-Means Clustering: K-Means clustering is one of the simplest and popular unsupervised machine learning algorithms. K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

##### 2. Hierarchical Clustering: Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. It is a way of clustering that is used to find groups of objects that have similar characteristics.

##### 3. Density-Based Clustering: Density-based clustering is a family of clustering algorithms that are based on density. The main idea is that a cluster is a region of high density, separated from other such regions by regions of low density.

##### 4. Grid-Based Clustering: Grid-based clustering is a clustering technique that divides the space into a grid of cells and then assigns each data point to the cell it belongs to. The data points in each cell are then clustered using a clustering algorithm.

##### 5. Model-Based Clustering: Model-based clustering is a clustering technique that uses a probabilistic model to cluster the data. The model-based clustering technique is based on the assumption that the data points are generated from a mixture of distributions.

##### 6. Clustering Based on Similarity: Clustering based on similarity is a clustering technique that clusters the data points based on the similarity between them. The similarity between the data points is calculated using a similarity measure.

# 2.1.1 K-Means Clustering

#### K-Means clustering is one of the simplest and popular unsupervised machine learning algorithms. K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
#### The ‘means’ in the K-Means refers to averaging of the data; that is, finding the centroid.

## The working of the K-Means algorithm is explained in the below steps:
#### 
#### Step-1: Select the number K to decide the number of clusters.
#### 
#### Step-2: Select random K points or centroids. (It can be other from the input dataset).
#### 
#### Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
#### 
#### Step-4: Calculate the variance and place a new centroid of each cluster.
#### 
#### Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
#### 
#### Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
#### 
#### Step-7: The model is ready.

In [3]:
# Kmeans 

from sklearn.cluster import KMeans

K_means = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)


# init = "k-means++" : Selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.
#  'max_iter',  :  Maximum number of iterations of the k-means algorithm for a single run.
#  'n_clusters',  :  The number of clusters to form as well as the number of centroids to generate.
#  'n_init',  :  Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
#  'n_jobs',  :  The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.
#  'precompute_distances',  :  Precompute distances (faster but takes more memory).
#  'predict',  :  Predict the closest cluster each sample in X belongs to.
#  'random_state',  :  The generator used to initialize the centers. If an integer is given, it fixes the seed. Defaults to the global numpy random number generator.



### Elbow Method

#### elbow method uses the WCSS concept to draw the plot by plotting WCSS values on the Y-axis and the number of clusters on the X-axis. So we are going to calculate the value for WCSS for different k values ranging from 1 to 10.

#### The WCSS value is calculated by adding the square of the distance between each data point and its assigned centroid. The lower the WCSS value, the better the clustering.

In [None]:
# elbow method

distortions = []

for i in range(1, 11):
    km = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=300, random_state=0)
    km.fit(X)
    distortions.append(km.inertia_)

plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

## 2.1.2 Hierarchical Clustering

#### Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. It is a way of clustering that is used to find groups of objects that have similar characteristics. It is a type of unsupervised learning.

#### The hierarchical clustering algorithm starts with all the data points assigned to a cluster of their own. Then, the two closest clusters are merged into a single cluster. This process is repeated until there is only a single cluster left.

#### There are two types of hierarchical clustering:

##### 1. Agglomerative: In this type of clustering, each data point is treated as a single cluster in the beginning. Then, the two closest clusters are merged into a single cluster. This process is repeated until there is only a single cluster left.

##### 2. Divisive: In this type of clustering, all the data points are assigned to a single cluster in the beginning. Then, the data points are split into two clusters. This process is repeated until each data point is assigned to its own cluster.



In [None]:
# hierarchical clustering

from scipy.cluster.hierarchy import linkage

row_clusters = linkage(pdist(X, metric='euclidean'), method='complete')

from scipy.cluster.hierarchy import dendrogram

row_dendr = dendrogram(row_clusters, labels=labels)

plt.tight_layout()
plt.ylabel('Euclidean distance')
plt.show()

# agglomerative clustering

from sklearn.cluster import AgglomerativeClustering

ac = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete')

labels = ac.fit_predict(X)

print('Cluster labels: %s' % labels)

# DBSCAN

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.2, min_samples=5, metric='euclidean')

y_db = db.fit_predict(X)
