# 1.Introduction: 
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
> 
Let’s understand this with an example. Suppose, you are the head of a rental store and wish to understand preferences of your costumers to scale up your business. Is it possible for you to look at details of each costumer and devise a unique business strategy for each one of them? Definitely not. But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing habits and use a separate strategy for costumers in each of these 10 groups. And this is what we call clustering.

# 2. Types of Clustering
Broadly speaking, clustering can be divided into two subgroups :

* **Hard Clustering:** In hard clustering, each data point either belongs to a cluster completely or not. For example, in the above example each customer is put into one group out of the 10 groups.
* **Soft Clustering:** In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario each costumer is assigned a probability to be in either of 10 clusters of the retail store.

# 3. Types of clustering algorithms
Since the task of clustering is subjective, the means that can be used for achieving this goal are plenty. Every methodology follows a different set of rules for defining the ‘similarity’ among data points. In fact, there are more than 100 clustering algorithms known. But few of the algorithms are used popularly, let’s look at them in detail:

> * **Connectivity models:** As the name suggests, these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches. In the first approach, they start with classifying all data points into separate clusters & then aggregating them as the distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret but lacks scalability for handling big datasets. Examples of these models are hierarchical clustering algorithm and its variants.

> * **Centroid models:** These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters. K-Means clustering algorithm is a popular algorithm that falls into this category. In these models, the no. of clusters required at the end have to be mentioned beforehand, which makes it important to have prior knowledge of the dataset. These models run iteratively to find the local optima.
> 
> * **Distribution models:** These clustering models are based on the notion of how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These models often suffer from overfitting. A popular example of these models is Expectation-maximization algorithm which uses multivariate normal distributions.

> * **Density Models:** These models search the data space for areas of varied density of data points in the data space. It isolates various different density regions and assign the data points within these regions in the same cluster. Popular examples of density models are DBSCAN and OPTICS.Now I will be taking you through two of the most popular clustering algorithms in detail – K Means clustering and Hierarchical clustering. Let’s begin.

# K-mean clustering

## Importing modules

In [None]:
#Basic imports
import numpy as np
import pandas as pd

#sklearn imports
from sklearn.decomposition import PCA #Principal Component Analysis
from sklearn.manifold import TSNE #T-Distributed Stochastic Neighbor Embedding
from sklearn.cluster import KMeans #K-Means Clustering
from sklearn.preprocessing import StandardScaler #used for 'Feature Scaling'

#plotly imports
import plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

#other import 
import seaborn as sns
import matplotlib.pyplot as plt


We will write our code in functions so that it can later be used in .py extention as well as in extention to this notebook.
## Function for loading the data

In [None]:
def load_function():
    return pd.read_csv("../input/forest-cover-type-dataset/covtype.csv") #returning the datframe 

# Basic Analysis

In [None]:
df = load_function();
df.head()



**Remarks:**
* The dataset is large in terms of features.
* It contains 55 columns in total.

### Number of Records

In [None]:
len(df)

### Description

In [None]:
df.describe()

## Information/summary

In [None]:
df.info()

In [None]:
#there is no null value
#df.isna().sum()

## New feature

In [None]:
df["Distance_To_Hydrology"] =( (df["Horizontal_Distance_To_Hydrology"] ** 2) + (df["Vertical_Distance_To_Hydrology"] ** 2) ) ** (0.5)

In [None]:
df.drop(["Horizontal_Distance_To_Hydrology","Vertical_Distance_To_Hydrology"], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df['Cover_Type'].replace({1:'Spruce/Fir', 2:'Lodgepole Pine', 3:'Ponderosa Pine', 4:'Cottonwood/Willow', 5:'Aspen', 6:'Douglas-fir', 7:'Krummholz'}, inplace=True)

# One-hot-encoded dataframe

In [None]:
def one_hot_encoder(data):
    return pd.get_dummies(data)

df = one_hot_encoder(df)
df.head()

In [None]:
numerical_dataframe =  df[["Elevation","Aspect","Slope","Horizontal_Distance_To_Roadways","Hillshade_9am","Hillshade_Noon","Hillshade_3pm","Horizontal_Distance_To_Fire_Points","Distance_To_Hydrology"]]

In [None]:
categorical_dataframe = df[["Wilderness_Area1","Wilderness_Area2","Wilderness_Area3","Wilderness_Area4","Soil_Type1","Soil_Type2","Soil_Type3","Soil_Type4","Soil_Type5","Soil_Type6","Soil_Type7","Soil_Type8","Soil_Type9","Soil_Type10","Soil_Type11","Soil_Type12","Soil_Type13","Soil_Type14","Soil_Type15","Soil_Type16","Soil_Type17","Soil_Type18","Soil_Type19","Soil_Type20","Soil_Type21","Soil_Type22","Soil_Type23","Soil_Type24","Soil_Type25","Soil_Type26","Soil_Type27","Soil_Type28","Soil_Type29","Soil_Type30","Soil_Type31","Soil_Type32","Soil_Type33","Soil_Type34","Soil_Type35","Soil_Type36","Soil_Type37","Soil_Type38","Soil_Type39","Soil_Type40","Cover_Type_Aspen","Cover_Type_Cottonwood/Willow","Cover_Type_Douglas-fir","Cover_Type_Krummholz","Cover_Type_Lodgepole Pine","Cover_Type_Ponderosa Pine","Cover_Type_Spruce/Fir"]]

In [None]:
def standard_numerical_dataframe(data):
    scaler = StandardScaler()
    return pd.DataFrame(scaler.fit_transform(data))

numerical_dataframe = standard_numerical_dataframe(numerical_dataframe)
numerical_dataframe.head()

In [None]:
numerical_dataframe.columns = ["Elevation_Scaled","Aspect_Scaled","Slope_Scaled","Horizontal_Distance_To_Roadways_Scaled","Hillshade_9am_Scaled","Hillshade_Noon_Scaled","Hillshade_3pm_Scaled","Horizontal_Distance_To_Fire_Points_Scaled","Distance_To_Hydrology_Scaled"]

In [None]:
numerical_dataframe.head()

In [None]:
df = pd.concat([numerical_dataframe, categorical_dataframe], axis=1, join='inner')

In [None]:
df.head()

## Optimal number of cluster

In [None]:

def elbow_plot(data):
    scores = [KMeans(n_clusters=i+2).fit(data).inertia_ 
          for i in range(10)]
    sns.lineplot(np.arange(2, 12), scores)
    plt.xlabel('Number of clusters')
    plt.ylabel("Inertia")
    plt.title("Inertia of k-Means versus number of clusters")

elbow_plot(df)


**Remarks:**

* The optimal number of cluster is 5.


In [None]:
def cluster(data,k=5):
    kmeans = KMeans(n_clusters=k) #Initialize our model
    kmeans.fit(data)
    clusters = kmeans.predict(data)
    data["Cluster"] = clusters
    return data

In [None]:
clustered_df=cluster(df)

In [None]:
clustered_df["Cluster"].describe()

# Visualization of Clustered Dataset

## 1. Principal Component Analysis (PCA)

## Sample the dataset

In [None]:
def sample_dataset(data,data_point):
    sample = pd.DataFrame(np.array(data.sample(data_point)))
    sample.columns = data.columns
    return sample

In [None]:
sample = sample_dataset(clustered_df,10000)

# Principal component datasets

In [None]:
def pca_dataset(n_component,data,cluster_label):
    pca = PCA(n_components=n_component)
    pca_df = pd.DataFrame(pca.fit_transform(data.drop([cluster_label],axis=1)))
    return pca_df

In [None]:
PCs_1d = pca_dataset(1,sample,"Cluster")
PCs_2d = pca_dataset(2,sample,"Cluster")
PCs_3d = pca_dataset(3,sample,"Cluster")

In [None]:
PCs_1d.columns = ["PC1_1d"]
PCs_2d.columns = ["PC1_2d", "PC2_2d"]
PCs_3d.columns = ["PC1_3d", "PC2_3d", "PC3_3d"]

In [None]:
plot_df = pd.concat([sample,PCs_1d,PCs_2d,PCs_3d], axis=1, join='inner')

In [None]:
plot_df["dummy"] = 0 #1-D visualization

In [None]:
cluster0 = plot_df[plot_df["Cluster"] == 0]
cluster1 = plot_df[plot_df["Cluster"] == 1]
cluster2 = plot_df[plot_df["Cluster"] == 2]
cluster3 = plot_df[plot_df["Cluster"] == 3]
cluster4 = plot_df[plot_df["Cluster"] == 4]


In [None]:
def pca_visualization(num_of_cluster=5,dimension=1):
    
    trace1 = go.Scatter(
                    x = cluster0["PC1_1d"],
                    y = cluster0["dummy"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)
    trace2 = go.Scatter(
                    x = cluster1["PC1_1d"],
                    y = cluster1["dummy"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)
    trace3 = go.Scatter(
                    x = cluster2["PC1_1d"],
                    y = cluster2["dummy"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)
    trace4 = go.Scatter(
                    x = cluster3["PC1_1d"],
                    y = cluster3["dummy"],
                    mode = "markers",
                    name = "Cluster 3",
                    marker = dict(color = 'rgba(255, 25, 200, 0.8)'),
                    text = None)
    trace5 = go.Scatter(
                    x = cluster4["PC1_1d"],
                    y = cluster4["dummy"],
                    mode = "markers",
                    name = "Cluster 4",
                    marker = dict(color = 'rgba(0, 255, 2, 0.8)'),
                    text = None)
    data = [trace1, trace2, trace3, trace4, trace5]

    title = "Visualizing Clusters in One Dimension Using PCA"

    layout = dict(title = title,
              xaxis= dict(title= 'PC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= '',ticklen= 5,zeroline= False)
             )

    fig = dict(data = data, layout = layout)

    iplot(fig)
    

In [None]:
pca_visualization()

In [None]:
#trace1 is for 'Cluster 0'
trace1 = go.Scatter(
                    x = cluster0["PC1_2d"],
                    y = cluster0["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)
#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
                    x = cluster1["PC1_2d"],
                    y = cluster1["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)

#trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster2["PC1_2d"],
                    y = cluster2["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)
trace4 = go.Scatter(
                x = cluster3["PC1_2d"],
                y = cluster3["PC2_2d"],
                mode = "markers",
                name = "Cluster 3",
                marker = dict(color = 'rgba(255, 25, 200, 0.8)'),
                text = None)
trace5 = go.Scatter(
                x = cluster4["PC1_2d"],
                y = cluster4["PC2_2d"],
                mode = "markers",
                name = "Cluster 4",
                marker = dict(color = 'rgba(0, 255, 2, 0.8)'),
                text = None)
data = [trace1, trace2, trace3,trace4,trace5]

title = "Visualizing Clusters in Two Dimensions Using PCA"

layout = dict(title = title,
              xaxis= dict(title= 'PC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'PC2',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

In [None]:
#Instructions for building the 3-D plot

#trace1 is for 'Cluster 0'
trace1 = go.Scatter3d(
                    x = cluster0["PC1_3d"],
                    y = cluster0["PC2_3d"],
                    z = cluster0["PC3_3d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)
#trace2 is for 'Cluster 1'
trace2 = go.Scatter3d(
                    x = cluster1["PC1_3d"],
                    y = cluster1["PC2_3d"],
                    z = cluster1["PC3_3d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)
#trace3 is for 'Cluster 2'
trace3 = go.Scatter3d(
                    x = cluster2["PC1_3d"],
                    y = cluster2["PC2_3d"],
                    z = cluster2["PC3_3d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)
#trace4 is for 'Cluster 3'
trace4 = go.Scatter3d(
                    x = cluster3["PC1_3d"],
                    y = cluster3["PC2_3d"],
                    z = cluster3["PC3_3d"],
                    mode = "markers",
                    name = "Cluster 3",
                    marker = dict(color = 'rgba(255, 25, 200, 0.8)'),
                    text = None)
#trace5 is for 'Cluster 4'
trace5 = go.Scatter3d(
                    x = cluster4["PC1_3d"],
                    y = cluster4["PC2_3d"],
                    z = cluster4["PC3_3d"],
                    mode = "markers",
                    name = "Cluster 4",
                    marker = dict(color = 'rgba(0, 255, 2, 0.8)'),
                    text = None)

data = [trace1, trace2, trace3,trace4, trace5]

title = "Visualizing Clusters in Three Dimensions Using PCA"

layout = dict(title = title,
              xaxis= dict(title= 'PC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'PC2',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

## 2.T-SNE

In [None]:
sample = sample_dataset(clustered_df,10000)
#Set our perplexity
perplexity = 50

In [None]:
def tsne_dataset(n_component,data,cluster_label,perplexity):
    tsne = TSNE(n_components=n_component,perplexity=perplexity)
    tsne_df = pd.DataFrame(tsne.fit_transform(data.drop([cluster_label],axis=1)))
    return tsne_df

In [None]:
TCs_1d = tsne_dataset(1,sample,"Cluster",perplexity)
TCs_2d = tsne_dataset(2,sample,"Cluster",perplexity)
TCs_3d = tsne_dataset(3,sample,"Cluster",perplexity)

In [None]:
TCs_1d.columns = ["TC1_1d"]

PCs_1d.columns = ["PC1_1d"]

TCs_2d.columns = ["TC1_2d","TC2_2d"]

TCs_3d.columns = ["TC1_3d","TC2_3d","TC3_3d"]

In [None]:
tsne_data = pd.concat([sample,TCs_1d,TCs_2d,TCs_3d], axis=1, join='inner')

In [None]:
tsne_data["dummy"] = 0

In [None]:
cluster0 = tsne_data[tsne_data["Cluster"] == 0]
cluster1 = tsne_data[tsne_data["Cluster"] == 1]
cluster2 = tsne_data[tsne_data["Cluster"] == 2]
cluster3 = tsne_data[tsne_data["Cluster"] == 3]
cluster4 = tsne_data[tsne_data["Cluster"] == 4]

In [None]:
trace1 = go.Scatter(
                    x = cluster0["TC1_1d"],
                    y = cluster0["dummy"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)
trace2 = go.Scatter(
                    x = cluster1["TC1_1d"],
                    y = cluster1["dummy"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)
trace3 = go.Scatter(
                    x = cluster2["TC1_1d"],
                    y = cluster2["dummy"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)
trace4 = go.Scatter(
                    x = cluster3["TC1_1d"],
                    y = cluster3["dummy"],
                    mode = "markers",
                    name = "Cluster 3",
                    marker = dict(color = 'rgba(255, 25, 200, 0.8)'),
                    text = None)
trace5 = go.Scatter(
                    x = cluster4["TC1_1d"],
                    y = cluster4["dummy"],
                    mode = "markers",
                    name = "Cluster 4",
                    marker = dict(color = 'rgba(0, 255, 2, 0.8)'),
                    text = None)
data = [trace1, trace2, trace3,trace4, trace5]

title = "Visualizing Clusters in One Dimension Using T-SNE (perplexity=" + str(perplexity) + ")"

layout = dict(title = title,
              xaxis= dict(title= 'TC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= '',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

In [None]:
#trace1 is for 'Cluster 0'
trace1 = go.Scatter(
                    x = cluster0["TC1_2d"],
                    y = cluster0["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)
#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
                    x = cluster1["TC1_2d"],
                    y = cluster1["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)
#trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster2["TC1_2d"],
                    y = cluster2["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)
trace4 = go.Scatter(
                x = cluster3["TC1_2d"],
                y = cluster3["TC2_2d"],
                mode = "markers",
                name = "Cluster 3",
                marker = dict(color = 'rgba(255, 25, 200, 0.8)'),
                text = None)
trace5 = go.Scatter(
                x = cluster4["TC1_2d"],
                y = cluster4["TC2_2d"],
                mode = "markers",
                name = "Cluster 4",
                marker = dict(color = 'rgba(0, 255, 2, 0.8)'),
                text = None)

data = [trace1, trace2, trace3,trace4,trace5]

title = "Visualizing Clusters in Two Dimensions Using TSNE"

layout = dict(title = title,
              xaxis= dict(title= 'TC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'TC2',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

In [None]:
#trace1 is for 'Cluster 0'
trace1 = go.Scatter3d(
                    x = cluster0["TC1_3d"],
                    y = cluster0["TC2_3d"],
                    z = cluster0["TC3_3d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)
#trace2 is for 'Cluster 1'
trace2 = go.Scatter3d(
                    x = cluster1["TC1_3d"],
                    y = cluster1["TC2_3d"],
                    z = cluster1["TC3_3d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)
trace3 = go.Scatter3d(
                    x = cluster2["TC1_3d"],
                    y = cluster2["TC2_3d"],
                    z = cluster2["TC3_3d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)
#trace4 is for 'Cluster 3'
trace4 = go.Scatter3d(
                    x = cluster3["TC1_3d"],
                    y = cluster3["TC2_3d"],
                    z = cluster3["TC3_3d"],
                    mode = "markers",
                    name = "Cluster 3",
                    marker = dict(color = 'rgba(255, 25, 200, 0.8)'),
                    text = None)
#trace5 is for 'Cluster 4'
trace5 = go.Scatter3d(
                    x = cluster4["TC1_3d"],
                    y = cluster4["TC2_3d"],
                    z = cluster4["TC3_3d"],
                    mode = "markers",
                    name = "Cluster 4",
                    marker = dict(color = 'rgba(0, 0, 200, 0.8)'),
                    text = None)

data = [trace1, trace2, trace3,trace4, trace5]

title = "Visualizing Clusters in Three Dimensions Using TSNE"

layout = dict(title = title,
              xaxis= dict(title= 'TC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'TC2',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

# Mini batch Kmean Algorithem

The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm.

The algorithm iterates between two major steps, similar to vanilla k-means. In the first step,  samples are drawn randomly from the dataset, to form a mini-batch. These are then assigned to the nearest centroid. In the second step, the centroids are updated. In contrast to k-means, this is done on a per-sample basis. For each sample in the mini-batch, the assigned centroid is updated by taking the streaming average of the sample and all previous samples assigned to that centroid. This has the effect of decreasing the rate of change for a centroid over time. These steps are performed until convergence or a predetermined number of iterations is reached.

MiniBatchKMeans converges faster than KMeans, but the quality of the results is reduced. 

In [None]:
from sklearn.cluster import MiniBatchKMeans

In [None]:
def mini_cluster(data,k=5):
    minikmeans = MiniBatchKMeans(n_clusters=k) #Initialize our model
    minikmeans.fit(data)
    clusters = minikmeans.predict(data)
    data["Cluster"] = clusters
    return data

https://www.kaggle.com/minc33/visualizing-high-dimensional-clusters


In [None]:
mini_cluster_data = mini_cluster(df)
mini_cluster_data.head()

In [None]:
sample = sample_dataset(mini_cluster_data,5000)

In [None]:
PCs_1d = pca_dataset(1,sample,"Cluster")
PCs_2d = pca_dataset(2,sample,"Cluster")
PCs_3d = pca_dataset(3,sample,"Cluster")

In [None]:
PCs_1d.columns = ["PC1_1d"]
PCs_2d.columns = ["PC1_2d", "PC2_2d"]
PCs_3d.columns = ["PC1_3d", "PC2_3d", "PC3_3d"]

In [None]:
plot_df = pd.concat([sample,PCs_1d,PCs_2d,PCs_3d], axis=1, join='inner')
plot_df["dummy"] = 0 #1-D visualization

In [None]:
cluster0 = plot_df[plot_df["Cluster"] == 0]
cluster1 = plot_df[plot_df["Cluster"] == 1]
cluster2 = plot_df[plot_df["Cluster"] == 2]
cluster3 = plot_df[plot_df["Cluster"] == 3]
cluster4 = plot_df[plot_df["Cluster"] == 4]

In [None]:
#trace1 is for 'Cluster 0'
trace1 = go.Scatter(
                    x = cluster0["PC1_2d"],
                    y = cluster0["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)
#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
                    x = cluster1["PC1_2d"],
                    y = cluster1["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)

#trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster2["PC1_2d"],
                    y = cluster2["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)
trace4 = go.Scatter(
                x = cluster3["PC1_2d"],
                y = cluster3["PC2_2d"],
                mode = "markers",
                name = "Cluster 3",
                marker = dict(color = 'rgba(255, 25, 200, 0.8)'),
                text = None)
trace5 = go.Scatter(
                x = cluster4["PC1_2d"],
                y = cluster4["PC2_2d"],
                mode = "markers",
                name = "Cluster 4",
                marker = dict(color = 'rgba(0, 255, 2, 0.8)'),
                text = None)
data = [trace1, trace2, trace3,trace4,trace5]

title = "Visualizing Clusters in Two Dimensions Using PCA"

layout = dict(title = title,
              xaxis= dict(title= 'PC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'PC2',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

In [None]:
TCs_1d = tsne_dataset(1,sample,"Cluster",perplexity=50)
TCs_2d = tsne_dataset(2,sample,"Cluster",perplexity=50)
TCs_3d = tsne_dataset(3,sample,"Cluster",perplexity=50)

In [None]:
TCs_1d.columns = ["TC1_1d"]

PCs_1d.columns = ["PC1_1d"]

TCs_2d.columns = ["TC1_2d","TC2_2d"]

TCs_3d.columns = ["TC1_3d","TC2_3d","TC3_3d"]

In [None]:
tsne_data = pd.concat([sample,TCs_1d,TCs_2d,TCs_3d], axis=1, join='inner')

In [None]:
tsne_data["dummy"] = 0

In [None]:
cluster0 = tsne_data[tsne_data["Cluster"] == 0]
cluster1 = tsne_data[tsne_data["Cluster"] == 1]
cluster2 = tsne_data[tsne_data["Cluster"] == 2]
cluster3 = tsne_data[tsne_data["Cluster"] == 3]
cluster4 = tsne_data[tsne_data["Cluster"] == 4]

In [None]:
#trace1 is for 'Cluster 0'
trace1 = go.Scatter(
                    x = cluster0["TC1_2d"],
                    y = cluster0["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)
#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
                    x = cluster1["TC1_2d"],
                    y = cluster1["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)
#trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster2["TC1_2d"],
                    y = cluster2["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)
trace4 = go.Scatter(
                x = cluster3["TC1_2d"],
                y = cluster3["TC2_2d"],
                mode = "markers",
                name = "Cluster 3",
                marker = dict(color = 'rgba(255, 25, 200, 0.8)'),
                text = None)
trace5 = go.Scatter(
                x = cluster4["TC1_2d"],
                y = cluster4["TC2_2d"],
                mode = "markers",
                name = "Cluster 4",
                marker = dict(color = 'rgba(0, 255, 2, 0.8)'),
                text = None)

data = [trace1, trace2, trace3,trace4,trace5]

title = "Visualizing Clusters in Two Dimensions Using TSNE"

layout = dict(title = title,
              xaxis= dict(title= 'TC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'TC2',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

# K-mean with spark 

In [None]:
!pip install pyspark

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName("Kmeans").getOrCreate()

In [None]:
df.head()

In [None]:
spark_data = spark.createDataFrame(df)
spark_data.show(2)

In [None]:
spark_data.printSchema()

In [None]:
from pyspark.ml.feature import VectorAssembler
columns=spark_data.columns

In [None]:
assemble=VectorAssembler(inputCols=columns,outputCol="features")
assembled_data = assemble.transform(spark_data)
assembled_data.show(2)

In [None]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
silhouette_score=[]
evaluator = ClusteringEvaluator(predictionCol='prediction', featuresCol='features', \
                                metricName='silhouette', distanceMeasure='squaredEuclidean')
for i in range(2,10):
    
    KMeans_algo=KMeans(featuresCol='features', k=i)
    
    KMeans_fit=KMeans_algo.fit(assembled_data)
    
    output=KMeans_fit.transform(assembled_data)
    
    
    
    score=evaluator.evaluate(output)
    
    silhouette_score.append(score)
    
    print("Silhouette Score:",score)

In [None]:
#Visualizing the silhouette scores in a plot
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1, figsize =(8,6))
ax.plot(range(2,10),silhouette_score)
ax.set_xlabel('k')
ax.set_ylabel('cost')

In [None]:
k = 6
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
model = kmeans.fit(assembled_data)
centers = model.clusterCenters()

print("Cluster Centers: ")
for center in centers:
    print(center)

In [None]:
transformed = model.transform(assembled_data)
#rows = transformed.collect().
#print(rows[:3])
#print(type(rows))


In [None]:
df_spark = transformed.toPandas()
df_spark.head()

In [None]:
df_spark["Cluster"]=df_spark["prediction"]
df_spark.drop(["prediction"],axis=1,inplace=True)
df_spark.head()

In [None]:
df_spark.drop("features",axis=1,inplace=True)

In [None]:
sample = sample_dataset(df_spark,10000)

In [None]:
PCs_2d = pca_dataset(2,sample,"Cluster")
PCs_2d.columns = ["PC1_2d", "PC2_2d"]
plot_df = pd.concat([sample,PCs_2d], axis=1, join='inner')

In [None]:
cluster0 = plot_df[plot_df["Cluster"] == 0]
cluster1 = plot_df[plot_df["Cluster"] == 1]
cluster2 = plot_df[plot_df["Cluster"] == 2]
cluster3 = plot_df[plot_df["Cluster"] == 3]
cluster4 = plot_df[plot_df["Cluster"] == 4]
cluster5 = plot_df[plot_df["Cluster"] == 5]

In [None]:
cluster3.head()

In [None]:
#trace1 is for 'Cluster 0'
trace1 = go.Scatter(
                    x = cluster0["PC1_2d"],
                    y = cluster0["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)
#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
                    x = cluster1["PC1_2d"],
                    y = cluster1["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)
#trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster2["PC1_2d"],
                    y = cluster2["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)
trace4 = go.Scatter(
                x = cluster3["PC1_2d"],
                y = cluster3["PC2_2d"],
                mode = "markers",
                name = "Cluster 3",
                marker = dict(color = 'rgba(255, 25, 200, 0.8)'),
                text = None)
trace5 = go.Scatter(
                x = cluster4["PC1_2d"],
                y = cluster4["PC2_2d"],
                mode = "markers",
                name = "Cluster 4",
                marker = dict(color = 'rgba(0, 255, 2, 0.8)'),
                text = None)
trace6 = go.Scatter(
                x = cluster5["PC1_2d"],
                y = cluster5["PC2_2d"],
                mode = "markers",
                name = "Cluster 4",
                marker = dict(color = 'rgba(100, 100, 100, 0.8)'),
                text = None)
data = [trace1, trace2, trace3,trace4,trace5,trace6]
title = "Visualizing Clusters in Two Dimensions Using PCA"
layout = dict(title = title,
              xaxis= dict(title= 'PC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'PC2',ticklen= 5,zeroline= False)
             )
fig = dict(data = data, layout = layout)
iplot(fig)

In [None]:
PCs_2d = pca_dataset(2,sample,"Cluster")
PCs_2d.columns = ["PC1_2d", "PC2_2d"]
plot_df = pd.concat([sample,PCs_2d], axis=1, join='inner')

In [None]:
cluster0 = plot_df[plot_df["Cluster"] == 0]
cluster1 = plot_df[plot_df["Cluster"] == 1]
cluster2 = plot_df[plot_df["Cluster"] == 2]
cluster3 = plot_df[plot_df["Cluster"] == 3]
cluster4 = plot_df[plot_df["Cluster"] == 4]
cluster5 = plot_df[plot_df["Cluster"] == 5]

In [None]:
#trace1 is for 'Cluster 0'
trace1 = go.Scatter(
                    x = cluster0["PC1_2d"],
                    y = cluster0["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)
#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
                    x = cluster1["PC1_2d"],
                    y = cluster1["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)
#trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster2["PC1_2d"],
                    y = cluster2["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)
trace4 = go.Scatter(
                x = cluster3["PC1_2d"],
                y = cluster3["PC2_2d"],
                mode = "markers",
                name = "Cluster 3",
                marker = dict(color = 'rgba(255, 25, 200, 0.8)'),
                text = None)
trace5 = go.Scatter(
                x = cluster4["PC1_2d"],
                y = cluster4["PC2_2d"],
                mode = "markers",
                name = "Cluster 4",
                marker = dict(color = 'rgba(0, 255, 2, 0.8)'),
                text = None)
trace6 = go.Scatter(
                x = cluster5["PC1_2d"],
                y = cluster5["PC2_2d"],
                mode = "markers",
                name = "Cluster 4",
                marker = dict(color = 'rgba(100, 100, 100, 0.8)'),
                text = None)
data = [trace1, trace2, trace3,trace4,trace5,trace6]
title = "Visualizing Clusters in Two Dimensions Using PCA"
layout = dict(title = title,
              xaxis= dict(title= 'PC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'PC2',ticklen= 5,zeroline= False)
             )
fig = dict(data = data, layout = layout)
iplot(fig)

In [None]:
PCs_2d = pca_dataset(2,sample,"Cluster")
PCs_2d.columns = ["PC1_2d", "PC2_2d"]
plot_df = pd.concat([sample,PCs_2d], axis=1, join='inner')