# Introduction
<img src = 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRlOl_zV8SaOnu8WzqLsoCVaGio8heZ2SjCDQ&usqp=CAU' height = '600px' width = '600px'>

<b><i>K-means</i> </b>clustering is one of the simplest and popular unsupervised machine learning algorithms.K-Means clustering is used to find the similarity in the unlabelled dataset and group them together. <b>Clusters</b> refers to the group or aggregated of datapoints with similarities. K-Means algorithm works iteratively to assign each data point to one of group or cluster based on the features that are provided. Data points are clustered based on feature similarity. K Meanse tries to decrease intracluster distance while tries to increase intercluster distance. 

## Applications of K-Means
1. Image segmentation

2. Customer segmentation

3. Document clustering

4. Clustering languages

5. Anomaly Detection

## Some common terms in K-means

<i><b>Centroid:</b></i> A centroid is a point at the centre of a cluster. In k-means clustering, clusters are represented by a centroid and each datapoint is assigned to one centroid.The number of centroid is equal to k value. Each data point is assigned to it's nearest centroid. 

<i><b>K: </b></i> K refers to total number of clusters.Choosing appropriate value of k is very important as it is the most important factor that determines the performance of our algorithm. Each cluster has cluster centre called centroid and datapoints are assigned to the nearest cluster.

## How K-Means work?
1. Pick random value for k and random datapoint as cluster centroids.

2. Calculate Euclidean distance of each remaining data point from cluster centroids and associate each datapoint to the nearest cluster.

3. Recalculate the new cluster centroid

4. Repeat Steps 2 and 3 until the clusters do not change.


# Choosing the value of K
Choosing appropriate valur of K is the main objective of this kernel. Here, I am going to show 3 ways to choose value of k with example.

The K-Means algorithm depends upon finding the number of clusters. Since, the performance of K-Means algorithm depends upon the value of K. We should choose the optimal value of K that gives us best performance. We can choose value of k by:

1. Elbow method

2. Silhoutte score

3. Dendrogram

## Import necesary libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Let's load data

In [None]:

data = pd.read_csv("../input/mall-customers/Mall_Customers.csv")

In [None]:
# let's look through our data
data.head()

In [None]:
# display shape of data
print("Shape:",data.shape)

In [None]:
# check for any null value
data.isna().sum()

In [None]:
# let's rename some columns name for our convinence
data = data.rename(columns = {'CustomerID':'ID','Annual Income (k$)':'Income','Spending Score (1-100)': 'Spending','Genre':'Gender'} )
data.head()

In [None]:
# get information about data
data.info()

In [None]:
# lets change male to 1 and female to 0
data['Gender'] = data.Gender.map({'Female':0, 'Male':1})
data.head()

## Analysis

In [None]:
# does male spend more than female
# does male spend more than female
average_spending_for_male = data[data['Gender'] == 1].Spending.sum() / data[data['Gender'] == 1].Gender.count()

average_spending_for_female = data[data['Gender'] == 0].Spending.sum() / data[data['Gender'] == 0].Gender.count()
print('male:',average_spending_for_male)
print('Female:', average_spending_for_female)

In [None]:
# Income comparision on the basis of Gender
data.groupby('Gender').Income.describe()

In [None]:
# Spending comparision on the basis of Gender
data.groupby('Gender').Spending.describe()

In [None]:
data.groupby('Gender').Age.describe()

## Data Visualization

In [None]:
sns.countplot('Gender', data = data)
plt.show()

In [None]:
plt.hist(data.Spending, 10)
plt.xlabel('Spending')
plt.ylabel('Number of people')
plt.show()

In [None]:
plt.hist(data.Income, 5)
plt.xlabel('Income')
plt.ylabel('Number of people')
plt.show()

In [None]:
plt.title("Spending",fontsize = 22)
bins = np.linspace(data.Spending.min(), data.Spending.max(),5)
plt.hist(data.Age, bins)
plt.show()

In [None]:
# let's visualize Income and Spending score in scatter plot
# here i am going to make clusters out of these two, so that clusters can be visualize in 2D
plt.scatter(data.Spending, data.Income, marker = 'o', color = 'red')
plt.show()
# here we can easily see that datapoints can be grouped into 5 different clusters 

### Preparing Data For Training

In [None]:
X = data.iloc[:,3:].values
X

 # 1. Elbow method
 In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use. The elbow method runs k-means clustering on the dataset for a range of values for k  and then for each value of k computes an average score for all clusters. By default, the distortion score is computed, the sum of square distances from each point to its assigned center.When these overall metrics for each model are plotted, it is possible to visually determine the best value for k. If the line chart looks like an arm, then the “elbow” (the point of inflection on the curve) is the best value of k.

<img src = 'https://www.oreilly.com/library/view/numerical-computing-with/9781789953633/assets/f54e236e-f441-43d6-80a9-07feef4f6ef4.png'  >

<br>
<br>
<br>

The elbow method helps to choose the optimum value of 'k' (number of clusters) by fitting the model with a range of values of 'k'. Here we would be using a 2-dimensional data set but the elbow method holds for any multivariate data set.

In [None]:
from sklearn.cluster import KMeans

In [None]:
clusters = 10
wccs_array = []
for i in range(1, clusters):
    model = KMeans(n_clusters =  i, random_state = 42 )
    model.fit(X)
    wccs_array.append(model.inertia_)

In [None]:
# plot to find the number of k using elbow method
plt.plot(range(1,clusters), wccs_array, 'o-', color = 'red')
plt.xlabel('Numbers of Clusters')
plt.ylabel('WCCS Score')
plt.title("Elbow Method to find the number of cluster")
plt.show()

In [None]:
# here we can see that optimimum value for k in 5. So, let's build model using 5 clusters
KMeans_model = KMeans( n_clusters = 5, random_state = 44)
# fitting data in model and predicting
predict = KMeans_model.fit_predict(X)

In [None]:
## Let's visualize each clusters and it's centre
plt.figure(figsize = (10,10))
colors = ['red','lime','maroon','green','blue']
for i in range(5):
    plt.scatter(X[predict == i,0], X[predict == i, 1], s = 90, c = colors[i])
# for centroid 
plt.scatter(KMeans_model.cluster_centers_[:,0], KMeans_model.cluster_centers_[:,1], color = 'Black',marker = '*',
           s = 500)
plt.show()

# To be continue......
## Thanks for your visit