# Kmeans Clustering

This is a unsupervised learning technique

The objective is to categorize/partition your data into various classes based on some continuous features

It uses the concept of distance calculations to find point which are closely related to each other

In real-world applications of clustering, we do not have any ground truth category information (information provided as empirical evidence as opposed to inference) about those samples; otherwise, it would fall into the category of supervised learning. Thus, our goal is to group the samples based on their feature similarities, which can be achieved using the k-means algorithm that can be summarized by the following four steps:

    1. Randomly pick k centroids from the sample points as initial cluster centers.
    
    2. Assign each sample to the nearest centroid μ^(j), j ∈ {1, …, k}.
    
    3. Move the centroids to the center of the samples that were assigned to it.
    
    4. Repeat steps 2 and 3 until the cluster assignments do not change or a user-defined tolerance or maximum number of iterations is reached.

Now, the next question is how do we measure similarity between objects? We can define similarity as the opposite of distance, and a commonly used distance for clustering samples with continuous features is the squared Euclidean distance between two points x and y in m-dimensional space:

![Euclidean distance](https://www.gstatic.com/education/formulas2/397133473/en/euclidean_distance.svg)


# Grouping Stock performance based on Volume of share traded and Close price


Objective is to find answers to following questions:

1) What is generally the trend in stocks performance - frequent trade more often or low buy and share action?

2) How common is to see low close price and yet very high volume of trade? Is it a dominant trend?

3) Can we identify what ratio of historic data shows high volume trade? 


In [11]:
import pandas as pd

import plotly_express as pe

from sklearn.cluster import KMeans


df=pd.read_csv("/home/harshit/Desktop/PythonDA/Day7/healthcare-dataset-stroke-data.csv")
df.sample(5)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
1005,3437,Female,26.0,0,0,No,Private,Urban,82.61,28.5,smokes,0
4378,5654,Female,11.0,0,0,No,children,Rural,94.77,22.7,Unknown,0
4123,32523,Male,68.0,0,1,Yes,Private,Urban,217.74,25.5,Unknown,0
3566,65507,Male,33.0,0,0,Yes,Private,Rural,55.72,38.2,never smoked,0
881,49928,Female,59.0,0,0,Yes,Govt_job,Rural,111.99,35.5,formerly smoked,0


# Check for missing values

In [12]:
df.isna().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

# Rename columns

In [13]:
df.columns = [col.title() for col in df.columns ]
df

Unnamed: 0,Id,Gender,Age,Hypertension,Heart_Disease,Ever_Married,Work_Type,Residence_Type,Avg_Glucose_Level,Bmi,Smoking_Status,Stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


# Select Features

In [15]:
features = df[["Age","Avg_Glucose_Level"]].copy()

features

Unnamed: 0,Age,Avg_Glucose_Level
0,67.0,228.69
1,61.0,202.21
2,80.0,105.92
3,49.0,171.23
4,79.0,174.12
...,...,...
5105,80.0,83.75
5106,81.0,125.20
5107,35.0,82.99
5108,51.0,166.29


# The elbow method

## How would we know the actual number of clusters, to begin with?

For the k-means clustering method, the most common approach for answering this question is the so-called elbow method. It involves running the algorithm multiple times over a loop, with an increasing number of cluster choice and then plotting a clustering score as a function of the number of clusters.

![Graph](https://miro.medium.com/max/700/1*8wV1j-klQA1xFvfaNXuVzg.png)

The score is, in general, a measure of the input data on the k-means objective function i.e. some form of intra-cluster distance relative to inner-cluster distance.

In [16]:
#elbow method!!!!!--->for calculating number of clusters

ans=[ ]
for num in range(2,20):
    
    model=KMeans(n_clusters=num).fit(features) 
    ans.append( model.inertia_ )#moment inertia
    



In [17]:
print(ans)  #array of distoration score

[4760153.859228624, 3196953.823958565, 2243432.67559758, 1869945.3953674235, 1568292.5669498993, 1357878.6882983833, 1212984.9345418853, 1085872.230127081, 979810.3266053167, 896441.3844155322, 824439.9638736472, 753303.5508730611, 697836.8113989097, 651978.401069224, 612321.106998681, 575524.2910210658, 552213.3057048596, 523129.1249019606]


In [24]:
fig = pe.line(x=range(2,20),y=ans,markers=True)

fig.update_layout(
    xaxis_title = 'Number of Clusters',
    yaxis_title = 'Inertia',
)


# Kmeans Steps

# Select the number of clusters to create

In [25]:
model=KMeans(n_clusters=4) #BASED ON THE ELBOW OUTPUT!

# Make predictions for the cluster

In [26]:
ans=model.fit_predict(features) #here!!!!!! #this gives cluster number

In [27]:
predicted=pd.DataFrame(ans,columns=['CLUSTER PREDICTED'])
predicted

Unnamed: 0,CLUSTER PREDICTED
0,1
1,1
2,3
3,1
4,1
...,...
5105,3
5106,2
5107,0
5108,2


# Concatenate features and predicted clusters into a result frame

In [28]:
result=pd.concat([features,predicted],axis=1)
result

Unnamed: 0,Age,Avg_Glucose_Level,CLUSTER PREDICTED
0,67.0,228.69,1
1,61.0,202.21,1
2,80.0,105.92,3
3,49.0,171.23,1
4,79.0,174.12,1
...,...,...,...
5105,80.0,83.75,3
5106,81.0,125.20,2
5107,35.0,82.99,0
5108,51.0,166.29,2


# Let's add the predicted cluster as label in the original dataframe

In [29]:
final = pd.concat([df, predicted],axis=1)
final

Unnamed: 0,Id,Gender,Age,Hypertension,Heart_Disease,Ever_Married,Work_Type,Residence_Type,Avg_Glucose_Level,Bmi,Smoking_Status,Stroke,CLUSTER PREDICTED
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1,3
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0,3
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0,2
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0,2


In [34]:
centers=model.cluster_centers_ #extract cluster centres from the model

centers

array([[ 20.6117407 ,  81.92035414],
       [ 60.55309033, 211.00652932],
       [ 38.49318321, 124.00317298],
       [ 60.31154684,  82.5921024 ]])

In [31]:
final=final.astype({'CLUSTER PREDICTED':'category'}) #make this column category type so ploltly can color according to categories

In [36]:
fig = pe.scatter(x="Age",y="Avg_Glucose_Level",data_frame=final,color="CLUSTER PREDICTED") #using different colors for differernt clusters here

fig.add_scatter(
    x=centers[:, 0],
    y=centers[:, 1],
    marker=dict(size=15, color="Black"), name="Centers"
)
# plt.scatter(centers[:,0], centers[:,1], c='black', s=100, alpha=0.5);

# Division of data points in these clusters

In [37]:
final['CLUSTER PREDICTED'].value_counts()

3    1838
0    1666
2     975
1     631
Name: CLUSTER PREDICTED, dtype: int64