# Day 91 - KMeans, AgglomerativeClustering & DBSCAN

1. Load the data(91).csv file into the DataFrame. <br>
Then, implement the K-Means algorithm to split the given data into two clusters. Specify the centroid of each cluster and print its coordinates to the console. Round the result to three decimal places for each coordinate.

In [1]:
import numpy as np
from numpy.linalg import norm
import pandas as pd
import random
 
 
np.random.seed(42)
df = pd.read_csv('data(91).csv')
 
x1_min = df.x1.min()
x1_max = df.x1.max()
 
x2_min = df.x2.min()
x2_max = df.x2.max()
 
centroid_1 = np.array(
    [
        random.uniform(x1_min, x1_max),
        random.uniform(x2_min, x2_max),
    ]
)
centroid_2 = np.array(
    [
        random.uniform(x1_min, x1_max),
        random.uniform(x2_min, x2_max),
    ]
)
 
data = df.values
 
for i in range(10):
    clusters = []
    for point in data:
        centroid_1_dist = norm(centroid_1 - point)
        centroid_2_dist = norm(centroid_2 - point)
        cluster = 1
        if centroid_1_dist > centroid_2_dist:
            cluster = 2
        clusters.append(cluster)
 
    df['cluster'] = clusters
 
    centroid_1 = [
        round(df[df.cluster == 1].x1.mean(), 3),
        round(df[df.cluster == 1].x2.mean(), 3),
    ]
    centroid_2 = [
        round(df[df.cluster == 2].x1.mean(), 3),
        round(df[df.cluster == 2].x2.mean(), 3),
    ]
 
print(centroid_1)
print(centroid_2)

[0.352, 2.502]
[2.663, -3.083]


2. Load the clusters(91).csv file into the DataFrame. <br>
Using the KMeans class from the scikit-learn, split the data into three clusters. Set arguments: <br>
<br>
max_iter=1000 <br>
random_state=42 <br>
<br>
In response, print the coordinates of the centroid of each cluster as shown below.

In [2]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
 
np.random.seed(42)
df = pd.read_csv('clusters(91).csv')
kmeans = KMeans(n_clusters=3, max_iter=1000, random_state=42)
kmeans.fit(df)
 
print(kmeans.cluster_centers_)

[[-0.55537629 -0.32971364]
 [ 4.86661316  0.42352176]
 [-2.15656147 -4.30478556]]


3. Load the clusters.csv file into the DataFrame. <br>
Using the KMeans class from the scikit-learn, the model was created. Make a prediction based on this model (kmeans) and assign a cluster number to each sample in the df DataFrame as 'y_kmeans' column. <br>
In response, print the first ten rows of the df DataFrame to the console.

In [3]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
 
np.random.seed(42)
df = pd.read_csv('clusters(91).csv')
kmeans = KMeans(n_clusters=3, max_iter=1000, random_state=42)
kmeans.fit(df)
 
y_kmeans = kmeans.predict(df)
df['y_kmeans'] = y_kmeans
print(df.head(10))

         x1        x2  y_kmeans
0 -2.776333 -4.166641         2
1 -1.335879 -1.083934         0
2  6.507272 -0.158773         1
3 -0.956622  0.235036         0
4 -1.558383 -3.969630         2
5 -0.652304 -1.332604         0
6  5.560753  1.517069         1
7 -0.891052 -3.455786         2
8  6.391479  3.597473         1
9  5.812508 -0.845526         1


4. Load the clusters(91).csv file into the DataFrame. <br>
Using the KMeans class (set random_state=42) from the scikit-learn, create a list of WCSS (Within-Cluster Sum-of-Squared) values for the number of clusters from 2 to 9 inclusive. Round WCSS values to two decimal places and print to the console.

In [4]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
 
 
np.random.seed(42)
df = pd.read_csv('clusters.csv')
 
wcss = []

for i in range(2, 10):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(df)
    wcss.append(round(kmeans.inertia_, 2))
print(wcss)

[957.78, 779.97, 618.5, 510.3, 425.63, 377.54, 333.38, 304.72]


5. Load the clusters(91).csv file into the DataFrame. <br>
Using the DBSCAN class from the scikit-learn, create a model to split given dataset into clusters. Set the following arguemnts: <br>
<br>
eps=0.6 <br>
min_samples=7 <br>
<br>
Make a prediction based on this model and assign a new column 'cluster' which stores the cluster number for each sample in the df DataFrame. <br>
In response, print the first ten rows of the df DataFrame.

In [5]:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
 
df = pd.read_csv('clusters.csv')
cluster = DBSCAN(eps=0.6, min_samples=7)
cluster.fit(df)
df['cluster'] = cluster.labels_
print(df.head(10))

         x1        x2  cluster
0 -2.486532  7.025770        0
1 -3.522549  8.578303        0
2 -2.982040  7.998514        0
3 -2.135276  6.255888        0
4  2.762504  4.210918       -1
5 -3.541472  8.489106        0
6  1.240259  0.781640       -1
7  0.053390  8.966770       -1
8 -0.827918  6.742253        0
9  3.291716  1.296751        1
