<h3 data-start="132" data-end="177">Clustering with K-Means (Iris Dataset)</h3>
<p data-start="181" data-end="390"><strong data-start="181" data-end="206">What clustering does:</strong><br data-start="206" data-end="209">Clustering groups data points into naturally occurring clusters <em data-start="275" data-end="297">without using labels</em>. The goal is to find structure in the data&mdash;points that are similar end up in the same group.</p>
<p data-start="394" data-end="418"><strong data-start="394" data-end="416">How K-Means clusterer works:</strong></p>
<ul>
<li>
<ul data-start="421" data-end="780">
<li data-start="421" data-end="463">
<p data-start="423" data-end="463">Choose the number of clusters (<strong data-start="454" data-end="459">k</strong>).</p>
</li>
<li data-start="466" data-end="552">
<p data-start="468" data-end="552">Randomly initialize <strong data-start="488" data-end="493">k</strong> centers (with <em data-start="508" data-end="519">k-means++</em> giving smart starting points).</p>
</li>
<li data-start="555" data-end="599">
<p data-start="557" data-end="599">Assign each point to the nearest center.</p>
</li>
<li data-start="602" data-end="661">
<p data-start="604" data-end="661">Recompute centers as the mean of their assigned points.</p>
</li>
<li data-start="664" data-end="780">
<p data-start="666" data-end="780">Repeat until assignments don&rsquo;t change much.<br data-start="709" data-end="712">The result is a set of cluster labels and learned cluster centers.</p>
</li>
</ul>
</li>
</ul>
<p data-start="784" data-end="814"><strong data-start="784" data-end="812">What this cell does:</strong></p>
<ul>
<li>
<ul data-start="817" data-end="1431">
<li data-start="817" data-end="925">
<p data-start="819" data-end="925"><strong data-start="819" data-end="836">Load dataset:</strong> Import the Iris measurements and target labels (though clustering ignores the labels).</p>
</li>
<li data-start="928" data-end="1022">
<p data-start="930" data-end="1022"><strong data-start="930" data-end="949">Split the data:</strong> Use 60% for training the clusterer and 40% for evaluating/visualizing.</p>
</li>
<li data-start="1025" data-end="1116">
<p data-start="1027" data-end="1116"><strong data-start="1027" data-end="1041">Add noise:</strong> Slight uniform noise is added to make clusters less perfectly separable.</p>
</li>
<li data-start="1119" data-end="1206">
<p data-start="1121" data-end="1206"><strong data-start="1121" data-end="1142">Cluster the data:</strong> Run <strong data-start="1147" data-end="1158">K-Means</strong> with 3 clusters&mdash;ideal for the 3 iris species.</p>
</li>
<li data-start="1209" data-end="1314">
<p data-start="1211" data-end="1314"><strong data-start="1211" data-end="1225">Visualize:</strong> Plot two features (sepal length vs. sepal width) and color points by assigned cluster.</p>
</li>
<li data-start="1317" data-end="1431">
<p data-start="1319" data-end="1431"><strong data-start="1319" data-end="1339">Inspect results:</strong> Print the learned cluster centers, number of clusters, and the cluster label of each point.</p>
</li>
</ul>
</li>
</ul>

In [1]:
from pyy.dataset import Dataset
from pyy.splitter import Splitter
from pyy.noise import Noise
from pyy.clusterer import Clusterer
from pyy.plot import ScatterPlot

# VISIBLE_CODE_START
#✏️ Clustering
Dataset(file_name='iris.csv', target=["target"],block_id='~_p$V{M*._ecyw6RG?]a')
Splitter(test_size=0.4, random_state=42,block_id='?cVJbHG+;p/tc3C6+#3L')
Noise(noise_type='uniform', columns=[], min=0.0, max=0.1,block_id='1b*,)VdiK{%R:PcCmPi:')
my_clusterer = Clusterer(name='K-Means', init='k-means++', n_clusters=3,block_id='1;bR;WypE@G9sVrSs0T?')
ScatterPlot(columns=["sepal length (cm)","sepal width (cm)"], fig_size='6.4 x 4.8', title='(auto)', show_color_bar=True, cmap='tab10',block_id='*0Z[cvtF[iIA:e@CAh}s')
print('Centers:' + str(my_clusterer.get_centers()))
print('number of clusters:' + str(my_clusterer.get_n_clusters()))
print('cluster labels:' + str(my_clusterer.get_labels()))
# VISIBLE_CODE_END

0

Centers:[[5.86898223 2.80163597 4.42287817 1.46343324]

 [5.00203282 3.4388319  1.48727886 0.27805347]

 [6.94803244 3.14554389 5.73042322 2.06250656]]

number of clusters:3

cluster labels:[1 1 1 0 1 1 0 0 0 1 2 1 2 2 0 0 1 1 0 0 0 1 2 1 1 2 0 0 0 0 0 1 0 2 1 1 1

 2 1 1 0 1 0 0 0 2 0 2 0 2 1 1 0 1 1 0 0 0 1 0 1 2 0 1 0 1 2 2 2 0 1 0 1 1

 1 1 2 2 0 0 2 2 0 2 0 2 1 0 0 2]