# Initialize the K Starting Centroids in Practice
After data has been loaded, create an instance of the K-means algorithm and initialize it with the desired number of clusters (K).

In [None]:
# Initial imports
import pandas as pd
import plotly.express as px
import hvplot.pandas
from sklearn.cluster import KMeans

In [None]:
# Load Dataset
file_path = r"C:\Users\Stephen\Desktop\Class Projects\Cryptocurrencies\iris.csv"
iris_df = pd.read_csv(file_path)
iris_df.head()

![image.png](attachment:image.png)

## Initialize the K Starting Centroids
After data has been loaded, create an instance of the K-means algorithm and initialize it with the desired number of clusters (K).

IMPORTANT
We're working with data that has a set number of clusters. Often, you won't know the number that you should work with, so you'll have to use the trial-and-error method to determine it. In the next section, we'll learn an approach that can help with the trial-and-error method.

For this example, we know that there are three different classes of iris plants, so we'll use K = 3:



In [None]:
# Initializing model with K =3 (since we already know there are 3 classes of iris plants)
model = KMeans(n_clusters=3, random_state =5)
model

![image.png](attachment:image.png)

## Data Points Assigned to Nearest Centroid
Once the model instance is created, our next step is to fit the model with the unlabeled data. This step should be familiar with fitting data from supervised learning; however, you'll notice that data is not being split into training and test data. When the model is being trained (fit the data), the K-means algorithm will iteratively look for the best centroid for each of the K clusters:

In [None]:
# Fitting model
model.fit(df_iris)

## Group Data Points
After the model is fit, the corresponding cluster for every iris plant in the dataset can be found using the predict() method:

In [None]:
# Get the predictions
predictions = model.predict(df_iris)
print(predictions)

![image.png](attachment:image.png)

### IMPORTANT
As you can see, there were three subclasses that were labeled 0, 1, and 2. These are not the means for the centroids, but rather just the label names. The actual naming of the classes is part of the job by a subject matter expert, or whoever performs the analysis, such as yourself. The K-means algorithm is able to identify how many clusters are in the data and label them with numbers.

After we have the class for each data point, we can add a new column to the DataFrame with the predicted classes:

In [None]:
# Add a new class column to the df_iris
df_iris["class"] = model.labels_
df_iris.head()

![image.png](attachment:image.png)

## Visualize the Results
Visualizing the clusters helps to graphically understand how they are arranged. In this case, we actually have too many features to represent visually, but we can select a few of them and plot the clusters.

For our visualizations, we'll use hvPlot, a graphing library that allows deeper exploration of the data.



In [None]:
# Import hvplot
import plotly.express as px
import hvplot.pandas

First, look at the data with two features. The hvPlot library makes it easy to create scatter plots directly from a Pandas DataFrame. After our DataFrame has been loaded in from the CSV, we can create a scatter plot with one line of code. We pass in the arguments for the x- and y-axis and color them by class:

In [None]:
# Create a scatterplot of df_iris with two features
df_iris.hvplot.scatter(x="sepal_length", y="sepal_width", by="class")

![image.png](attachment:image.png)

In the results, it appears some of the clusters are overlapping and not quite forming three distincts groups as we had hoped. Before jumping to the conclusion that our model didn't do what we wanted, remember that we are taking multiple data points (petal_width, sepal_length, and petal_length). Since this plot is on a 2D graph, all three features can't be properly displayed.

Plotting in 3D takes a few more arguments and will allow us to visualize more data points. We now have an x-, y-, and z-axis that will take all three of our features as coordinates. We pass in the class data points to determine color and symbol of the points. Size of the points will be determined by sepal_width.

Finally, we'll update the figure by passing a dictionary with x, y, and z:

In [None]:
# Plotting the clusters with three features
fig = px.scatter_3d(df_iris, x="petal_width", y="sepal_length", z="petal_length", color="class", symbol="class", size="sepal_width",width=800)
fig.update_layout(legend=dict(x=0,y=1))
fig.show()

[The 3D scatter plot can be rotated using the mouse to click and drag and panned using the scroll wheel.]{style="font-weight: 400;"}

![image.png](attachment:image.png)

Here, you can see that our model did do what we wanted! There are now three distinct groups that correspond to the three clusters that we expect the model to break the data into.