# Lesson 1: Understanding Clustering with k-Means Algorithm Basics

# Understanding Clustering with k-Means Algorithm Basics

## Introduction
Welcome to the fascinating landscape of **Unsupervised Learning** and **Clustering**. In this course, we'll explore the popular **k-Means clustering algorithm**, a simple yet powerful form of clustering. Although clustering might seem technical, if you've ever sorted your clothes into piles based on their colors or types, you've unknowingly performed a form of "clustering" — grouping similar items into different categories or clusters. Intrigued? Let's get started!

---

## Understanding Clustering
- **Supervised learning** is like learning with a teacher. The computer is provided labeled data (input and correct answers) to find patterns and make predictions.
- **Unsupervised learning**, however, is like learning on your own. The algorithm is given data without explicit directions, exploring to find patterns. No correct answers or teacher guidance are involved.

**Clustering** groups objects so that items in the same group (a cluster) are more similar to each other than to those in other clusters.

Example: If you have a list of fruits with their corresponding weight and volume, clustering can segment the data into groups. Although we don’t know what the fruits are, we could infer that data points in the same cluster belong to the same fruit type.

Given new fruit data, you could classify which group it belongs to by identifying the closest cluster center.

This lesson will focus on the widely used **k-Means clustering** method.

---

## k-Means Clustering
The **k-Means clustering algorithm** partitions observations into **k clusters**, where each observation belongs to the most similar cluster. The steps involved are:

1. **Initialization**: Randomly initialize k centroids.
2. **Assignment**: Allocate each data point to the closest centroid.
3. **Update**: Update each centroid by computing the mean of all points in its cluster.

These steps repeat until the centroids no longer change significantly. We'll manually set **k**, the number of clusters.

---

## Implementing k-Means: Setup
Let's translate this algorithm into Python code using a simple dataset of 2D points.

```python
# Toy dataset with 2D points
data = [(2,3), (5,3.4), (1.3,1), (3,4), (2,3.5), (7,5)]

# k-Means settings
k = 2  
centers = random.sample(data, k)
```

Next, we create a **distance()** function to calculate the Euclidean distance between two points, which is essential to the k-Means algorithm.

```python
# Definition of Euclidean distance
def distance(point1, point2):
    return ((point1[0]-point2[0])**2 + (point1[1]-point2[1])**2)**0.5
```

---

## Implementing k-Means: Algorithm
Now, we proceed to implement the k-Means algorithm.

```python
# k-Means algorithm
def k_means(data, centers, k):
    while True:
        clusters = [[] for _ in range(k)] 

        # Assign data points to the closest center
        for point in data:
            distances = [distance(point, center) for center in centers]
            index = distances.index(min(distances)) 
            clusters[index].append(point)

        # Update centers to be the mean of points in a cluster
        new_centers = []
        for cluster in clusters:
            center = (sum([point[0] for point in cluster])/len(cluster), 
                      sum([point[1] for point in cluster])/len(cluster)) 
            new_centers.append(center)

        # Break loop if centers don't change significantly
        if max([distance(new, old) for new, old in zip(new_centers, centers)]) < 0.0001:
            break
        else:
            centers = new_centers
    return clusters, centers
```

This code sets up clusters, assigns each data point to the nearest centroid, and updates the centroids based on the mean of the points in each cluster.

---

## Implementing k-Means: Run
Finally, we run the k-Means algorithm.

```python
clusters, centers = k_means(data, centers, k)

# Print cluster centers
for i, center in enumerate(centers):
    print(f"Cluster{i+1} center is : {center}")
# Cluster1 center is : (2.66, 2.98)
# Cluster2 center is : (7.0, 5.0)

# Print clusters
for i, cluster in enumerate(clusters):
    print(f"Cluster{i+1} points are : {cluster}")
# Cluster1 points are : [(2, 3), (5, 3.4), (1.3, 1), (3, 4), (2, 3.5)]
# Cluster2 points are : [(7, 5)]
```

---

## Implementing k-Means: Visualization
To visualize the clusters, we can plot the points.

```python
import matplotlib.pyplot as plt

colors = ['r', 'g', 'b', 'y', 'c', 'm']
fig, ax = plt.subplots()

# Plot points
for i, cluster in enumerate(clusters):
    for point in cluster:
        ax.scatter(*point, color=colors[i])

# Plot centers
for i, center in enumerate(centers):
    ax.scatter(*center, color='black', marker='x', s=300)

ax.set_title('Clusters and their centers')
plt.show()
```

Crosses represent the cluster centers, and points of different colors belong to different clusters.

---

## Lesson Summary and Practice
Congratulations on successfully navigating the core aspects of clustering and implementing the **k-Means algorithm**! Practice exercises are available to help solidify these concepts. I look forward to seeing you in the next lesson!


## Visualize Clustering with k-Means Algorithm

## Exploring Space with More Clusters

## Calculating the New Center in Clustering

## Implementing the k-Means Centroid Update