## Library

In [60]:
import pandas as pd
import numpy as np

## K-means

Let's code k-means from scratch. To implement the k-means clustering algorithm from scratch, follow these steps:

1. Initialize centroids by selecting random data points from the DataFrame.
2. Assign each data point to the nearest centroid.
3. Update the centroids based on the mean of data points assigned to them.
4. Repeat steps 2 and 3 until convergence or a set number of iterations is reached.

Here's a Python function to compute k-means clustering for a pandas DataFrame with two columns, using \( k = 2 \):

```python
import pandas as pd
import numpy as np

def distance(p1, p2):
    """Compute Euclidean distance between two points."""
    return np.sqrt((p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2)

def kmeans_2clusters(df, max_iterations=100, tolerance=1e-4):
    """
    Compute 2-means clustering on a DataFrame with two columns.
    
    Args:
    - df (pd.DataFrame): Input DataFrame with two columns.
    - max_iterations (int): Maximum number of iterations.
    - tolerance (float): Convergence tolerance.
    
    Returns:
    - centroids (np.array): Updated centroids.
    - assignments (list): Cluster assignments for each data point.
    """
    # 1. Initialize centroids by selecting two random data points.
    # 2. Assign each data point to the nearest centroid.
    # 3. Update centroids.

    return centroids, assignments
```

This function initializes two centroids by selecting two random data points from the DataFrame. It then assigns each data point to the nearest centroid, updates the centroids, and checks for convergence.

In [None]:
def distance(p1, p2):
    """Compute Euclidean distance between two points."""
    return np.sqrt((p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2)

def kmeans_2clusters(df, max_iterations=100, tolerance=1e-4):
    """
    Compute 2-means clustering on a DataFrame with two columns.

    Args:
    - df (pd.DataFrame): Input DataFrame with two columns.
    - max_iterations (int): Maximum number of iterations.
    - tolerance (float): Convergence tolerance.

    Returns:
    - centroids (np.array): Updated centroids.
    - assignments (list): Cluster assignments for each data point.
    """
    # 1. Initialize centroids by selecting two random data points.
    centroids = df.sample(2).values

    for _ in range(max_iterations):
        # 2. Assign each data point to the nearest centroid.
        assignments = []
        for _, row in df.iterrows():
            distances = [distance(row.values, centroid) for centroid in centroids]
            cluster = np.argmin(distances)
            assignments.append(cluster)

        # Convert assignments to a NumPy array for boolean indexing
        assignments = np.array(assignments)

        # 3. Update centroids.
        new_centroids = []
        for i in range(2):
            cluster_points = df[assignments == i].values
            if len(cluster_points) > 0:
                new_centroid = cluster_points.mean(axis=0)
                new_centroids.append(new_centroid)
            else:
                # if no points were assigned to the centroid, reinitialize it
                new_centroids.append(df.sample(1).values[0])

        # Check for convergence.
        shifts = [distance(centroids[i], new_centroids[i]) for i in range(2)]
        if max(shifts) < tolerance:
            break

        centroids = np.array(new_centroids)

    return centroids, assignments

In [61]:
# Example
df = pd.DataFrame({
    'x': [1, 2, 5, 6, 9, 10],
    'y': [1, 2, 5, 6, 9, 10]
})

centroids, assignments = kmeans_2clusters(df)
print('Centroids:', centroids)
print('Assignments:', assignments)

Centroids: [[7.5 7.5]
 [1.5 1.5]]
Assignments: [1 1 0 0 0 0]
