In [None]:
For clustering sequences of number sequences that represent trajectories of colored nodes on a graph, choosing an appropriate distance metric is crucial to ensure that the clustering captures meaningful similarities between sequences. Here are a few distance metrics commonly used for clustering sequences:

Hamming Distance:

Suitable when sequences are of equal length and represent categorical data (like color categories). It calculates the number of positions at which corresponding elements differ between two sequences.
Levenshtein Distance (or Edit Distance):

Measures the minimum number of single-character edits (insertions, deletions, substitutions) required to change one sequence into the other. It's useful when sequences can have different lengths and require alignment.
Jaccard Distance:

Computes dissimilarity between sets of elements. It's often used when sequences represent sets of items (e.g., nodes visited in a graph), and the order of items is not as important as their presence or absence.
Euclidean Distance:

Typically used when sequences can be represented as vectors (e.g., using one-hot encoding for categorical data). It measures the straight-line distance between two points in a multidimensional space.
Cosine Distance:

Measures the cosine of the angle between two vectors in a multidimensional space. It's useful for comparing sequences represented as vectors, where the magnitude and direction are both considered.

Extracting preferences from clusters involves analyzing the clustered sequences to identify common patterns and frequent elements. Here’s a step-by-step approach to extract and interpret the color preferences from the clustered sequences:

Step-by-Step Approach

	1.	Cluster Analysis:
	•	Once the sequences are clustered, examine the sequences within each cluster to identify common patterns.
	•	Calculate the frequency of each color within the clusters to determine which colors are preferred.
	2.	Frequency Analysis:
	•	Count the occurrences of each color in each cluster.
	•	Identify the most frequent colors in each cluster as the preferred colors.
	3.	Pattern Identification:
	•	Look for common subsequences within each cluster.
	•	Identify any repeating patterns or common transitions between colors.

Implementation Example

Here’s an example to demonstrate how to extract preferences from clustered sequences:

import numpy as np
from sklearn.cluster import AgglomerativeClustering
from difflib import SequenceMatcher
from collections import Counter
import pandas as pd

# Example sequences of colors (numbers)
sequences = [
    [1, 7, 9, 10],
    [1, 8, 9, 10],
    [2, 7, 9, 11],
    [1, 7, 9, 11],
    [1, 7, 8, 10]
]

# Function to compute LCS-based distance
def lcs_distance(seq1, seq2):
    match = SequenceMatcher(None, seq1, seq2)
    lcs_length = match.find_longest_match(0, len(seq1), 0, len(seq2)).size
    return 1 - (lcs_length / min(len(seq1), len(seq2)))

# Compute pairwise distance matrix
n = len(sequences)
distance_matrix = np.zeros((n, n))
for i in range(n):
    for j in range(i + 1, n):
        distance = lcs_distance(sequences[i], sequences[j])
        distance_matrix[i, j] = distance
        distance_matrix[j, i] = distance

# Clustering using Agglomerative Hierarchical Clustering
clustering = AgglomerativeClustering(affinity='precomputed', linkage='average', n_clusters=2)
labels = clustering.fit_predict(distance_matrix)

print("Cluster labels:", labels)

# Assign sequences to clusters
clustered_sequences = {}
for label, seq in zip(labels, sequences):
    if label not in clustered_sequences:
        clustered_sequences[label] = []
    clustered_sequences[label].append(seq)

# Extract preferences
preferences = {}
for label, seqs in clustered_sequences.items():
    all_colors = [color for seq in seqs for color in seq]
    color_counts = Counter(all_colors)
    preferences[label] = color_counts

# Display preferences
for label, color_counts in preferences.items():
    print(f"Cluster {label}:")
    print(pd.DataFrame(color_counts.most_common(), columns=["Color", "Frequency"]))
    print()

# Identify common patterns in clusters
for label, seqs in clustered_sequences.items():
    print(f"Cluster {label} sequences:")
    for seq in seqs:
        print(seq)
    print()

Explanation

	1.	Cluster Analysis:
	•	After clustering, sequences are grouped based on their similarity.
	2.	Frequency Analysis:
	•	For each cluster, the frequency of each color is counted using Counter from the collections module.
	•	The most_common method of Counter helps list colors by their frequency.
	3.	Pattern Identification:
	•	Common sequences within each cluster are identified and printed to visually inspect patterns.

Example Output Interpretation

Given the example output, you might see something like:

Cluster 0:
   Color  Frequency
0      1          3
1      7          3
2      9          3
3     10          2
4      8          1

Cluster 1:
   Color  Frequency
0      7          2
1      9          2
2     11          2
3      2          1
4      1          1

	•	Cluster 0: The most preferred colors are 1, 7, and 9, followed by 10.
	•	Cluster 1: The most preferred colors are 7, 9, and 11.

Summary

	1.	Cluster Analysis: Group sequences into clusters based on similarity.
	2.	Frequency Analysis: Count the frequency of each color in each cluster.
	3.	Pattern Identification: Identify common patterns and sequences within each cluster.
	4.	Interpretation: Determine the preferred colors based on their frequency and commonality in the clusters.

This approach helps you derive meaningful insights into color preferences from the clustered trajectory data.

In [None]:
# Chosen method

In [None]:
In the provided code snippet, the goal is to perform hierarchical clustering on a dataset using different linkage methods and a Jaccard distance metric. Here's an explanation of how the distance matrix (`distance_matrix_condensed`) is generated and used in hierarchical clustering:

### Step-by-Step Explanation:

1. **Dataset Loading and Preprocessing:**
   ```python
   # Load the dataset
   df = filtered_df.copy()
   
   # Convert paths to sets of nodes
   df['Path_Set'] = df['Path'].apply(lambda x: set(x))
   ```
   - Here, `df['Path']` likely contains sequences of nodes (or paths), which are converted into sets of nodes (`Path_Set`). This conversion is essential for calculating Jaccard distances between paths later.

2. **Jaccard Distance Calculation:**
   ```python
   # Calculate the Jaccard distance matrix
   def jaccard_distance(set1, set2):
       if len(set1.union(set2)) == 0:
           return 0
       return 1 - len(set1.intersection(set2)) / len(set1.union(set2))
   ```
   - The `jaccard_distance` function computes the Jaccard distance between two sets (`set1` and `set2`). Jaccard distance measures dissimilarity between sets based on their intersection and union.

3. **Distance Matrix Initialization:**
   ```python
   # Generate the distance matrix
   n = len(df)
   distance_matrix = np.zeros((n, n))
   for i in range(n):
       for j in range(i + 1, n):
           distance_matrix[i, j] = jaccard_distance(df.iloc[i]['Path_Set'], df.iloc[j]['Path_Set'])
           distance_matrix[j, i] = distance_matrix[i, j]
   ```
   - `distance_matrix` is initialized as a symmetric matrix (`n x n`), where `n` is the number of paths in `df`. Each element `distance_matrix[i, j]` represents the Jaccard distance between the sets `df.iloc[i]['Path_Set']` and `df.iloc[j]['Path_Set']`.

4. **Convert to Condensed Format:**
   ```python
   # Convert the distance matrix to a format suitable for linkage
   distance_matrix_condensed = squareform(distance_matrix)
   ```
   - `squareform` converts the symmetric `distance_matrix` into a condensed format suitable for hierarchical clustering algorithms like `linkage`. The condensed format is a one-dimensional array that preserves the upper triangular part of the original distance matrix.

5. **Hierarchical Clustering:**
   ```python
   # Define a function to perform clustering with different parameters
   def hierarchical_clustering(distance_matrix, linkage_method='complete', n_clusters=3):
       # Perform hierarchical clustering
       Z = linkage(distance_matrix, method=linkage_method)
       
       # Cut the dendrogram at a specific level to form clusters
       clusters = fcluster(Z, n_clusters, criterion='maxclust')
       
       return Z, clusters
   ```
   - `hierarchical_clustering` function performs hierarchical clustering using the provided `distance_matrix` and `linkage_method`. It returns the linkage matrix `Z` and cluster labels (`clusters`) based on the specified `n_clusters` and `linkage_method`.

6. **Experimenting with Different Parameters:**
   ```python
   # Experiment with different hyperparameters
   linkage_methods = ['single', 'complete', 'average', 'ward']
   n_clusters = 3
   
   for linkage_method in linkage_methods:
       Z, clusters = hierarchical_clustering(distance_matrix_condensed, linkage_method, n_clusters)
       
       # Assign cluster labels to the original dataframe
       df['Cluster'] = clusters
       
       # Further processing and visualization (color mapping, cluster analysis, dendrogram visualization) are performed iteratively for each `linkage_method`.
   ```
   - The script iterates through different `linkage_methods`, performs hierarchical clustering using `distance_matrix_condensed`, assigns cluster labels to `df`, and visualizes the resulting clusters and dendrogram.

### Summary:

The `distance_matrix_condensed` is a one-dimensional array derived from the original `distance_matrix` using `squareform`, making it suitable for input into the `linkage` function for hierarchical clustering. This approach facilitates the comparison of paths based on their Jaccard distances, which capture similarities between sets of nodes. The iterative process with different linkage methods allows for exploring how clustering results vary with different clustering strategies.

In [None]:
# import pandas as pd
# from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
# import matplotlib.pyplot as plt
# from scipy.spatial.distance import pdist, squareform
# import numpy as np

# # Load the dataset
# df = filtered_df.copy()

# # Convert paths to sets of nodes
# df['Path_Set'] = df['Path'].apply(lambda x: set(x))

# # Calculate the Jaccard distance matrix
# def jaccard_distance(set1, set2):
#     if len(set1.union(set2)) == 0:
#         return 0
#     return 1 - len(set1.intersection(set2)) / len(set1.union(set2))

# # Generate the distance matrix
# n = len(df)
# distance_matrix = np.zeros((n, n))
# for i in range(n):
#     for j in range(i + 1, n):
#         distance_matrix[i, j] = jaccard_distance(df.iloc[i]['Path_Set'], df.iloc[j]['Path_Set'])
#         distance_matrix[j, i] = distance_matrix[i, j]

# # Convert the distance matrix to a format suitable for linkage
# distance_matrix_condensed = squareform(distance_matrix)

# # Define a function to perform clustering with different parameters
# def hierarchical_clustering(distance_matrix, linkage_method='complete', n_clusters=3):
#     # Perform hierarchical clustering
#     Z = linkage(distance_matrix, method=linkage_method)
    
#     # Cut the dendrogram at a specific level to form clusters
#     clusters = fcluster(Z, n_clusters, criterion='maxclust')
    
#     return Z, clusters

# # Experiment with different hyperparameters
# linkage_methods = ['single', 'complete', 'average','ward']
# distance_metrics = ['euclidean', 'cityblock', 'cosine','jaccard']
# n_clusters = 2

# for linkage_method in linkage_methods:
#     print(f"Linkage Method: {linkage_method}")
#     Z, clusters = hierarchical_clustering(distance_matrix_condensed, linkage_method, n_clusters)
    
#     # Assign cluster labels to the original dataframe
#     df['Cluster'] = clusters

#     # Color mapping dictionary
#     color_mapping = color_mapping

#     # Extract color information for each path
#     def extract_colors(path):
#         return [color_mapping[int(node)] for node in path if int(node) in color_mapping]

#     df['Colors'] = df['Path'].apply(extract_colors)

#     # Count color occurrences in each cluster
#     color_counts_per_cluster = df.groupby('Cluster')['Colors'].apply(lambda colors: pd.Series([color for sublist in colors for color in sublist]).value_counts())

#     # Create a DataFrame to display the color distributions in each cluster
#     color_distribution_df = color_counts_per_cluster.unstack().fillna(0).astype(int)

#     print("Color Distributions in Each Cluster:")
#     print(color_distribution_df)
#     print("\n")

#     # Optionally, visualize the clusters
#     plt.figure(figsize=(10, 6))
#     plt.title(f'Hierarchical Clustering Dendrogram - {linkage_method}')
#     plt.xlabel('Sample index')
#     plt.ylabel('Distance')
#     dendrogram(Z, leaf_rotation=90., leaf_font_size=8., labels=df.index)
#     plt.show()
