# Data Mining Project4: Cluster on Given Dataset 
- e-mail: niejy20@lzu.edu.cn
- data：June 13th

# 1. Introduction

## 1.1 数据集简介

### GSE235508 转录组数据集

- **数据来源**：来自类风湿关节炎（RA）、系统性红斑狼疮（SLE）患者及健康孕妇的血液转录组数据，旨在分析妊娠期免疫调节的基因表达差异。
- **分类任务**：将样本分为不同组别（如 `HEALTHY`、`SPRA`、`SLE`），属于多分类问题（具体类别需根据 `samplegroup:ch1` 的取值确定）。
- **数据规模**：包含 **335 个样本**，每个样本有 **60,218 个基因表达特征**（CPM 值），属于典型的高维小样本数据。
- **特征特点**：特征为基因表达量，需进行标准化处理（如 log 转换），且存在大量零值或低方差基因，需进行特征筛选。

---

### 数据集对比
- 在第一次作业中，我使用了数据挖掘领域较为经典的数据集：Breast Cancer Wisconsin，与本次作业的数据集同样应用于医学领域，这两个数据集的区别如下：

| 特征                  | GSE235508 转录组数据集       | Breast Cancer Wisconsin 数据集 |
|-----------------------|------------------------------|---------------------------------|
| **数据量**            | 335 个样本                   | 569 个样本                      |
| **特征数量**          | 60,218 个基因表达特征        | 30 个形态学特征                 |
| **应用领域**          | 自身免疫疾病研究             | 乳腺癌医学诊断                  |
| **分类任务**          | 多分类（疾病状态分组）       | 二分类（良性 vs 恶性）          
| **数据挑战**          | 高维度、小样本、特征稀疏     | 小样本、特征可解释性高          |
| **典型预处理方法**    | 标准化、特征选择、降维       | 标准化、特征相关性分析          |

---

### 对比分析

1. **数据维度差异**  
   - GSE235508 的特征数量（60k+）远超 Breast Cancer（30），需采用策略避免维度灾难（**PCA、t-SNE 或 LASSO 特征选择**等） 。
   - 与 Breast Cancer 数据集相比，GSE235508 的样本量更小，但特征维度更高，容易导致模型过拟合。

2. **领域特异性**  
   - **医学转录组数据** 的基因表达特征具有生物学意义，但需结合通路分析（如 GSEA）增强可解释性。
   - 不同于 MiniBooNE 的物理信号，基因表达数据通常需 **log 转换** 和 **批次效应校正**。

---

通过对比可见，GSE235508 的 **高维小样本特性** 为数据挖掘带来了挑战。

## 1.2 聚类算法简介

这是1.2的内容。

# 2. Cluster (10 algorithms)

In [1]:
# Cell 1: Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.metrics import adjusted_rand_score, silhouette_score
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering, SpectralClustering, Birch
from sklearn.manifold import TSNE
from sklearn.feature_selection import VarianceThreshold
from scipy.cluster.hierarchy import dendrogram, linkage
import hdbscan
import warnings
from sklearn.model_selection import ParameterGrid

# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# Set plotting style
sns.set(style="whitegrid", palette="muted")
plt.rcParams['figure.figsize'] = (12, 8)

In [2]:
# Cell 2: Enhanced Data Loader
class ClusteringAnalyzer:
    def __init__(self, expr_path, meta_path):
        self.expr_path = expr_path
        self.meta_path = meta_path
        self.X = None
        self.meta = None
        self.labels_group = None
        self.labels_das28 = None
        
    def load_and_preprocess_data(self, log_transform=True, var_threshold=0.1, pca_components=0.95):
        # Load expression data
        expr = pd.read_csv(self.expr_path, sep='\t', comment='!', index_col=0, encoding='utf-8').T
        
        # Load metadata
        meta = pd.read_csv(self.meta_path, sep='\t', quotechar='"', dtype=str)
        meta = meta[['geo_accession', 'samplegroup:ch1', 'das28:ch1']]
        meta.columns = ['sample_id', 'group', 'das28']
        
        # Merge data
        expr.index.name = 'sample_id'
        merged = expr.merge(meta, left_index=True, right_on='sample_id').set_index('sample_id')
        
        # Handle missing values
        merged['das28'] = pd.to_numeric(merged['das28'].replace('NA', np.nan), errors='coerce').fillna(0)
        
        # Create target labels
        le_group = LabelEncoder()
        self.labels_group = le_group.fit_transform(merged['group'])
        
        conditions = [
            merged['das28'] < 1,
            (merged['das28'] >= 1) & (merged['das28'] < 3),
            merged['das28'] >= 3
        ]
        choices = [0, 1, 2]  # Low, Medium, High
        self.labels_das28 = np.select(conditions, choices, default=0)
        
        # Feature matrix
        X = merged.drop(['group', 'das28'], axis=1).astype(float)
        
        # Log transformation
        if log_transform:
            X = np.log1p(X)
            
        # Filter low variance genes
        selector = VarianceThreshold(threshold=var_threshold)
        X = selector.fit_transform(X)
        
        # Standardize data
        scaler = StandardScaler()
        X = scaler.fit_transform(X)
        
        # Dimensionality reduction
        pca = PCA(n_components=pca_components, random_state=42)
        self.X = pca.fit_transform(X)
        
        print(f"Data preprocessing complete. Final dimensions: {self.X.shape}")
        print(f"PCA explained variance: {np.sum(pca.explained_variance_ratio_):.2f}")
        
        return self.X, self.labels_group, self.labels_das28

In [6]:
# Cell 3: Data Loading
# Update paths according to your environment
expr_path = "./Data/GSE235508_mRNA_counts.txt"
meta_path = "./Data/GSE235508.meta.txt"

analyzer = ClusteringAnalyzer(expr_path, meta_path)
X, labels_group, labels_das28 = analyzer.load_and_preprocess_data(
    log_transform=True,
    var_threshold=0.5,  # Increased variance threshold
    pca_components=0.99  # Retain more variance
)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 169863: invalid start byte

In [None]:
# Cell 4: Clustering Evaluation Helper
def evaluate_clustering(X, labels_true, algorithm, params):
    """
    Evaluate clustering algorithm with given parameters
    """
    try:
        model = algorithm(**params)
        
        if hasattr(model, 'fit_predict'):
            clusters = model.fit_predict(X)
        else:
            model.fit(X)
            clusters = model.labels_
            
        # Handle noise points in DBSCAN/HDBSCAN
        if -1 in clusters:
            clusters[clusters == -1] = max(clusters) + 1
            
        ari = adjusted_rand_score(labels_true, clusters)
        sil_score = silhouette_score(X, clusters) if len(np.unique(clusters)) > 1 else -1
        
        return {
            'params': params,
            'clusters': clusters,
            'ari': ari,
            'silhouette': sil_score,
            'n_clusters': len(np.unique(clusters))
        }
    
    except Exception as e:
        print(f"Error with {algorithm.__name__} and params {params}: {str(e)}")
        return None

In [None]:
# Cell 5: KMeans Clustering with Tuning
kmeans_results = []
param_grid = {
    'n_clusters': [2, 3, 4, 5],
    'init': ['k-means++', 'random'],
    'n_init': [10, 20],
    'random_state': [42]
}

for params in ParameterGrid(param_grid):
    result = evaluate_clustering(X, labels_group, KMeans, params)
    if result:
        kmeans_results.append(result)

# Find best result
best_kmeans = max(kmeans_results, key=lambda x: x['ari'])
print(f"Best KMeans - ARI: {best_kmeans['ari']:.4f}, Silhouette: {best_kmeans['silhouette']:.4f}")

In [None]:
# Cell 6: Hierarchical Clustering with Tuning
hierarchical_results = []
param_grid = {
    'n_clusters': [2, 3, 4, 5],
    'linkage': ['ward', 'complete', 'average'],
    'affinity': ['euclidean', 'cosine']
}

for params in ParameterGrid(param_grid):
    result = evaluate_clustering(X, labels_group, AgglomerativeClustering, params)
    if result:
        hierarchical_results.append(result)

best_hierarchical = max(hierarchical_results, key=lambda x: x['ari'])
print(f"Best Hierarchical - ARI: {best_hierarchical['ari']:.4f}")

In [None]:
# Cell 7: DBSCAN Clustering with Tuning
dbscan_results = []
param_grid = {
    'eps': [0.3, 0.5, 0.7, 1.0, 1.5],
    'min_samples': [3, 5, 10]
}

for params in ParameterGrid(param_grid):
    result = evaluate_clustering(X, labels_group, DBSCAN, params)
    if result:
        dbscan_results.append(result)

if dbscan_results:
    best_dbscan = max(dbscan_results, key=lambda x: x['ari'])
    print(f"Best DBSCAN - ARI: {best_dbscan['ari']:.4f}")
else:
    print("No valid DBSCAN parameters found")

In [None]:
# Cell 8: Gaussian Mixture Model with Tuning
from sklearn.mixture import GaussianMixture

gmm_results = []
param_grid = {
    'n_components': [2, 3, 4, 5],
    'covariance_type': ['full', 'tied', 'diag', 'spherical'],
    'random_state': [42]
}

for params in ParameterGrid(param_grid):
    try:
        model = GaussianMixture(**params)
        model.fit(X)
        clusters = model.predict(X)
        
        ari = adjusted_rand_score(labels_group, clusters)
        sil_score = silhouette_score(X, clusters)
        
        gmm_results.append({
            'params': params,
            'clusters': clusters,
            'ari': ari,
            'silhouette': sil_score,
            'n_clusters': params['n_components']
        })
    except Exception as e:
        print(f"Error with GMM and params {params}: {str(e)}")

best_gmm = max(gmm_results, key=lambda x: x['ari'])
print(f"Best GMM - ARI: {best_gmm['ari']:.4f}, Silhouette: {best_gmm['silhouette']:.4f}")

In [None]:
# Cell 9: Visualization of Best Results
def visualize_results(X, true_labels, cluster_labels, algorithm_name):
    """Visualize clustering results using t-SNE"""
    tsne = TSNE(n_components=2, random_state=42)
    X_tsne = tsne.fit_transform(X)
    
    plt.figure(figsize=(15, 6))
    
    plt.subplot(1, 2, 1)
    sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], hue=true_labels, palette='viridis', s=50, alpha=0.8)
    plt.title(f'True Groups ({algorithm_name})', fontsize=14)
    plt.xlabel('t-SNE 1')
    plt.ylabel('t-SNE 2')
    
    plt.subplot(1, 2, 2)
    sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], hue=cluster_labels, palette='viridis', s=50, alpha=0.8)
    plt.title(f'{algorithm_name} Clustering Results', fontsize=14)
    plt.xlabel('t-SNE 1')
    plt.ylabel('t-SNE 2')
    
    plt.tight_layout()
    plt.savefig(f'{algorithm_name}_clustering.png', dpi=300)
    plt.show()

# Visualize best algorithms
visualize_results(X, labels_group, best_kmeans['clusters'], 'KMeans')
visualize_results(X, labels_group, best_hierarchical['clusters'], 'Hierarchical')
visualize_results(X, labels_group, best_gmm['clusters'], 'GMM')

# 3. Summary

## 3.1 聚类模型效果对比

## 3.2 作业总结