# **Cluster**

## Objectives

* Fit and evaluate a cluster model to group similar data
* Analyse the clusters against the diagnostic (malignant or benign)
* Understand the profile for each cluster

## Inputs

* Outputs/datasets/collection/breast-cancer.csv
* Instructions on which variables to use for data cleaning and feature engineering, which are found in their respective notebooks.

## Outputs

* Cluster Pipeline
* Train Set
* Most important features to define a cluster plot
* Clusters Profile Description
* Cluster Silhouette

## Additional Comments

* This notebook was written based on the guidelines provided in the walkthrough project 2: 'Churnometer'.
* This notebook relates to the Data Understanding step of Crisp-DM methodology.
* This notebook and the following will represent the learning outcome after following the Code Institute - Predictive Analytics and Machine Learning module.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd() 
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/breast-cancer.csv")
    .drop(['id', 'diagnosis'], axis=1)
    )
print(df.shape)
df.head(3)

# Cluster Pipeline with all data

## ML Pipeline for Cluster

* Our objective is to cluster similar data points and then analyse the clusters against the diagnostic (malignant or benign).

* As a result, we will use only the thirty features (all variables but Diagnostic) to fit the cluster pipeline.

In [None]:
from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### PCA
from sklearn.decomposition import PCA

### ML algorithm
from sklearn.cluster import KMeans

def PipelineCluster():
    pipeline_base = Pipeline([
        
        ("scaler", StandardScaler()),
        ("PCA", PCA(n_components=50, random_state=0)),
        ("model", KMeans(n_clusters=50, random_state=0)),
    ])

    return pipeline_base

PipelineCluster()

## Principal Component Analysis (PCA)

* We are interested to find the most suitable n_components, then we update the value in the ML Pipeline for Cluster

* To reach that, we will create an object based on PipelineCluster(), then remove the last two steps (PCA and model): .steps[:-2]

* Finally, the pipeline_pca scales the data, so we can apply PCA afterwards

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_pca = Pipeline(pipeline_cluster.steps[:-2])
df_pca = pipeline_pca.fit_transform(df)

print(df_pca.shape,'\n', type(df_pca))

* Next, we apply PCA separately to the scaled data

In [None]:
%matplotlib inline
# This line is used to display matplotlib plots inline within the Jupyter Notebook.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

n_components = 16


def pca_components_analysis(df_pca, n_components):
    pca = PCA(n_components=n_components).fit(df_pca)
    x_PCA = pca.transform(df_pca)  # array with transformed PCA

    ComponentsList = ["Component " + str(number)
                    for number in range(n_components)]
    dfExplVarRatio = pd.DataFrame(
        data=np.round(100 * pca.explained_variance_ratio_, 3),
        index=ComponentsList,
        columns=['Explained Variance Ratio (%)'])

    dfExplVarRatio['Accumulated Variance'] = dfExplVarRatio['Explained Variance Ratio (%)'].cumsum(
    )

    PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum(
    )

    print(
        f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
    plt.figure(figsize=(9, 6))
    sns.lineplot(data=dfExplVarRatio,  marker="o")
    plt.xticks(rotation=90)
    plt.yticks(np.arange(0, 110, 10))
    plt.show()


pca_components_analysis(df_pca=df_pca, n_components=n_components)

* With seven components we can achieve a bit more than 90% of data variance judging by the above figure

In [None]:
n_components = 7

pca = PCA(n_components=n_components).fit(df_pca)
x_PCA = pca.transform(df_pca)

ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,3),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

* We rewrite the PipelineCluster(), updating n_components to 7

In [None]:
def PipelineCluster():
    pipeline_base = Pipeline([

        ("scaler", StandardScaler()),
        ("PCA", PCA(n_components=7, random_state=0)),
        ("model", KMeans(n_clusters=50, random_state=0)),
    ])

    return pipeline_base

PipelineCluster()

## Elbow Method and Silhouette Score

* We will combine 2 techniques (Elbow Method and Silhouette Score) to find the optimal value for the number of clusters

* We will transform the data up to the point that it will hit the model, for Elbow Method and Silhouette score.
  * Therefore we remove the last step (.steps[:-1]) and fit_transform pipeline_analysis to the data.

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_analysis = Pipeline(pipeline_cluster.steps[:-1])
df_analysis = pipeline_analysis.fit_transform(df)

print(df_analysis.shape,'\n', type(df_analysis))

### Elbow Method

* We use KElbowVisualizer() from YellowbrickElbow Analysis to implement the Elbow Method

* We pass in as arguments the algorithm we want (KMeans) and the range for the number of clusters we want to try

In [None]:
from yellowbrick.cluster import KElbowVisualizer

plt.rcParams['font.family'] = 'DejaVu Sans'
plt.rcParams['font.sans-serif'] = ['DejaVu Sans']

visualizer = KElbowVisualizer(KMeans(random_state=0), k=(1,11)) # 11 is not inclusive
visualizer.fit(df_analysis)
visualizer.show()

From the visualizer we can deduce -

* The plot suggests 3 clusters
* Between 2 and 5, the values have a sharp and steep falloff.|
* Outside this range, it does not fall off in a similar manner.

### Silhouette score

* By Silhouette score, we can interpret and validate the consistency within clusters, which is based on the mean intra-cluster distance and mean nearest-cluster distance for each data point.

The silhouette score range is from -1 to +1.

* “+1” means that a clustered data point is dense and properly separated from other clusters.
* A score close to 0 means the clustered data point is overlapping with another cluster.
* A negative score means that the clustered data point may be wrong; it may even belong to another cluster.

In [None]:
from yellowbrick.cluster import SilhouetteVisualizer

plt.rcParams['font.family'] = 'DejaVu Sans'
plt.rcParams['font.sans-serif'] = ['DejaVu Sans']
print("=== Average Silhouette Score for different number of clusters ===")
visualizer = KElbowVisualizer(KMeans(random_state=0), k=(2,5), metric='silhouette')
visualizer.fit(df_analysis) 
visualizer.show() 
plt.show()
print("\n")

for n_clusters in np.arange(start=2,stop=11):

    print(f"=== Silhouette plot for {n_clusters} Clusters ===")
    visualizer = SilhouetteVisualizer(estimator = KMeans(n_clusters=n_clusters, random_state=0),
                                        colors = 'yellowbrick')
    visualizer.fit(df_analysis)
    visualizer.show()
    print("\n")

**Optimal Value for clusters -**

* Elbow Method says 3.
* The average Silhouette Score says 2, but the Silhouette Plot from 3 clusters is better than for 2 clusters.
* As a result, we will pick 3, since the Elbow Method and Silhouette Plot both support that decision.

* **We rewrite the PipelineCluster(), updating n_cluster to 3**

In [None]:
def PipelineCluster():
    pipeline_base = Pipeline([

        ("scaler", StandardScaler()),
        ("PCA", PCA(n_components=7, random_state=0)),
        ("model", KMeans(n_clusters=3, random_state=0)),
    ])

    return pipeline_base

PipelineCluster()

## Fit Cluster Pipeline

* Quick recap of our data for training cluster pipeline

In [None]:
X = df.copy()
print(X.shape)
X.head(3)

* Fit Cluster pipeline

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_cluster.fit(X)

## Add cluster predictions to dataset

* **We add a column "Clusters" (with the Cluster Pipeline predictions) to X**

In [None]:
X['Clusters'] = pipeline_cluster['model'].labels_
print(X.shape)
X.head(3)

* We are interested to know the cluster frequency

In [None]:
print(f"* Clusters frequencies \n{ X['Clusters'].value_counts(normalize=True).to_frame().round(2)} \n\n")
X['Clusters'].value_counts().sort_values().plot(kind='bar')
plt.show()

*  There are three clusters, and the counting starts from 0.

* We note that the algorithm found that the majority of the data (62%) belongs to cluster number 0.

* Cluster 1 and 2 contain 21% and 17% of the data.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df_analysis[:, 0], y=df_analysis[:, 1],
                hue=X['Clusters'], palette='Set1', alpha=0.6)
plt.scatter(x=pipeline_cluster['model'].cluster_centers_[:, 0], y=pipeline_cluster['model'].cluster_centers_[:, 1],
            marker="x", s=169, linewidths=3, color="black")
plt.xlabel("PCA Component 0")
plt.ylabel("PCA Component 1")
plt.title("PCA Components colored by Clusters")
plt.show()

From the scatterpolt above -

* Cluster 0 has the highest density of data.
* Overlap quantity among the data is very low which serves as a good point for cluster analysis.

* We save the cluster predictions from this pipeline to use in the future.

In [None]:
cluster_predictions_with_all_variables = X['Clusters']
cluster_predictions_with_all_variables

## Fit a classifier, where the target is cluster predictions and features remaining variables

* Our new dataset has Clusters as a variable.
* We use a technique where Clusters will be the target for a classifier, and the remaining variables will be features for that target.

* We copy X to a DataFrame df_clf

In [None]:
df_clf = X.copy()
print(df_clf.shape)
df_clf.tail(3)

* Split Train and Test sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df_clf.drop(['Clusters'], axis=1),
    df_clf['Clusters'],
    test_size=0.2,
    random_state=0
)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

* Create classifier pipeline steps.
* We are considering a model that typically offers good results, and feature's importance can be assessed with .features_importance_ using a tree-based algorithm. 
* We are using AdaBoostClassifier since it had good performance in the previous notebook.

In [None]:
# Feat Selection
from sklearn.feature_selection import SelectFromModel

# ML algorithm
from sklearn.ensemble import GradientBoostingClassifier

def PipelineClf2ExplainClusters():
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("feat_selection", SelectFromModel(GradientBoostingClassifier(random_state=0)) ),
        ("model", GradientBoostingClassifier(random_state=0)),
    ])

    return pipeline_base

PipelineClf2ExplainClusters()

* Fit the classifier to the training data

In [None]:
pipeline_clf_cluster = PipelineClf2ExplainClusters()
pipeline_clf_cluster.fit(X_train, y_train)

## Evaluate classifier performance on Train and Test Sets

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_train, pipeline_clf_cluster.predict(X_train)))

In [None]:
print(classification_report(y_test, pipeline_clf_cluster.predict(X_test)))

## Assess the most important Features that define a cluster

In [None]:
pipeline_clf_cluster

In [None]:
# Taking pipeline steps upto classifier without including it
pipeline_feat_select = Pipeline(pipeline_clf_cluster.steps[:-2])

# Fit-transform the feature selection pipeline
X_train_selected = pipeline_feat_select.fit_transform(X_train, y_train)

# The columns that survive the selection
selected_columns = X_train.columns[pipeline_clf_cluster['feat_selection'].get_support()]

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame({
    'Feature': selected_columns,
    'Importance': pipeline_clf_cluster['model'].feature_importances_})
                         .sort_values(by='Importance', ascending=False)
)

# assign best features in importance order
best_features = df_feature_importance['Feature'].to_list()

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
    f"The model was trained on them: \n{best_features} \n")
df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

These are the 4 most important features in descending order -

* ['concavity_mean', 'area_worst', 'fractal_dimension_worst', 'perimeter_worst']

In [None]:
best_features_pipeline_all_variables = best_features
best_features_pipeline_all_variables

## Cluster Analysis

* We load function that plots a table with description for all Clusters

In [None]:
def DescriptionAllClusters(df, decimal_points=3):

  DescriptionAllClusters = pd.DataFrame(columns=df.drop(['Clusters'],axis=1).columns)
  # iterate on each cluster , calls Clusters_IndividualDescription()
  for cluster in df.sort_values(by='Clusters')['Clusters'].unique():
    
      EDA_ClusterSubset = df.query(f"Clusters == {cluster}").drop(['Clusters'],axis=1)
      ClusterDescription = Clusters_IndividualDescription(EDA_ClusterSubset,cluster,decimal_points)
      DescriptionAllClusters = pd.concat([DescriptionAllClusters, ClusterDescription], ignore_index=True)

  
  DescriptionAllClusters.set_index(['Cluster'],inplace=True)
  return DescriptionAllClusters


def Clusters_IndividualDescription(EDA_Cluster,cluster, decimal_points):

  ClustersDescription = pd.DataFrame(columns=EDA_Cluster.columns)
  # for a given cluster, iterate in all columns
  # if the variable is numerical, calculate the IQR: display as Q1 -- Q3.
    # That will show the range for the most common values for the numerical variable
  # if the variable is categorical, count the frequencies and display the top 3 most frequent
    # That will show the most common levels for the category

  for col in EDA_Cluster.columns:
    
    try:  # eventually a given cluster will have only missing data for a given variable
      
      if EDA_Cluster[col].dtypes == 'object':
        
        top_frequencies = EDA_Cluster.dropna(subset=[col])[[col]].value_counts(normalize=True).nlargest(n=3)
        Description = ''
        
        for x in range(len(top_frequencies)):
          freq = top_frequencies.iloc[x]
          category = top_frequencies.index[x][0]
          CategoryPercentage = int(round(freq*100,0))
          statement =  f"'{category}': {CategoryPercentage}% , "  
          Description = Description + statement
        
        ClustersDescription.at[0,col] = Description[:-2]


      
      elif EDA_Cluster[col].dtypes in ['float', 'int']:
        DescStats = EDA_Cluster.dropna(subset=[col])[[col]].describe()
        Q1 = round(DescStats.iloc[4,0], decimal_points)
        Q3 = round(DescStats.iloc[6,0], decimal_points)
        Description = f"{Q1} -- {Q3}"
        ClustersDescription.at[0,col] = Description
    
    
    except Exception as e:
      ClustersDescription.at[0,col] = 'Not available'
      print(f"** Error Exception: {e} - cluster {cluster}, variable {col}")
  
  ClustersDescription['Cluster'] = str(cluster)
  
  return ClustersDescription


* We load a custom function to plot cluster distribution per Variable (absolute and relative levels)

In [None]:
import plotly.express as px


def cluster_distribution_per_variable(df, target):
    """
    The data should have 2 variables, the cluster predictions and
    the variable you want to analyze with, in this case we call "target".
    We use plotly express to create 2 plots:
    Cluster distribution across the target.
    Relative presence of the target level in each cluster.
    """
    df_bar_plot = df.groupby(['Clusters', target]).size().reset_index(name='Count')
    df_bar_plot.columns = ['Clusters', target, 'Count']
    df_bar_plot[target] = df_bar_plot[target].astype('object')

    print(f"Clusters distribution across {target} levels")
    fig = px.bar(df_bar_plot, x='Clusters', y='Count',
                 color=target, width=800, height=500)
    fig.update_layout(xaxis=dict(tickmode='array',
                      tickvals=df['Clusters'].unique()))
    fig.show(renderer='jupyterlab')

    df_relative = (df
                   .groupby(["Clusters", target])
                   .size()
                   .unstack(fill_value=0)
                   .apply(lambda x: 100 * x / x.sum(), axis=1)
                   .stack()
                   .reset_index(name='Relative Percentage (%)')
                   .sort_values(by=['Clusters', target])
                   )

    print(f"Relative Percentage (%) of {target} in each cluster")
    fig = px.line(df_relative, x='Clusters', y='Relative Percentage (%)',
                  color=target, width=800, height=500)
    fig.update_layout(xaxis=dict(tickmode='array',
                      tickvals=df['Clusters'].unique()))
    fig.update_traces(mode='markers+lines')
    fig.show(renderer='jupyterlab')

* We create a DataFrame that contains best features and Clusters Predictions since we want to analyse the patterns for each cluster

In [None]:
df_cluster_profile = df_clf.copy()
df_cluster_profile = df_cluster_profile.filter(items=best_features + ['Clusters'], axis=1)
print(df_cluster_profile.shape)
df_cluster_profile.tail(3)

* We want to analyse diagnosis levels also, so we load its data

In [None]:
df_diagnosis = pd.read_csv("outputs/datasets/collection/breast-cancer.csv").filter(['diagnosis'])
df_diagnosis['diagnosis'] = df_diagnosis['diagnosis'].astype('object')
df_diagnosis.head(3)

### Cluster profile based on the best features

In [None]:
pd.set_option('display.max_colwidth', None)
clusters_profile = df=pd.concat([df_cluster_profile,df_diagnosis], axis=1)
clusters_profile

In [None]:
pd.set_option('display.max_colwidth', None)
clusters_profile = DescriptionAllClusters(df=pd.concat([df_cluster_profile,df_diagnosis], axis=1),
                                          decimal_points=3)
clusters_profile

## Cluster analysis from each profile

Since we have set '0' as Benign (B) and '1' as Malignant (M), from the above profiling we can describe each clusters in the following -

* **Cluster 0 Profile**

    - In Cluster 0, mean concavity values range from 0.02 to 0.057, indicating smooth tumor contours. The worst area measurements, ranging from 468.9 to 706.1, suggest smaller tumor sizes, while the worst fractal dimension falls between 0.069 and 0.082, indicating low structural chaos. Perimeter values of 79.79 to 99.08 indicate compact shapes. Notably, 91% of diagnoses in this cluster are benign, with only 9% malignant, aligning with the low-risk features.

* **Cluster 1 Profile**

    - Cluster 1 shows mean concavity from 0.127 to 0.22, indicating irregular tumor margins. The worst area is significantly larger, ranging from 1437.0 to 2009.0, while the worst fractal dimension (0.076 to 0.092) suggests moderate disorder. The worst perimeter ranges from 143.7 to 171.1, reflecting aggressive tumor growth. Notably, 100% of the diagnoses in this cluster are malignant (M), consistent with a high-risk profile.

- **Cluster 2 Profile**

    - In Cluster 2, mean concavity ranges from 0.104 to 0.186, indicating moderately irregular shapes. The worst area spans from 580.6 to 975.2, and the worst perimeter varies between 94.22 and 122.1, suggesting mid-range tumor size. The worst fractal dimension, from 0.096 to 0.12, is the highest among all clusters, showing chaotic tumor structures. Diagnoses are mixed, with 64% malignant (M) and 36% benign (B), indicating possible early-stage malignancies or borderline cases needing further examination.

### Clusters distribution across diagnosis levels & Relative Percentage of diagnosis in each cluster

In [None]:
df_cluster_vs_diagnosis = df_diagnosis.copy()
df_cluster_vs_diagnosis['Clusters'] = X['Clusters']
cluster_distribution_per_variable(df=df_cluster_vs_diagnosis, target='diagnosis')

## Fit New Cluster Pipeline with most important features

* In order to reduce feature space, we will study the trade-off between the previous Cluster Pipeline (fitted with all variables) and Pipeline using the variables that are most important to define the clusters from the previous pipeline

In [None]:
best_features_pipeline_all_variables

### Define trade-off and metrics to compare new and previous Cluster Pipeline

To evaluate this trade-off we will -

1. Conduct a elbow method and silhouette analysis and check if the same number of clusters is suggested
2. Fit new cluster pipeline and compare if the predictions from this pipeline are "equivalent" to the predictions from the previous pipeline
3. Fit a classifier to explain cluster, and check if performance on Train and Test sets is similar to the previous pipeline
4. Check if the most important features for the classifier are the same from the previous pipeline
5. Compare if the cluster profile from both pipelines are "equivalent"

* Should we express our willingness to support them, we may implement a cluster pipeline that utilizes the features most indicative of the clusters identified in the previous pipeline.

### Subset data with the most relevant variables

In [None]:
df_reduced = df.filter(best_features_pipeline_all_variables)
print(df_reduced.shape)
df_reduced.head(3)

### Rewrite Cluster Pipeline

In [None]:
def PipelineCluster():
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("model", KMeans(n_clusters=3, random_state=0)),
    ])

    return pipeline_base

PipelineCluster()

### Apply Elbow Method and Silhouette analysis

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_analysis = Pipeline(pipeline_cluster.steps[:-1])
df_analysis = pipeline_analysis.fit_transform(df_reduced)

print(df_analysis.shape,'\n', type(df_analysis))

#### Elbow Analysis

In [None]:
from yellowbrick.cluster import KElbowVisualizer
print("=== Average Elbow Score for different number of clusters ===")
visualizer = KElbowVisualizer(KMeans(random_state=0), k=(1,11))
visualizer.fit(df_analysis) 
visualizer.show() 
plt.show()

From the visualizer we can deduce -

* The plot suggests 3 clusters
* Between 2 and 5, the values have a sharp and steep falloff.|
* Outside this range, it does not fall off in a similar manner.

#### Silhoutte Score

In [None]:
from yellowbrick.cluster import SilhouetteVisualizer

n_cluster_start, n_cluster_stop = 2, 5

print("=== Average Silhouette Score for different number of clusters ===")
visualizer = KElbowVisualizer(KMeans(random_state=0), k=(
    n_cluster_start, n_cluster_stop), metric='silhouette')
visualizer.fit(df_analysis)
visualizer.show()
plt.show()
print("\n")


for n_clusters in np.arange(start=n_cluster_start, stop=n_cluster_stop):

    print(f"=== Silhouette plot for {n_clusters} Clusters ===")
    visualizer = SilhouetteVisualizer(estimator=KMeans(n_clusters=n_clusters, random_state=0),
                                      colors='yellowbrick')
    visualizer.fit(df_analysis)
    visualizer.show()
    plt.show()
    print("\n")

**Optimal Value for clusters with best features-**

* Elbow Method says 3.
* The average Silhouette Score says 2, but the Silhouette Plot from 3 clusters is better than for 2 clusters.
* As a result, we will pick 3, since the Elbow Method and Silhouette Plot both support that decision.

## Fit New Cluster Pipeline

* We set X as our training set for the cluster. It is a copy of df_reduced

In [None]:
X = df_reduced.copy()
print(X.shape)
X.head(3)

### Fit Cluster pipeline

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_cluster.fit(X)

### Add cluster predictions to dataset

* We add a column "Clusters" (with the cluster pipeline predictions) to the dataset

In [None]:
X['Clusters'] = pipeline_cluster['model'].labels_
print(X.shape)
X.head(3)

In [None]:
print(f"* Clusters frequencies \n{ X['Clusters'].value_counts(normalize=True).to_frame().round(2)} \n\n")
X['Clusters'].value_counts().sort_values().plot(kind='bar')
plt.show()

## Compare current cluster predictions to previous cluster predictions

We just fitted a new cluster pipeline and want to compare if its predictions are "equivalent" to the previous cluster.

* These are the predictions from the previous cluster pipeline - trained with all variables

In [None]:
cluster_predictions_with_all_variables

* And these are the predictions from current cluster pipeline, trained with ['concavity_mean', 'area_worst', 'fractal_dimension_worst', 'perimeter_worst']

In [None]:
cluster_predictions_with_best_features = X['Clusters'] 
cluster_predictions_with_best_features

#### We use a confusion matrix to evaluate if the predictions of both pipelines are "equivalent"

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(cluster_predictions_with_all_variables, cluster_predictions_with_best_features))

* We can see that the cluster alignment is matched perfectly between the two and the data spread is similar, other than cluster 2 being heavier on the Malignant side.

#### Fit a classifier, where the target is cluster predictions and features remaining variables

In [None]:
df_clf = X.copy()
print(df_clf.shape)
df_clf.head(3)

* Split Train and Test sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df_clf.drop(['Clusters'], axis=1),
    df_clf['Clusters'],
    test_size=0.2,
    random_state=0
)

print(X_train.shape, X_test.shape)

* Rewrite pipeline to explain clusters

In [None]:
def PipelineClf2ExplainClusters():
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        # we don't consider feature selection step, since we know which features to consider
        ("model", GradientBoostingClassifier(random_state=0)),

    ])
    return pipeline_base


PipelineClf2ExplainClusters()

### Fit a classifier, where the target is cluster labels and features remaining variables

* We create and fit a classifier pipeline to learn the feature importance when defining a cluster

In [None]:
pipeline_clf_cluster = PipelineClf2ExplainClusters()
pipeline_clf_cluster.fit(X_train,y_train)

## Evaluate classifier performance on Train and Test Sets

In [None]:
print(classification_report(y_train, pipeline_clf_cluster.predict(X_train)))

In [None]:

print(classification_report(y_test, pipeline_clf_cluster.predict(X_test)))

## Assess Most Important Features

In [None]:
# since we don't have feature selection step in this pipeline, best_features is Xtrain columns
best_features = X_train.columns.to_list()

# create a DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Feature': best_features,
    'Importance': pipeline_clf_cluster['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

best_features = df_feature_importance['Feature'].to_list()

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

## Cluster Analysis

* We create a DataFrame that contains the best features and Clusters Predictions: we want to analyse the patterns for each cluster.

In [None]:
df_cluster_profile = df_clf.copy()
df_cluster_profile = df_cluster_profile.filter(items=best_features + ['Clusters'], axis=1)
df_cluster_profile.head(3)

* We analyse diagnosis levels

In [None]:
df_diagnosis = pd.read_csv("outputs/datasets/collection/breast-cancer.csv").filter(['diagnosis'])
df_diagnosis['diagnosis'] = df_diagnosis['diagnosis'].astype('object')
df_diagnosis.head(3)

### Cluster profile on most important features

In [None]:
pd.set_option('display.max_colwidth', None)
clusters_profile = DescriptionAllClusters(df= pd.concat([df_cluster_profile,df_diagnosis], axis=1))
clusters_profile

### Clusters distribution across diagnosis levels & Relative Percentage of diagnosis in each cluster

In [None]:
df_cluster_vs_diagnosis = df_diagnosis.copy()
df_cluster_vs_diagnosis['Clusters'] = X['Clusters']
cluster_distribution_per_variable(df=df_cluster_vs_diagnosis, target='diagnosis')

### Comparative analysis of the cluster profiles between the original pipeline and the new pipeline

Here’s a concise comparative analysis of the cluster profiles between the original pipeline and the new pipeline:

* **Cluster 0 Comparison**

    - The new data retains Cluster 0's low-risk characteristics, with slightly broader ranges for all features. The range for `perimeter_worst` is now 79.848 to 99.285 (vs. 79.79 to 99.08), while `concavity_mean` is 0.021 to 0.06 (vs. 0.02 to 0.057). The proportion of benign diagnoses dropped from 91% to 90%, with malignant cases rising from 9% to 10%. The consistency in `fractal_dimension_worst` (0.069 to 0.083 vs. 0.069 to 0.082) confirms reliable identification of benign tumors despite using fewer features.

* **Cluster 1 Comparison**

    - Cluster 1 continues to show 100% malignancy in both datasets. The ranges for `concavity_mean` (0.133 to 0.22 vs. 0.127 to 0.22) and `fractal_dimension_worst` (0.076 to 0.093 vs. 0.076 to 0.092) remain nearly identical. The boundaries for `perimeter_worst` (145.2 to 171.325) and `area_worst` (1436.75 to 2009.25) are also unchanged, indicating these features effectively capture this high-risk profile.

* **Cluster 2 Comparison**

    - The new model shows increased malignancy detection in Cluster 2 (72% vs. 64%), suggesting better capture of malignant characteristics. All ranges slightly expanded: `perimeter_worst` (99.175 to 122.95 vs. 94.22 to 122.1), `concavity_mean` (0.114 to 0.201 vs. 0.104 to 0.186), and `area_worst` (639.2 to 981.05 vs. 580.6 to 975.2). Notably, `fractal_dimension_worst` increased to 0.103 to 0.124, reinforcing the link between high fractal dimension and malignancy. The benign proportion dropped from 36% to 28%, indicating fewer false negatives.

Overall, the four-feature model aligns well with the original 30-feature version, particularly for clear benign or malignant cases. Cluster 2's increased malignancy percentage suggests these features are effective in identifying high-risk tumors, while the broader ranges indicate slightly less precision in borderline cases. The consistency across Cluster 0 and Cluster 1 further confirms the reliability of the 4 feature-cluster pipeline.

---

# Conclusion and Pushing files to Repo

## Which pipeline should I deploy?

We recap the criteria we consider to evaluate the trade-off -

* *Conduct an elbow method and silhouette analysis and check if the same number of clusters is suggested.*
* *Fit a new cluster pipeline and compare if the predictions from this pipeline are "equivalent" to the predictions from the previous pipeline.*
* *Fit a classifier to explain cluster and check if performance on Train and Test sets is similar to the previous pipeline.*
* *Check if the most important features for the classifier are the same from the previous pipeline.*
* *Compare if the cluster profile from both pipelines is "equivalent".*

**To conclude** -

* Following a comprehensive comparative analysis of the existing and newly developed pipelines, I have determined that the newer pipeline, which incorporates only four features identified as optimal for cluster analysis, is more advantageous.

* This model retains diagnostic accuracy while enhancing computational efficiency.

* The streamlined approach ensures a clear distinction between benign and malignant clusters (indicated as 0 and 1), and it improves the detection of malignancies in borderline cases (designated as Cluster 2).

* Furthermore, the dimensionality reduction enhances its scalability for clinical implementation without compromising essential diagnostic performance.

In [None]:
pipeline_cluster # selected pipeline

## Push files to Repo

We will generate the following files

* Cluster Pipeline
* Train Set
* Feature importance plot
* Clusters Description
* Cluster Silhouette

In [None]:
import joblib
import os

version = 'v1'
file_path = f'outputs/ml_pipeline/cluster_analysis/{version}'

try:
    os.makedirs(name=file_path)
except Exception as e:
    print(e)

## Cluster pipeline

In [None]:
pipeline_cluster

In [None]:
joblib.dump(value=pipeline_cluster, filename=f"{file_path}/cluster_pipeline.pkl")

## Train Set

In [None]:
print(df_reduced.shape)
df_reduced.head(3)

In [None]:
df_reduced.to_csv(f"{file_path}/TrainSet.csv", index=False)

## Most important features plot

* These are the features that define a cluster

In [None]:
df_feature_importance.plot(kind='bar', x='Feature', y='Importance', figsize=(8,4))
plt.show()

In [None]:
df_feature_importance.plot(kind='bar', x='Feature', y='Importance', figsize=(8,4))
plt.savefig(f"{file_path}/features_define_cluster.png", bbox_inches='tight', dpi=150)

## Cluster Profile

In [None]:
clusters_profile

In [None]:
clusters_profile.to_csv(f"{file_path}/clusters_profile.csv")

## Cluster silhouette plot

In [None]:
visualizer = SilhouetteVisualizer(Pipeline(pipeline_cluster.steps[-1:])[0], colors='yellowbrick')
visualizer.fit(df_analysis)
visualizer.show()
plt.show()

In [None]:
fig, axes = plt.subplots(figsize=(7,5))
fig = SilhouetteVisualizer(Pipeline(pipeline_cluster.steps[-1:])[0] , colors='yellowbrick', ax=axes)
fig.fit(df_analysis)

plt.savefig(f"{file_path}/clusters_silhouette.png", bbox_inches='tight',dpi=150)

---