# **Cluster**

## Objectives

* Fit and evaluate a cluster model to group similar data
* Analyse the clusters against the diagnostic (malignant or benign)
* Understand the profile for each cluster

## Inputs

* Outputs/datasets/collection/breast-cancer.csv
* Instructions on which variables to use for data cleaning and feature engineering, which are found in their respective notebooks.

## Outputs

* Cluster Pipeline
* Train Set
* Most important features to define a cluster plot
* Clusters Profile Description
* Cluster Silhouette

## Additional Comments

* This notebook was written based on the guidelines provided in the walkthrough project 2: 'Churnometer'.
* This notebook relates to the Data Understanding step of Crisp-DM methodology.
* This notebook and the following will represent the learning outcome after following the Code Institute - Predictive Analytics and Machine Learning module.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/breast-cancer.csv")
    .drop(['id', 'diagnosis'], axis=1)
    )
print(df.shape)
df.head(3)

# Cluster Pipeline with all data

## ML Pipeline for Cluster

* Our objective is to cluster similar data points and then analyse the clusters against the diagnostic (malignant or benign).

* As a result, we will use only the thirty features (all variables but Diagnostic) to fit the cluster pipeline.

In [None]:
from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### PCA
from sklearn.decomposition import PCA

### ML algorithm
from sklearn.cluster import KMeans

def PipelineCluster():
    pipeline_base = Pipeline([

        ("scaler", StandardScaler()),
        ("PCA", PCA(n_components=50, random_state=0)),
        ("model", KMeans(n_clusters=50, random_state=0)),
    ])

    return pipeline_base

PipelineCluster()

## Principal Component Analysis (PCA)

* We are interested to find the most suitable n_components, then we update the value in the ML Pipeline for Cluster

* To reach that, we will create an object based on PipelineCluster(), then remove the last two steps (PCA and model): .steps[:-2]

* Finally, the pipeline_pca scales the data, so we can apply PCA afterwards

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_pca = Pipeline(pipeline_cluster.steps[:-2])
df_pca = pipeline_pca.fit_transform(df)

print(df_pca.shape,'\n', type(df_pca))

* Next, we apply PCA separately to the scaled data

In [None]:
%matplotlib inline
# This line is used to display matplotlib plots inline within the Jupyter Notebook.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

n_components = 30


def pca_components_analysis(df_pca, n_components):
    pca = PCA(n_components=n_components).fit(df_pca)
    x_PCA = pca.transform(df_pca)  # array with transformed PCA

    ComponentsList = ["Component " + str(number)
                    for number in range(n_components)]
    dfExplVarRatio = pd.DataFrame(
        data=np.round(100 * pca.explained_variance_ratio_, 3),
        index=ComponentsList,
        columns=['Explained Variance Ratio (%)'])

    dfExplVarRatio['Accumulated Variance'] = dfExplVarRatio['Explained Variance Ratio (%)'].cumsum(
    )

    PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum(
    )

    print(
        f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
    plt.figure(figsize=(9, 6))
    sns.lineplot(data=dfExplVarRatio,  marker="o")
    plt.xticks(rotation=90)
    plt.yticks(np.arange(0, 110, 10))
    plt.show()


pca_components_analysis(df_pca=df_pca, n_components=n_components)

* With seven components we can achieve a bit more than 90% of data variance judging by the above figure

In [None]:
n_components = 7

pca = PCA(n_components=n_components).fit(df_pca)
x_PCA = pca.transform(df_pca)

ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,3),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

* We rewrite the PipelineCluster(), updating n_components to 7

In [None]:
def PipelineCluster():
    pipeline_base = Pipeline([

        ("scaler", StandardScaler()),
        ("PCA", PCA(n_components=7, random_state=0)),
        ("model", KMeans(n_clusters=50, random_state=0)),
    ])

    return pipeline_base

PipelineCluster()

## Elbow Method and Silhouette Score

* We will combine 2 techniques (Elbow Method and Silhouette Score) to find the optimal value for the number of clusters

* We will transform the data up to the point that it will hit the model, for Elbow Method and Silhouette score.
  * Therefore we remove the last step (.steps[:-1]) and fit_transform pipeline_analysis to the data.

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_analysis = Pipeline(pipeline_cluster.steps[:-1])
df_analysis = pipeline_analysis.fit_transform(df)

print(df_analysis.shape,'\n', type(df_analysis))

### Elbow Method

* We use KElbowVisualizer() from YellowbrickElbow Analysis to implement the Elbow Method

* We pass in as arguments the algorithm we want (KMeans) and the range for the number of clusters we want to try

In [None]:
from yellowbrick.cluster import KElbowVisualizer
import warnings

warnings.filterwarnings("ignore", message="findfont.*") # Ignore font warnings
visualizer = KElbowVisualizer(KMeans(random_state=0), k=(1,11)) # 11 is not inclusive
visualizer.fit(df_analysis)
visualizer.show()

From the visualizer we can deduce -

* The plot suggests three clusters
* Between 2 and 5, the values have a sharp and steep falloff.
* Outside this range, it does not fall off in a similar manner.

### Silhouette score

* By Silhouette score, we can interpret and validate the consistency within clusters, which is based on the mean intra-cluster distance and mean nearest-cluster distance for each data point.

The silhouette score range is from -1 to +1.

* “+1” means that a clustered data point is dense and properly separated from other clusters.
* A score close to 0 means the clustered data point is overlapping with another cluster.
* A negative score means that the clustered data point may be wrong; it may even belong to another cluster.

In [None]:
from yellowbrick.cluster import SilhouetteVisualizer

print("=== Average Silhouette Score for different number of clusters ===")
visualizer = KElbowVisualizer(KMeans(random_state=0), k=(2,7), metric='silhouette')
visualizer.fit(df_analysis) 
visualizer.show() 
plt.show()
print("\n")

for n_clusters in np.arange(start=2,stop=11):

    print(f"=== Silhouette plot for {n_clusters} Clusters ===")
    visualizer = SilhouetteVisualizer(estimator = KMeans(n_clusters=n_clusters, random_state=0),
                                        colors = 'yellowbrick')
    visualizer.fit(df_analysis)
    visualizer.show()
    print("\n")

**Optimal Value for clusters -**

* Elbow Method says three.
* The average Silhouette Score says two, but the Silhouette Plot from three clusters is better than for two clusters.
* As a result, we will pick three, since the Elbow Method and Silhouette Plot both support that decision.

* **We rewrite the PipelineCluster(), updating n_cluster to 3**

In [None]:
def PipelineCluster():
    pipeline_base = Pipeline([

        ("scaler", StandardScaler()),
        ("PCA", PCA(n_components=7, random_state=0)),
        ("model", KMeans(n_clusters=3, random_state=0)),
    ])

    return pipeline_base

PipelineCluster()

## Fit Cluster Pipeline

* Quick recap of our data for training cluster pipeline

In [None]:
X = df.copy()
print(X.shape)
X.head(3)

* Fit Cluster pipeline

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_cluster.fit(X)

## Add cluster predictions to dataset

* **We add a column "Clusters" (with the Cluster Pipeline predictions) to X**

In [None]:
X['Clusters'] = pipeline_cluster['model'].labels_
print(X.shape)
X.head(3)

* We are interested to know the cluster frequency

In [None]:
print(f"* Clusters frequencies \n{ X['Clusters'].value_counts(normalize=True).to_frame().round(2)} \n\n")
X['Clusters'].value_counts().sort_values().plot(kind='bar')
plt.show()

---

# Load and Inspect Kaggle Data

### Convert 'diagnosis' values

* We will convert `diagnosis` values from `M` and `B` to `1` and `0` respectively so that the saved dataset will already have the target variable in a numeric format, so any ML model can consume it directly without extra preprocessing later.

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.