# Scikit-learn - Unit 08 - Cluster

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand how to group similar data using KMeans clustering algorithm
* Explain Clusters profiles



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 07 - Cluster

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Welcome to the world of unsupervised learning! It is slightly different than supervised learning, due to one aspect: there is no target variable. **The algorithm is left on its own and look for patterns in the data**
* The ML task we will will study is called **cluster**, a type of unsupervised algorithm where it looks to group the data by similarity
* The workflow used in cluster will be also slightly different than we used for regression and classification tasks, however you will still do tasks, like create pipeline steps, fit the pipeline using your data and evaluate the pipeline. But now they will be done in a different way




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> There are multiple clustering algorithms at Scikit learn, you may go to this [link](https://scikit-learn.org/stable/modules/clustering.html) and look for the potential algorithms to learn and use over your career. 
* We will study **KMeans** in this course, since it is a starting point for your career and will not add much complexity to what we have been studying so far. In case you want to refresh the concepts of KMeans, you may revert to Module 2 - ML Essentials 

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Workflow

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Introduction

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In practical terms, we don't know for sure how good your cluster model performance will be
* Unless you gather a separate data and find a way to discover the actual value, so you can compare to the cluster prediction. This is not so trivial in practical terms and in this course we will not consider this alternative.
* That being said, you will not know, for sure for example, if a pipeline with 4 clusters is in reality better than a pipeline with 7 clusters. However, there are approaches you can use to frame the project and reach more conclusive results that will make you to understand the patterns in your data.

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Expectation and Pipeline Objective

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> This notebook, is dense and we cover many concepts. But always remember the core concept of this notebook is simple:
* **Fit a Cluster Pipeline that groups similar data and explain each Cluster profile**

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Major ideas

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
"> The major ideas  we consider in this notebook are:
* 1: **Create Cluster pipeline**. Before fitting the pipeline, we need to define the number of PCA components and number of cluster.
* 2: **Fit the Cluster Pipeline**
* 3: We need to **understand Cluster profile**. We will use a classifier where the target is the cluster prediction to identify the most important variables that define a cluster
* 4: **Cluster analysis**: explain each cluster profile in terms of the most important variables. In addition, in case your dataset has a separate variable you want to study and you didn't include in the cluster pipeline, you can study how this variable correlates to the clusters. In our case, we will analyze the clusters and the diagnostic (malignant or benign) 

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Practical Workflow

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
"> However, the practical workflow we will code is longer, and it is based on the major ideas we saw above. In particular, in this notebook we will:
* 1 - **Create a Cluster Pipeline** that contains the following steps: data cleaning, feature engineering, feature scaling, PCA and Cluster Model (KMeans). Note: this pipeline has parameters for PCA and Cluster that we will need to update over the notebook.
* 4 - Conduct an analysis to determine the  number of components in a PCA. We will update that into the Cluster Pipeline
* 5 - Apply **Elbow Method and evaluate Silhoutte score**, to define the number of clusters in Cluster Pipeline
* 7 - **Fit** the Cluser pipeline
* 8 - Add cluster predictions to the data
* 9 - Create a separate **Classifier Pipeline**, where target variable is clusters predictions and features are remaining variables
* 10 - Fit this classifier, evaluate performance and assess most important features. These features are the most **important features to define the clusters predictions**
* 11 - **Cluster analysis**: explain each cluster profile in terms of the most important features from previous step. In addition, in case your dataset has a separate variable you want to study and you didn't include in the cluster pipeline, you can study how this variable correlates to the clusters 

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Load Data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's load the breast cancer data from sklearn. It shows records for a breast mass sample and a diagnosis informing whether it is as malignant or benign cancer, where 0 is malignant, 1 is benign.
* **Our objective is to cluster similar datapoints, and after analyze the clusters against the diagnostic (malignant or benign)**
  * As a result, **we will use only the 30 features** (all variables but Diagnostic) to fit the cluster pipeline.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We know in advance this dataset has only numerical feaures and no missing data.
* We are adding on purpose missing data (`np.NaN`) in the first 10 rows of 'mean smoothness' using `.iloc[:10,4]`, just to better simulate the datasets you will likely face in the workplace

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
df = pd.DataFrame(data.data,columns=data.feature_names)
df.iloc[:10,4] = np.NaN
print(df.shape)
df.head()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> ML Pipeline for Cluster

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The Cluster Pipeline is made of Data Cleaning (mediam imputation on `mean smoothness`) feature scaling, PCA and model (KMeans) steps
* Note: `n_components` of PCA and `n_clusters` of KMeans values will be updated afterwards, for now we leave arbitrary numbers of 50 (it could be any number)

from sklearn.pipeline import Pipeline

### Data Cleaning
from feature_engine.imputation import MeanMedianImputer

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### PCA
from sklearn.decomposition import PCA

### ML algorithm
from sklearn.cluster import KMeans

def PipelineCluster():
  pipeline_base = Pipeline([
                            
      ( 'MeanMedianImputer', MeanMedianImputer(imputation_method='median',
                                               variables=['mean smoothness']) ),

      ("scaler", StandardScaler()  ),    

      ("PCA",  PCA(n_components=50, random_state=0)), 

      ("model", KMeans(n_clusters=50, random_state=0)  ), 
  ])
  return pipeline_base

PipelineCluster()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Principal Component Analysis (PCA)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Principal Component Analysis, or PCA, is a transformation to your data and attempts to find out what features explain the most variance in your data.
* PCA reduces the number of variables, while it preserves as much information as possible. After the transformation, it creates a set of components, where each component contains the relevant information from the original variables.
* **This is useful in a Cluster pipeline since it is a method to reduce the feature space and provide a data to the model that is in a better format for the algorithm to group similar data**.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are interested to find the most suitable `n_components`, then we update the value on ML Pipeline for Cluster
* To reach that, we will create an object based on PipelineCluster(), then remove the last 2 steps (PCA and model): `.steps[:-2]`
* At the end, the `pipeline_pca` scales the data, so we can apply PCA afterwards

pipeline_cluster = PipelineCluster()
pipeline_pca = Pipeline(pipeline_cluster.steps[:-2])
df_pca = pipeline_pca.fit_transform(df)

print(df_pca.shape,'\n', type(df_pca))

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next we apply PCA separately to the scaled data. Similarly to what we did on previous unit notebook.
* Next we are interested to define the number of components from PCA step. We will set the number of components as the number of columns the scaled data has, in this case, 30. That is useful in understanding the explained variance of each component.
* The interpretation is similar from previous notebook
  * The first three components are more significant than the others. And, together, they sum 72.47% of the data variance. That is okay. It is a good sign when in a few components, like 3 or 4, you can get more than 80% of your data variance. So you could select three as the number of components, which is good progress since you had 30 features and now have three components.
  * But in this exercise, for learning purposes, we will aim for more than 90% of data variance and use seven components since we could get more data variance with a relatively low increase of components.

n_components = 30 # set the number of components as all columns in the data

pca = PCA(n_components=n_components).fit(df_pca)  # set PCA object and fit to the data
x_PCA = pca.transform(df_pca) # array with transformed PCA


# the PCA object has .explained_variance_ratio_ attribute, which tells 
# how much information (variance) each component has 
# We store that to a DataFrame relating each component to its variance explanation
ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,3),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

# prints how much of the dataset these components exaplain (naturally in this case will be 100%)
PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In the next cell we just copied the code from the cell above and changed `n_components` to 7.
  * With 7 components we achieved a bit more than 90% of data variance

n_components = 7

pca = PCA(n_components=n_components).fit(df_pca)
x_PCA = pca.transform(df_pca)

ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,3),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

Next we rewrite the `PipelineCluster()`, updating `n_components` to 7
* Note, in a real project, you don't have to necessarly rewrite in the cell below the pipeline. You could have scrolled up until the cell where we defined previously the pipeline and update there. But for learning purposes, we rewrite the pipeline the cell below 

def PipelineCluster():
  pipeline_base = Pipeline([
                            
      ( 'MeanMedianImputer', MeanMedianImputer(imputation_method='median',
                                               variables=['mean smoothness']) ),

      ("scaler", StandardScaler()  ),    

      ("PCA",  PCA(n_components=7, random_state=0)),  ##### we update the n_components to 7

      ("model", KMeans(n_clusters=30, random_state=0)  ), 
  ])
  return pipeline_base

PipelineCluster()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Elbow Method and Silhoutte Score

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are now interested to find the most suitable `n_clusters`, then we update the value on ML Pipeline for Cluster
* But how do you know the optimal amount of clusters to your data?
* **We will combine 2 techniques (Elbow Method and Silhoutte Score) to find the optimal value for number of cluster**. Both will suggest values and we will use them in conjuction to make a decision on the optimal amount of clusters


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 We will first explain and apply Elbow. Then we will explain and apply Silhouette score.





<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">  There is a technique called Elbow Method. According to [Yellowbrick documentation](https://www.scikit-yb.org/en/latest/api/cluster/elbow.html) - (a ML visualization library), the elbow method runs k-means clustering on the dataset for a range of values for k and then for each value of k computes an average score for all clusters. By default, the distortion score is computed, the sum of square distances from each point to its assigned center
   
* That is plotted as a line chart, where on x axis you find the values for amount of clusters and in the y axis the distortion score. The line chart will remind you an arm, then you will picj the point of inflection (or the elbow) as the optimal value to number of clusters.
  * According to [Wikipedia](https://en.wikipedia.org/wiki/Elbow_method_(clustering)), using the "elbow" or "knee of a curve" as a cutoff point is a common heuristic in mathematical optimization to choose a point where diminishing returns are no longer worth the additional cost. In clustering, this means one should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Prepare data for analysis
  * You need to transform your data up to the point that it will hit the model, for Elbow Method and Silhouette score. 
    * Therefore we remove the last step (`.steps[:-1]`) and fit_transform `pipeline_analysis` to the data
    * Note the data has 7 columns, since passed through PCA step

pipeline_cluster = PipelineCluster()
pipeline_analysis = Pipeline(pipeline_cluster.steps[:-1])
df_analysis = pipeline_analysis.fit_transform(df)

print(df_analysis.shape,'\n', type(df_analysis))

Next we use [`KElbowVisualizer()`](https://www.scikit-yb.org/en/latest/api/cluster/elbow.html) from YellowbrickElbow Analysis to implement the Elbow Method
* We parse the algorithm we want (KMeans) and the range of number of clusters we want to try, in this case from 1 to 10, so we parse in a tuple of (1,11), where the last value is not inclusive. 
* Here there is no fixed recipe, you have to try few ranges of number of clusters. Typically you may try initially a range of 1 to 10, or 1 to 15 and refine accordingly.
* Then we fit this object to the `df_elbow` (the data that passed through data cleaning, feature scaling and PCA)
  * **Note the plot suggests 3 clusters!**

from yellowbrick.cluster import KElbowVisualizer

visualizer = KElbowVisualizer(KMeans(random_state=0), k=(1,11))
visualizer.fit(df_analysis) 
visualizer.show() 
plt.show()

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
"> There is also **Silhouette score** that help us to define the number of clusters. You can revert to Module 2 where we presented the concept int the performance metric video.

* The silhouette score **interprets and validates the consistency within clusters**, which is based on the mean intra-cluster distance and mean nearest-cluster distance for each data point.
  * The mean intra-cluster distance is the average distance between the data point and all other data points in the same cluster. Essentially, how far each data point is from the center of its own cluster. 
  * The mean nearest-cluster distance on the other hand is the average distance between the data point and all other data points of the next nearest cluster. In other words, how far each data point in 1 cluster is to the center of its nearest neighbouring cluster.
 

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The silhouette score range is from -1 to +1, where:
  *   “+1” means that a clustered data point is dense and properly separated from other clusters. 
  * A score close to 0 means the clustered data point is overlapping with another cluster.  
  * A negative score means that the clustered data point may be wrong; it may even belong to another cluster.




<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 The silhoutte score for each data point allows you to build a Silhoutte plot, showing each silhoutte score for each data point across all clusters.
* You can then calculate an **average silhoutte score** for the plot. This average helps to (1) compare different models with different number of clusters and (2) define a performance metric to a given cluster model. A rule of thumb in the industry is that average silhoutte score greater than 0.5 means the clusters are nicely separated, but there may be a case where for your dataset the optimal amount of cluster leads to a averge lower than 0.5. This is fine also, it just means we computed for that particular dataset the optimal way to cluster it even though it doesn't have a tremendous great silhoutte score.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> To evaluate clusters silhouete we need the data in a format before it hits the model, we have done this already and the result is stored at `df_analysis`
* We will use [SilhouetteVisualizer](https://www.scikit-yb.org/en/latest/api/cluster/silhouette.html) and [KElbowVisualizer](https://www.scikit-yb.org/en/latest/api/cluster/elbow.html) from Yellowbrick


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The code logic has 2 moments
* **First you will calculate the average silhoutte score for different number of clusters** using  KElbowVisualizer() by setting KMeans() as the algorithm,  the range 2 to 11 of number of clusters (it doesn't accept 1 cluster) and the  metric='silhouette'. Then you will fit to the scaleddata (df_analysis) and show the results
  * You will evalute which number of clusters produce higher average silhoutte score
* Then you will iterate on the **silhouette plot for models with different number of clusters**, in this case from 2 to 11. You will use SilhouetteVisualizer() and set the estimator as KMeans(). Then you will fit to the scaleddata (df_analysis) and show the results
  *  You will evaluate if the silhouette values are varying too much in the cluster, if there are too many values lower than the average silhouette, if there are too many negative sihouette values.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note the following:
* Average Silhouette Score: best result is with 2 clusters, but 3 is not that far away. We will give more attention to evalute the Silhouette Plot from these options. 
* Silhouette Plot:
  * 2 clusters: there is one cluster that is dominant (the blue) since it is more frequent and the majority of its value is greater than the average score (the red dotted line). The other cluster (green) has few data points with negative score (these may belong to other cluster) and almost no data point is above the average score.

  * 3 clusters: there is one cluster that is dominant (the blue) since it is more frequent and the majority of its value is greater than the average score (the red dotted line). the other 2 clusters look to have similar frequency. The blue cluster has few datapoints greater than the average and little negative silhouette values. However, the green cluster has few data points with negative score (these may belong to other cluster) and almost no data point is above the average score.

from yellowbrick.cluster import SilhouetteVisualizer

print("=== Average Silhouette Score for different number of clusters ===")
visualizer = KElbowVisualizer(KMeans(random_state=0), k=(2,7), metric='silhouette')
visualizer.fit(df_analysis) 
visualizer.show() 
plt.show()
print("\n")

for n_clusters in np.arange(start=2,stop=11):
  
  print(f"=== Silhoutte plot for {n_clusters} Clusters ===")
  visualizer = SilhouetteVisualizer(estimator = KMeans(n_clusters=n_clusters, random_state=0),
                                    colors = 'yellowbrick')
  visualizer.fit(df_analysis)
  visualizer.show()
  plt.show()
  print("\n")

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%208-%20Challenge.png"> What is the number of clusters then?
* Elbow Method says 3
* Average Silhouette Score says 2, but Silhoutte Plot from 3 clusters is better than 2 clusters.
* As a result, we will pick 3, since the Eblow Method and Silhoutte Plot supports that decisions

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next we rewrite the `PipelineCluster()`, updating `n_cluster` to 3
* Note, in a real project, you don't have to necessarly rewrite in the cell below the pipeline. You could have scrolled up until the cell where we defined previously the pipeline and update there. But for learning purposes, we rewrite the pipeline the cell below 

def PipelineCluster():
  pipeline_base = Pipeline([
                            
      ( 'MeanMedianImputer', MeanMedianImputer(imputation_method='median',
                                               variables=['mean smoothness']) ),

      ("scaler", StandardScaler()  ),    

      ("PCA",  PCA(n_components=7, random_state=0)), 

      ("model", KMeans(n_clusters=3, random_state=0)  ),  ##### update n_clusters to 3 
  ])
  return pipeline_base

PipelineCluster()

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> Notice the additional effort and steps we take in clustering and compare to the workflow we have for Classification and Regression. Just now we are read to train the pipeline.
  * Note we have only 1 pipeline and we are not doing hyperparameter optimization when training the model
  * We "kind" of made a hyperparameter optimization in the previous sections, since we were trying different options for PCA components and number of clusters for KMeans().
  * Let's fit the pipeline then!

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit Cluster Pipeline

We don't need to split our data. All available data is used for training. 
* For training purposes, we create a DataFrame `X` that is a copy of your data.

X = df.copy()
print(X.shape)
X.head(3)

Then we fit Cluster pipeline to the training data (`X`)

pipeline_cluster = PipelineCluster()
pipeline_cluster.fit(X)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Add cluster predictions to dataset

We add a column "`Clusters`" (with the Cluster Pipeline predictions) to X
* Scroll to the right and check the last variable. That is the clusters predicitons for each datapoint of your dataset
* The model predictions are stored in an attribute `.labels_`
* Since the model is in a pipeline, you will grab using the notation `pipeline_cluster['model'].labels_`

X['Clusters'] = pipeline_cluster['model'].labels_
print(X.shape)
X.head(3)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next we are interested to know the cluster frequency
* Note there are 3 clusters, and the counting starts from 0.
* We note that the algorithm found in the dataset that the majority of the data (63%) belong to cluster number 2, where the remaining datapoints share equally the other 2 clusters.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">  **But what is the profile of each cluster?**


print(f"* Clusters frequencies \n{ X['Clusters'].value_counts(normalize=True).to_frame().round(2)} \n\n")
X['Clusters'].value_counts().sort_values().plot(kind='bar')
plt.show()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit a classifier, where target is cluster predictions and features remaining variables

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are in a moment where we have clusters predictions made from the cluster pipeline, but we can't interpret the clusters yet. 


We are **interested to learn the profile from each cluster**, based on the most relevant dataset variables.
* Our new dataset has `Clusters` as a variable. We are using a technique where  `Clusters` will be the **target for a classifier**, and the remaining variables will be features to that target.
  * We will assume that the most relevant features for this classifier, will be the most relevant variables that define a cluster.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 To do that, we will use the traditional workflow we covered in the previous notebooks: 
 * 1 - split the data in train and test set
 * 2 - create the classifier pipeline
 * 3 - fit the classifier to training data
 * 4 - evaluate pipeline performance
 * 5 - and (most important for our analysis) **assess feature importance**.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png "> Note: If you need, pause for a second and reflect in which step from the "Major ideas" section we are. That may help you to better understand our goal, which moment we are now and the next step to move on.


We start by copying `X` to a DataFrame `df_clf`

df_clf = X.copy()
print(df_clf.shape)
df_clf.head(3)

Next we split train and test sets, where the target variable is `'Clusters'`

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['Clusters'],axis=1),
                                    df_clf['Clusters'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Create a classifier pipeline 
* We should use the **data cleaning and feature engineering** steps from the Cluster Pipeline.
* Then we add the conventional steps for supervised learning: **feature scaling, feature selection and modelling**
* We are considering a model that typically offers good results and features importance can be assessed with `.features_importance_` using a tree based algorithm. We are using GradientBoostingClassifier since it typically has good performance while it is fast to train.
  * We could conduct a detailed hyperparameter optimization to find the best tree based model, but we are most interested here to find a pipeline that can explain the relationship between the target (Clusters) and the features, so we can assess the feature importance afterwards.

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithm
from sklearn.ensemble import GradientBoostingClassifier 

def PipelineClf2ExplainClusters():
  pipeline_base = Pipeline([
                            
      ( 'MeanMedianImputer', MeanMedianImputer(imputation_method='median',
                                               variables=['mean smoothness']) ),

      ("scaler", StandardScaler()  ),    

      ("feat_selection", SelectFromModel(GradientBoostingClassifier(random_state=0)) ), 

      ("model",  GradientBoostingClassifier(random_state=0) ), 
  ])
  return pipeline_base

  
PipelineClf2ExplainClusters()

We fit the classifier to the training data
* Note again, here we are not doing a detailed hyperparameter optimization. This classification pipeline is useful only to the the features that look to be more important to predict the Clusters. We are not deploying this model, so fitting with default hyperparameter is typiically fine for this task

pipeline_clf_cluster = PipelineClf2ExplainClusters()
pipeline_clf_cluster.fit(X_train, y_train)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Evaluate classifier performance on Train and Test Sets

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 In theory, we expect to have a good performance, since the Clusters were generated by the KMeans() and that algorithm has a logic. As a result, the classifier algorithm (GradientBoosting) would be able to map these relationships, in theory. So let's check that.



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Then evaluate the performance on the Train set using `classification_report()`
* It looks that learned the relationships to ace all predictions in the train set

from sklearn.metrics import classification_report
print(classification_report(y_train, pipeline_clf_cluster.predict(X_train)))

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> And finally we evaluate in the test set
* It looks that learned the relationship between the target and the features so it could generalize in the test set, since the performance is not distant from the train set.

print(classification_report(y_test, pipeline_clf_cluster.predict(X_test)))

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Assess Most Important Features that define a cluster

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Now we assess the feature importance from the pipeline. First we need to know how many data cleaning and feature engineering your pipeline has
* It is 1 step only: median imputation

pipeline_clf_cluster

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We use the same code we saw in previous unit notebook where we grab the feature importance from feature selection step and store in a DataFrame
* The plot shows that these are the 4 most important features in descending order: `['mean concavity', 'worst perimeter', 'worst fractal dimension', 'mean perimeter'] `




<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> **We are considering these as the most important variable that define a Cluster. They will be used to understand the Cluster Profile**

# after data cleaning and feat engineering, the feature space changes

data_cleaning_feat_eng_steps = 1 # how many data cleaning and feature engineering does your pipeline have?
columns_after_data_cleaning_feat_eng = (Pipeline(pipeline_clf_cluster.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)

best_features = columns_after_data_cleaning_feat_eng[pipeline_clf_cluster['feat_selection'].get_support()].to_list()

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
          'Feature': columns_after_data_cleaning_feat_eng[pipeline_clf_cluster['feat_selection'].get_support()],
          'Importance': pipeline_clf_cluster['model'].feature_importances_})
  .sort_values(by='Importance', ascending=False)
  )

best_features = df_feature_importance['Feature'].to_list() # reassign best features in importance order

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{best_features} \n")
df_feature_importance.plot(kind='bar',x='Feature',y='Importance')
plt.show()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Cluster Analysis

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%209-%20Well%20done.png"> Bravo! You know which variables to consider now to explain each cluster!
* Let's create a custom function where we will explain the cluster profile, in terms of  `['mean concavity', 'worst perimeter', 'worst fractal dimension', 'mean perimeter']`. For each cluster, we want to know the most common values for each variable.

* Go through the code and check the pseudo code to understand its logic. It may take a while to understand it, but the focus is to understand and apply the function to our business problem

# df contains the most important features and the clusters
# Note: your DataFrame needs to have a variable called 'Clusters' which will
# contain the cluster prediction from the pipeline

# It outputs a table showing for each cluster what is the most common values for a given variable

def DescriptionAllClusters(df, decimal_points=3):

  DescriptionAllClusters = pd.DataFrame(columns=df.drop(['Clusters'],axis=1).columns)
  # iterate on each cluster , calls Clusters_IndividualDescription()
  for cluster in df.sort_values(by='Clusters')['Clusters'].unique():
    
      EDA_ClusterSubset = df.query(f"Clusters == {cluster}").drop(['Clusters'],axis=1)
      ClusterDescription = Clusters_IndividualDescription(EDA_ClusterSubset,cluster,decimal_points)
      DescriptionAllClusters = DescriptionAllClusters.append(ClusterDescription)

  
  DescriptionAllClusters.set_index(['Cluster'],inplace=True)
  return DescriptionAllClusters


def Clusters_IndividualDescription(EDA_Cluster,cluster, decimal_points):

  ClustersDescription = pd.DataFrame(columns=EDA_Cluster.columns)
  # for a given cluster, iterate in all columns
  # if the variable is numerical, calculate the IQR: display as Q1 -- Q3.
    # That will show the range for the most common values for the numerical variable
  # if the variable is categorical, count the frequencies and displays the top 3 most frequent
    # That will show the most common levels for the category

  for col in EDA_Cluster.columns:
    
    try:  # eventually a given cluster will have only mssing data for a given variable
      
      if EDA_Cluster[col].dtypes == 'object':
        
        top_frequencies = EDA_Cluster.dropna(subset=[col])[[col]].value_counts(normalize=True).nlargest(n=3)
        Description = ''
        
        for x in range(len(top_frequencies)):
          freq = top_frequencies.iloc[x]
          category = top_frequencies.index[x][0]
          CategoryPercentage = int(round(freq*100,0))
          statement =  f"'{category}': {CategoryPercentage}% , "  
          Description = Description + statement
        
        ClustersDescription.at[0,col] = Description[:-2]


      
      elif EDA_Cluster[col].dtypes in ['float', 'int']:
        DescStats = EDA_Cluster.dropna(subset=[col])[[col]].describe()
        Q1 = round(DescStats.iloc[4,0], decimal_points)
        Q3 = round(DescStats.iloc[6,0], decimal_points)
        Description = f"{Q1} -- {Q3}"
        ClustersDescription.at[0,col] = Description
    
    
    except Exception as e:
      ClustersDescription.at[0,col] = 'Not available'
      print(f"** Error Exception: {e} - cluster {cluster}, variable {col}")
  
  ClustersDescription['Cluster'] = str(cluster)
  
  return ClustersDescription




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The next custom function is called `cluster_distribution_per_variable() ` and is used to analyze the Clusters and given Variable - in our case it will evaluate **Clusters x Diagnostic**.
* It will show the absolute and relative levels of Diagnostic (Malignant and Benign) per cluster
* Go through the code and check the pseudo code to understand its logic. It may take a while to understand it, but the focus is to understand and apply the function to our business problem

import plotly.express as px
def cluster_distribution_per_variable(df,target):

  # the data should have 2 variables, the cluster predictions and
  # the variable you want to analyze with, in this case we call "target"
  
  # we use plotly express to create 2 plots
  # cluster distribution across the target
  # relative presence of the target level in each cluster
  
   
  df_bar_plot = df.value_counts(["Clusters", target]).reset_index() 
  df_bar_plot.columns = ['Clusters',target,'Count']
  df_bar_plot[target] = df_bar_plot[target].astype('object')

  print(f"Clusters distribution across {target} levels")
  fig = px.bar(df_bar_plot, x='Clusters',y='Count',color=target,width=800, height=500)
  fig.update_layout(xaxis=dict(tickmode= 'array',tickvals= df['Clusters'].unique()))
  fig.show()


  df_relative = (df
                 .groupby(["Clusters", target])
                 .size()
                 .groupby(level=0)
                 .apply(lambda x:  100*x / x.sum())
                 .reset_index()
                 .sort_values(by=['Clusters'])
                 )
  df_relative.columns = ['Clusters',target,'Relative Percentage (%)']
 

  print(f"Relative Percentage (%) of {target} in each cluster")
  fig = px.line(df_relative, x='Clusters',y='Relative Percentage (%)',color=target,width=800, height=500)
  fig.update_layout(xaxis=dict(tickmode= 'array',tickvals= df['Clusters'].unique()))
  fig.update_traces(mode='markers+lines')
  fig.show()
 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> To start the analysis we want a DataFrame that contains best features and Clusters Predictions, since we want to analyze the patterns for each cluster
* we will copy `df_clf` DataFrame (since it has all features and Cluster predictions) and filter `best_features` plus `['Clusters']`.


df_cluster_profile = df_clf.copy()
df_cluster_profile = df_cluster_profile.filter(items=best_features + ['Clusters'], axis=1)
print(df_cluster_profile.shape)
df_cluster_profile.head(3)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We want also to analyze Diagnostic levels
* In this exercise, we get it from `data.target` and create a DataFrame.
* We know in advance Diagnostic represents a categorical variable and came as integer. Therefore we change its data type to `'object'`.

df_diagnostic = pd.DataFrame(data.target, columns=['diagnostic'])
df_diagnostic['diagnostic'] = df_diagnostic['diagnostic'].astype('object')
df_diagnostic.head(3)

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Cluster profile on most important features

We call `DescriptionAllClusters()` and parse a concatenated DataFrame made with `df_cluster_profile` and `df_diagnostic`. Before parsing let's just show this concatenated data, so you can get it clear.
* It has the best features `['mean concavity', 'worst perimeter', 'worst fractal dimension', 'mean perimeter']`, Cluster Predictions and Diagnostic (where 0 is malignant, 1 is benign)



pd.concat([df_cluster_profile,df_diagnostic], axis=1).head(4)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Finally, we use `DescriptionAllClusters()` parsing the concatenated DataFrame. It outputs a table shoing for each cluster what is the most common values for a given variable, including the diagnostic level (where 0 is malignant, 1 is benign). You will also parse the decimal points you want to see when the evaluated variable is numerical, depending on the range of the numerical variable you need more decimal points. In our case, 2 decimal points are fine, but you can re-run the function after and check with different values, like 0 and 4.




<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
"> Recall that we found the most important variables that help to define a cluster are: `['mean concavity', 'worst perimeter', 'worst fractal dimension', 'mean perimeter']`
* Note that the algorithm found that for Cluster 0, the most common values for mean concavity is between 0.13 -- 0.22, for worst perimeter is 145.7 -- 174.18, worst fractal is between	0.08 -- 0.09 and mean perimeter beteen	120.88 -- 136.88. Also, all diagnostic in cluster 0 is 0 - malignant. **This is the profile from cluster 0!**
  * Repeat this analysis for the remaining clusters. Note we start giving meaning for each cluster.
* Note also our analyzed variable (diagnostic). It shows that cluster 0 has  only malignant cases, cluster 1 is a mix between malignant and bening but malignant is more dominant, and cluster has 2 only benign cases. 
  * Think for a moment how cool that is. The algorithm found patterns to split in 3 groups, 1 with malignant, another a mix and the last benign. Now think how this analysis could be applied to solve other business problems.



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note the major differences/patterns between clusters across variables, like:
  * The ranges of mean concavity look to be smaller when diagnostic is benign (1) and look to increase when diganostic tends to be 0 (malignant) 
  * The values of worst perimeter in clusters where malignant is predominant, tend to be higher than in benign cluster.
    * Note we keep adding meaning on how the clusters interact based on the analysis between a given variable (mean concavity for example) and diagnostic
  * Repeat the same analysis for other variables (worst fractal and mean perimeter)

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 Typically you will notice differences in ranges acrros the clusters and across the levels of your analyzed variable (diagnostic). This difference is typically the pattern we are interested to discover.	

pd.set_option('display.max_colwidth', None)
clusters_profile = DescriptionAllClusters(df=pd.concat([df_cluster_profile,df_diagnostic], axis=1),
                                          decimal_points=2)
clusters_profile

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Clusters distribution across Diagnostic levels & Relative Percentage of Diagnostic in each cluster

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> This analyzes shows now the Clusters distribution across Diagnostic. Actually, this information is revealed in the previous table, but now we can make more visual to stakeholders. It has 2 plots
* The first is a bar plot, in the x axis the clusters, the bar length is how many data points in that cluster and is colored by the level of diagnosis (where 0 is malignant, 1 is benign)
* The second plot gives a complementary vision to the first. In the first we saw the absolute values (the counts). Now we see the relative (the percentage)

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
"> Let's analyze the plots
* The first plot shows that cluster 0 has malignant cases only, cluster 1 a mix of both cases with malignant predominance and the last cluster is predominantly benign cases (however there are few malignant cases. If that is required, you could do a data analysis later on this malignant cases.)
* The second plot quicly reveals the percentage presence of Diagnostic (malignant and benign) in each cluster.


df_cluster_vs_diagnostic=  df_diagnostic.copy()
df_cluster_vs_diagnostic['Clusters'] = X['Clusters']
cluster_distribution_per_variable(df=df_cluster_vs_diagnostic, target='diagnostic')

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> What should I do now?

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%209-%20Well%20done.png"> You could deploy the Cluster Pipeline as it is, however it would **need all 30 variables** to predict a given cluster for a new breast sample. even though you used 4 variables to describe the profile from each cluster.
* In real system, we should consider the amount of input variables we want to manage
* Therefore we would consider an additional step for trying to **refit the cluster pipeline the most important variables**. We say "trying" since we will need to conduct a tradeoff analysis to validate if the pipeline with all variables and the pipeline with ony "best feature" produce "equivalent" results.
  * In case they produce "equivalent" results, you can deploy a pipeline with less variables that will deliver similar performance.




<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> However, we will study this approach in our second walkthrough project
* For the moment, what really matters is to understand that we can **cluster the data on similar datapoints, explain the profile of clusters, and we can analyze the clusters vs another variable** (in our case, clusters vs diagnostic)

---