# Scikit-learn - Unit 07 - PCA (Principal Component Analysis)

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand what PCA (Principal Component Analysis) is and how it can be used in your project



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 06 - PCA (Principal Component Analysis)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Principal Component Analysis, or PCA, is a transformation to your data and attempts to find out what features explain the most variance in your data.

* It reduces the number of variables, while it preserves as much information as possible. Therefore it is also reffered as "dimensionality reduction".
* After the transformation, it creates a set of components, where each component contains the relevant information from the original variables.
  * Each component explains a certain part of the variance of the whole dataset and is independent (uncorrelated) from each other.
  * The drawback of PCA is that it is not easy to understand what each of these components represents since they don't relate one to one to a specific variable, instead each component corresponds to a combination of the original variabl 




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> **We will not focus** on the mathematical study of PCA but instead will discuss the idea behind it and how to use PCA in practical terms in your data science project
* It will take time and experience to understand how the PCA algorithm works. For now, the central aspect is to understand what PCA is and why it will help you in predictive modelling.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> **Why and when should I consider using PCA?**


* Imagine if your data has a lot of variables (or dimensions). 

  <img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 You want to be able to **visualize** your data to discover patterns, however it is unfeasabble to visualize all of your data in a single plot. You can use PCA to reduce your dataset to 2 or 3 components, and viualize it. We will explore that in this notebook

  <img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
  In **predictive modelling**, we are concerned about which variables are more relevant for modelling. PCA is a tool capable of transforming your data, retaining only the most appropriate information or the most variance while keeping all the original variables that help the model learn the patterns in the data.
  * In supervised learning, you can use PCA as a step when extracting features for your ML model. Instead of using, for example, SelectFromModel(). You may also use PCA to transform your features into relevant components that can help to predict your target variable. We will explore this technique in the Walkthrough Project 02.
  * In addition, in unsupervised learning, you can use PCA as a step to reduce dimensionality. So your cluster algorithm will be able to understand better how to group similar data. We will explore this technique in the next lesson and also in Walkthrough Project 02.


You can import PCA using the command below

from sklearn.decomposition import PCA

In the next cells we are going to:
* Load a dataset and define the pipeline steps to prepare the data for PCA
* Transform the data using PCA and understand how many components to consider
* Visualize the data after the PCA transformation

---

### Load Data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's load the breast cancer data from sklearn and apply PCA
* It shows records for a breast mass sample and a diagnosis informing whether it is as malignant or benign cancer, where 0 is malignant, 1 is benign. 
* The target variable is 'diagnostic' and the features the remaining variables.



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">  We know in advance this dataset has only numerical feaures and no missing data. 
* We are adding on purpose missing data (`np.NaN`) in the first 10 rows of 'mean smoothness' using `.iloc[:10,4]`, just to better simulate the datasets you will likely face in the workplace

from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
df_clf = pd.DataFrame(data.data,columns=data.feature_names)
df_clf['diagnostic'] = pd.Series(data.target)
df_clf = df_clf.sample(frac=0.6, random_state=101)
df_clf.iloc[:10,4] = np.NaN

print(df_clf.shape)
df_clf.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are interested in applying PCA to the features only (not the diagnostic)
* We create 2 distinct DataFrames, `X` which is the features, and `df_target` that contains the diagnostic (benign or malignant). 
  * Note, there are 30 features in `X`
  * We will use `X` to apply PCA, and `df_target` at a later stage when we visualize the data 


df_target = df_clf[['diagnostic']]
X = df_clf.drop(['diagnostic'], axis=1)
print(X.shape)
X.head(3)

---

### Create pipeline steps

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> To apply PCA, we should scale the data. Therefore we create our pipeline that is responsible for data cleaning, feature engineering and feature scaling.
* In our case, it will perform data cleaninig (median imputation) and feature scaling

from sklearn.pipeline import Pipeline
### Data Cleaning
from feature_engine.imputation import MeanMedianImputer
### Feat Scaling
from sklearn.preprocessing import StandardScaler


def PipelineDataCleaningFeatEngFeatScaling():
  pipeline_base = Pipeline([
                            
      ( 'MeanMedianImputer', MeanMedianImputer(imputation_method='median') ),

      ( 'feature_scaling', StandardScaler() ),
  ])

  return pipeline_base

PipelineDataCleaningFeatEngFeatScaling()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We fit and transform the data to the pipeline.
* The result is a NumPy array. Note there is still the same amount of rows and columns (341, 30); the point is that the data type is now an array due to the feature scaling transformation.

pipeline_pca = PipelineDataCleaningFeatEngFeatScaling()
df_pca = pipeline_pca.fit_transform(X)
print(df_pca.shape,'\n', type(df_pca))

Just for learning purpose, let's check `df_pca`. 
* As we expect, it is the familiar NumPy array we covered in previous sections. Note also it is a 2D array.

df_pca

---

### PCA transformation

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Now that the data is scaled, we can apply PCA component
* We are not assemblng PCA to a pipeline in this lesson, we will do that at a later stage. The idea here is to understand how the process works
* **A quick recap**: PCA reduces the number of variables, while it preserves as much information as possible. After the transformation, it creates a set of components, where each component contains the relevant information from the original variables.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
"> The first question is: 
* **How many components should I consider?** That depends; let's test, setting the number of components as the number of columns the scaled data has, in this case, 30. That is useful in understanding the explained variance of each component. 
* Read the pseudo code below to understand its logic. Once you run the cell, you will notice that:
  * The first three components are more significant than the others. And, together, they sum 72.47% of the data variance. That is okay. It is a good sign when in a few components, like 3 or 4, you can get more than 80% of your data variance. So you could select three as the number of components, which is good progress since you had 30 features and now have three components.
  * But in this exercise, for learning purposes, we will aim for more than 90% of data variance and use seven components since we could get more data variance with a relatively low increase of components. Since before, we had 30 features with all data variance. Then switched to 3 components with 72% of data variance, and now seven components with 90% of data variance.
  

import numpy as np
from sklearn.decomposition import PCA # import PCA from sklearn

n_components = 30 # set the number of components as all columns in the data

pca = PCA(n_components=n_components).fit(df_pca)  # set PCA object and fit to the data
x_PCA = pca.transform(df_pca) # array with transformed PCA


# the PCA object has .explained_variance_ratio_ attribute, which tells 
# how much information (variance) each component has 
# We store that to a DataFrame relating each component to its variance explanation
ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,2),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

# prints how much of the dataset these components exaplain (naturally in this case will be 100%)
PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

In the next cell we just copied the code from the cell above and changed n_components to 7. 
* With 7 components we achieved a bit more than 91% of data variance

n_components = 7

pca = PCA(n_components=n_components).fit(df_pca)
x_PCA = pca.transform(df_pca) # array with transformed PCA

ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,2),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

Note that the data is transformed and stored at `x_PCA`. Let's check its content
* You will notice it is a NumPy array, and its dimension is 341 x 7, where the rows indicate the number of rows and seven relates to the number of components we defined earlier
* Imagine now that this data would be fed to a model. For this particular dataset, the ML task would be a classification.
* Also note that the PCA helped reduce from 30 features to 7 components where these seven components contain 90% of the information.

print(x_PCA.shape)
x_PCA

---

### Visualize data after PCA transformation

Imagine you want to visualize your data, before and after applying PCA.
* If you had to visualize the 30 features, you could do a correlation analysis and look for features that are correlated among themselves or, in this particular dataset, features that are correlated to the target.
* So let's suppose you want to visualize the relationship between "mean concavity" and "mean concave points" and the target. Since the features are numerical, you can do a scatter plot with them and color by the target.
 * You will imagine/visualize the frontier between the blue and orange dots. Although that is good, the malignant and benign may look to be separable. At the same time, few data points look mingled in this frontier.
  * However, what about the remaining variables? When you consider this dataset as a whole, is that informative enough to separate these classes?

var1, var2 = 'mean concavity' , 'mean concave points'
sns.scatterplot(x=X[var1], y=X[var2], hue=df_target['diagnostic'])
plt.xlabel(var1)
plt.ylabel(var2)
plt.show()

We can plot the PCA components to evalute, from another perspective, how the data behaves.
* We know x_PCA holds the data after transformation, and has 7 components. We will plot in a scatterplot the most representative components: components 0 and 1

sns.scatterplot(x=x_PCA[:,0], y=x_PCA[:,1])
plt.xlabel('Component 0')
plt.ylabel('Component 1')
plt.show()

We know that these 2 components hold by themselves 62% of the information (data variance).
* This is powerful because with 2 variables (2 components) we have a clearer vision on how the dataset looks to have enough information to separate malignant and benign
* We now color the plot by diagnostic using df_target as the hue argument.
  * Note we see a clearer border between 0 and 1.
  * In a nutshell, we have the same data, showing the same information. The difference now is that the data was reduced to its major components
  * The drawback is that we lose the interpretation, since component 0 is made of a combination of the original variables

sns.scatterplot(x=x_PCA[:,0], y=x_PCA[:,1], hue=df_target['diagnostic'], alpha=0.8)
plt.xlabel('Component 0')
plt.ylabel('Component 1')
plt.show()

Naturally We can plot more components. In this exercise we can plot 3 components in a 3D scatter plot using Plotly Express
  * Move around the 3D plot and try to visualize if you could draw a surface that would separate the dots. The surface you imagined, is a ML model.
  * Note again these 3 components alone hold 72% of all information from the dataset to diagnose malignant or benign.

import plotly.express as px
fig = px.scatter_3d(x=x_PCA[:,0], y=x_PCA[:,1], z= x_PCA[:,2] , color=df_target['diagnostic'],
                    labels=dict(x="Component 0", y="Component 1", z='Component 2'),
                    color_continuous_scale='spectral',
                    width=750, height=500)
fig.update_traces(marker_size=5)
fig.show()

---