<a href="https://cocl.us/Data_Science_with_Scalla_top"><img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/adds/Data_Science_with_Scalla_notebook_top.png" width = 750, align = "center"></a>
 <br/>
<a><img src="https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width="200" align="center"></a>"

# Module 3: Feature Engineering

## Principal Component Analysis (PCA) in Feature Englineering

### Lesson Objectives

After completing this lesson, you should be able to: 

- Understand what Principal Component Analysis (PCA) is
-	Understand PCA's role in feature engineering 



## PCA: Definition 

-	PCA is a dimension reduction technique. it is unsupervised machine learning, and it has many uses; on this video we only care about its use for feature engineering


## PCA: How It Works

-	The first Principal Component (PC) is defined as the linear combination of the predictors that captures the most variability of all possible linear combinations.
-	Then, subsequent PCs are derived such that these linear combinations capture the most remaining variability while also being uncorrelated with all previous PCs.


## Feature Engineering 

-	"Feature Engineering" is a practice where predictors are created and refined to maximize model performance
-	It can take quite some time to identify and prepare relevant features


## Feature Engineering with PCA

-	Basic idea: generate a smaller set of variables that capture most of the information in the original variables
-	The new predictors are functions of the original predictors; all the original predictors are still needed to create the surrogate variables Dataset: Predict US Crimes
-	We want to predict the proportion of violent crimes per 100k population on different locations in the US
-	More than 100 predictors. Examples:
  -	`householdsize`: mean people per household
  - `PctLess9thGrade`: percentage of offenders who have not yet entered high school
  - `pctWWage`: percentage of households with wage or salary income in 1989
-	For a description of these variables, see the UCI repository (communities and crimes) Dataset: Predict US Crimes 
-	Let's assume that we don't want to operate with those >100 predictors. Why?
-	Some will be collinear (ie highly correlated)
- It's hard to see relationships in a high-dimensional space 
-	How do we use PCA to get down to 10 dimensions?

In [1]:
import sys.process._
"wget https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0105EN/data/UScrime2-colsLotsOfNAremoved.csv " !

--2020-06-12 12:18:32--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0105EN/data/UScrime2-colsLotsOfNAremoved.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 971758 (949K) [text/csv]
Saving to: ‘UScrime2-colsLotsOfNAremoved.csv’

     0K .......... .......... .......... .......... ..........  5% 1017K 1s
    50K .......... .......... .......... .......... .......... 10%  361K 2s
   100K .......... .......... .......... .......... .......... 15% 2.17M 1s
   150K .......... .......... .......... .......... .......... 21% 1.29M 1s
   200K .......... .......... .......... .......... .......... 26% 2.49M 1s
   250K .......... .......... .......... .......... .......... 31% 13.8M 1s
   300K ........



0

In [2]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._

import org.apache.spark.sql.functions._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.{VectorAssembler, PCA}

spark = org.apache.spark.sql.SparkSession@790059d4


org.apache.spark.sql.SparkSession@790059d4

In [3]:
val crimes = spark.read.
    format("com.databricks.spark.csv").
    option("delimiter", ",").
    option("header", "true").
    option("inferSchema", "true").
    load("UScrime2-colsLotsOfNAremoved.csv")

crimes = [community: string, population: double ... 99 more fields]


[community: string, population: double ... 99 more fields]

In [4]:
val assembler = new VectorAssembler().setInputCols(crimes.columns.filterNot(name => List("community", "otherpercap").contains(name.toLowerCase))).setOutputCol("features")
//this is different from video 
val featuresDF = assembler.transform(crimes).select("features")

val pca = new PCA()
  .setInputCol("features")
  .setOutputCol("pcaFeatures")
  .setK(10)
  .fit(featuresDF)

val result = pca.transform(featuresDF).select("pcaFeatures")
result.show(false)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|pcaFeatures                                                                                                                                                                                              |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1.2138889196984592,0.5645677593367362,-0.022284837106876964,-0.3093435457824009,-0.8195107213520451,-0.42466623894186817,-0.41000046615093594,0.6418064693954159,0.19967782074511403,0.507495196949215] |
|[0.6279851901950964,1.1668941486566553,-0.5141643066488679,-0.2483413354182063,-0.624463455060983,-0.11825739720679951,-0.4505024369417832,1.2455920970597112,0.02530068973793365,0.683

assembler = vecAssembler_8b84712b0a74
featuresDF = [features: vector]
pca = pca_774333d2c7a5
result = [pcaFeatures: vector]


[pcaFeatures: vector]

-	Principal components are stored in a local dense matrix.
-	The matrix pc is now 10 dimensions, but it represents the variability 'almost as well' as the previous 100 dimensions


## Pros I

-	Interpretability (!)
-	PCA creates components that are uncorrelated, and Some predictive models prefer little to no collinearity (example linear regression)
-	Helps avoiding the 'curse of dimensionality': Classifiers tend to overfit the training data in high dimensional spaces, so reducing the number of dimensions may help


## Pros II 

- Performance. On further modeling, the computational effort often depends on the number of variables. PCA gives you far fewer variables; this may make any further processing more performant
-	For classification problems PCA can show potential separation of classes (if there is a separation).


## Cons

-	The computational effort often depends greatly on the number of variables and the number of data records. 
-	PCA seeks linear combinations of predictors that maximize variability, it will naturally first be drawn to summarizing predictors that retain most of the variation in the data.


## How Many Principal Components to Use?

-	No simple answer to this question
-	But there are heuristics:
- find the elbow on the graph for dimensions by variance explained 
-	Set up a 'variance explained threshold' (for example, take as many Principal components as needed to explain 95% of the variance



## Tip for Best Practice

-	Always center and scale the predictors prior to performing PCA 9see previous course). Otherwise the predictors that have more validation will soak the top principal components


## Lesson Summary

Having completed this lesson, you should be able to: 

-	Apply PCA in Spark
-	Use PCA to fix datasets with correlated predictors that could otherwise trip your models!

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.