## <span style="color:#fa04d9"><center>PRINCIPAL COMPONENT ANALYSIS USING APACHE SPARK</center></span>

### Suppose we have the following simple dataset composed of 5 datapoints with (x,y) coordinates: A(-4, -1), B(-2 0), C(0, 1), D(2, 2) and E(4, 3)

![figure1](https://raw.githubusercontent.com/DScienceAtScale/SparkPCA/master/pictures/figure1.jpg)

### It is very straightforward to notice that these 5 data points are forming a straight line (with slope equal to 0.5) and could therefore be described with a single dimension vector such as the vector V(2,1) or any multiple of it, as described in the picture below.<br><br>
![figure2](https://raw.githubusercontent.com/DScienceAtScale/SparkPCA/master/pictures/figure2.jpg)

### The figure above represents the straight line (in light blue) passing through points A, B, C, D and E and two example vectors, V of coordinates (2,1) (in red) and U of coordinates (-3, -1.5) (in green) which can serve as a base for a single dimensional set of coordinates for the points on this line.

### Let us now suppose that the fact that our data set can be described by a single dimension was not obvious, (this is typically the case with more complex scenarios, where the dataset can be described fairly well with less dimensions than initially given) and work on "discovering" this property. This can be done using the mathematical method known as "Principal Component Analysis" which is implemented in Apache Spark as "PCA".

# <span style="color:#fa04d9">**Step 1: Define the Spark context variable**</span>

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# <span style="color:#fa04d9"> Step 2: Create the dataset corresponding to the five data point described above as a dataframe.</span>

Virtually all Spark machine learning implementations take Vectors of features as input (rather than individual columns), so we will build our data frame as a set of 5 rows, where each row represents one of the data points from above. 

In [2]:
from pyspark.ml.linalg import Vectors
data = [(Vectors.dense([0.0, 1.0]),), 
        (Vectors.dense([2.0, 2.0]),),
        (Vectors.dense([4.0, 3.0]),),
        (Vectors.dense([-2.0, 0.0]),),
        (Vectors.dense([-4.0, -1.0]),)
        ]

df = spark.createDataFrame(data, ["features"])

### Verify the correctness of the data frame by displaying it.

In [3]:
df.show()

+-----------+
|   features|
+-----------+
|  [0.0,1.0]|
|  [2.0,2.0]|
|  [4.0,3.0]|
| [-2.0,0.0]|
|[-4.0,-1.0]|
+-----------+



### <span style="color:green">Remark: If you are wondering about the syntax used above to create the data frame, where each Vector seems to be "wrapped" into an extra set of brackets with an extra comma, you will find below a quick clarification<span>

<span style="color:green">The reason for this is that SparkSession.createDataFrame(), which is used under the hood, requires an RDD / list of Row/tuple/list/ or pandas.DataFrame, unless a schema with DataType is provided. This is better explained with an example.<br> Consider trying to do the following to create a data frame in Python:<br><br>data = ['a', 'b', 'c']<br>df = spark.createDataFrame(data, ["features"])<br><br>
This code above will fail with a message such as: <span style="color:blue">TypeError: schema should be StructType or list or None, but got: set(['features'])</span> <br><br>
You can then consider a couple of ways to fix this.

### <span style="color:green">Method 1: Declare the schema type using a StructType directly in the invocation of createDataFrame.
<span style="color:green">from pyspark.sql.types import StringType<br> 
df = spark.createDataFrame(data, StringType(), ["features"])

### <span style="color:green">Method 2: You can alternatively feed tuples into createDataFrame (rather than single values). This is where the Python syntax to create single element tuples is used. In <span style="color:green">Python, (a,) represents a one-element tuple containing just a.
<span style="color:green">data = [('a',), ('b',), ('c',)]<br>
df = spark.createDataFrame(data, ["features"])<br><br>This code should now be successful.
### <span style="color:green">Closing the remark and resuming with the PCA tutorial.

# <span style="color:#fa04d9">Step 3: Instantiate and train the PCA model in Spark. </span>

First, we will instantiate a standard scaler. The original data will be transformed to have a mean of 0, which is a very common data preparation technic in data science. This implies in our case that our set of 5 data points will be centered around the origin so that the sum of all X values and the sum of all Y values add up to 0.

In [4]:
from pyspark.ml.feature import StandardScaler
scaler_definition = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=False, withMean=True)

In [5]:
# As implied by the names used for the variables, scaler_definition is only a definition of the standard scaler. In order to obtain an actual instance, we need to apply the "fit" method to the definition,
# passing in the actual data.
scaler_instance_trained = scaler_definition.fit(df)

In [6]:
# The trained instance of the scaler can now be used with the "transform" method, to take in the original dataset (a dataframe of feature vectors) and produce as output a 
# new dataframe where the data now has a 0 mean.
scaler_output_df = scaler_instance_trained.transform(df)

In [7]:
# Verify that the newly produced data is indeed scaled to have a 0 mean.
scaler_output_df.show()

+-----------+--------------+
|   features|scaledFeatures|
+-----------+--------------+
|  [0.0,1.0]|     [0.0,0.0]|
|  [2.0,2.0]|     [2.0,1.0]|
|  [4.0,3.0]|     [4.0,2.0]|
| [-2.0,0.0]|   [-2.0,-1.0]|
|[-4.0,-1.0]|   [-4.0,-2.0]|
+-----------+--------------+



Now that we have our scaledFeatures, we can create our PCA model which will be expecting an input column named "scaledFeatures" and will produce as output a column (of vectors) named "pcaFeatures". This notebook will detail the meaning of each.

## Important Remark:<br>
The meaning of the parameter k=2 (in the cell below) indicates that we want our PCA model to produce a description of our data in TWO dimensions. This may seem counter intuitive as we 'already' know that our dataset is a straight line and therefore unidimensional. Remember that we are working with a very simple example where we are pretending not to be aware of the end result. In general, when starting with an N dimensional dataset, it is common to use PCA with a value of N since it is not initially known what the "meaningful" number of final dimensions is. As a matter of fact, as we will see shortly, it is one of the outputs of PCA, once it produces a new set of dimensions describing the dataset, to also provide the amount of 'variance' captured by each one of those dimensions. <span style="color:red">**It is then up to the data scientist to decide how many dimensions to keep in order to get the best *bang for the buck* **.</span>

In [8]:
from pyspark.ml.feature import PCA
pca_model_definition = PCA(k=2, inputCol="scaledFeatures", outputCol="pcaFeatures") # k=2 tells PCA how many dimensions to evaluate (see above)

Using the same approach as with the standard scaler a few cells above, we create a PCA class object (think of this as a class capable of producing an actual Model, once it has been given a dataset to work with and invoked the 'fit' method on that dataset) and then subsequently make that definition into a *real* instantiated model by fitting it to the output of the standard scaler produced higher.

In [9]:
pca_model_instance = pca_model_definition.fit(scaler_output_df)

The variable "pca_model_instance" is the actual PCA model which was obtained using our simple 5 points dataset. **The Spark implementation class type of this object is <span style="color:red">"PCAModel"**</span>

The Spark documentation provides a list of methods which can be invoked on this pca_model_instance that we just obtained. One method of interest is "pc" and stands for "principal components". This method returns a matrix which we will look at more closely after displaying it first.

In [10]:
pca_model_instance.pc # pc = Principle Components

DenseMatrix(2, 2, [-0.8944, -0.4472, -0.4472, 0.8944], 0)

We can see that this matrix has two column/vectors (represented by the first '2' in the output above) and that each column/vector has two coordinates (represented by the second '2'). We can also print this matrix with a slightly better formatting as in the cell below.

In [11]:
print (pca_model_instance.pc)

DenseMatrix([[-0.89442719, -0.4472136 ],
             [-0.4472136 ,  0.89442719]])


# <span style="color:#fa04d9">Step 4: Interpreting the output of the PCA implementation in Spark.</span>

## &nbsp;&nbsp;&nbsp;<span style="color:#fa04d9">Step 4.1: Output of the method "pc" from the Spark PCAModel class: The Eigen Vectors. </span>
We can represent this matrix as two vectors W and X (or PC1 and PC2, for Principal Component 1 and Principal Component 2 returned by the Spark PCA algorithm) with the corresponding coordinates as can be seen below (coordinates rounded to two decimal significant digits).

![eigen vectors](https://raw.githubusercontent.com/DScienceAtScale/SparkPCA/master/pictures/eigen_vectors.jpg)

The figure below also provides a visual representation of these two dimensions (W and X) returned by PCA, in the context of the original dataset with the five data points A, B, C, D and E and the original vector V(2,1), all represented at a larger scale than the original picture at the beginnig of this notebook. (Note that vector U has been left out to avoid overloading the diagram).
![figure3](https://raw.githubusercontent.com/DScienceAtScale/SparkPCA/master/pictures/figure3.jpg)

### <span style="color:blue">Some important remarks about the results obtained so far:</span><br>
1- The dimensions W and X returned by PCA are known mathematically as the <span style="color:red">**"Eigen vectors"**</span> of the <span style="color:red">**covariance matrix**</span> of the original dimensions which were provided. **This notebook will not address the mathematical aspects and background of the PCA method although very interesting. Further reading on this topic is very widely available online.**<br><br>
2- The first vector / dimension returned by PCA, "W" has an exact slope of 0.5 and corresponds to the direction of the vectors U and V which we had identified "intuitively" at the beginning of this notebook. For added clarity, vector V has been added to the drawing above to highlight the fact that it has the exact same slope as W (but a different size and direction).<br><br>
3- The vector, or dimension X is exactly orthogonal to W. This is a property of the PCA method: all the returned dimensions are orthogonal (even when there are more than two).<br><br>
4- The coordinates of vectors W and X were not chosen by the Spark PCA at random (i.e we could pick 'longer' or 'shorter' vectors along the same directions). If you calculate the norm (size) of W or X (given by **sqrt(x^2 + y^2)**), you will notice that each one of those vectors has a norm of 1, making them <span style="color:blue">unit</span> vectors. The <span style="color:blue">basis</span> formed by W and X is therefore <span style="color:blue">orthonormal</span>.<br><br>
5- We can arbitrarily multiply anyone of the PCA dimensions by -1, which would only change the overall direction by 180 degrees, but does not change the orthonormal property.<br><br>
6- In the <span style="color:red">red</span> orthonormal basis above formed by W and X, we can see that each one of our original datapoints will have a different abscissa (aka X coordinate), but the ordinate (aka Y coordinate) will be rigorously identical for all datapoints. As a matter of fact, if the dataset is initially centered around the origin --imagine shifting the dataset downward in a vertical motion so that datapoint C is at the origin-- (this was performed by the 0 mean transformation of our standard scaler), then we will have the case where the ordinate for each datapoint is <span style="color:red">0</span>.<br><br>
7- <span style="color:red">We can conclude from the remark at point 6- just above that the second dimension X is not useful at all, and all datapoints can be identified by their abscissa (or X coordinate) along direction W alone. As a matter of fact, if a dataset is to be used as input to a predictive algorithm, if one dimension presents exactly the same value for all datapoints, then it can be dropped as it does not affect the result at all</span>. This knowledge about how much information is provided by each of the PCA dimensions is known as the <span style="color:blue">**Eigen values**</span> of the same covariance matrix discussed in the first point above. The Spark PCA algorithm provides these **Eigen values** associated with each **Eigen vector** (W and X here), through the method **'explainedVariance'** which we will look at below.<br>

## &nbsp;&nbsp;&nbsp;<span style="color:#fa04d9">Step 4.2: The "explainedVariance" method and the Eigen values:</span>

Let us print the Eigen values for the two dimensions W and X studied so far. This is done using the "explainedVariance" method:

In [12]:
pca_model_instance.explainedVariance

DenseVector([1.0, 0.0])

**If you have followed the analysis of the PCA dimensions (W and X) results described a couple of cells above, you should hopefully not be surprised by the outpout you are seeing. This is telling us that:**<br>
1- The first principal component, or dimension (Eigen vector W in our case) captures <span style="color:red">**100%**</span> of the variance or information in our dataset.<br>
2- The second principal component, or dimension (Eigen vector X in our case) caputures <span style="color:red">**0%**</span> of the variance or information in our dataset.<br>
3- <span style="color:red">**Consequently, we can infer that our dataset is unidimensional and can be fully described with the single FIRST dimension returned by PCA.**</span>

## &nbsp;&nbsp;&nbsp;<span style="color:#fa04d9"> Step 4.3: Coordinates of datapoints in the new orthonormal basis returned by PCA </span>

We now have two different bases in which we can describe our datapoints A, B, C, D and E. The original basis where all points had an X value and a Y value (two dimensions), and the new (red) basis returned by PCA where we have determined that only one vector is going to be sufficient to describe our dataset. The natural subsequent question then becomes: <span style="color:red">**So how do we find the coordinates of datapoints A, B, C, D and E in the new basis which was returned by PCA (represented in red in the figure higher up)?**</span><br><br>
Since PCA is a linear transformation of the original dimensions (**details not discussed in this notebook, but details widely available online**), we can obtain the new coordinates of our datapoints in the new basis by multiplying their (original) coordinates by the matrix of Eigen vectors which was discussed a few cells above. You can find in the picture below an example of applying the PCA transformation to the original datapoints A, B, C, D and E

![pca_coords](https://raw.githubusercontent.com/DScienceAtScale/SparkPCA/master/pictures/pca_coords.jpg)

### <span style="color:blue">Some remarks about the matrix multiplication above:</span><br>
1- The original datapoints coordinates are collected in the first matrix above, in a transposed (row) format. So point A's coordinates are shown as the first row, point B's coordinates are shown as the second row, and so on...<br><br>
2- The PCA matrix with the principal components as it was returned by Spark and discussed higher is represented with a more precise version of the coordinates of the Eigen vectors W and X. Note that this matrix is "vertical" and the coordinates of W and X are represented "vertically".<br><br>
3- The intermediate resulting matrix shows the details of the multiplication operations which take place, and the final matrix (underneath) shows, in each row, the coordinates of the same datapoints from A to E in the target PCA basis.<br><br>
4- <span style="color:blue">**As expected, notice how all the datapoints have the same ordinate (Y coordinate) in the new basis, making it therefore useless from a predictive point of view.**</span><br><br>
5- <span style="color:blue">**One more detail needs to be addressed.**</span> Remember that the coordinates of points A to E were transformed by the standard scaler to a 0 mean set of values. So the initial matrix of coordinates for the datapoints is actually the one returned as the "scaledFeatures" higher towards the beginning of this notebook rather than the one shown in the picture above. Here it is below as a reminder:<br>
![scaled_original_coords](https://raw.githubusercontent.com/DScienceAtScale/SparkPCA/master/pictures/scaled_original_coords.jpg)

6- If we pass these coordinates through the same matrix multiplication as shown above, we get the following results for the PCA coordinates of points A through E:
![scaled_pca_coords](https://raw.githubusercontent.com/DScienceAtScale/SparkPCA/master/pictures/scaled_pca_coords.jpg)

### As it turns out, you do not need to perform these matrix multiplications manually. Recall the definition of our PCA model at the beginning of this notebook:
pca_model_definition = PCA(k=2, inputCol="scaledFeatures", <span style="color:blue">**outputCol="pcaFeatures"**</span>)<br><br>
Spark will actually compute the coordinates of each datapoint in the PCA basis and display them in the output column **pcaFeatures**. Let's take a look at how this works with our example.

Our PCA model is available as "pca_model_definition". We have already looked at invoking "pc" and "explainedVariance" on our model. As with all Spark models, we can also apply this model to the input dataframe of datapoints (using the "transform" method) and obtain as a result, an output dataframe that will have appended to it the new coordinates in the PCA dimensions orthonormal basis.

In [13]:
pca_output = pca_model_instance.transform(scaler_output_df)

In [14]:
pca_output.collect()

[Row(features=DenseVector([0.0, 1.0]), scaledFeatures=DenseVector([0.0, 0.0]), pcaFeatures=DenseVector([0.0, 0.0])),
 Row(features=DenseVector([2.0, 2.0]), scaledFeatures=DenseVector([2.0, 1.0]), pcaFeatures=DenseVector([-2.2361, 0.0])),
 Row(features=DenseVector([4.0, 3.0]), scaledFeatures=DenseVector([4.0, 2.0]), pcaFeatures=DenseVector([-4.4721, 0.0])),
 Row(features=DenseVector([-2.0, 0.0]), scaledFeatures=DenseVector([-2.0, -1.0]), pcaFeatures=DenseVector([2.2361, 0.0])),
 Row(features=DenseVector([-4.0, -1.0]), scaledFeatures=DenseVector([-4.0, -2.0]), pcaFeatures=DenseVector([4.4721, 0.0]))]

### We can see in the cell above:
- Each one of the original 5 datapoints as a row in the dataframe.
- Both the original and scaled coordinates before the PCA transformation.
- The PCA coordinates.
- <span style="color:blue">We can also compare for correctness (and better understanding) the results in the pcaFeatures column with the results obtained from the manual matrix multiplication a few cells higher (Scaled PCA Coords).</span>

# <span style="color:#fa04d9">Step 5: Visual representations using the Python based Brunel library.</span>

Now that our understanding of the meaning of the pcaFeatures column is solidifying, we can go one step further and use Python capabilities to plot some data. In order to do this, we will use the Brunel library which is extensively described online. If you are interested in learning more about Brunel, one good starting point would be this PDF document:  http://brunel.mybluemix.net/docs/Brunel%20Documentation.pdf

We will be plotting simple two dimensional graphs, where datapoints have X and Y coordinates. One simple input which Brunel can take in order to produce the desired graph is a Pandas dataframe. (If you are not familiar with the Python Pandas library, you can also find several tutorials online).<br><br> The first step will therefore consist in taking the pcaFeatures output provided by Spark right above and extract it as a Pandas dataframe consisting of two columns "x" and "y".

The Spark transformation below is not the most efficient, but represents a straightforward approach to extracting the desired data from the Spark dataframe and creating a Pandas dataframe named "my_pandas_data"

In [15]:
my_pandas_data = pca_output.select(["pcaFeatures"]).rdd.map(lambda row: (float(row[0][0]), float(row[0][1]))).toDF(["x", "y"]).toPandas()

In [16]:
my_pandas_data

Unnamed: 0,x,y
0,0.0,0
1,-2.236068,0
2,-4.472136,0
3,2.236068,0
4,4.472136,0


In [17]:
# Import the brunel library
import brunel

In [18]:
# Plot the 5 datapoints
%brunel data('my_pandas_data') point x(x) y(y) color(#selection):: width=800, height=300

<IPython.core.display.Javascript object>

# Conclusion:<br> 
We have seen in this first simple example how PCA can be used to reduce the dimensionality of a problem / dataset. In this example above, if the initial dataset was to be used as input to a predictive algorithm, we could reduce the input to a single dimension through PCA

# <span style="color:#fa04d9">Step 6: 2D exercise.</span>

<span style="color:blue">**Insert a few blank cells below in this notebook and modify your datapoints in such a way that the resulting dataset is not aligned anymore and then rerun the principal component analysis as shown above (including the brunel visualizations)**:
- Which significant change do you anticipate to observe in comparison with the example covered so far ?
- How many dimensions do you need to describe your new dataset ? 
- Would you still keep one single dimension if your data was being used for predictive purposes ? Why or Why not ?
- Using brunel, plot your original data and the transformation into the PCA dimension(s)
- Discuss your results with you neighbor or instructor at your preference.
- **Stretch goal: Keep modifying your dataset in specific ways, such as giving it a particular shape (elongated in one particular direction, etc...) and keep running PCA and check how the algorithm will systematically extract the first dimension as the direction in which your dataset has the most information.****</span>

Here is one suggested modified dataset to get started.

In [20]:
data = [(Vectors.dense([-4.8, -3.6]),),
        (Vectors.dense([-1.1, 0.4]),),
        (Vectors.dense([1.0, 0.6]),),
        (Vectors.dense([1.2, -1.8]),),
        (Vectors.dense([2.9, 0.1]),),
        (Vectors.dense([4.5, 2.2]),)
        ]

df = spark.createDataFrame(data, ["features"])

# Your exercise space starts here. Please insert additional blank cells as needed...

In [21]:
#from pyspark.ml.feature import StandardScaler
scaler_definition = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=False, withMean=True)

In [22]:
scaler_instance_trained = scaler_definition.fit(df)

In [23]:
scaler_output_df = scaler_instance_trained.transform(df)

In [24]:
scaler_output_df.show()

+-----------+--------------------+
|   features|      scaledFeatures|
+-----------+--------------------+
|[-4.8,-3.6]|[-5.4166666666666...|
| [-1.1,0.4]|[-1.7166666666666...|
|  [1.0,0.6]|[0.38333333333333...|
| [1.2,-1.8]|[0.58333333333333...|
|  [2.9,0.1]|[2.28333333333333...|
|  [4.5,2.2]|[3.88333333333333...|
+-----------+--------------------+



In [25]:
#from pyspark.ml.feature import PCA
pca_model_definition = PCA(k=2, inputCol="scaledFeatures", outputCol="pcaFeatures") # k=2 tells PCA how many dimensions to evaluate (see above)

In [26]:
pca_model_instance = pca_model_definition.fit(scaler_output_df)

In [27]:
pca_model_instance.pc # pc = Principle Components

DenseMatrix(2, 2, [-0.8706, -0.492, -0.492, 0.8706], 0)

In [28]:
print (pca_model_instance.pc)

DenseMatrix([[-0.87058954, -0.49201002],
             [-0.49201002,  0.87058954]])


In [29]:
pca_model_instance.explainedVariance

DenseVector([0.9228, 0.0772])

In [30]:
pca_output = pca_model_instance.transform(scaler_output_df)

In [31]:
pca_output.collect()

[Row(features=DenseVector([-4.8, -3.6]), scaledFeatures=DenseVector([-5.4167, -3.25]), pcaFeatures=DenseVector([6.3147, -0.1644])),
 Row(features=DenseVector([-1.1, 0.4]), scaledFeatures=DenseVector([-1.7167, 0.75]), pcaFeatures=DenseVector([1.1255, 1.4976])),
 Row(features=DenseVector([1.0, 0.6]), scaledFeatures=DenseVector([0.3833, 0.95]), pcaFeatures=DenseVector([-0.8011, 0.6385])),
 Row(features=DenseVector([1.2, -1.8]), scaledFeatures=DenseVector([0.5833, -1.45]), pcaFeatures=DenseVector([0.2056, -1.5494])),
 Row(features=DenseVector([2.9, 0.1]), scaledFeatures=DenseVector([2.2833, 0.45]), pcaFeatures=DenseVector([-2.2093, -0.7317])),
 Row(features=DenseVector([4.5, 2.2]), scaledFeatures=DenseVector([3.8833, 2.55]), pcaFeatures=DenseVector([-4.6354, 0.3094]))]

In [32]:
my_pandas_data = pca_output.select(["pcaFeatures"]).rdd.map(lambda row: (float(row[0][0]), float(row[0][1]))).toDF(["x", "y"]).toPandas()

In [59]:
my_pandas_data
#type(my_pandas_data)

Unnamed: 0,x,y
0,6.314726,-0.164362
1,1.125505,1.497559
2,-0.801136,0.638456
3,0.205571,-1.549361
4,-2.209251,-0.731658
5,-4.635415,0.309364


In [34]:
#import brunel

In [44]:
%brunel data('my_pandas_data') point x(x) y(y) color(#selection):: width=800, height=240

<IPython.core.display.Javascript object>

# <span style="color:#fa04d9">Step 6: 3D exercise.</span>

**We have so far worked in this notebook with basic two dimensional datasets in order to get a grasp of PCA. However, as one might suspect, PCA provides added value when working with higher numbers of dimensions, although it becomes more difficult to represent things visually as the number of dimensions increases. In the exercise suggested below, it is proposed to download a three dimensional dataset and perform a PCA analysis on it to determine whether it may be simplified and how.**<br><br>
## <span style="color:red">Suggested exercise steps:</span>
<span style="color:blue">1- Download the threed dataset (code provided in the cell below)<br>
2- Define a vector assembler.<br>
3- Define a standard scaler. <br>
4- Build a pipeline with those two transformers: StandardScaler and VectorAssembler.<br>
5- Define an instance of PCA and run it on your dataset.<br>
6- Determine how many dimensions you can drop and how many you would keep, explain why.<br>
7- What can you say about the "shape" of the dataset based on the Eigen values you are seeing?<br>
8- Confirm your conclusions from the step above by displaying the result of your PCA using Brunel.**<br></span>

This cell below will download a three dimensional dataset from github (deletes any existing local version of the file before downloading)

In [45]:
#Delete the file if it exists, download a new copy from GitHub and load it into a dataframe
!rm threed_data.csv* -f
!wget https://raw.githubusercontent.com/DScienceAtScale/SparkPCA/master/data/threed_data.csv

threed_data = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .load('./threed_data.csv')

# Take a look at a few elements of the dataset.
threed_data.take(5)

--2017-08-31 13:55:54--  https://raw.githubusercontent.com/DScienceAtScale/SparkPCA/master/data/threed_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.180.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.180.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 559 [text/plain]
Saving to: ‘threed_data.csv’


2017-08-31 13:55:54 (151 MB/s) - ‘threed_data.csv’ saved [559/559]



[Row(Name=u'P1', X=0.0, Y=0.0, Z=2.0),
 Row(Name=u'P2', X=0.0, Y=-2.0, Z=0.0),
 Row(Name=u'P3', X=2.0, Y=0.0, Z=0.0),
 Row(Name=u'P4', X=-0.47, Y=-0.63, Z=1.84),
 Row(Name=u'P5', X=0.63, Y=0.47, Z=1.84)]

# Your exercise space starts here:
## Some hints are provided below, covering SparkML concepts mentioned in the exercise steps above, but not covered so far in this notebook, such as <span style="color:blue">VectorAssembler</span> and <span style="color:blue">Pipeline</span>.

### <span style="color:blue">Step 2</span> of the recommended steps above suggests creating a <span style="color:blue">VectorAssembler.</span> VectorAssemblers are standard SparkML transformers which have not been previously covered in this notebook. If you are not familiar with this concept, the hints below help you define a vector assembler that will take the three coordinates X, Y, Z of each point in your dataset and concatenate them into a single vector. <br><br>The first hint below is a generic example to help you understand how VectorAssembler works.

<div class="panel-group" id="accordion-14">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-14" href="#collapse1-14">
        Hint1: Click on this link to expand this cell, then copy and paste the code which will appear in a new cell just below, and execute that cell to see how VectorAssembler works. (You may subsequently delete that new cell and proceed with the exercise).</a>
      </h4>
    </div>
    <div id="collapse1-14" class="panel-collapse collapse">
      <div class="panel-body">
from pyspark.ml.linalg import Vectors <br>
from pyspark.ml.feature import VectorAssembler <br>
<br>
dataset = spark.createDataFrame( <br>
    [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)], <br>
    ["id", "hour", "mobile", "userFeatures", "clicked"]) <br>
<br>
assembler = VectorAssembler( <br>
    inputCols=["hour", "mobile", "userFeatures"], <br>
    outputCol="features") <br>
<br>
output = assembler.transform(dataset) <br>
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'") <br>
output.select("features", "clicked").show(truncate=False) <br>
      </div>
    </div>
  </div>

In [46]:
# If you elect to run the code from Hint1 above, you can paste it in this cell right under this comment

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

dataset = spark.createDataFrame(
[(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
["id", "hour", "mobile", "userFeatures", "clicked"])

assembler = VectorAssembler(
inputCols=["hour", "mobile", "userFeatures"],
outputCol="features")

output = assembler.transform(dataset)
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")
output.select("features", "clicked").show(truncate=False) 

Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'
+-----------------------+-------+
|features               |clicked|
+-----------------------+-------+
|[18.0,1.0,0.0,10.0,0.5]|1.0    |
+-----------------------+-------+



### <span style="color:red">Spoiler warning</span> if you want to take a crack at this first !!! This second hint contains the code you need to write to define the VectorAssembler.

<div class="panel-group" id="accordion-1">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-1" href="#collapse1-1">
        Hint2: If you desire additional help creating a vector assembler for you current dataset, click here to unveil the code to define a VectorAssembler, which you can copy/paste in a new cell below.</a>
      </h4>
    </div>
    <div id="collapse1-1" class="panel-collapse collapse">
      <div class="panel-body">
      # Import VectorAssembler, we haven't used it yet.<br>
      from pyspark.ml.feature import VectorAssembler <br>
      featureCols = ['X', 'Y', 'Z'] <br>
      assembler_definition = VectorAssembler(inputCols=featureCols , outputCol="features")<br>
      </div>
    </div>
  </div>

In [49]:
# If you decide to use the code from Hint2 above, you can paste it in this cell right under this comment. Otherwise, write your own code here...
#Import VectorAssembler #, we haven't used it yet.
from pyspark.ml.feature import VectorAssembler
featureCols = ['X', 'Y', 'Z']
assembler_definition = VectorAssembler(inputCols=featureCols , outputCol="features")

### <span style="color:blue">Step 3</span> of the recommended steps above suggests creating a <span style="color:blue">StandardScaler.</span> There are such examples higher in this notebook which you can duplicate here.

### <span style="color:red">Spoiler warning</span> if you want to take a crack at this first !!! This hint below contains the code you need to write to define the StandardScaler.

<div class="panel-group" id="accordion-2">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-2" href="#collapse1-2">
        Hint3: If you desire additional help creating a Standard Scaler for you current dataset, click here to unveil the code to define it, which you can copy/paste in a new cell below.</a>
      </h4>
    </div>
    <div id="collapse1-2" class="panel-collapse collapse">
      <div class="panel-body">
   from pyspark.ml.feature import StandardScaler <br>
   scaler_definition = StandardScaler(inputCol="features", outputCol="scaledFeatures",<br>
                        withStd=False, withMean=True) <br>
      </div>
    </div>
  </div>


In [50]:
# If you decide to use the code from Hint3 above, you can paste it in this cell right under this comment. Otherwise, write your own code here...
from pyspark.ml.feature import StandardScaler
scaler_definition = StandardScaler(inputCol="features", outputCol="scaledFeatures",
withStd=False, withMean=True) 

### <span style="color:blue">Step 4</span> of the recommended steps above suggests creating a <span style="color:blue">Pipeline.</span> If you are not familiar with this concept, the hint below will help you define a pipeline combining the VectorAssembler and StandardScaler (just defined above), which will take your input data and produce a scaled vector ready to be fed into the PCA logic.

### <span style="color:red">Spoiler warning</span> if you want to take a crack at this first !!! This hint below contains the code you need to write to define the Pipeline and transform the data by using the pipeline.

<div class="panel-group" id="accordion-4">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-4" href="#collapse1-4">
        Hint4: If you desire additional help creating a Pipeline combining your VectorAssembler and Standard Scaler, click here to unveil the code to define it, which you can copy/paste in a new cell below.</a>
      </h4>
    </div>
    <div id="collapse1-4" class="panel-collapse collapse">
      <div class="panel-body">
       from pyspark.ml import Pipeline<br>
       <br>
       pipeline_data_prep_definition = Pipeline(stages=[assembler_definition, scaler_definition]) <br>
       pipeline_data_prep_instance = pipeline_data_prep_definition.fit(threed_data) <br>
       prepped_data_for_pca = pipeline_data_prep_instance.transform(threed_data)<br>
      </div>
    </div>
  </div>

In [51]:
# If you decide to use the code from Hint4 above, you can paste it in this cell right under this comment. Otherwise, write your own code here...
from pyspark.ml import Pipeline

pipeline_data_prep_definition = Pipeline(stages=[assembler_definition, scaler_definition])
pipeline_data_prep_instance = pipeline_data_prep_definition.fit(threed_data)
prepped_data_for_pca = pipeline_data_prep_instance.transform(threed_data)

### If you completed the previous steps of this exercise until <span style="color:blue">step 4</span>, you should now have a Spark dataframe which contains data that has been 'VectorAssembled' and 'Scaled' by the pipeline you just executed above. This data is now prepped for PCA (you can use the pySpark command <span style="color:blue">dataframeName.take(5)</span> to look at the first 5 rows in it --use the actual name of the dataframe, not 'dataframeName'--). <br><br>The remaining steps in this exercise are very similar to examples previously covered in this notebook. Feel free to include additional blank cells below for the rest of your work if needed.
### Please continue using the <span style="color:blue">"suggested steps"</span> at the beginning of this exercise for guidance, which are also repeated for your convenience below with empty blank cells to include your code.

### The few cells below are a detailed breakdown of <span style="color:blue">Step 5</span>. <br><br>Define an instance of PCA in the same way as was done earlier (Warning, what should be the value of "k" this time ?)

In [52]:
pca_model_definition = PCA(k=3, inputCol="scaledFeatures", outputCol="pcaFeatures") # k=2 tells PCA how many dimensions to evaluate (see above)

### Fit the PCA instance to your prepped data to obtain an actual PCA model, in the same way as was done earlier.

In [53]:
pca_model_instance = pca_model_definition.fit(prepped_data_for_pca)

### Now that you have an actual PCA model which was trained on  your 3D data, print out the matrix of Eigen vectors.

In [54]:
pca_model_instance.pc

DenseMatrix(3, 3, [-0.59, -0.7835, -0.1949, 0.5646, -0.2279, -0.7933, 0.5771, -0.5781, 0.5768], 0)

### Print out the Eigen values.

In [56]:
print (pca_model_instance.pc)
pca_model_instance.explainedVariance

DenseMatrix([[-0.59004926,  0.56460426,  0.57711689],
             [-0.78348585, -0.22785468, -0.57812816],
             [-0.19491484, -0.79328701,  0.57680493]])


DenseVector([0.5415, 0.4585, 0.0])

### <span style="color:blue">Step 6:</span> This is not a coding step. Feel free to record your thoughts below. How many dimensions can you drop (if any ?) and how many should you keep ? Why ?

In [65]:
#type(prepped_data_for_pca)
prepped_data_for_pca.show()

+----+-----+-----+-----+------------------+--------------------+
|Name|    X|    Y|    Z|          features|      scaledFeatures|
+----+-----+-----+-----+------------------+--------------------+
|  P1|  0.0|  0.0|  2.0|     [0.0,0.0,2.0]|[-0.6996428571428...|
|  P2|  0.0| -2.0|  0.0|    [0.0,-2.0,0.0]|[-0.6996428571428...|
|  P3|  2.0|  0.0|  0.0|     [2.0,0.0,0.0]|[1.30035714285714...|
|  P4|-0.47|-0.63| 1.84|[-0.47,-0.63,1.84]|[-1.1696428571428...|
|  P5| 0.63| 0.47| 1.84|  [0.63,0.47,1.84]|[-0.0696428571428...|
|  P6|-0.63|-1.05| 1.58|[-0.63,-1.05,1.58]|[-1.3296428571428...|
|  P7|-0.67|-1.39| 1.28|[-0.67,-1.39,1.28]|[-1.3696428571428...|
|  P8| 1.44| 0.66| 1.22|  [1.44,0.66,1.22]|[0.74035714285714...|
|  P9|-0.62|-1.61| 1.01|[-0.62,-1.61,1.01]|[-1.3196428571428...|
| P10| 1.62| 0.62|  1.0|   [1.62,0.62,1.0]|[0.92035714285714...|
| P11|-0.53|-1.77| 0.75|[-0.53,-1.77,0.75]|[-1.2296428571428...|
| P12|  1.8| 0.51| 0.72|   [1.8,0.51,0.72]|[1.10035714285714...|
| P13|-0.39| -1.9| 0.49| 

### <span style="color:blue">Step 7:</span> This is not a coding step. Feel free to record your thoughts below. What can you say about the general "shape" of the mysterious dataset based on the Eigen values that you are seeing ? Why ?

In [None]:
%brunel data('my_pandas_data') point x(x) y(y) color(#selection):: width=800, height=240

### <span style="color:blue">Step 8:</span> Confirm your conclusions from <span style="color:blue">steps 6 & 7</span> above by displaying the resulting dataset. Reuse code from previous sections of this notebook to do this. If you need more space than just the single blank cell below, feel free to add more...

### The cell below can be the last in your notebook to delete the datafile which was downloaded at the beginning of the exercise...

In [None]:
# Remove the local file (not needed anymore)
!rm threed_data.csv* -f

**For questions or feedback, please contact:<br>
Mokhtar Kandil.<br>
mkandil@ca.ibm.com**<br>
IBM WW Big Data and Data Science<br>
August 2017.